Website Sources

Crawl and index website pages to automatically build your agent's knowledge base

Website Sources#

Automatically crawl and index entire websites to build your agent's knowledge base. AlonChat's crawler discovers pages, extracts clean text content, and indexes it so your agent can answer questions about your site.

Overview#

Website sources allow your agent to learn from:

Marketing websites and landing pages
Documentation sites
Blog posts and articles
Product catalogs
Help centers and FAQs

The crawler automatically discovers pages, extracts clean text (removing navigation, ads, and other non-content elements), and indexes everything for intelligent retrieval.

How Website Crawling Works#

AlonChat uses a two-phase crawling process:

Phase 1: Discovery#

Starts from the URL you provide
Discovers linked pages automatically
Respects robots.txt rules
Detects sitemaps (sitemap.xml) for faster, more complete discovery
Shows you all discovered URLs for review

Phase 2: Crawl and Index#

You review and approve the discovered URLs
The crawler visits each approved page
Clean text content is extracted (navigation, ads, and boilerplate are removed)
Content is processed and indexed for your agent to use

Adding a Website Source#

Step 1: Navigate to Sources#

Go to your agent's dashboard
Click Sources in the sidebar
Open Website
Click Add Website

Step 2: Enter Website URL#

Enter the starting URL to crawl:

Code

https://example.com

Tips:

Start with the homepage for full site coverage
Use a specific section URL for targeted crawling (e.g., https://example.com/docs or https://example.com/blog)
You can also provide a direct sitemap.xml URL

Step 3: Configure Crawl Settings#

Maximum Pages#

Set a limit on how many pages to crawl. The maximum you can set is configurable and capped by a plan/admin-set limit.

Recommendations:

Small site (under 20 pages): A low limit is plenty
Medium site (20-100 pages): Raise the limit to cover the sections you care about
Large site (100+ pages): Use pattern filtering to focus on the most relevant sections rather than crawling everything

URL Pattern Filters#

Include Patterns -- only crawl URLs matching these patterns:

Code

/docs/*          # Only documentation pages
/blog/*          # Only blog posts
/products/*      # Only product pages

Exclude Patterns -- skip URLs matching these patterns:

Code

/admin/*         # Skip admin pages
/cart/*          # Skip shopping cart
*.pdf            # Skip PDF files
/api/*           # Skip API endpoints

Pattern syntax:

* matches any characters
/docs/* matches all pages under /docs
*.pdf matches all PDF file URLs
Use one pattern per line

Step 4: Review Discovered URLs#

After discovery completes, you see a list of all found URLs:

Included: URLs matching your patterns (will be crawled)
Excluded: URLs filtered by your exclusion patterns
Review: Click to preview page content

Actions:

Approve All: Crawl all included URLs
Approve Selected: Choose specific pages to crawl
Cancel: Start over with different settings

Step 5: Wait for Crawl#

The crawler will:

Visit each approved URL
Extract clean text content
Process and index the content
Update your agent's knowledge base

You can track progress in real time, including pages processed and any errors encountered.

Sitemap Support#

AlonChat automatically detects and uses sitemaps for faster, more complete discovery.

Automatic detection:

Checks /sitemap.xml
Checks /sitemap_index.xml
Parses robots.txt for sitemap references

Manual sitemap URL: You can also enter a sitemap URL directly:

Code

https://example.com/sitemap.xml

Benefits of sitemaps:

Faster discovery (no need to follow links to find pages)
More complete coverage (finds all published pages)
Respects priority hints from the sitemap

Re-Crawling Websites#

Keep your agent's knowledge up-to-date by re-crawling:

Go to Sources > Website
Find the website source
Click the menu icon then select Re-crawl

What happens during a re-crawl:

New pages added to the site are discovered
Content is updated for existing pages
Content from deleted pages is removed
Your original filter settings are preserved
Updates are applied without data loss -- existing content remains available until new content is successfully processed

Pattern Filtering Examples#

Documentation Only#

Code

Include:
/docs/*

Exclude:
/blog/*
/pricing

Multiple Sections#

Code

Include:
/docs/*
/guides/*
/tutorials/*

Exclude:
*/drafts/*
*/archive/*

Blog Posts Only#

Code

Include:
/blog/*

Exclude:
/blog/tag/*
/blog/author/*

Best Practices#

1. Start Small#

Test with 10-20 pages first
Verify content quality looks good
Then expand to the full site

2. Use Pattern Filters#

Exclude login pages and admin panels
Skip shopping cart and checkout flows
Focus on informational content that answers customer questions

3. Monitor Crawl Quality#

Review crawled pages on the Website page
Check that navigation and ads were properly removed
Adjust patterns if you see unwanted content

4. Keep Content Fresh#

Re-crawl weekly for frequently updated sites
Re-crawl monthly for static sites
Set reminders to check for outdated content

Limitations#

Maximum pages: Configurable per website source, up to a plan/admin-set limit
File types: HTML pages only (for PDFs or Word docs, use File Sources)
JavaScript rendering: Limited support for heavily JavaScript-dependent sites
Authentication: Cannot crawl password-protected pages
Dynamic content: AJAX-loaded content may not be captured

Workarounds:

For PDF or Word documents: Use File Sources
For protected pages: Export to PDF and upload as a file
For JavaScript-heavy sites: Consider using text sources or file uploads instead

Troubleshooting#

"No pages discovered"#

Check that the URL is accessible in your browser
Verify the site does not block crawlers in its robots.txt
Try providing the sitemap URL directly

"Crawl failed"#

The site may have anti-bot protection
Try crawling fewer pages at once
Check if the site requires authentication

"Poor content quality"#

The extracted content may include too much navigation or boilerplate
Try more specific URL patterns to target content-rich pages
Consider using File Sources or Text Sources for better control

"Missing pages"#

Check your exclusion patterns -- they may be filtering out pages you want
Increase the maximum pages limit
Verify the missing pages are linked from other crawled pages (or add them via sitemap)

Next Steps#

Training Your Agent -- Process and activate your knowledge
File Sources -- Upload documents directly
Chat Training -- Import chat history from Facebook, Instagram, and WhatsApp