Website Sources

Crawl and index website pages to automatically build your agent's knowledge base

Website Sources#

Automatically crawl and index entire websites to build your agent's knowledge base. AlonChat's crawler discovers pages, extracts clean text content, and indexes it so your agent can answer questions about your site.


Overview#

Website sources allow your agent to learn from:

  • Marketing websites and landing pages
  • Documentation sites
  • Blog posts and articles
  • Product catalogs
  • Help centers and FAQs

The crawler automatically discovers pages, extracts clean text (removing navigation, ads, and other non-content elements), and indexes everything for intelligent retrieval.


How Website Crawling Works#

AlonChat uses a two-phase crawling process:

Phase 1: Discovery#

  • Starts from the URL you provide
  • Discovers linked pages automatically
  • Respects robots.txt rules
  • Detects sitemaps (sitemap.xml) for faster, more complete discovery
  • Shows you all discovered URLs for review

Phase 2: Crawl and Index#

  • You review and approve the discovered URLs
  • The crawler visits each approved page
  • Clean text content is extracted (navigation, ads, and boilerplate are removed)
  • Content is processed and indexed for your agent to use

Adding a Website Source#

Step 1: Navigate to Sources#

  1. Go to your agent's dashboard
  2. Click Sources in the sidebar
  3. Open Website
  4. Click Add Website

Step 2: Enter Website URL#

Enter the starting URL to crawl:

Code
https://example.com

Tips:

  • Start with the homepage for full site coverage
  • Use a specific section URL for targeted crawling (e.g., https://example.com/docs or https://example.com/blog)
  • You can also provide a direct sitemap.xml URL

Step 3: Configure Crawl Settings#

Maximum Pages#

Set a limit on how many pages to crawl (default: 50, max: 500).

Recommendations:

  • Small site (under 20 pages): Set to 50
  • Medium site (20-100 pages): Set to 100-200
  • Large site (100+ pages): Set to 500 or use pattern filtering to focus on the most relevant sections

URL Pattern Filters#

Include Patterns -- only crawl URLs matching these patterns:

Code
/docs/*          # Only documentation pages
/blog/*          # Only blog posts
/products/*      # Only product pages

Exclude Patterns -- skip URLs matching these patterns:

Code
/admin/*         # Skip admin pages
/cart/*          # Skip shopping cart
*.pdf            # Skip PDF files
/api/*           # Skip API endpoints

Pattern syntax:

  • * matches any characters
  • /docs/* matches all pages under /docs
  • *.pdf matches all PDF file URLs
  • Use one pattern per line

Step 4: Review Discovered URLs#

After discovery completes, you see a list of all found URLs:

  • Included: URLs matching your patterns (will be crawled)
  • Excluded: URLs filtered by your exclusion patterns
  • Review: Click to preview page content

Actions:

  • Approve All: Crawl all included URLs
  • Approve Selected: Choose specific pages to crawl
  • Cancel: Start over with different settings

Step 5: Wait for Crawl#

The crawler will:

  1. Visit each approved URL
  2. Extract clean text content
  3. Process and index the content
  4. Update your agent's knowledge base

You can track progress in real time, including pages processed and any errors encountered.


Sitemap Support#

AlonChat automatically detects and uses sitemaps for faster, more complete discovery.

Automatic detection:

  • Checks /sitemap.xml
  • Checks /sitemap_index.xml
  • Parses robots.txt for sitemap references

Manual sitemap URL: You can also enter a sitemap URL directly:

Code
https://example.com/sitemap.xml

Benefits of sitemaps:

  • Faster discovery (no need to follow links to find pages)
  • More complete coverage (finds all published pages)
  • Respects priority hints from the sitemap

Re-Crawling Websites#

Keep your agent's knowledge up-to-date by re-crawling:

  1. Go to Sources > Website
  2. Find the website source
  3. Click the menu icon then select Re-crawl

What happens during a re-crawl:

  • New pages added to the site are discovered
  • Content is updated for existing pages
  • Content from deleted pages is removed
  • Your original filter settings are preserved
  • Updates are applied without data loss -- existing content remains available until new content is successfully processed

Pattern Filtering Examples#

Documentation Only#

Code
Include:
/docs/*

Exclude:
/blog/*
/pricing

Multiple Sections#

Code
Include:
/docs/*
/guides/*
/tutorials/*

Exclude:
*/drafts/*
*/archive/*

Blog Posts Only#

Code
Include:
/blog/*

Exclude:
/blog/tag/*
/blog/author/*

Best Practices#

1. Start Small#

  • Test with 10-20 pages first
  • Verify content quality looks good
  • Then expand to the full site

2. Use Pattern Filters#

  • Exclude login pages and admin panels
  • Skip shopping cart and checkout flows
  • Focus on informational content that answers customer questions

3. Monitor Crawl Quality#

  • Review crawled pages on the Website page
  • Check that navigation and ads were properly removed
  • Adjust patterns if you see unwanted content

4. Keep Content Fresh#

  • Re-crawl weekly for frequently updated sites
  • Re-crawl monthly for static sites
  • Set reminders to check for outdated content

Limitations#

  • Maximum pages: 500 pages per website source
  • File types: HTML pages only (for PDFs or Word docs, use File Sources)
  • JavaScript rendering: Limited support for heavily JavaScript-dependent sites
  • Authentication: Cannot crawl password-protected pages
  • Dynamic content: AJAX-loaded content may not be captured

Workarounds:

  • For PDF or Word documents: Use File Sources
  • For protected pages: Export to PDF and upload as a file
  • For JavaScript-heavy sites: Consider using text sources or file uploads instead

Troubleshooting#

"No pages discovered"#

  • Check that the URL is accessible in your browser
  • Verify the site does not block crawlers in its robots.txt
  • Try providing the sitemap URL directly

"Crawl failed"#

  • The site may have anti-bot protection
  • Try crawling fewer pages at once
  • Check if the site requires authentication

"Poor content quality"#

  • The extracted content may include too much navigation or boilerplate
  • Try more specific URL patterns to target content-rich pages
  • Consider using File Sources or Text Sources for better control

"Missing pages"#

  • Check your exclusion patterns -- they may be filtering out pages you want
  • Increase the maximum pages limit
  • Verify the missing pages are linked from other crawled pages (or add them via sitemap)

Next Steps#