website-sources

Website Sources#

Automatically crawl and index entire websites to build your agent's knowledge base. AlonChat's intelligent crawler can discover, extract, and process web content while respecting your preferences.

Overview#

Website sources allow your agent to learn from:

  • Marketing websites and landing pages
  • Documentation sites
  • Blog posts and articles
  • Product catalogs
  • Help centers and FAQs

The crawler automatically discovers pages, extracts clean text, and creates searchable knowledge chunks.

How Website Crawling Works#

AlonChat uses a two-phase progressive crawling system:

Phase 1: Discovery#

  • Starts from your provided URL
  • Discovers linked pages automatically
  • Respects robots.txt rules
  • Finds sitemaps (sitemap.xml)
  • Shows you all discovered URLs for review

Phase 2: Crawl & Index#

  • You review and approve discovered URLs
  • Crawler visits approved pages
  • Extracts text content (removes navigation, ads, etc.)
  • Chunks content into searchable pieces
  • Generates embeddings for semantic search

Adding a Website Source#

Step 1: Navigate to Knowledge Base#

  1. Go to your agent's Settings page
  2. Click Knowledge Base tab
  3. Click Add Website

Step 2: Enter Website URL#

Enter the starting URL to crawl:

Code
https://example.com

Tips:

  • Start with the homepage for full site coverage
  • Use a specific section for targeted crawling (e.g., /docs or /blog)
  • Can also provide direct sitemap.xml URL

Step 3: Configure Crawl Settings#

Maximum Pages#

Limit how many pages to crawl (default: 50, max: 500)

Recommendations:

  • Small site (< 20 pages): Set to 50
  • Medium site (20-100 pages): Set to 100-200
  • Large site (100+ pages): Set to 500 or use pattern filtering

URL Pattern Filters#

Include Patterns - Only crawl URLs matching these patterns:

Code
/docs/*          # Only documentation pages
/blog/*          # Only blog posts
/products/*      # Only product pages

Exclude Patterns - Skip URLs matching these patterns:

Code
/admin/*         # Skip admin pages
/cart/*          # Skip shopping cart
*.pdf            # Skip PDF files
/api/*           # Skip API endpoints

Pattern Syntax:

  • * = Match any characters
  • /docs/* = All pages under /docs
  • *.pdf = All PDF file URLs
  • Use one pattern per line

Step 4: Review Discovered URLs#

After discovery completes, you'll see a list of all found URLs:

  • Included: URLs matching your patterns (will be crawled)
  • Excluded: URLs filtered by your exclusion patterns
  • 🔍 Review: Click to preview page content

Actions:

  • Approve All: Crawl all included URLs
  • Approve Selected: Choose specific pages to crawl
  • Cancel: Start over with different settings

Step 5: Wait for Crawl#

The crawler will now:

  1. Visit each approved URL
  2. Extract clean text content
  3. Create 2KB content chunks
  4. Generate embeddings
  5. Index for search

Progress Tracking:

  • See real-time crawl progress
  • Monitor pages processed
  • View any errors encountered

Sitemap Support#

AlonChat automatically detects and uses sitemaps for faster discovery.

Automatic Detection:

  • Checks /sitemap.xml
  • Checks /sitemap_index.xml
  • Parses robots.txt for sitemap references

Manual Sitemap URL:

Code
https://example.com/sitemap.xml

Benefits:

  • ✅ Faster discovery (no need to crawl to find pages)
  • ✅ Complete coverage (finds all published pages)
  • ✅ Respects priority hints

Re-Crawling Websites#

Keep your agent's knowledge up-to-date by re-crawling websites:

  1. Go to Knowledge Base
  2. Find the website source
  3. Click menu → Re-crawl

What happens during re-crawl:

  • Discovers new pages added to the site
  • Updates content for existing pages
  • Removes content from deleted pages
  • Preserves your original filter settings

Zero Data Loss: AlonChat uses the APPEND-THEN-CLEAN pattern - old content stays until new content successfully processes.

Pattern Filtering Examples#

Example 1: Documentation Only#

Code
Include:
/docs/*

Exclude:
/blog/*
/pricing
Code
Exclude:
http://*
https://external-site.com/*

Example 3: Multiple Sections#

Code
Include:
/docs/*
/guides/*
/tutorials/*

Exclude:
*/drafts/*
*/archive/*

Best Practices#

1. Start Small#

  • Test with 10-20 pages first
  • Verify content quality
  • Then expand to full site

2. Use Pattern Filters#

  • Exclude login pages, admin panels
  • Skip shopping cart, checkout flows
  • Focus on informational content

3. Monitor Crawl Quality#

  • Review crawled pages in Knowledge Base
  • Check if navigation/ads were properly removed
  • Adjust patterns if needed

4. Keep Content Fresh#

  • Re-crawl weekly for frequently updated sites
  • Re-crawl monthly for static sites
  • Set reminders to check for outdated content

5. Respect Rate Limits#

  • AlonChat crawls 2 pages per second (default)
  • Increase for your own sites
  • Keep default for external sites

Limitations#

  • Maximum pages: 500 pages per website source
  • File types: HTML only (PDF/Word require file upload)
  • JavaScript rendering: Limited support for heavy JS sites
  • Authentication: Cannot crawl password-protected pages
  • Dynamic content: AJAX-loaded content may be missed

Workarounds:

  • For PDF/Word docs: Use File Sources instead
  • For protected pages: Export to PDF and upload
  • For JS-heavy sites: Use RSS feeds or API integration

Troubleshooting#

"No pages discovered"#

  • Check URL is accessible
  • Verify site doesn't block crawlers (robots.txt)
  • Try the sitemap URL directly

"Crawl failed"#

  • Site may have anti-bot protection
  • Try crawling fewer pages at once
  • Check if site requires authentication

"Poor content quality"#

  • Content may have too much navigation/ads
  • Try more specific URL patterns
  • Consider using File Sources for better control

"Missing pages"#

  • Check your exclusion patterns
  • Increase maximum pages limit
  • Verify pages are linked from crawled pages

Next Steps#

website-sources | AlonChat Docs