Website Sources
Crawl and index website pages to automatically build your agent's knowledge base
Website Sources#
Automatically crawl and index entire websites to build your agent's knowledge base. AlonChat's crawler discovers pages, extracts clean text content, and indexes it so your agent can answer questions about your site.
Overview#
Website sources allow your agent to learn from:
- Marketing websites and landing pages
- Documentation sites
- Blog posts and articles
- Product catalogs
- Help centers and FAQs
The crawler automatically discovers pages, extracts clean text (removing navigation, ads, and other non-content elements), and indexes everything for intelligent retrieval.
How Website Crawling Works#
AlonChat uses a two-phase crawling process:
Phase 1: Discovery#
- Starts from the URL you provide
- Discovers linked pages automatically
- Respects
robots.txtrules - Detects sitemaps (
sitemap.xml) for faster, more complete discovery - Shows you all discovered URLs for review
Phase 2: Crawl and Index#
- You review and approve the discovered URLs
- The crawler visits each approved page
- Clean text content is extracted (navigation, ads, and boilerplate are removed)
- Content is processed and indexed for your agent to use
Adding a Website Source#
Step 1: Navigate to Sources#
- Go to your agent's dashboard
- Click Sources in the sidebar
- Open Website
- Click Add Website
Step 2: Enter Website URL#
Enter the starting URL to crawl:
https://example.com
Tips:
- Start with the homepage for full site coverage
- Use a specific section URL for targeted crawling (e.g.,
https://example.com/docsorhttps://example.com/blog) - You can also provide a direct
sitemap.xmlURL
Step 3: Configure Crawl Settings#
Maximum Pages#
Set a limit on how many pages to crawl (default: 50, max: 500).
Recommendations:
- Small site (under 20 pages): Set to 50
- Medium site (20-100 pages): Set to 100-200
- Large site (100+ pages): Set to 500 or use pattern filtering to focus on the most relevant sections
URL Pattern Filters#
Include Patterns -- only crawl URLs matching these patterns:
/docs/* # Only documentation pages
/blog/* # Only blog posts
/products/* # Only product pages
Exclude Patterns -- skip URLs matching these patterns:
/admin/* # Skip admin pages
/cart/* # Skip shopping cart
*.pdf # Skip PDF files
/api/* # Skip API endpoints
Pattern syntax:
*matches any characters/docs/*matches all pages under/docs*.pdfmatches all PDF file URLs- Use one pattern per line
Step 4: Review Discovered URLs#
After discovery completes, you see a list of all found URLs:
- Included: URLs matching your patterns (will be crawled)
- Excluded: URLs filtered by your exclusion patterns
- Review: Click to preview page content
Actions:
- Approve All: Crawl all included URLs
- Approve Selected: Choose specific pages to crawl
- Cancel: Start over with different settings
Step 5: Wait for Crawl#
The crawler will:
- Visit each approved URL
- Extract clean text content
- Process and index the content
- Update your agent's knowledge base
You can track progress in real time, including pages processed and any errors encountered.
Sitemap Support#
AlonChat automatically detects and uses sitemaps for faster, more complete discovery.
Automatic detection:
- Checks
/sitemap.xml - Checks
/sitemap_index.xml - Parses
robots.txtfor sitemap references
Manual sitemap URL: You can also enter a sitemap URL directly:
https://example.com/sitemap.xml
Benefits of sitemaps:
- Faster discovery (no need to follow links to find pages)
- More complete coverage (finds all published pages)
- Respects priority hints from the sitemap
Re-Crawling Websites#
Keep your agent's knowledge up-to-date by re-crawling:
- Go to Sources > Website
- Find the website source
- Click the menu icon then select Re-crawl
What happens during a re-crawl:
- New pages added to the site are discovered
- Content is updated for existing pages
- Content from deleted pages is removed
- Your original filter settings are preserved
- Updates are applied without data loss -- existing content remains available until new content is successfully processed
Pattern Filtering Examples#
Documentation Only#
Include:
/docs/*
Exclude:
/blog/*
/pricing
Multiple Sections#
Include:
/docs/*
/guides/*
/tutorials/*
Exclude:
*/drafts/*
*/archive/*
Blog Posts Only#
Include:
/blog/*
Exclude:
/blog/tag/*
/blog/author/*
Best Practices#
1. Start Small#
- Test with 10-20 pages first
- Verify content quality looks good
- Then expand to the full site
2. Use Pattern Filters#
- Exclude login pages and admin panels
- Skip shopping cart and checkout flows
- Focus on informational content that answers customer questions
3. Monitor Crawl Quality#
- Review crawled pages on the Website page
- Check that navigation and ads were properly removed
- Adjust patterns if you see unwanted content
4. Keep Content Fresh#
- Re-crawl weekly for frequently updated sites
- Re-crawl monthly for static sites
- Set reminders to check for outdated content
Limitations#
- Maximum pages: 500 pages per website source
- File types: HTML pages only (for PDFs or Word docs, use File Sources)
- JavaScript rendering: Limited support for heavily JavaScript-dependent sites
- Authentication: Cannot crawl password-protected pages
- Dynamic content: AJAX-loaded content may not be captured
Workarounds:
- For PDF or Word documents: Use File Sources
- For protected pages: Export to PDF and upload as a file
- For JavaScript-heavy sites: Consider using text sources or file uploads instead
Troubleshooting#
"No pages discovered"#
- Check that the URL is accessible in your browser
- Verify the site does not block crawlers in its
robots.txt - Try providing the sitemap URL directly
"Crawl failed"#
- The site may have anti-bot protection
- Try crawling fewer pages at once
- Check if the site requires authentication
"Poor content quality"#
- The extracted content may include too much navigation or boilerplate
- Try more specific URL patterns to target content-rich pages
- Consider using File Sources or Text Sources for better control
"Missing pages"#
- Check your exclusion patterns -- they may be filtering out pages you want
- Increase the maximum pages limit
- Verify the missing pages are linked from other crawled pages (or add them via sitemap)
Next Steps#
- Training Your Agent -- Process and activate your knowledge
- File Sources -- Upload documents directly
- Facebook Import -- Import conversation history
- Instagram Import -- Import Instagram conversations