website-sources
Website Sources#
Automatically crawl and index entire websites to build your agent's knowledge base. AlonChat's intelligent crawler can discover, extract, and process web content while respecting your preferences.
Overview#
Website sources allow your agent to learn from:
- Marketing websites and landing pages
- Documentation sites
- Blog posts and articles
- Product catalogs
- Help centers and FAQs
The crawler automatically discovers pages, extracts clean text, and creates searchable knowledge chunks.
How Website Crawling Works#
AlonChat uses a two-phase progressive crawling system:
Phase 1: Discovery#
- Starts from your provided URL
- Discovers linked pages automatically
- Respects
robots.txtrules - Finds sitemaps (
sitemap.xml) - Shows you all discovered URLs for review
Phase 2: Crawl & Index#
- You review and approve discovered URLs
- Crawler visits approved pages
- Extracts text content (removes navigation, ads, etc.)
- Chunks content into searchable pieces
- Generates embeddings for semantic search
Adding a Website Source#
Step 1: Navigate to Knowledge Base#
- Go to your agent's Settings page
- Click Knowledge Base tab
- Click Add Website
Step 2: Enter Website URL#
Enter the starting URL to crawl:
https://example.com
Tips:
- Start with the homepage for full site coverage
- Use a specific section for targeted crawling (e.g.,
/docsor/blog) - Can also provide direct
sitemap.xmlURL
Step 3: Configure Crawl Settings#
Maximum Pages#
Limit how many pages to crawl (default: 50, max: 500)
Recommendations:
- Small site (< 20 pages): Set to 50
- Medium site (20-100 pages): Set to 100-200
- Large site (100+ pages): Set to 500 or use pattern filtering
URL Pattern Filters#
Include Patterns - Only crawl URLs matching these patterns:
/docs/* # Only documentation pages
/blog/* # Only blog posts
/products/* # Only product pages
Exclude Patterns - Skip URLs matching these patterns:
/admin/* # Skip admin pages
/cart/* # Skip shopping cart
*.pdf # Skip PDF files
/api/* # Skip API endpoints
Pattern Syntax:
*= Match any characters/docs/*= All pages under/docs*.pdf= All PDF file URLs- Use one pattern per line
Step 4: Review Discovered URLs#
After discovery completes, you'll see a list of all found URLs:
- ✅ Included: URLs matching your patterns (will be crawled)
- ❌ Excluded: URLs filtered by your exclusion patterns
- 🔍 Review: Click to preview page content
Actions:
- Approve All: Crawl all included URLs
- Approve Selected: Choose specific pages to crawl
- Cancel: Start over with different settings
Step 5: Wait for Crawl#
The crawler will now:
- Visit each approved URL
- Extract clean text content
- Create 2KB content chunks
- Generate embeddings
- Index for search
Progress Tracking:
- See real-time crawl progress
- Monitor pages processed
- View any errors encountered
Sitemap Support#
AlonChat automatically detects and uses sitemaps for faster discovery.
Automatic Detection:
- Checks
/sitemap.xml - Checks
/sitemap_index.xml - Parses
robots.txtfor sitemap references
Manual Sitemap URL:
https://example.com/sitemap.xml
Benefits:
- ✅ Faster discovery (no need to crawl to find pages)
- ✅ Complete coverage (finds all published pages)
- ✅ Respects priority hints
Re-Crawling Websites#
Keep your agent's knowledge up-to-date by re-crawling websites:
- Go to Knowledge Base
- Find the website source
- Click ⋮ menu → Re-crawl
What happens during re-crawl:
- Discovers new pages added to the site
- Updates content for existing pages
- Removes content from deleted pages
- Preserves your original filter settings
Zero Data Loss: AlonChat uses the APPEND-THEN-CLEAN pattern - old content stays until new content successfully processes.
Pattern Filtering Examples#
Example 1: Documentation Only#
Include:
/docs/*
Exclude:
/blog/*
/pricing
Example 2: Exclude External Links#
Exclude:
http://*
https://external-site.com/*
Example 3: Multiple Sections#
Include:
/docs/*
/guides/*
/tutorials/*
Exclude:
*/drafts/*
*/archive/*
Best Practices#
1. Start Small#
- Test with 10-20 pages first
- Verify content quality
- Then expand to full site
2. Use Pattern Filters#
- Exclude login pages, admin panels
- Skip shopping cart, checkout flows
- Focus on informational content
3. Monitor Crawl Quality#
- Review crawled pages in Knowledge Base
- Check if navigation/ads were properly removed
- Adjust patterns if needed
4. Keep Content Fresh#
- Re-crawl weekly for frequently updated sites
- Re-crawl monthly for static sites
- Set reminders to check for outdated content
5. Respect Rate Limits#
- AlonChat crawls 2 pages per second (default)
- Increase for your own sites
- Keep default for external sites
Limitations#
- Maximum pages: 500 pages per website source
- File types: HTML only (PDF/Word require file upload)
- JavaScript rendering: Limited support for heavy JS sites
- Authentication: Cannot crawl password-protected pages
- Dynamic content: AJAX-loaded content may be missed
Workarounds:
- For PDF/Word docs: Use File Sources instead
- For protected pages: Export to PDF and upload
- For JS-heavy sites: Use RSS feeds or API integration
Troubleshooting#
"No pages discovered"#
- Check URL is accessible
- Verify site doesn't block crawlers (
robots.txt) - Try the sitemap URL directly
"Crawl failed"#
- Site may have anti-bot protection
- Try crawling fewer pages at once
- Check if site requires authentication
"Poor content quality"#
- Content may have too much navigation/ads
- Try more specific URL patterns
- Consider using File Sources for better control
"Missing pages"#
- Check your exclusion patterns
- Increase maximum pages limit
- Verify pages are linked from crawled pages
Next Steps#
- Training Your Agent - Process and activate knowledge
- Testing Your Agent - Verify website content retrieval
- Facebook Sources - Import conversation history