training-your-agent
Training Your Agent#
After adding knowledge sources (files, websites, Q&A, etc.), you need to train your agent to process and index the content. Training converts raw content into searchable, AI-ready knowledge.
What Happens During Training#
Training involves multiple automated steps:
1. Content Extraction#
- Files: Extract text from PDFs, Word docs, Excel, etc.
- Websites: Download and clean HTML pages
- Facebook: Unzip and parse message history
- Q&A: Format question-answer pairs
2. Chunking#
Content is split into small, searchable pieces (chunks):
- Default size: 2,000 characters per chunk
- Overlap: 200 characters between chunks (preserves context)
- Smart splitting: Respects paragraphs and sentences
Example:
Source: 10-page PDF (25,000 characters)
Result: ~15 chunks of 2,000 chars each
3. Namespace Assignment#
Chunks are organized into 4 namespaces based on content type:
| Namespace | Contains | Example |
|---|---|---|
| persona | AI communication style | "Mix English-Tagalog, use π" |
| examples | Q&A pairs | "Q: Price? A: β±500" |
| docs | Files, websites, text | Product descriptions, policies |
| conversation | Chat history | Past conversations |
Each namespace has a configurable priority that determines how likely its content is to be retrieved. The system automatically prioritizes the right namespace based on the question type.
4. Embedding Generation#
Each chunk is converted to a vector embedding using AI:
- Purpose: Enables semantic search ("How much?" matches "price")
- Embeddings capture the meaning of text, not just keywords
5. Indexing#
Vectors are stored in the database with metadata:
- Source ID, position, namespace
- Priority score, recency timestamp
- Original text content
How to Train Your Agent#
Step 1: Add Knowledge Sources#
Before training, add at least one source:
- Upload files (PDF, Word, etc.)
- Add website to crawl
- Import Facebook data
- Create Q&A pairs
- Paste text directly
Tip: Add multiple sources at once before training (more efficient).
Step 2: Click "Train Agent"#
- Go to your agent's Settings page
- Click Knowledge Base tab
- Click Train Agent button (top-right)
- Confirm training start
Step 3: Monitor Progress#
Training runs in the background:
- Real-time progress bar - Shows chunks processed
- Step indicators - Current stage (extraction β chunking β embedding)
- Time estimate - Remaining time (approx.)
Processing time:
- 10 pages: ~1-2 minutes
- 100 pages: ~5-10 minutes
- 1,000 pages: ~30-60 minutes
Step 4: Training Complete#
When done, you'll see:
- β "Training Complete" status
- Total chunks: Number of searchable pieces created
- Sources trained: Count of successfully processed sources
What's next: Your agent is now ready to chat!
When to Re-Train#
Re-train your agent when:
1. Adding New Sources#
- Uploaded new files
- Added website pages
- Imported updated Facebook data
- Created new Q&A pairs
2. Editing Existing Sources#
- Modified Q&A answers
- Updated text sources
- Re-crawled website (new content)
3. Deleting Sources#
- Removed outdated files
- Deleted old Q&A pairs
- Archived unused sources
4. Changing Configuration#
- Modified chunking settings
- Adjusted retrieval weights
- Changed priority scores
Note: Training is incremental - only new/changed sources are reprocessed.
Understanding Training Status#
Sources have different status indicators:
| Status | Meaning | Action Needed |
|---|---|---|
| π’ Trained | Successfully processed and indexed | None - ready to use |
| π‘ Pending | Waiting for training | Click "Train Agent" |
| π Processing | Currently being trained | Wait for completion |
| π΄ Failed | Training encountered error | Check error message, retry |
| βΈοΈ Paused | Training paused by user | Resume or cancel |
Training Options (Advanced)#
Chunking Settings#
Control how content is split (Admin only):
Text Chunk Size (default: 2,000 chars)
- Smaller (1,000): More precise, slower retrieval
- Larger (4,000): More context, less precise
Q&A Chunk Size (default: 10,000 chars)
- Q&A pairs are larger to preserve full answers
Website Chunk Size (default: 2,000 chars)
- Optimized for web content structure
Chunk Overlap (default: 200 chars)
- Preserves context across chunk boundaries
Embedding Model#
The embedding model controls the quality and speed of semantic search. AlonChat uses an optimized default, but advanced users can adjust this in the admin settings.
Namespace Priority#
Adjust retrieval priorities for each namespace (Admin only). Higher priority means content from that namespace is more likely to be retrieved during conversations. The persona namespace is always included by default.
Training Performance#
Optimization Tips#
1. Batch Training
- Add multiple sources at once
- Train once instead of individually
- Saves time: 10 sources Γ 1 min = 10 min vs 1 batch = 3 min
2. Use Smaller Files
- Split 100-page PDFs into 10-page sections
- Faster processing, better organization
- Easier to update specific sections
3. Pre-Clean Content
- Remove unnecessary pages (covers, indexes)
- Delete duplicate content
- Fix formatting issues before upload
4. Schedule Off-Peak Training
- Large imports (1,000+ pages) during low traffic
- Avoid peak hours for faster processing
Monitoring System Load#
Training uses background workers:
- Concurrency: 2-3 sources processed simultaneously
- Queue: Remaining sources wait in line
- Priority: Manual training > automatic re-crawls
Check worker status: Admin dashboard shows active jobs
Troubleshooting Training Issues#
"Training failed"#
Common causes:
- File corrupted - Re-upload file
- Website unreachable - Check URL, try again
- Embedding API error - API key issue, contact support
- Timeout - File too large, split into smaller files
Solution: Click "Retry Training" on failed source
"Training stuck at 50%"#
Possible issues:
- Large file processing (patience needed)
- Worker overload (try again later)
- Network issue (check internet connection)
Fix: Wait 10 minutes, refresh page, check status
"Chunks not appearing in chat"#
Checklist:
- β Training completed successfully
- β Source status shows "Trained"
- β Content matches query (test with exact phrases)
- β Namespace enabled in agent config
Debug: Test chat with exact text from source
"Too many chunks created"#
Symptoms:
- Source created 1,000+ chunks
- Retrieval is slow
- Irrelevant content returned
Solutions:
- Increase chunk size (2,000 β 4,000 chars)
- Remove duplicate content
- Split source into smaller files
- Use pattern filtering for websites
"Training uses too many credits"#
Reduce costs:
- Remove duplicate sources
- Delete unused training runs
- Pre-process files to remove unnecessary content
- Split large files into focused sections
Training Best Practices#
1. Quality Over Quantity#
- 10 high-quality sources > 100 low-quality sources
- Focus on relevant, accurate content
- Remove outdated information regularly
2. Organize by Topic#
- Group related sources (e.g., "Pricing", "Shipping")
- Use consistent naming
- Makes debugging easier
3. Test After Training#
- Ask questions to verify retrieval
- Check if answers are accurate
- Adjust sources if needed
4. Keep Sources Updated#
- Re-train monthly for static content
- Re-train weekly for frequently changing content
- Set reminders for website re-crawls
5. Monitor Training Logs#
- Check for failed sources
- Review warning messages
- Fix issues promptly
Next Steps#
- Testing Your Agent - Verify training results
- Embedding & Deploying - Add agent to website
- Understanding Agents - Learn about RAG retrieval