training-your-agent

Training Your Agent#

After adding knowledge sources (files, websites, Q&A, etc.), you need to train your agent to process and index the content. Training converts raw content into searchable, AI-ready knowledge.

What Happens During Training#

Training involves multiple automated steps:

1. Content Extraction#

  • Files: Extract text from PDFs, Word docs, Excel, etc.
  • Websites: Download and clean HTML pages
  • Facebook: Unzip and parse message history
  • Q&A: Format question-answer pairs

2. Chunking#

Content is split into small, searchable pieces (chunks):

  • Default size: 2,000 characters per chunk
  • Overlap: 200 characters between chunks (preserves context)
  • Smart splitting: Respects paragraphs and sentences

Example:

Code
Source: 10-page PDF (25,000 characters)
Result: ~15 chunks of 2,000 chars each

3. Namespace Assignment#

Chunks are organized into 4 namespaces based on content type:

NamespaceContainsExample
personaAI communication style"Mix English-Tagalog, use 😊"
examplesQ&A pairs"Q: Price? A: β‚±500"
docsFiles, websites, textProduct descriptions, policies
conversationChat historyPast conversations

Each namespace has a configurable priority that determines how likely its content is to be retrieved. The system automatically prioritizes the right namespace based on the question type.

4. Embedding Generation#

Each chunk is converted to a vector embedding using AI:

  • Purpose: Enables semantic search ("How much?" matches "price")
  • Embeddings capture the meaning of text, not just keywords

5. Indexing#

Vectors are stored in the database with metadata:

  • Source ID, position, namespace
  • Priority score, recency timestamp
  • Original text content

How to Train Your Agent#

Step 1: Add Knowledge Sources#

Before training, add at least one source:

  • Upload files (PDF, Word, etc.)
  • Add website to crawl
  • Import Facebook data
  • Create Q&A pairs
  • Paste text directly

Tip: Add multiple sources at once before training (more efficient).

Step 2: Click "Train Agent"#

  1. Go to your agent's Settings page
  2. Click Knowledge Base tab
  3. Click Train Agent button (top-right)
  4. Confirm training start

Step 3: Monitor Progress#

Training runs in the background:

  • Real-time progress bar - Shows chunks processed
  • Step indicators - Current stage (extraction β†’ chunking β†’ embedding)
  • Time estimate - Remaining time (approx.)

Processing time:

  • 10 pages: ~1-2 minutes
  • 100 pages: ~5-10 minutes
  • 1,000 pages: ~30-60 minutes

Step 4: Training Complete#

When done, you'll see:

  • βœ… "Training Complete" status
  • Total chunks: Number of searchable pieces created
  • Sources trained: Count of successfully processed sources

What's next: Your agent is now ready to chat!

When to Re-Train#

Re-train your agent when:

1. Adding New Sources#

  • Uploaded new files
  • Added website pages
  • Imported updated Facebook data
  • Created new Q&A pairs

2. Editing Existing Sources#

  • Modified Q&A answers
  • Updated text sources
  • Re-crawled website (new content)

3. Deleting Sources#

  • Removed outdated files
  • Deleted old Q&A pairs
  • Archived unused sources

4. Changing Configuration#

  • Modified chunking settings
  • Adjusted retrieval weights
  • Changed priority scores

Note: Training is incremental - only new/changed sources are reprocessed.

Understanding Training Status#

Sources have different status indicators:

StatusMeaningAction Needed
🟒 TrainedSuccessfully processed and indexedNone - ready to use
🟑 PendingWaiting for trainingClick "Train Agent"
πŸ”„ ProcessingCurrently being trainedWait for completion
πŸ”΄ FailedTraining encountered errorCheck error message, retry
⏸️ PausedTraining paused by userResume or cancel

Training Options (Advanced)#

Chunking Settings#

Control how content is split (Admin only):

Text Chunk Size (default: 2,000 chars)

  • Smaller (1,000): More precise, slower retrieval
  • Larger (4,000): More context, less precise

Q&A Chunk Size (default: 10,000 chars)

  • Q&A pairs are larger to preserve full answers

Website Chunk Size (default: 2,000 chars)

  • Optimized for web content structure

Chunk Overlap (default: 200 chars)

  • Preserves context across chunk boundaries

Embedding Model#

The embedding model controls the quality and speed of semantic search. AlonChat uses an optimized default, but advanced users can adjust this in the admin settings.

Namespace Priority#

Adjust retrieval priorities for each namespace (Admin only). Higher priority means content from that namespace is more likely to be retrieved during conversations. The persona namespace is always included by default.

Training Performance#

Optimization Tips#

1. Batch Training

  • Add multiple sources at once
  • Train once instead of individually
  • Saves time: 10 sources Γ— 1 min = 10 min vs 1 batch = 3 min

2. Use Smaller Files

  • Split 100-page PDFs into 10-page sections
  • Faster processing, better organization
  • Easier to update specific sections

3. Pre-Clean Content

  • Remove unnecessary pages (covers, indexes)
  • Delete duplicate content
  • Fix formatting issues before upload

4. Schedule Off-Peak Training

  • Large imports (1,000+ pages) during low traffic
  • Avoid peak hours for faster processing

Monitoring System Load#

Training uses background workers:

  • Concurrency: 2-3 sources processed simultaneously
  • Queue: Remaining sources wait in line
  • Priority: Manual training > automatic re-crawls

Check worker status: Admin dashboard shows active jobs

Troubleshooting Training Issues#

"Training failed"#

Common causes:

  1. File corrupted - Re-upload file
  2. Website unreachable - Check URL, try again
  3. Embedding API error - API key issue, contact support
  4. Timeout - File too large, split into smaller files

Solution: Click "Retry Training" on failed source

"Training stuck at 50%"#

Possible issues:

  • Large file processing (patience needed)
  • Worker overload (try again later)
  • Network issue (check internet connection)

Fix: Wait 10 minutes, refresh page, check status

"Chunks not appearing in chat"#

Checklist:

  • βœ… Training completed successfully
  • βœ… Source status shows "Trained"
  • βœ… Content matches query (test with exact phrases)
  • βœ… Namespace enabled in agent config

Debug: Test chat with exact text from source

"Too many chunks created"#

Symptoms:

  • Source created 1,000+ chunks
  • Retrieval is slow
  • Irrelevant content returned

Solutions:

  1. Increase chunk size (2,000 β†’ 4,000 chars)
  2. Remove duplicate content
  3. Split source into smaller files
  4. Use pattern filtering for websites

"Training uses too many credits"#

Reduce costs:

  1. Remove duplicate sources
  2. Delete unused training runs
  3. Pre-process files to remove unnecessary content
  4. Split large files into focused sections

Training Best Practices#

1. Quality Over Quantity#

  • 10 high-quality sources > 100 low-quality sources
  • Focus on relevant, accurate content
  • Remove outdated information regularly

2. Organize by Topic#

  • Group related sources (e.g., "Pricing", "Shipping")
  • Use consistent naming
  • Makes debugging easier

3. Test After Training#

  • Ask questions to verify retrieval
  • Check if answers are accurate
  • Adjust sources if needed

4. Keep Sources Updated#

  • Re-train monthly for static content
  • Re-train weekly for frequently changing content
  • Set reminders for website re-crawls

5. Monitor Training Logs#

  • Check for failed sources
  • Review warning messages
  • Fix issues promptly

Next Steps#

training-your-agent | AlonChat Docs