file-sources
File Sources#
Upload documents (PDF, Word, Excel, etc.) to your agent's knowledge base. AlonChat extracts text, chunks it, and generates embeddings for intelligent retrieval.
Supported File Formats#
| Format | Extensions | Best For | Notes |
|---|---|---|---|
| Documents, manuals, reports | Text only (images ignored) | ||
| Microsoft Word | .docx, .doc | Documents, policies | Formatting preserved as text |
| Microsoft Excel | .xlsx, .xls, .csv | Price lists, data tables | Tables converted to text |
| Text | .txt | Plain text documents | Fastest processing |
| Markdown | .md | Technical docs, README files | Formatting preserved |
| JSON | .json | API docs, structured data | Parsed and formatted |
How It Works#
1. Upload File → Supabase Storage (agent-sources bucket)
↓
2. Queue Job → source-processor worker picks it up
↓
3. Download & Extract Text → FileProcessor.processFile()
↓
4. Chunk Content → ChunkManager.storeChunksWithNamespace()
- Default chunk size: 2000 characters (~500 words)
- Namespace: "docs"
↓
5. Generate Embeddings → embedding-generation queue
- Creates vector embeddings for semantic search
↓
6. Status: Ready → Agent can now use this knowledge
Processing Time:
- Small files (under 1MB): 30-60 seconds
- Medium files (1-10MB): 1-3 minutes
- Large files (10-50MB): 5-10 minutes
Step-by-Step: Adding a File Source#
1. Navigate to Knowledge Base#
- Go to your agent's dashboard
- Click Knowledge Base tab
- Click Add Source → File
2. Upload Your File#
Option A: Drag and Drop
- Drag your file into the upload area
- File name appears when upload completes
Option B: Click to Browse
- Click Choose File
- Select file from your computer
- Click Open
3. Configure Source Settings#
Source Name (required)
- Descriptive name for this source
- Examples: "Product Manual 2024", "Pricing Sheet", "Policy Document"
- This helps you identify sources in the dashboard
Priority (optional)
- Normal (default): Standard retrieval weight
- High: More likely to be retrieved (use for important info)
- Low: Lower retrieval priority (background info only)
Mark as Price Information (checkbox)
- Check this if the file contains pricing information
- Automatically sets priority to 1.0 (highest)
- Pricing questions will prioritize this source
4. Upload and Train#
- Click Upload File
- Wait for upload to complete (progress bar)
- Source appears in "Processing" status
- Click Train Agent to process the file
- Wait for status to change to "Ready"
Your agent can now answer questions using this file's content!
Best Practices#
File Preparation#
✅ Do:
- Use clear, descriptive filenames
- Keep files under 10MB when possible (faster processing)
- Use searchable PDFs (not scanned images)
- Organize content with headings and sections
- Remove unnecessary pages (covers, blank pages)
❌ Don't:
- Upload scanned images (text can't be extracted)
- Include password-protected files
- Upload corrupted files
- Use special characters in filenames
- Exceed 50MB file size limit
PDF Best Practices#
Good PDFs:
- Text-based (created from Word, Google Docs, etc.)
- Clear structure with headings
- Tables with text (not images)
- Searchable (you can select text)
Bad PDFs:
- Scanned documents (images, not text)
- Password-protected
- Heavily image-based
- Complex layouts (newspaper-style columns)
Tip: If you have a scanned PDF, use OCR software (Adobe Acrobat, online OCR tools) to convert it to searchable text first.
Excel/CSV Best Practices#
Good Excel files:
- Clear column headers
- One table per sheet
- Simple formatting
- No merged cells
Bad Excel files:
- Multiple tables on one sheet
- Complex formulas
- Heavily merged cells
- Charts and graphs (won't be processed)
Tip: For complex Excel files, consider exporting to CSV or creating a Text source with the important data instead.
Word Document Best Practices#
Good Word docs:
- Clear headings (Heading 1, Heading 2, etc.)
- Bulleted/numbered lists
- Simple tables
- Plain text content
Bad Word docs:
- Complex formatting (text boxes, columns)
- Embedded objects (videos, complex charts)
- Track changes enabled (may cause parsing issues)
Updating File Sources#
Editing a File Source#
To change the name or priority:
- Find the source in the Knowledge Base list
- Click Edit (pencil icon)
- Update name, priority, or "Is Price" setting
- Click Save
- Re-train your agent for changes to take effect
To replace the file content:
- Delete the old source (or archive it)
- Upload the new file as a new source
- Train your agent
Re-processing a File#
If processing failed or you suspect issues:
- Find the source in the Knowledge Base list
- Click Reprocess (refresh icon)
- Wait for processing to complete
- Check status
Troubleshooting#
"Upload Failed"#
Causes:
- File too large (>50MB)
- Network connection interrupted
- File format not supported
- File corrupted
Solutions:
- Check file size (compress if needed)
- Try uploading again
- Convert to a supported format (e.g., PDF)
- Verify file opens correctly on your computer
"Processing Failed"#
Causes:
- File is password-protected
- File is corrupted
- File format has issues
- Text extraction failed (scanned PDFs)
Solutions:
- Remove password protection
- Try re-exporting the file
- Convert to plain PDF or TXT
- Use OCR for scanned documents
"Agent Not Using File Content"#
Causes:
- Forgot to train after uploading
- File still processing
- Questions not related to file content
- Priority too low
Solutions:
- Click "Train Agent" button
- Wait for "Ready" status
- Test with questions directly from the file
- Increase priority to "High"
"File Processing Takes Too Long"#
Normal times:
- Less than 1MB: 30-60 seconds
- 1-10MB: 1-3 minutes
- 10-50MB: 5-10 minutes
If longer:
- Check worker health (admin panel)
- Large files naturally take time
- Complex PDFs process slower
- Consider splitting large files
Examples#
Example 1: Product Manual#
File: product-manual-v2.pdf (8.5MB, 120 pages)
Settings:
- Name: "Product Manual v2 (2024)"
- Priority: High
- Is Price: No
Expected questions:
- "How do I install the product?"
- "What are the technical specifications?"
- "How do I troubleshoot error codes?"
Processing time: ~2-3 minutes
Example 2: Price List#
File: pricing-2024.xlsx (250KB, 3 sheets)
Settings:
- Name: "2024 Pricing"
- Priority: High
- Is Price: Yes ✓
Expected questions:
- "How much does the Pro plan cost?"
- "What's included in the Enterprise tier?"
- "Do you offer discounts for annual plans?"
Processing time: ~30-45 seconds
Example 3: Company Policies#
File: employee-handbook.docx (2.1MB, 45 pages)
Settings:
- Name: "Employee Handbook 2024"
- Priority: Normal
- Is Price: No
Expected questions:
- "What's the vacation policy?"
- "What are the work-from-home guidelines?"
- "What benefits do employees get?"
Processing time: ~1-2 minutes
Technical Details#
Text Extraction#
AlonChat uses specialized libraries for each format:
- PDF: pdf-parse (Node.js)
- Word: mammoth.js
- Excel: xlsx
- Text: Direct reading
- Markdown: remark parser
- JSON: JSON.parse with formatting
Chunking Strategy#
- Chunk size: 2000 characters (configurable via admin)
- Overlap: 200 characters (10% overlap between chunks)
- Separator priority: Paragraphs > Sentences > Words
Example:
Original: 10,000 character document
Result: 5 chunks of ~2000 chars each
- Chunk 1: chars 0-2000
- Chunk 2: chars 1800-3800 (200 char overlap)
- Chunk 3: chars 3600-5600
- Chunk 4: chars 5400-7400
- Chunk 5: chars 7200-10000
Storage#
- Files: Supabase Storage (
agent-sourcesbucket) - Chunks:
source_chunkstable - Embeddings:
embeddingcolumn (pgvector) - Metadata:
sourcestable
API Reference#
For developers integrating file uploads programmatically:
Endpoint: POST /api/agents/[id]/sources/files
Request:
const formData = new FormData()
formData.append('file', fileBlob)
formData.append('name', 'My Document')
formData.append('priority', '0.8')
formData.append('isPrice', 'false')
const response = await fetch(`/api/agents/${agentId}/sources/files`, {
method: 'POST',
headers: {
'Authorization': `Bearer ${apiKey}`
},
body: formData
})
Response:
{
"success": true,
"source": {
"id": "source-uuid",
"name": "My Document",
"type": "file",
"status": "processing"
}
}
Limits#
| Plan | Max File Size | Max Sources | Processing Speed |
|---|---|---|---|
| Free | 10MB | 10 sources | Standard queue |
| Pro | 50MB | 100 sources | Priority queue |
| Enterprise | 100MB | Unlimited | Dedicated workers |
Next Steps#
- Text Sources - Add text directly (no file needed)
- Q&A Sources - Create question-answer pairs
- Website Sources - Crawl websites
- Training Your Agent - Best practices for training