file-sources

File Sources#

Upload documents (PDF, Word, Excel, etc.) to your agent's knowledge base. AlonChat extracts text, chunks it, and generates embeddings for intelligent retrieval.

Supported File Formats#

FormatExtensionsBest ForNotes
PDF.pdfDocuments, manuals, reportsText only (images ignored)
Microsoft Word.docx, .docDocuments, policiesFormatting preserved as text
Microsoft Excel.xlsx, .xls, .csvPrice lists, data tablesTables converted to text
Text.txtPlain text documentsFastest processing
Markdown.mdTechnical docs, README filesFormatting preserved
JSON.jsonAPI docs, structured dataParsed and formatted

How It Works#

Code
1. Upload File → Supabase Storage (agent-sources bucket)
   ↓
2. Queue Job → source-processor worker picks it up
   ↓
3. Download & Extract Text → FileProcessor.processFile()
   ↓
4. Chunk Content → ChunkManager.storeChunksWithNamespace()
   - Default chunk size: 2000 characters (~500 words)
   - Namespace: "docs"
   ↓
5. Generate Embeddings → embedding-generation queue
   - Creates vector embeddings for semantic search
   ↓
6. Status: Ready → Agent can now use this knowledge

Processing Time:

  • Small files (under 1MB): 30-60 seconds
  • Medium files (1-10MB): 1-3 minutes
  • Large files (10-50MB): 5-10 minutes

Step-by-Step: Adding a File Source#

1. Navigate to Knowledge Base#

  1. Go to your agent's dashboard
  2. Click Knowledge Base tab
  3. Click Add SourceFile

2. Upload Your File#

Option A: Drag and Drop

  1. Drag your file into the upload area
  2. File name appears when upload completes

Option B: Click to Browse

  1. Click Choose File
  2. Select file from your computer
  3. Click Open

3. Configure Source Settings#

Source Name (required)

  • Descriptive name for this source
  • Examples: "Product Manual 2024", "Pricing Sheet", "Policy Document"
  • This helps you identify sources in the dashboard

Priority (optional)

  • Normal (default): Standard retrieval weight
  • High: More likely to be retrieved (use for important info)
  • Low: Lower retrieval priority (background info only)

Mark as Price Information (checkbox)

  • Check this if the file contains pricing information
  • Automatically sets priority to 1.0 (highest)
  • Pricing questions will prioritize this source

4. Upload and Train#

  1. Click Upload File
  2. Wait for upload to complete (progress bar)
  3. Source appears in "Processing" status
  4. Click Train Agent to process the file
  5. Wait for status to change to "Ready"

Your agent can now answer questions using this file's content!

Best Practices#

File Preparation#

Do:

  • Use clear, descriptive filenames
  • Keep files under 10MB when possible (faster processing)
  • Use searchable PDFs (not scanned images)
  • Organize content with headings and sections
  • Remove unnecessary pages (covers, blank pages)

Don't:

  • Upload scanned images (text can't be extracted)
  • Include password-protected files
  • Upload corrupted files
  • Use special characters in filenames
  • Exceed 50MB file size limit

PDF Best Practices#

Good PDFs:

  • Text-based (created from Word, Google Docs, etc.)
  • Clear structure with headings
  • Tables with text (not images)
  • Searchable (you can select text)

Bad PDFs:

  • Scanned documents (images, not text)
  • Password-protected
  • Heavily image-based
  • Complex layouts (newspaper-style columns)

Tip: If you have a scanned PDF, use OCR software (Adobe Acrobat, online OCR tools) to convert it to searchable text first.

Excel/CSV Best Practices#

Good Excel files:

  • Clear column headers
  • One table per sheet
  • Simple formatting
  • No merged cells

Bad Excel files:

  • Multiple tables on one sheet
  • Complex formulas
  • Heavily merged cells
  • Charts and graphs (won't be processed)

Tip: For complex Excel files, consider exporting to CSV or creating a Text source with the important data instead.

Word Document Best Practices#

Good Word docs:

  • Clear headings (Heading 1, Heading 2, etc.)
  • Bulleted/numbered lists
  • Simple tables
  • Plain text content

Bad Word docs:

  • Complex formatting (text boxes, columns)
  • Embedded objects (videos, complex charts)
  • Track changes enabled (may cause parsing issues)

Updating File Sources#

Editing a File Source#

To change the name or priority:

  1. Find the source in the Knowledge Base list
  2. Click Edit (pencil icon)
  3. Update name, priority, or "Is Price" setting
  4. Click Save
  5. Re-train your agent for changes to take effect

To replace the file content:

  1. Delete the old source (or archive it)
  2. Upload the new file as a new source
  3. Train your agent

Re-processing a File#

If processing failed or you suspect issues:

  1. Find the source in the Knowledge Base list
  2. Click Reprocess (refresh icon)
  3. Wait for processing to complete
  4. Check status

Troubleshooting#

"Upload Failed"#

Causes:

  • File too large (>50MB)
  • Network connection interrupted
  • File format not supported
  • File corrupted

Solutions:

  • Check file size (compress if needed)
  • Try uploading again
  • Convert to a supported format (e.g., PDF)
  • Verify file opens correctly on your computer

"Processing Failed"#

Causes:

  • File is password-protected
  • File is corrupted
  • File format has issues
  • Text extraction failed (scanned PDFs)

Solutions:

  • Remove password protection
  • Try re-exporting the file
  • Convert to plain PDF or TXT
  • Use OCR for scanned documents

"Agent Not Using File Content"#

Causes:

  • Forgot to train after uploading
  • File still processing
  • Questions not related to file content
  • Priority too low

Solutions:

  • Click "Train Agent" button
  • Wait for "Ready" status
  • Test with questions directly from the file
  • Increase priority to "High"

"File Processing Takes Too Long"#

Normal times:

  • Less than 1MB: 30-60 seconds
  • 1-10MB: 1-3 minutes
  • 10-50MB: 5-10 minutes

If longer:

  • Check worker health (admin panel)
  • Large files naturally take time
  • Complex PDFs process slower
  • Consider splitting large files

Examples#

Example 1: Product Manual#

File: product-manual-v2.pdf (8.5MB, 120 pages)

Settings:

  • Name: "Product Manual v2 (2024)"
  • Priority: High
  • Is Price: No

Expected questions:

  • "How do I install the product?"
  • "What are the technical specifications?"
  • "How do I troubleshoot error codes?"

Processing time: ~2-3 minutes

Example 2: Price List#

File: pricing-2024.xlsx (250KB, 3 sheets)

Settings:

  • Name: "2024 Pricing"
  • Priority: High
  • Is Price: Yes ✓

Expected questions:

  • "How much does the Pro plan cost?"
  • "What's included in the Enterprise tier?"
  • "Do you offer discounts for annual plans?"

Processing time: ~30-45 seconds

Example 3: Company Policies#

File: employee-handbook.docx (2.1MB, 45 pages)

Settings:

  • Name: "Employee Handbook 2024"
  • Priority: Normal
  • Is Price: No

Expected questions:

  • "What's the vacation policy?"
  • "What are the work-from-home guidelines?"
  • "What benefits do employees get?"

Processing time: ~1-2 minutes

Technical Details#

Text Extraction#

AlonChat uses specialized libraries for each format:

  • PDF: pdf-parse (Node.js)
  • Word: mammoth.js
  • Excel: xlsx
  • Text: Direct reading
  • Markdown: remark parser
  • JSON: JSON.parse with formatting

Chunking Strategy#

  • Chunk size: 2000 characters (configurable via admin)
  • Overlap: 200 characters (10% overlap between chunks)
  • Separator priority: Paragraphs > Sentences > Words

Example:

Code
Original: 10,000 character document
Result: 5 chunks of ~2000 chars each
- Chunk 1: chars 0-2000
- Chunk 2: chars 1800-3800 (200 char overlap)
- Chunk 3: chars 3600-5600
- Chunk 4: chars 5400-7400
- Chunk 5: chars 7200-10000

Storage#

  • Files: Supabase Storage (agent-sources bucket)
  • Chunks: source_chunks table
  • Embeddings: embedding column (pgvector)
  • Metadata: sources table

API Reference#

For developers integrating file uploads programmatically:

Endpoint: POST /api/agents/[id]/sources/files

Request:

typescript
const formData = new FormData()
formData.append('file', fileBlob)
formData.append('name', 'My Document')
formData.append('priority', '0.8')
formData.append('isPrice', 'false')

const response = await fetch(`/api/agents/${agentId}/sources/files`, {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${apiKey}`
  },
  body: formData
})

Response:

json
{
  "success": true,
  "source": {
    "id": "source-uuid",
    "name": "My Document",
    "type": "file",
    "status": "processing"
  }
}

Limits#

PlanMax File SizeMax SourcesProcessing Speed
Free10MB10 sourcesStandard queue
Pro50MB100 sourcesPriority queue
Enterprise100MBUnlimitedDedicated workers

Next Steps#

file-sources | AlonChat Docs