Location: src/calibration/llm_analyzer.py
Purpose: The LLM examines sample emails and assigns each one to a discovered category, creating labeled training data for the ML model.
This is NOT the same as category discovery. Discovery finds WHAT categories exist. Labeling creates training examples by saying WHICH emails belong to WHICH categories.
The "Label Training Emails" phase described as "~3 seconds per email" is INCORRECT.
The actual implementation does NOT label emails individually.
Labels are created as a BYPRODUCT of batch category discovery, not as a separate sequential operation.
flowchart TD
Start([Calibration Phase Starts]) --> Sample[Sample 300 emails
stratified by sender]
Sample --> BatchSetup[Split into batches of 20 emails
300 ÷ 20 = 15 batches]
BatchSetup --> Batch1[Batch 1: Emails 1-20]
Batch1 --> Stats1[Calculate batch statistics
domains, keywords, attachments
~0.1 seconds]
Stats1 --> BuildPrompt1[Build LLM prompt
Include all 20 email summaries
~0.05 seconds]
BuildPrompt1 --> LLMCall1[Single LLM call for entire batch
Discovers categories AND labels all 20
~20 seconds TOTAL for batch]
LLMCall1 --> Parse1[Parse JSON response
Extract categories + labels
~0.1 seconds]
Parse1 --> Store1[Store results
categories: Dict
labels: List of Tuples]
Store1 --> Batch2{More batches?}
Batch2 -->|Yes| NextBatch[Batch 2: Emails 21-40]
Batch2 -->|No| Consolidate
NextBatch --> Stats2[Same process
15 total batches
~20 seconds each]
Stats2 --> Batch2
Consolidate[Consolidate categories
Merge duplicates
Single LLM call
~5 seconds]
Consolidate --> CacheSnap[Snap to cached categories
Match against persistent cache
~0.5 seconds]
CacheSnap --> Final[Final output
10-12 categories
300 labeled emails]
Final --> End([Labels ready for ML training])
style LLMCall1 fill:#ff6b6b
style Consolidate fill:#ff6b6b
style Stats2 fill:#ffd93d
style Final fill:#4ec9b0
Sequential (WRONG assumption): 300 emails × 3 sec/email = 900 seconds (15 minutes)
Batched (ACTUAL): 15 batches × 20 sec/batch = 300 seconds (5 minutes)
Savings: 10 minutes (67% faster than assumed)
flowchart TD
Start([Batch of 20 emails]) --> Stats[Calculate Statistics
~0.1 seconds]
Stats --> StatDetails[Domain analysis
Recipient counts
Attachment detection
Keyword extraction]
StatDetails --> BuildList[Build email summaries
For each email:
ID + From + Subject + Preview]
BuildList --> Prompt[Construct LLM prompt
~2KB text
Contains:
- Statistics summary
- All 20 email summaries
- Instructions
- JSON schema]
Prompt --> LLM[LLM Call
POST /api/generate
qwen3:4b-instruct-2507-q8_0
temp=0.1, max_tokens=2000
~18-22 seconds]
LLM --> Response[LLM Response
JSON with:
categories: Dict
labels: List of 20 Tuples]
Response --> Parse[Parse JSON
Regex extraction
Brace counting
~0.05 seconds]
Parse --> Validate{Valid JSON?}
Validate -->|Yes| Extract[Extract data
categories: 3-8 new
labels: 20 tuples]
Validate -->|No| FallbackParse[Fallback parsing
Try to salvage partial data]
FallbackParse --> Extract
Extract --> Return[Return batch results
categories: Dict str→str
labels: List Tuple str,str]
Return --> End([Merge with global results])
style LLM fill:#ff6b6b
style Parse fill:#4ec9b0
style FallbackParse fill:#ffd93d
| Operation | Per Batch (20 emails) | Total (15 batches) | % of Total Time |
|---|---|---|---|
| Calculate statistics | 0.1 sec | 1.5 sec | 0.5% |
| Build email summaries | 0.05 sec | 0.75 sec | 0.2% |
| Construct prompt | 0.01 sec | 0.15 sec | 0.05% |
| LLM API call | 18-22 sec | 270-330 sec | 98% |
| Parse JSON response | 0.05 sec | 0.75 sec | 0.2% |
| Merge results | 0.02 sec | 0.3 sec | 0.1% |
| SUBTOTAL: Batch Discovery | ~300 seconds (5 min) | 98.5% | |
| Consolidation LLM call | 5 seconds | 1.3% | |
| Cache snapping (semantic matching) | 0.5 seconds | 0.2% | |
| TOTAL LABELING PHASE | ~305 seconds (5 min) | 100% | |
Original estimate: "~3 seconds per email" = 900 seconds for 300 emails
Actual timing: ~20 seconds per batch of 20 = ~305 seconds for 300 emails
Difference: 3× faster than original assumption
Why: Batching allows LLM to see context across multiple emails and make better category decisions in a single inference pass.
flowchart LR
Input[300 sampled emails] --> Discovery[Category Discovery
15 batches × 20 emails]
Discovery --> RawCats[Raw Categories
~30-40 discovered
May have duplicates:
Work, work, Business, etc.]
RawCats --> Consolidate[Consolidation
LLM merges similar
~5 seconds]
Consolidate --> Merged[Merged Categories
~12-15 categories
Work, Financial, etc.]
Merged --> CacheSnap[Cache Snap
Match against persistent cache
~0.5 seconds]
CacheSnap --> Final[Final Categories
10-12 categories]
Discovery --> RawLabels[Raw Labels
300 tuples:
email_id, category]
RawLabels --> UpdateLabels[Update label categories
to match snapped names]
UpdateLabels --> FinalLabels[Final Labels
300 training pairs]
Final --> Training[Training Data]
FinalLabels --> Training
Training --> MLTrain[Train LightGBM Model
~5 seconds]
MLTrain --> Model[Trained Model
1.8MB .pkl file]
style Discovery fill:#ff6b6b
style Consolidate fill:#ff6b6b
style Model fill:#4ec9b0
| Approach | LLM Calls | Time/Call | Total Time | Quality |
|---|---|---|---|---|
| Sequential (1 email/call) | 300 | 3 sec | 900 sec (15 min) | Poor - no context |
| Small batches (5 emails/call) | 60 | 8 sec | 480 sec (8 min) | Fair - limited context |
| Current (20 emails/call) | 15 | 20 sec | 300 sec (5 min) | Good - sufficient context |
| Large batches (50 emails/call) | 6 | 45 sec | 270 sec (4.5 min) | Risk - may exceed token limits |
| Parameter | Location | Default | Effect on Timing |
|---|---|---|---|
| sample_size | CalibrationConfig | 300 | 300 samples = 15 batches = 5 min |
| batch_size | llm_analyzer.py:62 | 20 | Hardcoded - affects batch count |
| llm_batch_size | CalibrationConfig | 50 | NOT USED for discovery (misleading name) |
| temperature | LLM call | 0.1 | Lower = faster, more deterministic |
| max_tokens | LLM call | 2000 | Higher = potentially slower response |
gantt
title Calibration Phase Timeline (300 samples, 10k total emails)
dateFormat mm:ss
axisFormat %M:%S
section Sampling
Stratified sample (3% of 10k) :00:00, 01s
section Category Discovery
Batch 1 (emails 1-20) :00:01, 20s
Batch 2 (emails 21-40) :00:21, 20s
Batch 3 (emails 41-60) :00:41, 20s
Batch 4-13 (emails 61-260) :01:01, 200s
Batch 14 (emails 261-280) :04:21, 20s
Batch 15 (emails 281-300) :04:41, 20s
section Consolidation
LLM category merge :05:01, 05s
Cache snap :05:06, 00.5s
section ML Training
Feature extraction (300) :05:07, 06s
LightGBM training :05:13, 05s
Validation (100 emails) :05:18, 02s
Save model to disk :05:20, 00.5s
The LLM creates labels as a byproduct of batch category discovery. There is NO separate "label each email one by one" phase.
Processing 20 emails in a single LLM call (20 sec) is 3× faster than 20 individual calls (60 sec total).
98% of labeling phase time is LLM API calls. Everything else (parsing, merging, caching) is negligible.
Merging 30-40 raw categories into 10-12 final ones takes only ~5 seconds with a single LLM call.
| Optimization | Current | Potential | Tradeoff |
|---|---|---|---|
| Increase batch size | 20 emails/batch | 30-40 emails/batch | May hit token limits, slower per call |
| Reduce sample size | 300 samples (3%) | 200 samples (2%) | Less training data, potentially worse model |
| Parallel batching | Sequential 15 batches | 3-5 concurrent batches | Requires async LLM client, more complex |
| Skip consolidation | Always consolidate if >10 cats | Skip if <15 cats | May leave duplicate categories |
| Cache-first approach | Discover then snap to cache | Snap to cache, only discover new | Less adaptive to new mailbox types |