"I want to run ML-only classification on new mailboxes WITHOUT full calibration. Maybe 1 LLM call to verify categories match, then pure ML on embeddings. How can we do this fast for experimentation?"
flowchart TD
Start([New Mailbox: 10k emails]) --> Check{Model exists?}
Check -->|No| Calibration[CALIBRATION PHASE
~20 minutes]
Check -->|Yes| LoadModel[Load existing model]
Calibration --> Sample[Sample 300 emails]
Sample --> Discovery[LLM Category Discovery
15 batches × 20 emails
~5 minutes]
Discovery --> Consolidate[Consolidate categories
LLM call
~5 seconds]
Consolidate --> Label[Label 300 samples]
Label --> Extract[Feature extraction]
Extract --> Train[Train LightGBM
~5 seconds]
Train --> SaveModel[Save new model]
SaveModel --> Classify[CLASSIFICATION PHASE]
LoadModel --> Classify
Classify --> Loop{For each email}
Loop --> Embed[Generate embedding
~0.02 sec]
Embed --> TFIDF[TF-IDF features
~0.001 sec]
TFIDF --> Predict[ML Prediction
~0.003 sec]
Predict --> Threshold{Confidence?}
Threshold -->|High| MLDone[ML result]
Threshold -->|Low| LLMFallback[LLM fallback
~4 sec]
MLDone --> Next{More?}
LLMFallback --> Next
Next -->|Yes| Loop
Next -->|No| Done[Results]
style Calibration fill:#ff6b6b
style Discovery fill:#ff6b6b
style LLMFallback fill:#ff6b6b
style MLDone fill:#4ec9b0
flowchart TD
Start([New Mailbox: 10k emails]) --> LoadModel[Load pre-trained model
Categories: 11 known
~0.5 seconds]
LoadModel --> OptionalCheck{Verify categories?}
OptionalCheck -->|Yes| QuickVerify[Single LLM call
Sample 10-20 emails
Check category match
~20 seconds]
OptionalCheck -->|Skip| StartClassify
QuickVerify --> MatchCheck{Categories match?}
MatchCheck -->|Yes| StartClassify[START CLASSIFICATION]
MatchCheck -->|No| Warn[Warning: Category mismatch
Continue anyway]
Warn --> StartClassify
StartClassify --> Loop{For each email}
Loop --> Embed[Generate embedding
all-minilm:l6-v2
384 dimensions
~0.02 sec]
Embed --> TFIDF[TF-IDF features
~0.001 sec]
TFIDF --> Combine[Combine features
Embedding + TF-IDF vector]
Combine --> Predict[LightGBM prediction
~0.003 sec]
Predict --> Result[Category + confidence
NO threshold check
NO LLM fallback]
Result --> Next{More emails?}
Next -->|Yes| Loop
Next -->|No| Done[10k emails classified
Total time: ~4 minutes]
style QuickVerify fill:#ffd93d
style Result fill:#4ec9b0
style Done fill:#4ec9b0
Your trained model contains:
It can classify ANY email that has the same feature structure (embeddings + TF-IDF).
The all-minilm:l6-v2 model creates 384-dim embeddings for ANY text. It doesn't need to be "trained" on your categories - it just maps text to semantic space.
Same embedding model works on Gmail, Outlook, any mailbox.
Already implemented. When set:
If model exists at src/models/pretrained/classifier.pkl, calibration is skipped entirely.
Scenario: Model trained on Enron (business emails)
New mailbox: Personal Gmail (shopping, social, newsletters)
| Enron Categories (Trained) | Gmail Categories (Natural) | ML Behavior |
|---|---|---|
| Work, Meetings, Financial | Shopping, Social, Travel | Forces Gmail into Enron categories |
| "Operational" | No equivalent | Emails mis-classified as "Operational" |
| "External" | "Newsletters" | May map but semantically different |
Result: Model works, but accuracy drops. Emails get forced into inappropriate categories.
flowchart TD
Start([New Mailbox]) --> LoadModel[Load trained model
11 categories known]
LoadModel --> Sample[Sample 10-20 emails
Quick random sample
~0.1 seconds]
Sample --> BuildPrompt[Build verification prompt
Show trained categories
Show sample emails]
BuildPrompt --> LLMCall[Single LLM call
~20 seconds
Task: Are these categories
appropriate for this mailbox?]
LLMCall --> Parse[Parse response
Expected: Yes/No + suggestions]
Parse --> Decision{Response?}
Decision -->|"Good match"| Proceed[Proceed with ML-only]
Decision -->|"Poor match"| Options{User choice}
Options -->|Continue anyway| Proceed
Options -->|Full calibration| Calibrate[Run full calibration
Discover new categories]
Options -->|Abort| Stop[Stop - manual review]
Proceed --> FastML[Fast ML Classification
10k emails in 4 minutes]
style LLMCall fill:#ffd93d
style FastML fill:#4ec9b0
style Calibrate fill:#ff6b6b
| Feature | Status | Work Required | Time |
|---|---|---|---|
| Option A: Pure ML | ✅ WORKS NOW | None - just use --no-llm-fallback | 0 hours |
| --verify-categories flag | ❌ Needs implementation | Add CLI flag, sample logic, LLM prompt, response parsing | 2-3 hours |
| --quick-calibrate flag | ❌ Needs implementation | Modify calibration workflow, category mapping logic | 4-6 hours |
| Category adapter/mapper | ❌ Needs implementation | Map new categories to existing model categories using embeddings | 6-8 hours |
Step 1: Run on Enron subset (same domain)
python -m src.cli run --source enron --limit 5000 --output test_enron/ --no-llm-fallback
Expected accuracy: ~78% (baseline)
Step 2: Run on different Enron mailbox
python -m src.cli run --source enron --limit 5000 --output test_enron2/ --no-llm-fallback
Expected accuracy: ~70-75% (slight drift)
Step 3: If you have personal Gmail/Outlook data, run there
python -m src.cli run --source gmail --limit 5000 --output test_gmail/ --no-llm-fallback
Expected accuracy: ~50-65% (significant drift, but still useful)
| Approach | LLM Calls | Time (10k emails) | Accuracy (Same domain) | Accuracy (Different domain) |
|---|---|---|---|---|
| Full Calibration | ~500 (discovery + labeling + classification fallback) | ~2.5 hours | 92-95% | 92-95% |
| Option A: Pure ML | 0 | ~4 minutes | 75-80% | 50-65% |
| Option B: Verify + ML | 1 (verification) | ~4.5 minutes | 75-80% | 50-65% |
| Option C: Quick Calibrate + ML | ~50 (quick discovery) | ~6 minutes | 80-85% | 65-75% |
| Current: ML + LLM Fallback | ~2100 (21% fallback rate) | ~2.5 hours | 92-95% | 85-90% |
You said: "map it all to our structured embedding and that's how it gets done"
This is exactly right.
Transfer learning works when:
Transfer learning fails when:
I can implement --verify-categories flag that:
Time cost: +20 seconds (1 LLM call)
Value: Automated sanity check before bulk processing