--verify-categories Feature
✅ IMPLEMENTED AND READY TO USE
Feature: Single LLM call to verify model categories fit new mailbox
Cost: +20 seconds, 1 LLM call
Value: Confidence check before bulk ML classification
Usage
Basic usage (with verification):
python -m src.cli run \
--source enron \
--limit 10000 \
--output verified_test/ \
--no-llm-fallback \
--verify-categories
Custom verification sample size:
python -m src.cli run \
--source enron \
--limit 10000 \
--output verified_test/ \
--no-llm-fallback \
--verify-categories \
--verify-sample 30
Without verification (fastest):
python -m src.cli run \
--source enron \
--limit 10000 \
--output fast_test/ \
--no-llm-fallback
How It Works
flowchart TD
Start([Run with --verify-categories]) --> LoadModel[Load trained model
Categories: Updates, Work,
Meetings, etc.]
LoadModel --> FetchEmails[Fetch all emails
10,000 total]
FetchEmails --> CheckFlag{--verify-categories?}
CheckFlag -->|No| SkipVerify[Skip verification
Proceed to classification]
CheckFlag -->|Yes| Sample[Sample random emails
Default: 20 emails]
Sample --> BuildPrompt[Build verification prompt
Show model categories
Show sample emails]
BuildPrompt --> LLMCall[Single LLM call
~20 seconds
Task: Rate category fit]
LLMCall --> ParseResponse[Parse JSON response
Extract verdict + confidence]
ParseResponse --> Verdict{Verdict?}
Verdict -->|GOOD_MATCH
80%+ fit| LogGood[Log: Categories appropriate
Confidence: 0.8-1.0]
Verdict -->|FAIR_MATCH
60-80% fit| LogFair[Log: Categories acceptable
Confidence: 0.6-0.8]
Verdict -->|POOR_MATCH
<60% fit| LogPoor[Log WARNING
Show suggested categories
Recommend calibration
Confidence: 0.0-0.6]
LogGood --> Proceed[Proceed with ML classification]
LogFair --> Proceed
LogPoor --> Proceed
SkipVerify --> Proceed
Proceed --> ClassifyAll[Classify all 10,000 emails
Pure ML, no LLM fallback
~4 minutes]
ClassifyAll --> Done[Results saved]
style LLMCall fill:#ffd93d
style LogGood fill:#4ec9b0
style LogPoor fill:#ff6b6b
style ClassifyAll fill:#4ec9b0
Example Outputs
Scenario 1: GOOD_MATCH (Enron → Enron)
================================================================================
VERIFYING MODEL CATEGORIES
================================================================================
Verifying model categories against 10000 emails
Model categories (11): Updates, Work, Meetings, External, Financial, Test, Administrative, Operational, Technical, Urgent, Requests
Sampled 20 emails for verification
Calling LLM for category verification...
Verification complete: GOOD_MATCH (0.85)
Reasoning: The sample emails fit well into the trained categories. Most are work-related correspondence, meetings, and operational updates which align with the model.
Verification: GOOD_MATCH
Confidence: 85%
Model categories look appropriate for this mailbox
================================================================================
Starting classification...
Scenario 2: POOR_MATCH (Enron → Personal Gmail)
================================================================================
VERIFYING MODEL CATEGORIES
================================================================================
Verifying model categories against 10000 emails
Model categories (11): Updates, Work, Meetings, External, Financial, Test, Administrative, Operational, Technical, Urgent, Requests
Sampled 20 emails for verification
Calling LLM for category verification...
Verification complete: POOR_MATCH (0.45)
Reasoning: Many sample emails are shopping confirmations, social media notifications, and personal correspondence which don't fit the business-focused categories well.
Verification: POOR_MATCH
Confidence: 45%
================================================================================
WARNING: Model categories may not fit this mailbox well
Suggested categories: ['Shopping', 'Social', 'Travel', 'Newsletters', 'Personal']
Consider running full calibration for better accuracy
Proceeding with existing model anyway...
================================================================================
Starting classification...
LLM Prompt Structure
You are evaluating whether pre-trained email categories fit a new mailbox.
TRAINED MODEL CATEGORIES (11 categories):
- Updates
- Work
- Meetings
- External
- Financial
- Test
- Administrative
- Operational
- Technical
- Urgent
- Requests
SAMPLE EMAILS FROM NEW MAILBOX (20 total, showing first 20):
1. From: phillip.allen@enron.com
Subject: Re: AEC Volumes at OPAL
Preview: Here are the volumes for today...
2. From: notifications@amazon.com
Subject: Your order has shipped
Preview: Your Amazon.com order #123-4567890...
[... 18 more emails ...]
TASK:
Evaluate if the trained categories are appropriate for this mailbox.
Consider:
1. Do the sample emails naturally fit into the trained categories?
2. Are there obvious email types that don't match any category?
3. Are the category names semantically appropriate?
4. Would a user find these categories helpful for THIS mailbox?
Respond with JSON:
{
"verdict": "GOOD_MATCH" | "FAIR_MATCH" | "POOR_MATCH",
"confidence": 0.0-1.0,
"reasoning": "brief explanation",
"fit_percentage": 0-100,
"suggested_categories": ["cat1", "cat2", ...],
"category_mapping": {"old_name": "better_name", ...}
}
Configuration
| Flag |
Type |
Default |
Description |
--verify-categories |
Flag |
False |
Enable category verification |
--verify-sample |
Integer |
20 |
Number of emails to sample |
--no-llm-fallback |
Flag |
False |
Disable LLM fallback during classification |
When Verification Runs
- ✅ Only if
--verify-categories flag is set
- ✅ Only if trained model exists (not mock)
- ✅ After emails are fetched, before calibration/classification
- ❌ Skipped if using mock model
- ❌ Skipped if model doesn't exist (calibration will run anyway)
Timing Impact
| Configuration |
Time (10k emails) |
LLM Calls |
| ML-only (no flags) |
~4 minutes |
0 |
ML-only + --verify-categories |
~4.3 minutes |
1 (verification) |
| Full calibration (no model) |
~25 minutes |
~500 |
| ML + LLM fallback (21%) |
~2.5 hours |
~2100 |
Decision Tree
flowchart TD
Start([Need to classify emails]) --> HaveModel{Trained model
exists?}
HaveModel -->|No| MustCalibrate[Must run calibration
~20 minutes
~500 LLM calls]
HaveModel -->|Yes| SameDomain{Same domain as
training data?}
SameDomain -->|Yes, confident| FastML[Pure ML
4 minutes
0 LLM calls]
SameDomain -->|Unsure| VerifyML[ML + Verification
4.3 minutes
1 LLM call]
SameDomain -->|No, different| Options{Accuracy needs?}
Options -->|High accuracy required| MustCalibrate
Options -->|Speed more important| VerifyML
Options -->|Experimental| FastML
MustCalibrate --> Done[Classification complete]
FastML --> Done
VerifyML --> Done
style FastML fill:#4ec9b0
style VerifyML fill:#ffd93d
style MustCalibrate fill:#ff6b6b
Quick Start
Test with verification on same domain (Enron → Enron):
python -m src.cli run \
--source enron \
--limit 1000 \
--output verify_test_same/ \
--no-llm-fallback \
--verify-categories
Expected: GOOD_MATCH (0.80-0.95)
Time: ~30 seconds
Test without verification for speed comparison:
python -m src.cli run \
--source enron \
--limit 1000 \
--output no_verify_test/ \
--no-llm-fallback
Expected: Same accuracy, 20 seconds faster
Time: ~10 seconds