--verify-categories Feature

✅ IMPLEMENTED AND READY TO USE

Feature: Single LLM call to verify model categories fit new mailbox

Cost: +20 seconds, 1 LLM call

Value: Confidence check before bulk ML classification

Usage

Basic usage (with verification): python -m src.cli run \ --source enron \ --limit 10000 \ --output verified_test/ \ --no-llm-fallback \ --verify-categories Custom verification sample size: python -m src.cli run \ --source enron \ --limit 10000 \ --output verified_test/ \ --no-llm-fallback \ --verify-categories \ --verify-sample 30 Without verification (fastest): python -m src.cli run \ --source enron \ --limit 10000 \ --output fast_test/ \ --no-llm-fallback

How It Works

flowchart TD
    Start([Run with --verify-categories]) --> LoadModel[Load trained model
Categories: Updates, Work,
Meetings, etc.] LoadModel --> FetchEmails[Fetch all emails
10,000 total] FetchEmails --> CheckFlag{--verify-categories?} CheckFlag -->|No| SkipVerify[Skip verification
Proceed to classification] CheckFlag -->|Yes| Sample[Sample random emails
Default: 20 emails] Sample --> BuildPrompt[Build verification prompt
Show model categories
Show sample emails] BuildPrompt --> LLMCall[Single LLM call
~20 seconds
Task: Rate category fit] LLMCall --> ParseResponse[Parse JSON response
Extract verdict + confidence] ParseResponse --> Verdict{Verdict?} Verdict -->|GOOD_MATCH
80%+ fit| LogGood[Log: Categories appropriate
Confidence: 0.8-1.0] Verdict -->|FAIR_MATCH
60-80% fit| LogFair[Log: Categories acceptable
Confidence: 0.6-0.8] Verdict -->|POOR_MATCH
<60% fit| LogPoor[Log WARNING
Show suggested categories
Recommend calibration
Confidence: 0.0-0.6] LogGood --> Proceed[Proceed with ML classification] LogFair --> Proceed LogPoor --> Proceed SkipVerify --> Proceed Proceed --> ClassifyAll[Classify all 10,000 emails
Pure ML, no LLM fallback
~4 minutes] ClassifyAll --> Done[Results saved] style LLMCall fill:#ffd93d style LogGood fill:#4ec9b0 style LogPoor fill:#ff6b6b style ClassifyAll fill:#4ec9b0

Example Outputs

Scenario 1: GOOD_MATCH (Enron → Enron)

================================================================================ VERIFYING MODEL CATEGORIES ================================================================================ Verifying model categories against 10000 emails Model categories (11): Updates, Work, Meetings, External, Financial, Test, Administrative, Operational, Technical, Urgent, Requests Sampled 20 emails for verification Calling LLM for category verification... Verification complete: GOOD_MATCH (0.85) Reasoning: The sample emails fit well into the trained categories. Most are work-related correspondence, meetings, and operational updates which align with the model. Verification: GOOD_MATCH Confidence: 85% Model categories look appropriate for this mailbox ================================================================================ Starting classification...

Scenario 2: POOR_MATCH (Enron → Personal Gmail)

================================================================================ VERIFYING MODEL CATEGORIES ================================================================================ Verifying model categories against 10000 emails Model categories (11): Updates, Work, Meetings, External, Financial, Test, Administrative, Operational, Technical, Urgent, Requests Sampled 20 emails for verification Calling LLM for category verification... Verification complete: POOR_MATCH (0.45) Reasoning: Many sample emails are shopping confirmations, social media notifications, and personal correspondence which don't fit the business-focused categories well. Verification: POOR_MATCH Confidence: 45% ================================================================================ WARNING: Model categories may not fit this mailbox well Suggested categories: ['Shopping', 'Social', 'Travel', 'Newsletters', 'Personal'] Consider running full calibration for better accuracy Proceeding with existing model anyway... ================================================================================ Starting classification...

LLM Prompt Structure

You are evaluating whether pre-trained email categories fit a new mailbox. TRAINED MODEL CATEGORIES (11 categories): - Updates - Work - Meetings - External - Financial - Test - Administrative - Operational - Technical - Urgent - Requests SAMPLE EMAILS FROM NEW MAILBOX (20 total, showing first 20): 1. From: phillip.allen@enron.com Subject: Re: AEC Volumes at OPAL Preview: Here are the volumes for today... 2. From: notifications@amazon.com Subject: Your order has shipped Preview: Your Amazon.com order #123-4567890... [... 18 more emails ...] TASK: Evaluate if the trained categories are appropriate for this mailbox. Consider: 1. Do the sample emails naturally fit into the trained categories? 2. Are there obvious email types that don't match any category? 3. Are the category names semantically appropriate? 4. Would a user find these categories helpful for THIS mailbox? Respond with JSON: { "verdict": "GOOD_MATCH" | "FAIR_MATCH" | "POOR_MATCH", "confidence": 0.0-1.0, "reasoning": "brief explanation", "fit_percentage": 0-100, "suggested_categories": ["cat1", "cat2", ...], "category_mapping": {"old_name": "better_name", ...} }

Configuration

Flag Type Default Description
--verify-categories Flag False Enable category verification
--verify-sample Integer 20 Number of emails to sample
--no-llm-fallback Flag False Disable LLM fallback during classification

When Verification Runs

Timing Impact

Configuration Time (10k emails) LLM Calls
ML-only (no flags) ~4 minutes 0
ML-only + --verify-categories ~4.3 minutes 1 (verification)
Full calibration (no model) ~25 minutes ~500
ML + LLM fallback (21%) ~2.5 hours ~2100

Decision Tree

flowchart TD
    Start([Need to classify emails]) --> HaveModel{Trained model
exists?} HaveModel -->|No| MustCalibrate[Must run calibration
~20 minutes
~500 LLM calls] HaveModel -->|Yes| SameDomain{Same domain as
training data?} SameDomain -->|Yes, confident| FastML[Pure ML
4 minutes
0 LLM calls] SameDomain -->|Unsure| VerifyML[ML + Verification
4.3 minutes
1 LLM call] SameDomain -->|No, different| Options{Accuracy needs?} Options -->|High accuracy required| MustCalibrate Options -->|Speed more important| VerifyML Options -->|Experimental| FastML MustCalibrate --> Done[Classification complete] FastML --> Done VerifyML --> Done style FastML fill:#4ec9b0 style VerifyML fill:#ffd93d style MustCalibrate fill:#ff6b6b

Quick Start

Test with verification on same domain (Enron → Enron): python -m src.cli run \ --source enron \ --limit 1000 \ --output verify_test_same/ \ --no-llm-fallback \ --verify-categories Expected: GOOD_MATCH (0.80-0.95) Time: ~30 seconds Test without verification for speed comparison: python -m src.cli run \ --source enron \ --limit 1000 \ --output no_verify_test/ \ --no-llm-fallback Expected: Same accuracy, 20 seconds faster Time: ~10 seconds