Category Verification Feature

✅ IMPLEMENTED AND READY TO USE

Feature: Single LLM call to verify model categories fit new mailbox

Cost: +20 seconds, 1 LLM call

Value: Confidence check before bulk ML classification

Usage

Basic usage (with verification): python -m src.cli run \ --source enron \ --limit 10000 \ --output verified_test/ \ --no-llm-fallback \ --verify-categories Custom verification sample size: python -m src.cli run \ --source enron \ --limit 10000 \ --output verified_test/ \ --no-llm-fallback \ --verify-categories \ --verify-sample 30 Without verification (fastest): python -m src.cli run \ --source enron \ --limit 10000 \ --output fast_test/ \ --no-llm-fallback

How It Works

flowchart TD
    Start([Run with --verify-categories]) --> LoadModel[Load trained model
Categories: Updates, Work,
Meetings, etc.]

    LoadModel --> FetchEmails[Fetch all emails
10,000 total]

    FetchEmails --> CheckFlag{--verify-categories?}
    CheckFlag -->|No| SkipVerify[Skip verification
Proceed to classification]
    CheckFlag -->|Yes| Sample[Sample random emails
Default: 20 emails]

    Sample --> BuildPrompt[Build verification prompt
Show model categories
Show sample emails]

    BuildPrompt --> LLMCall[Single LLM call
~20 seconds
Task: Rate category fit]

    LLMCall --> ParseResponse[Parse JSON response
Extract verdict + confidence]

    ParseResponse --> Verdict{Verdict?}

    Verdict -->|GOOD_MATCH
80%+ fit| LogGood[Log: Categories appropriate
Confidence: 0.8-1.0]
    Verdict -->|FAIR_MATCH
60-80% fit| LogFair[Log: Categories acceptable
Confidence: 0.6-0.8]
    Verdict -->|POOR_MATCH
<60% fit| LogPoor[Log WARNING
Show suggested categories
Recommend calibration
Confidence: 0.0-0.6]

    LogGood --> Proceed[Proceed with ML classification]
    LogFair --> Proceed
    LogPoor --> Proceed

    SkipVerify --> Proceed

    Proceed --> ClassifyAll[Classify all 10,000 emails
Pure ML, no LLM fallback
~4 minutes]

    ClassifyAll --> Done[Results saved]

    style LLMCall fill:#ffd93d
    style LogGood fill:#4ec9b0
    style LogPoor fill:#ff6b6b
    style ClassifyAll fill:#4ec9b0

Example Outputs

Scenario 1: GOOD_MATCH (Enron → Enron)

================================================================================ VERIFYING MODEL CATEGORIES ================================================================================ Verifying model categories against 10000 emails Model categories (11): Updates, Work, Meetings, External, Financial, Test, Administrative, Operational, Technical, Urgent, Requests Sampled 20 emails for verification Calling LLM for category verification... Verification complete: GOOD_MATCH (0.85) Reasoning: The sample emails fit well into the trained categories. Most are work-related correspondence, meetings, and operational updates which align with the model. Verification: GOOD_MATCH Confidence: 85% Model categories look appropriate for this mailbox ================================================================================ Starting classification...

Scenario 2: POOR_MATCH (Enron → Personal Gmail)

================================================================================ VERIFYING MODEL CATEGORIES ================================================================================ Verifying model categories against 10000 emails Model categories (11): Updates, Work, Meetings, External, Financial, Test, Administrative, Operational, Technical, Urgent, Requests Sampled 20 emails for verification Calling LLM for category verification... Verification complete: POOR_MATCH (0.45) Reasoning: Many sample emails are shopping confirmations, social media notifications, and personal correspondence which don't fit the business-focused categories well. Verification: POOR_MATCH Confidence: 45% ================================================================================ WARNING: Model categories may not fit this mailbox well Suggested categories: ['Shopping', 'Social', 'Travel', 'Newsletters', 'Personal'] Consider running full calibration for better accuracy Proceeding with existing model anyway... ================================================================================ Starting classification...

LLM Prompt Structure

You are evaluating whether pre-trained email categories fit a new mailbox. TRAINED MODEL CATEGORIES (11 categories): - Updates - Work - Meetings - External - Financial - Test - Administrative - Operational - Technical - Urgent - Requests SAMPLE EMAILS FROM NEW MAILBOX (20 total, showing first 20): 1. From: phillip.allen@enron.com Subject: Re: AEC Volumes at OPAL Preview: Here are the volumes for today... 2. From: notifications@amazon.com Subject: Your order has shipped Preview: Your Amazon.com order #123-4567890... [... 18 more emails ...] TASK: Evaluate if the trained categories are appropriate for this mailbox. Consider: 1. Do the sample emails naturally fit into the trained categories? 2. Are there obvious email types that don't match any category? 3. Are the category names semantically appropriate? 4. Would a user find these categories helpful for THIS mailbox? Respond with JSON: { "verdict": "GOOD_MATCH" | "FAIR_MATCH" | "POOR_MATCH", "confidence": 0.0-1.0, "reasoning": "brief explanation", "fit_percentage": 0-100, "suggested_categories": ["cat1", "cat2", ...], "category_mapping": {"old_name": "better_name", ...} }

Configuration

When Verification Runs

Timing Impact

Decision Tree

Flag	Type	Default	Description
`--verify-categories`	Flag	False	Enable category verification
`--verify-sample`	Integer	20	Number of emails to sample
`--no-llm-fallback`	Flag	False	Disable LLM fallback during classification

Configuration	Time (10k emails)	LLM Calls
ML-only (no flags)	~4 minutes	0
ML-only + `--verify-categories`	~4.3 minutes	1 (verification)
Full calibration (no model)	~25 minutes	~500
ML + LLM fallback (21%)	~2.5 hours	~2100

flowchart TD
    Start([Need to classify emails]) --> HaveModel{Trained model
exists?}

    HaveModel -->|No| MustCalibrate[Must run calibration
~20 minutes
~500 LLM calls]

    HaveModel -->|Yes| SameDomain{Same domain as
training data?}

    SameDomain -->|Yes, confident| FastML[Pure ML
4 minutes
0 LLM calls]

    SameDomain -->|Unsure| VerifyML[ML + Verification
4.3 minutes
1 LLM call]

    SameDomain -->|No, different| Options{Accuracy needs?}

    Options -->|High accuracy required| MustCalibrate
    Options -->|Speed more important| VerifyML
    Options -->|Experimental| FastML

    MustCalibrate --> Done[Classification complete]
    FastML --> Done
    VerifyML --> Done

    style FastML fill:#4ec9b0
    style VerifyML fill:#ffd93d
    style MustCalibrate fill:#ff6b6b

Quick Start

Test with verification on same domain (Enron → Enron): python -m src.cli run \ --source enron \ --limit 1000 \ --output verify_test_same/ \ --no-llm-fallback \ --verify-categories Expected: GOOD_MATCH (0.80-0.95) Time: ~30 seconds Test without verification for speed comparison: python -m src.cli run \ --source enron \ --limit 1000 \ --output no_verify_test/ \ --no-llm-fallback Expected: Same accuracy, 20 seconds faster Time: ~10 seconds

--verify-categories Feature