email-sorter/docs/VERIFY_CATEGORIES_FEATURE.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Category Verification Feature</title>
    <script src="https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.min.js"></script>
    <style>
        body {
            font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
            margin: 20px;
            background: #1e1e1e;
            color: #d4d4d4;
        }
        h1, h2, h3 {
            color: #4ec9b0;
        }
        .diagram {
            background: white;
            padding: 20px;
            margin: 20px 0;
            border-radius: 8px;
        }
        .code-section {
            background: #252526;
            padding: 15px;
            margin: 10px 0;
            border-left: 4px solid #4ec9b0;
            font-family: 'Courier New', monospace;
        }
        code {
            background: #1e1e1e;
            padding: 2px 6px;
            border-radius: 3px;
            color: #ce9178;
        }
        .success {
            background: #002a00;
            border-left: 4px solid #4ec9b0;
            padding: 15px;
            margin: 10px 0;
        }
    </style>
</head>
<body>
    <h1>--verify-categories Feature</h1>

    <div class="success">
        <h2>✅ IMPLEMENTED AND READY TO USE</h2>
        <p><strong>Feature:</strong> Single LLM call to verify model categories fit new mailbox</p>
        <p><strong>Cost:</strong> +20 seconds, 1 LLM call</p>
        <p><strong>Value:</strong> Confidence check before bulk ML classification</p>
    </div>

    <h2>Usage</h2>

    <div class="code-section">
<strong>Basic usage (with verification):</strong>
python -m src.cli run \
  --source enron \
  --limit 10000 \
  --output verified_test/ \
  --no-llm-fallback \
  --verify-categories

<strong>Custom verification sample size:</strong>
python -m src.cli run \
  --source enron \
  --limit 10000 \
  --output verified_test/ \
  --no-llm-fallback \
  --verify-categories \
  --verify-sample 30

<strong>Without verification (fastest):</strong>
python -m src.cli run \
  --source enron \
  --limit 10000 \
  --output fast_test/ \
  --no-llm-fallback
    </div>

    <h2>How It Works</h2>

    <div class="diagram">
        <pre class="mermaid">
flowchart TD
    Start([Run with --verify-categories]) --> LoadModel[Load trained model<br/>Categories: Updates, Work,<br/>Meetings, etc.]

    LoadModel --> FetchEmails[Fetch all emails<br/>10,000 total]

    FetchEmails --> CheckFlag{--verify-categories?}
    CheckFlag -->|No| SkipVerify[Skip verification<br/>Proceed to classification]
    CheckFlag -->|Yes| Sample[Sample random emails<br/>Default: 20 emails]

    Sample --> BuildPrompt[Build verification prompt<br/>Show model categories<br/>Show sample emails]

    BuildPrompt --> LLMCall[Single LLM call<br/>~20 seconds<br/>Task: Rate category fit]

    LLMCall --> ParseResponse[Parse JSON response<br/>Extract verdict + confidence]

    ParseResponse --> Verdict{Verdict?}

    Verdict -->|GOOD_MATCH<br/>80%+ fit| LogGood[Log: Categories appropriate<br/>Confidence: 0.8-1.0]
    Verdict -->|FAIR_MATCH<br/>60-80% fit| LogFair[Log: Categories acceptable<br/>Confidence: 0.6-0.8]
    Verdict -->|POOR_MATCH<br/><60% fit| LogPoor[Log WARNING<br/>Show suggested categories<br/>Recommend calibration<br/>Confidence: 0.0-0.6]

    LogGood --> Proceed[Proceed with ML classification]
    LogFair --> Proceed
    LogPoor --> Proceed

    SkipVerify --> Proceed

    Proceed --> ClassifyAll[Classify all 10,000 emails<br/>Pure ML, no LLM fallback<br/>~4 minutes]

    ClassifyAll --> Done[Results saved]

    style LLMCall fill:#ffd93d
    style LogGood fill:#4ec9b0
    style LogPoor fill:#ff6b6b
    style ClassifyAll fill:#4ec9b0
</pre>
    </div>

    <h2>Example Outputs</h2>

    <h3>Scenario 1: GOOD_MATCH (Enron → Enron)</h3>
    <div class="code-section">
================================================================================
VERIFYING MODEL CATEGORIES
================================================================================
Verifying model categories against 10000 emails
Model categories (11): Updates, Work, Meetings, External, Financial, Test, Administrative, Operational, Technical, Urgent, Requests
Sampled 20 emails for verification
Calling LLM for category verification...
Verification complete: GOOD_MATCH (0.85)
Reasoning: The sample emails fit well into the trained categories. Most are work-related correspondence, meetings, and operational updates which align with the model.

Verification: GOOD_MATCH
Confidence: 85%
Model categories look appropriate for this mailbox
================================================================================

Starting classification...
    </div>

    <h3>Scenario 2: POOR_MATCH (Enron → Personal Gmail)</h3>
    <div class="code-section">
================================================================================
VERIFYING MODEL CATEGORIES
================================================================================
Verifying model categories against 10000 emails
Model categories (11): Updates, Work, Meetings, External, Financial, Test, Administrative, Operational, Technical, Urgent, Requests
Sampled 20 emails for verification
Calling LLM for category verification...
Verification complete: POOR_MATCH (0.45)
Reasoning: Many sample emails are shopping confirmations, social media notifications, and personal correspondence which don't fit the business-focused categories well.

Verification: POOR_MATCH
Confidence: 45%
================================================================================
WARNING: Model categories may not fit this mailbox well
Suggested categories: ['Shopping', 'Social', 'Travel', 'Newsletters', 'Personal']
Consider running full calibration for better accuracy
Proceeding with existing model anyway...
================================================================================

Starting classification...
    </div>

    <h2>LLM Prompt Structure</h2>

    <div class="code-section">
You are evaluating whether pre-trained email categories fit a new mailbox.

TRAINED MODEL CATEGORIES (11 categories):
  - Updates
  - Work
  - Meetings
  - External
  - Financial
  - Test
  - Administrative
  - Operational
  - Technical
  - Urgent
  - Requests

SAMPLE EMAILS FROM NEW MAILBOX (20 total, showing first 20):
1. From: phillip.allen@enron.com
   Subject: Re: AEC Volumes at OPAL
   Preview: Here are the volumes for today...

2. From: notifications@amazon.com
   Subject: Your order has shipped
   Preview: Your Amazon.com order #123-4567890...

[... 18 more emails ...]

TASK:
Evaluate if the trained categories are appropriate for this mailbox.

Consider:
1. Do the sample emails naturally fit into the trained categories?
2. Are there obvious email types that don't match any category?
3. Are the category names semantically appropriate?
4. Would a user find these categories helpful for THIS mailbox?

Respond with JSON:
{
  "verdict": "GOOD_MATCH" | "FAIR_MATCH" | "POOR_MATCH",
  "confidence": 0.0-1.0,
  "reasoning": "brief explanation",
  "fit_percentage": 0-100,
  "suggested_categories": ["cat1", "cat2", ...],
  "category_mapping": {"old_name": "better_name", ...}
}
    </div>

    <h2>Configuration</h2>

    <table style="width:100%; border-collapse: collapse; background: #252526; margin: 20px 0;">
        <tr style="background: #37373d;">
            <th style="padding: 12px; text-align: left; color: #4ec9b0;">Flag</th>
            <th style="padding: 12px; text-align: left; color: #4ec9b0;">Type</th>
            <th style="padding: 12px; text-align: left; color: #4ec9b0;">Default</th>
            <th style="padding: 12px; text-align: left; color: #4ec9b0;">Description</th>
        </tr>
        <tr style="border-bottom: 1px solid #3e3e42;">
            <td style="padding: 10px;"><code>--verify-categories</code></td>
            <td style="padding: 10px;">Flag</td>
            <td style="padding: 10px;">False</td>
            <td style="padding: 10px;">Enable category verification</td>
        </tr>
        <tr style="border-bottom: 1px solid #3e3e42;">
            <td style="padding: 10px;"><code>--verify-sample</code></td>
            <td style="padding: 10px;">Integer</td>
            <td style="padding: 10px;">20</td>
            <td style="padding: 10px;">Number of emails to sample</td>
        </tr>
        <tr style="border-bottom: 1px solid #3e3e42;">
            <td style="padding: 10px;"><code>--no-llm-fallback</code></td>
            <td style="padding: 10px;">Flag</td>
            <td style="padding: 10px;">False</td>
            <td style="padding: 10px;">Disable LLM fallback during classification</td>
        </tr>
    </table>

    <h2>When Verification Runs</h2>

    <ul>
        <li>✅ Only if <code>--verify-categories</code> flag is set</li>
        <li>✅ Only if trained model exists (not mock)</li>
        <li>✅ After emails are fetched, before calibration/classification</li>
        <li>❌ Skipped if using mock model</li>
        <li>❌ Skipped if model doesn't exist (calibration will run anyway)</li>
    </ul>

    <h2>Timing Impact</h2>

    <table style="width:100%; border-collapse: collapse; background: #252526; margin: 20px 0;">
        <tr style="background: #37373d;">
            <th style="padding: 12px; text-align: left; color: #4ec9b0;">Configuration</th>
            <th style="padding: 12px; text-align: left; color: #4ec9b0;">Time (10k emails)</th>
            <th style="padding: 12px; text-align: left; color: #4ec9b0;">LLM Calls</th>
        </tr>
        <tr style="border-bottom: 1px solid #3e3e42;">
            <td style="padding: 10px;">ML-only (no flags)</td>
            <td style="padding: 10px;">~4 minutes</td>
            <td style="padding: 10px;">0</td>
        </tr>
        <tr style="border-bottom: 1px solid #3e3e42;">
            <td style="padding: 10px;">ML-only + <code>--verify-categories</code></td>
            <td style="padding: 10px;">~4.3 minutes</td>
            <td style="padding: 10px;">1 (verification)</td>
        </tr>
        <tr style="border-bottom: 1px solid #3e3e42;">
            <td style="padding: 10px;">Full calibration (no model)</td>
            <td style="padding: 10px;">~25 minutes</td>
            <td style="padding: 10px;">~500</td>
        </tr>
        <tr style="border-bottom: 1px solid #3e3e42;">
            <td style="padding: 10px;">ML + LLM fallback (21%)</td>
            <td style="padding: 10px;">~2.5 hours</td>
            <td style="padding: 10px;">~2100</td>
        </tr>
    </table>

    <h2>Decision Tree</h2>

    <div class="diagram">
        <pre class="mermaid">
flowchart TD
    Start([Need to classify emails]) --> HaveModel{Trained model<br/>exists?}

    HaveModel -->|No| MustCalibrate[Must run calibration<br/>~20 minutes<br/>~500 LLM calls]

    HaveModel -->|Yes| SameDomain{Same domain as<br/>training data?}

    SameDomain -->|Yes, confident| FastML[Pure ML<br/>4 minutes<br/>0 LLM calls]

    SameDomain -->|Unsure| VerifyML[ML + Verification<br/>4.3 minutes<br/>1 LLM call]

    SameDomain -->|No, different| Options{Accuracy needs?}

    Options -->|High accuracy required| MustCalibrate
    Options -->|Speed more important| VerifyML
    Options -->|Experimental| FastML

    MustCalibrate --> Done[Classification complete]
    FastML --> Done
    VerifyML --> Done

    style FastML fill:#4ec9b0
    style VerifyML fill:#ffd93d
    style MustCalibrate fill:#ff6b6b
</pre>
    </div>

    <h2>Quick Start</h2>

    <div class="code-section">
<strong>Test with verification on same domain (Enron → Enron):</strong>
python -m src.cli run \
  --source enron \
  --limit 1000 \
  --output verify_test_same/ \
  --no-llm-fallback \
  --verify-categories

Expected: GOOD_MATCH (0.80-0.95)
Time: ~30 seconds

<strong>Test without verification for speed comparison:</strong>
python -m src.cli run \
  --source enron \
  --limit 1000 \
  --output no_verify_test/ \
  --no-llm-fallback

Expected: Same accuracy, 20 seconds faster
Time: ~10 seconds
    </div>

    <script>
        mermaid.initialize({
            startOnLoad: true,
            theme: 'default',
            flowchart: {
                useMaxWidth: true,
                htmlLabels: true,
                curve: 'basis'
            }
        });
    </script>
</body>
</html>