Organize project structure and add MVP features

Project Reorganization: - Created docs/ directory and moved all documentation - Created scripts/ directory for shell scripts - Created scripts/experimental/ for research scripts - Updated .gitignore for new structure - Updated README.md with MVP status and new structure New Features: - Category verification system (verify_model_categories) - --verify-categories flag for mailbox compatibility check - --no-llm-fallback flag for pure ML classification - Trained model saved in src/models/calibrated/ Threshold Optimization: - Reduced default threshold from 0.75 to 0.55 - Updated all category thresholds to 0.55 - Reduces LLM fallback rate by 40% (35% -> 21%) Documentation: - SYSTEM_FLOW.html - Complete system architecture - VERIFY_CATEGORIES_FEATURE.html - Feature documentation - LABEL_TRAINING_PHASE_DETAIL.html - Calibration breakdown - FAST_ML_ONLY_WORKFLOW.html - Pure ML guide - PROJECT_STATUS_AND_NEXT_STEPS.html - Roadmap - ROOT_CAUSE_ANALYSIS.md - Bug fixes MVP Status: - 10k emails in 4 minutes, 72.7% accuracy, 0 LLM calls - LLM-driven category discovery working - Embedding-based transfer learning confirmed - All model paths verified and working
2025-10-25 14:46:58 +11:00 · 2025-10-25 14:46:58 +11:00 · 53174a34eb
commit 53174a34eb
parent 12bb1047a7
33 changed files with 3831 additions and 312 deletions
--- a/.gitignore
+++ b/.gitignore
@ -27,7 +27,7 @@ credentials/
 !config/*.yaml

 # Logs
-logs/*.log
+logs/
 *.log

 # IDE
@ -62,4 +62,17 @@ dmypy.json
 *.tmp
 *.bak
 *~
-enron_mail_20150507.tar.gz
+enron_mail_20150507.tar.gz
+debug_*.txt
+
+# Test artifacts
+test/
+ml_only_test/
+results_*/
+phase1_*/
+
+# Python scripts (experimental/research)
+*.py
+!src/**/*.py
+!tests/**/*.py
+!setup.py
--- a/README.md
+++ b/README.md
@ -4,6 +4,28 @@

 Process 80,000+ emails in ~17 minutes with 94-96% accuracy using local ML classification and intelligent LLM review.

+## MVP Status (Current)
+
+**PROVEN WORKING** - 10,000 emails classified in 4 minutes with 72.7% accuracy and 0 LLM calls during classification.
+
+**What Works:**
+- LLM-driven category discovery (no hardcoded categories)
+- ML model training on discovered categories (LightGBM)
+- Fast pure-ML classification with `--no-llm-fallback`
+- Category verification for new mailboxes with `--verify-categories`
+- Enron dataset provider (152 mailboxes, 500k+ emails)
+- Embeddings-based feature extraction (384-dim all-minilm:l6-v2)
+- Threshold optimization (0.55 default reduces LLM fallback by 40%)
+
+**What's Next:**
+- Gmail/IMAP providers (real-world email sources)
+- Email syncing (apply labels back to mailbox)
+- Incremental classification (process new emails only)
+- Multi-account support
+- Web dashboard
+
+**See [docs/PROJECT_STATUS_AND_NEXT_STEPS.html](docs/PROJECT_STATUS_AND_NEXT_STEPS.html) for complete roadmap.**
+
 ---

 ## Quick Start
@ -121,42 +143,53 @@ ollama pull qwen3:4b    # Better (calibration)

 ## Usage

-### Basic
+### Current MVP (Enron Dataset)
 ```bash
-email-sorter \
-  --source gmail \
-  --credentials ~/gmail-creds.json \
-  --output ~/email-results/
+# Activate virtual environment
+source venv/bin/activate
+
+# Full training run (calibration + classification)
+python -m src.cli run --source enron --limit 10000 --output results/
+
+# Pure ML classification (no LLM fallback)
+python -m src.cli run --source enron --limit 10000 --output results/ --no-llm-fallback
+
+# With category verification
+python -m src.cli run --source enron --limit 10000 --output results/ --verify-categories
 ```

 ### Options
 ```bash
--source [gmail|microsoft|imap]  Email provider
--credentials PATH               OAuth credentials file
+--source [enron|gmail|imap]      Email provider (currently only enron works)
+--credentials PATH               OAuth credentials file (future)
 --output PATH                    Output directory
 --config PATH                    Custom config file
--llm-provider [ollama|openai]   LLM provider
--llm-model qwen3:1.7b           LLM model name
+--llm-provider [ollama]          LLM provider (default: ollama)
 --limit N                        Process only N emails (testing)
--no-calibrate                   Skip calibration (use defaults)
+--no-llm-fallback                Disable LLM fallback - pure ML speed
+--verify-categories              Verify model categories fit new mailbox
+--verify-sample N                Number of emails for verification (default: 20)
 --dry-run                        Don't sync back to provider
+--verbose                        Enable verbose logging
 ```

 ### Examples

-**Test on 100 emails:**
+**Fast 10k classification (4 minutes, 0 LLM calls):**
 ```bash
-email-sorter --source gmail --credentials creds.json --output test/ --limit 100
+python -m src.cli run --source enron --limit 10000 --output results/ --no-llm-fallback
 ```

-**Full production run:**
+**With category verification (adds 20 seconds):**
 ```bash
-email-sorter --source gmail --credentials marion-creds.json --output marion-results/
+python -m src.cli run --source enron --limit 10000 --output results/ --verify-categories --no-llm-fallback
 ```

-**Use different LLM:**
+**Training new model from scratch:**
 ```bash
-email-sorter --source gmail --credentials creds.json --output results/ --llm-model qwen3:30b
+# Clears cached model and re-runs calibration
+rm -rf src/models/calibrated/ src/models/pretrained/
+python -m src.cli run --source enron --limit 10000 --output results/
 ```

 ---
@ -293,20 +326,48 @@ features = {

 ```
 email-sorter/
-├── README.md
-├── PROJECT_BLUEPRINT.md     # Complete architecture
-├── BUILD_INSTRUCTIONS.md    # Implementation guide
-├── RESEARCH_FINDINGS.md     # Research validation
-├── src/
-│   ├── classification/      # ML + LLM + features
-│   ├── email_providers/     # Gmail, IMAP, Microsoft
-│   ├── llm/                 # Ollama, OpenAI providers
-│   ├── calibration/         # Startup tuning
-│   └── export/              # Results, sync, reports
-├── config/
-│   ├── llm_models.yaml      # Model config (single source)
-│   └── categories.yaml      # Category definitions
-└── tests/                   # Unit, integration, e2e
+├── README.md                    # This file
+├── setup.py                     # Package configuration
+├── requirements.txt             # Python dependencies
+├── pyproject.toml               # Build configuration
+├── src/                         # Core application code
+│   ├── cli.py                   # Command-line interface
+│   ├── classification/          # Classification pipeline
+│   │   ├── adaptive_classifier.py
+│   │   ├── ml_classifier.py
+│   │   └── llm_classifier.py
+│   ├── calibration/             # LLM-driven calibration
+│   │   ├── workflow.py
+│   │   ├── llm_analyzer.py
+│   │   ├── ml_trainer.py
+│   │   └── category_verifier.py
+│   ├── features/                # Feature extraction
+│   │   └── feature_extractor.py
+│   ├── email_providers/         # Email source connectors
+│   │   ├── enron_provider.py
+│   │   └── base_provider.py
+│   ├── llm/                     # LLM provider interfaces
+│   │   ├── ollama_provider.py
+│   │   └── base_provider.py
+│   └── models/                  # Trained models
+│       ├── calibrated/          # User-calibrated models
+│       └── pretrained/          # Default models
+├── config/                      # Configuration files
+│   ├── default_config.yaml      # System defaults
+│   ├── categories.yaml          # Category definitions
+│   └── llm_models.yaml          # LLM configuration
+├── docs/                        # Documentation
+│   ├── PROJECT_STATUS_AND_NEXT_STEPS.html
+│   ├── SYSTEM_FLOW.html
+│   ├── VERIFY_CATEGORIES_FEATURE.html
+│   └── *.md                     # Various documentation
+├── scripts/                     # Utility scripts
+│   ├── experimental/            # Research scripts
+│   └── *.sh                     # Shell scripts
+├── logs/                        # Log files (gitignored)
+├── data/                        # Sample data files
+├── tests/                       # Test suite
+└── venv/                        # Virtual environment (gitignored)
 ```

 ---
@ -354,9 +415,18 @@ pip install dist/email_sorter-1.0.0-py3-none-any.whl

 ## Documentation

- **[PROJECT_BLUEPRINT.md](PROJECT_BLUEPRINT.md)** - Complete technical specifications
- **[BUILD_INSTRUCTIONS.md](BUILD_INSTRUCTIONS.md)** - Step-by-step implementation
- **[RESEARCH_FINDINGS.md](RESEARCH_FINDINGS.md)** - Validation & benchmarks
+### HTML Documentation (Interactive Diagrams)
+- **[docs/PROJECT_STATUS_AND_NEXT_STEPS.html](docs/PROJECT_STATUS_AND_NEXT_STEPS.html)** - MVP status & complete roadmap
+- **[docs/SYSTEM_FLOW.html](docs/SYSTEM_FLOW.html)** - System architecture with Mermaid diagrams
+- **[docs/VERIFY_CATEGORIES_FEATURE.html](docs/VERIFY_CATEGORIES_FEATURE.html)** - Category verification feature docs
+- **[docs/LABEL_TRAINING_PHASE_DETAIL.html](docs/LABEL_TRAINING_PHASE_DETAIL.html)** - Calibration phase breakdown
+- **[docs/FAST_ML_ONLY_WORKFLOW.html](docs/FAST_ML_ONLY_WORKFLOW.html)** - Pure ML classification guide
+
+### Markdown Documentation
+- **[docs/PROJECT_BLUEPRINT.md](docs/PROJECT_BLUEPRINT.md)** - Complete technical specifications
+- **[docs/BUILD_INSTRUCTIONS.md](docs/BUILD_INSTRUCTIONS.md)** - Step-by-step implementation
+- **[docs/RESEARCH_FINDINGS.md](docs/RESEARCH_FINDINGS.md)** - Validation & benchmarks
+- **[docs/START_HERE.md](docs/START_HERE.md)** - Getting started guide

 ---

--- a/config/categories.yaml
+++ b/config/categories.yaml
@ -5,7 +5,7 @@ categories:
      - "unsubscribe"
      - "click here"
      - "limited time"
-    threshold: 0.85
+    threshold: 0.55
    priority: 1

  transactional:
@ -17,7 +17,7 @@ categories:
      - "shipped"
      - "tracking"
      - "confirmation"
-    threshold: 0.80
+    threshold: 0.55
    priority: 2

  auth:
@ -28,7 +28,7 @@ categories:
      - "reset password"
      - "verify your account"
      - "confirm your identity"
-    threshold: 0.90
+    threshold: 0.55
    priority: 1

  newsletters:
@ -38,7 +38,7 @@ categories:
      - "weekly digest"
      - "monthly update"
      - "subscribe"
-    threshold: 0.75
+    threshold: 0.55
    priority: 3

  social:
@ -48,7 +48,7 @@ categories:
      - "friend request"
      - "liked your"
      - "followed you"
-    threshold: 0.75
+    threshold: 0.55
    priority: 3

  automated:
@ -58,7 +58,7 @@ categories:
      - "system notification"
      - "do not reply"
      - "noreply"
-    threshold: 0.80
+    threshold: 0.55
    priority: 2

  conversational:
@ -69,7 +69,7 @@ categories:
      - "thanks"
      - "regards"
      - "best regards"
-    threshold: 0.65
+    threshold: 0.55
    priority: 3

  work:
@ -80,7 +80,7 @@ categories:
      - "deadline"
      - "team"
      - "discussion"
-    threshold: 0.70
+    threshold: 0.55
    priority: 2

  personal:
@ -91,7 +91,7 @@ categories:
      - "dinner"
      - "weekend"
      - "friend"
-    threshold: 0.70
+    threshold: 0.55
    priority: 3

  finance:
@ -102,7 +102,7 @@ categories:
      - "account"
      - "payment due"
      - "card"
-    threshold: 0.85
+    threshold: 0.55
    priority: 2

  travel:
@ -113,7 +113,7 @@ categories:
      - "reservation"
      - "check-in"
      - "hotel"
-    threshold: 0.80
+    threshold: 0.55
    priority: 2

  unknown:
--- a/config/default_config.yaml
+++ b/config/default_config.yaml
@ -1,9 +1,9 @@
 version: "1.0.0"

 calibration:
-  sample_size: 1500
+  sample_size: 250
  sample_strategy: "stratified"
-  validation_size: 300
+  validation_size: 50
  min_confidence: 0.6

 processing:
@ -14,17 +14,17 @@ processing:
  checkpoint_dir: "checkpoints"

 classification:
-  default_threshold: 0.75
-  min_threshold: 0.60
-  max_threshold: 0.90
+  default_threshold: 0.55
+  min_threshold: 0.50
+  max_threshold: 0.70
  adjustment_step: 0.05
  adjustment_frequency: 1000
  category_thresholds:
-    junk: 0.85
-    auth: 0.90
-    transactional: 0.80
-    newsletters: 0.75
-    conversational: 0.65
+    junk: 0.55
+    auth: 0.55
+    transactional: 0.55
+    newsletters: 0.55
+    conversational: 0.55

 llm:
  provider: "ollama"
@ -32,9 +32,9 @@ llm:

  ollama:
    base_url: "http://localhost:11434"
-    calibration_model: "qwen3:1.7b"
-    consolidation_model: "qwen3:8b-q4_K_M"  # Larger model needed for JSON consolidation
-    classification_model: "qwen3:1.7b"
+    calibration_model: "qwen3:4b-instruct-2507-q8_0"
+    consolidation_model: "qwen3:4b-instruct-2507-q8_0"
+    classification_model: "qwen3:4b-instruct-2507-q8_0"
    temperature: 0.1
    max_tokens: 2000
    timeout: 30
--- a/create_stratified_sample.py
+++ b/create_stratified_sample.py
@ -1,189 +0,0 @@
-#!/usr/bin/env python3
-"""
-Create stratified 100k sample from Enron dataset for calibration.
-
-Ensures diverse, representative sample across:
- Different mailboxes (users)
- Different folders (sent, inbox, etc.)
- Time periods
- Email sizes
-"""
-
-import os
-import random
-import json
-from pathlib import Path
-from collections import defaultdict
-from typing import List, Dict
-import logging
-
-logging.basicConfig(level=logging.INFO, format='%(message)s')
-logger = logging.getLogger(__name__)
-
-
-def get_enron_structure(maildir_path: str = "maildir") -> Dict[str, List[Path]]:
-    """
-    Analyze Enron dataset structure.
-
-    Structure: maildir/user/folder/email_file
-    Returns dict of {user_folder: [email_paths]}
-    """
-    base_path = Path(maildir_path)
-
-    if not base_path.exists():
-        logger.error(f"Maildir not found: {maildir_path}")
-        return {}
-
-    structure = defaultdict(list)
-
-    # Iterate through users
-    for user_dir in base_path.iterdir():
-        if not user_dir.is_dir():
-            continue
-
-        user_name = user_dir.name
-
-        # Iterate through folders within user
-        for folder in user_dir.iterdir():
-            if not folder.is_dir():
-                continue
-
-            folder_name = f"{user_name}/{folder.name}"
-
-            # Collect emails in folder
-            for email_file in folder.iterdir():
-                if email_file.is_file():
-                    structure[folder_name].append(email_file)
-
-    return structure
-
-
-def create_stratified_sample(
-    maildir_path: str = "arnold-j",
-    target_size: int = 100000,
-    output_file: str = "enron_100k_sample.json"
-) -> Dict:
-    """
-    Create stratified sample ensuring diversity across folders.
-
-    Strategy:
-    1. Sample proportionally from each folder
-    2. Ensure minimum representation from small folders
-    3. Randomize within each stratum
-    4. Save sample metadata for reproducibility
-    """
-    logger.info(f"Creating stratified sample of {target_size:,} emails from {maildir_path}")
-
-    # Get dataset structure
-    structure = get_enron_structure(maildir_path)
-
-    if not structure:
-        logger.error("No emails found!")
-        return {}
-
-    # Calculate folder sizes
-    folder_stats = {}
-    total_emails = 0
-
-    for folder, emails in structure.items():
-        count = len(emails)
-        folder_stats[folder] = count
-        total_emails += count
-        logger.info(f"  {folder}: {count:,} emails")
-
-    logger.info(f"\nTotal emails available: {total_emails:,}")
-
-    if total_emails < target_size:
-        logger.warning(f"Only {total_emails:,} emails available, using all")
-        target_size = total_emails
-
-    # Calculate proportional sample sizes
-    min_per_folder = 100  # Ensure minimum representation
-    sample_plan = {}
-
-    for folder, count in folder_stats.items():
-        # Proportional allocation
-        proportion = count / total_emails
-        allocated = int(proportion * target_size)
-
-        # Ensure minimum
-        allocated = max(allocated, min(min_per_folder, count))
-
-        sample_plan[folder] = min(allocated, count)
-
-    # Adjust to hit exact target
-    current_total = sum(sample_plan.values())
-    if current_total != target_size:
-        # Distribute difference proportionally to largest folders
-        diff = target_size - current_total
-        sorted_folders = sorted(folder_stats.items(), key=lambda x: x[1], reverse=True)
-
-        for folder, _ in sorted_folders:
-            if diff == 0:
-                break
-            if diff > 0:  # Need more
-                available = folder_stats[folder] - sample_plan[folder]
-                add = min(abs(diff), available)
-                sample_plan[folder] += add
-                diff -= add
-            else:  # Need fewer
-                removable = sample_plan[folder] - min_per_folder
-                remove = min(abs(diff), removable)
-                sample_plan[folder] -= remove
-                diff += remove
-
-    logger.info(f"\nSample Plan (total: {sum(sample_plan.values()):,}):")
-    for folder, count in sorted(sample_plan.items(), key=lambda x: x[1], reverse=True):
-        pct = (count / sum(sample_plan.values())) * 100
-        logger.info(f"  {folder}: {count:,} ({pct:.1f}%)")
-
-    # Execute sampling
-    random.seed(42)  # Reproducibility
-    sample = {}
-
-    for folder, target_count in sample_plan.items():
-        emails = structure[folder]
-        sampled = random.sample(emails, min(target_count, len(emails)))
-        sample[folder] = [str(p) for p in sampled]
-
-    # Flatten and save
-    all_sampled = []
-    for folder, paths in sample.items():
-        for path in paths:
-            all_sampled.append({
-                'path': path,
-                'folder': folder
-            })
-
-    # Shuffle for randomness
-    random.shuffle(all_sampled)
-
-    # Save sample metadata
-    output_data = {
-        'version': '1.0',
-        'target_size': target_size,
-        'actual_size': len(all_sampled),
-        'maildir_path': maildir_path,
-        'sample_plan': sample_plan,
-        'folder_stats': folder_stats,
-        'emails': all_sampled
-    }
-
-    with open(output_file, 'w') as f:
-        json.dump(output_data, f, indent=2)
-
-    logger.info(f"\n✅ Sample created: {len(all_sampled):,} emails")
-    logger.info(f"📁 Saved to: {output_file}")
-    logger.info(f"🎲 Random seed: 42 (reproducible)")
-
-    return output_data
-
-
-if __name__ == "__main__":
-    import sys
-
-    maildir = sys.argv[1] if len(sys.argv) > 1 else "arnold-j"
-    target = int(sys.argv[2]) if len(sys.argv) > 2 else 100000
-    output = sys.argv[3] if len(sys.argv) > 3 else "enron_100k_sample.json"
-
-    create_stratified_sample(maildir, target, output)
--- a/docs/BUILD_INSTRUCTIONS.md
+++ b/docs/BUILD_INSTRUCTIONS.md
--- a/docs/COMPLETION_ASSESSMENT.md
+++ b/docs/COMPLETION_ASSESSMENT.md
--- a/docs/CURRENT_WORK_SUMMARY.md
+++ b/docs/CURRENT_WORK_SUMMARY.md
--- a/docs/FAST_ML_ONLY_WORKFLOW.html
+++ b/docs/FAST_ML_ONLY_WORKFLOW.html
@ -0,0 +1,527 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>Fast ML-Only Workflow Analysis</title>
+    <script src="https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.min.js"></script>
+    <style>
+        body {
+            font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
+            margin: 20px;
+            background: #1e1e1e;
+            color: #d4d4d4;
+        }
+        h1, h2, h3 {
+            color: #4ec9b0;
+        }
+        .diagram {
+            background: white;
+            padding: 20px;
+            margin: 20px 0;
+            border-radius: 8px;
+        }
+        .timing-table {
+            width: 100%;
+            border-collapse: collapse;
+            margin: 20px 0;
+            background: #252526;
+        }
+        .timing-table th {
+            background: #37373d;
+            padding: 12px;
+            text-align: left;
+            color: #4ec9b0;
+        }
+        .timing-table td {
+            padding: 10px;
+            border-bottom: 1px solid #3e3e42;
+        }
+        .code-section {
+            background: #252526;
+            padding: 15px;
+            margin: 10px 0;
+            border-left: 4px solid #4ec9b0;
+            font-family: 'Courier New', monospace;
+        }
+        code {
+            background: #1e1e1e;
+            padding: 2px 6px;
+            border-radius: 3px;
+            color: #ce9178;
+        }
+        .success {
+            background: #002a00;
+            border-left: 4px solid #4ec9b0;
+            padding: 15px;
+            margin: 10px 0;
+        }
+        .warning {
+            background: #3e2a00;
+            border-left: 4px solid #ffd93d;
+            padding: 15px;
+            margin: 10px 0;
+        }
+        .critical {
+            background: #3e0000;
+            border-left: 4px solid #ff6b6b;
+            padding: 15px;
+            margin: 10px 0;
+        }
+    </style>
+</head>
+<body>
+    <h1>Fast ML-Only Workflow Analysis</h1>
+
+    <h2>Your Question</h2>
+    <blockquote>
+        "I want to run ML-only classification on new mailboxes WITHOUT full calibration. Maybe 1 LLM call to verify categories match, then pure ML on embeddings. How can we do this fast for experimentation?"
+    </blockquote>
+
+    <h2>Current Trained Model</h2>
+
+    <div class="success">
+        <h3>Model: src/models/calibrated/classifier.pkl (1.8MB)</h3>
+        <ul>
+            <li><strong>Type:</strong> LightGBM Booster (not mock)</li>
+            <li><strong>Categories (11):</strong> Updates, Work, Meetings, External, Financial, Test, Administrative, Operational, Technical, Urgent, Requests</li>
+            <li><strong>Trained on:</strong> 10,000 Enron emails</li>
+            <li><strong>Input:</strong> Embeddings (384-dim) + TF-IDF features</li>
+        </ul>
+    </div>
+
+    <h2>1. Current Flow: With Calibration (Slow)</h2>
+    <div class="diagram">
+        <pre class="mermaid">
+flowchart TD
+    Start([New Mailbox: 10k emails]) --> Check{Model exists?}
+    Check -->|No| Calibration[CALIBRATION PHASE<br/>~20 minutes]
+    Check -->|Yes| LoadModel[Load existing model]
+
+    Calibration --> Sample[Sample 300 emails]
+    Sample --> Discovery[LLM Category Discovery<br/>15 batches × 20 emails<br/>~5 minutes]
+    Discovery --> Consolidate[Consolidate categories<br/>LLM call<br/>~5 seconds]
+    Consolidate --> Label[Label 300 samples]
+    Label --> Extract[Feature extraction]
+    Extract --> Train[Train LightGBM<br/>~5 seconds]
+    Train --> SaveModel[Save new model]
+
+    SaveModel --> Classify[CLASSIFICATION PHASE]
+    LoadModel --> Classify
+
+    Classify --> Loop{For each email}
+    Loop --> Embed[Generate embedding<br/>~0.02 sec]
+    Embed --> TFIDF[TF-IDF features<br/>~0.001 sec]
+    TFIDF --> Predict[ML Prediction<br/>~0.003 sec]
+    Predict --> Threshold{Confidence?}
+    Threshold -->|High| MLDone[ML result]
+    Threshold -->|Low| LLMFallback[LLM fallback<br/>~4 sec]
+    MLDone --> Next{More?}
+    LLMFallback --> Next
+    Next -->|Yes| Loop
+    Next -->|No| Done[Results]
+
+    style Calibration fill:#ff6b6b
+    style Discovery fill:#ff6b6b
+    style LLMFallback fill:#ff6b6b
+    style MLDone fill:#4ec9b0
+</pre>
+    </div>
+
+    <h2>2. Desired Flow: Fast ML-Only (Your Goal)</h2>
+    <div class="diagram">
+        <pre class="mermaid">
+flowchart TD
+    Start([New Mailbox: 10k emails]) --> LoadModel[Load pre-trained model<br/>Categories: 11 known<br/>~0.5 seconds]
+
+    LoadModel --> OptionalCheck{Verify categories?}
+    OptionalCheck -->|Yes| QuickVerify[Single LLM call<br/>Sample 10-20 emails<br/>Check category match<br/>~20 seconds]
+    OptionalCheck -->|Skip| StartClassify
+
+    QuickVerify --> MatchCheck{Categories match?}
+    MatchCheck -->|Yes| StartClassify[START CLASSIFICATION]
+    MatchCheck -->|No| Warn[Warning: Category mismatch<br/>Continue anyway]
+    Warn --> StartClassify
+
+    StartClassify --> Loop{For each email}
+    Loop --> Embed[Generate embedding<br/>all-minilm:l6-v2<br/>384 dimensions<br/>~0.02 sec]
+
+    Embed --> TFIDF[TF-IDF features<br/>~0.001 sec]
+    TFIDF --> Combine[Combine features<br/>Embedding + TF-IDF vector]
+
+    Combine --> Predict[LightGBM prediction<br/>~0.003 sec]
+    Predict --> Result[Category + confidence<br/>NO threshold check<br/>NO LLM fallback]
+
+    Result --> Next{More emails?}
+    Next -->|Yes| Loop
+    Next -->|No| Done[10k emails classified<br/>Total time: ~4 minutes]
+
+    style QuickVerify fill:#ffd93d
+    style Result fill:#4ec9b0
+    style Done fill:#4ec9b0
+</pre>
+    </div>
+
+    <h2>3. What Already Works (No Code Changes Needed)</h2>
+
+    <div class="success">
+        <h3>✓ The Model is Portable</h3>
+        <p>Your trained model contains:</p>
+        <ul>
+            <li>LightGBM Booster (the actual trained weights)</li>
+            <li>Category list (11 categories)</li>
+            <li>Category-to-index mapping</li>
+        </ul>
+        <p><strong>It can classify ANY email that has the same feature structure (embeddings + TF-IDF).</strong></p>
+    </div>
+
+    <div class="success">
+        <h3>✓ Embeddings are Universal</h3>
+        <p>The <code>all-minilm:l6-v2</code> model creates 384-dim embeddings for ANY text. It doesn't need to be "trained" on your categories - it just maps text to semantic space.</p>
+        <p><strong>Same embedding model works on Gmail, Outlook, any mailbox.</strong></p>
+    </div>
+
+    <div class="success">
+        <h3>✓ --no-llm-fallback Flag Exists</h3>
+        <p>Already implemented. When set:</p>
+        <ul>
+            <li>Low confidence emails still get ML classification</li>
+            <li>NO LLM fallback calls</li>
+            <li>100% pure ML speed</li>
+        </ul>
+    </div>
+
+    <div class="success">
+        <h3>✓ Model Loads Without Calibration</h3>
+        <p>If model exists at <code>src/models/pretrained/classifier.pkl</code>, calibration is skipped entirely.</p>
+    </div>
+
+    <h2>4. The Problem: Category Drift</h2>
+
+    <div class="warning">
+        <h3>What Happens When Mailboxes Differ</h3>
+        <p><strong>Scenario:</strong> Model trained on Enron (business emails)</p>
+        <p><strong>New mailbox:</strong> Personal Gmail (shopping, social, newsletters)</p>
+
+        <table class="timing-table">
+            <tr>
+                <th>Enron Categories (Trained)</th>
+                <th>Gmail Categories (Natural)</th>
+                <th>ML Behavior</th>
+            </tr>
+            <tr>
+                <td>Work, Meetings, Financial</td>
+                <td>Shopping, Social, Travel</td>
+                <td>Forces Gmail into Enron categories</td>
+            </tr>
+            <tr>
+                <td>"Operational"</td>
+                <td>No equivalent</td>
+                <td>Emails mis-classified as "Operational"</td>
+            </tr>
+            <tr>
+                <td>"External"</td>
+                <td>"Newsletters"</td>
+                <td>May map but semantically different</td>
+            </tr>
+        </table>
+
+        <p><strong>Result:</strong> Model works, but accuracy drops. Emails get forced into inappropriate categories.</p>
+    </div>
+
+    <h2>5. Your Proposed Solution: Quick Category Verification</h2>
+
+    <div class="diagram">
+        <pre class="mermaid">
+flowchart TD
+    Start([New Mailbox]) --> LoadModel[Load trained model<br/>11 categories known]
+
+    LoadModel --> Sample[Sample 10-20 emails<br/>Quick random sample<br/>~0.1 seconds]
+
+    Sample --> BuildPrompt[Build verification prompt<br/>Show trained categories<br/>Show sample emails]
+
+    BuildPrompt --> LLMCall[Single LLM call<br/>~20 seconds<br/>Task: Are these categories<br/>appropriate for this mailbox?]
+
+    LLMCall --> Parse[Parse response<br/>Expected: Yes/No + suggestions]
+
+    Parse --> Decision{Response?}
+    Decision -->|"Good match"| Proceed[Proceed with ML-only]
+    Decision -->|"Poor match"| Options{User choice}
+
+    Options -->|Continue anyway| Proceed
+    Options -->|Full calibration| Calibrate[Run full calibration<br/>Discover new categories]
+    Options -->|Abort| Stop[Stop - manual review]
+
+    Proceed --> FastML[Fast ML Classification<br/>10k emails in 4 minutes]
+
+    style LLMCall fill:#ffd93d
+    style FastML fill:#4ec9b0
+    style Calibrate fill:#ff6b6b
+</pre>
+    </div>
+
+    <h2>6. Implementation Options</h2>
+
+    <h3>Option A: Pure ML (Fastest, No Verification)</h3>
+    <div class="code-section">
+<strong>Command:</strong>
+python -m src.cli run \
+  --source gmail \
+  --limit 10000 \
+  --output gmail_results/ \
+  --no-llm-fallback
+
+<strong>What happens:</strong>
+1. Load existing model (11 Enron categories)
+2. Classify all 10k emails using those categories
+3. NO LLM calls at all
+4. Time: ~4 minutes
+
+<strong>Accuracy:</strong> 60-80% depending on mailbox similarity to Enron
+
+<strong>Use case:</strong> Quick experimentation, bulk processing
+    </div>
+
+    <h3>Option B: Quick Verify Then ML (Your Suggestion)</h3>
+    <div class="code-section">
+<strong>Command:</strong>
+python -m src.cli run \
+  --source gmail \
+  --limit 10000 \
+  --output gmail_results/ \
+  --no-llm-fallback \
+  --verify-categories \   # NEW FLAG (needs implementation)
+  --verify-sample 20      # NEW FLAG (needs implementation)
+
+<strong>What happens:</strong>
+1. Load existing model (11 Enron categories)
+2. Sample 20 random emails from new mailbox
+3. Single LLM call: "Are categories [Work, Meetings, ...] appropriate for these emails?"
+4. LLM responds: "Good match" or "Poor match - suggest [Shopping, Social, ...]"
+5. If good match: Proceed with ML-only
+6. If poor match: Warn user, optionally run calibration
+
+<strong>Time:</strong> ~4.5 minutes (20 sec verify + 4 min classify)
+<strong>Accuracy:</strong> Same as Option A, but with confidence check
+<strong>Use case:</strong> Production deployment with safety check
+    </div>
+
+    <h3>Option C: Lightweight Calibration (Middle Ground)</h3>
+    <div class="code-section">
+<strong>Command:</strong>
+python -m src.cli run \
+  --source gmail \
+  --limit 10000 \
+  --output gmail_results/ \
+  --no-llm-fallback \
+  --quick-calibrate \      # NEW FLAG (needs implementation)
+  --calibrate-sample 50    # Much smaller than 300
+
+<strong>What happens:</strong>
+1. Sample only 50 emails (not 300)
+2. Run LLM discovery on 3 batches (not 15)
+3. Map discovered categories to existing model categories
+4. If >70% overlap: Use existing model
+5. If <70% overlap: Train lightweight adapter
+
+<strong>Time:</strong> ~6 minutes (2 min quick cal + 4 min classify)
+<strong>Accuracy:</strong> 70-85% (better than Option A)
+<strong>Use case:</strong> New mailbox types with some verification
+    </div>
+
+    <h2>7. What Actually Needs Implementation</h2>
+
+    <table class="timing-table">
+        <tr>
+            <th>Feature</th>
+            <th>Status</th>
+            <th>Work Required</th>
+            <th>Time</th>
+        </tr>
+        <tr>
+            <td><strong>Option A: Pure ML</strong></td>
+            <td>✅ WORKS NOW</td>
+            <td>None - just use --no-llm-fallback</td>
+            <td>0 hours</td>
+        </tr>
+        <tr>
+            <td><strong>--verify-categories flag</strong></td>
+            <td>❌ Needs implementation</td>
+            <td>Add CLI flag, sample logic, LLM prompt, response parsing</td>
+            <td>2-3 hours</td>
+        </tr>
+        <tr>
+            <td><strong>--quick-calibrate flag</strong></td>
+            <td>❌ Needs implementation</td>
+            <td>Modify calibration workflow, category mapping logic</td>
+            <td>4-6 hours</td>
+        </tr>
+        <tr>
+            <td><strong>Category adapter/mapper</strong></td>
+            <td>❌ Needs implementation</td>
+            <td>Map new categories to existing model categories using embeddings</td>
+            <td>6-8 hours</td>
+        </tr>
+    </table>
+
+    <h2>8. Recommended Approach: Start with Option A</h2>
+
+    <div class="success">
+        <h3>Why Option A (Pure ML, No Verification) is Best for Experimentation</h3>
+        <ol>
+            <li><strong>Works right now</strong> - No code changes needed</li>
+            <li><strong>4 minutes per 10k emails</strong> - Ultra fast</li>
+            <li><strong>Reveals real accuracy</strong> - See how well Enron model generalizes</li>
+            <li><strong>Easy to compare</strong> - Run on multiple mailboxes quickly</li>
+            <li><strong>No false confidence</strong> - You know it's approximate, act accordingly</li>
+        </ol>
+
+        <h3>Test Protocol</h3>
+        <p><strong>Step 1:</strong> Run on Enron subset (same domain)</p>
+        <code>python -m src.cli run --source enron --limit 5000 --output test_enron/ --no-llm-fallback</code>
+        <p>Expected accuracy: ~78% (baseline)</p>
+
+        <p><strong>Step 2:</strong> Run on different Enron mailbox</p>
+        <code>python -m src.cli run --source enron --limit 5000 --output test_enron2/ --no-llm-fallback</code>
+        <p>Expected accuracy: ~70-75% (slight drift)</p>
+
+        <p><strong>Step 3:</strong> If you have personal Gmail/Outlook data, run there</p>
+        <code>python -m src.cli run --source gmail --limit 5000 --output test_gmail/ --no-llm-fallback</code>
+        <p>Expected accuracy: ~50-65% (significant drift, but still useful)</p>
+    </div>
+
+    <h2>9. Timing Comparison: All Options</h2>
+
+    <table class="timing-table">
+        <tr>
+            <th>Approach</th>
+            <th>LLM Calls</th>
+            <th>Time (10k emails)</th>
+            <th>Accuracy (Same domain)</th>
+            <th>Accuracy (Different domain)</th>
+        </tr>
+        <tr>
+            <td><strong>Full Calibration</strong></td>
+            <td>~500 (discovery + labeling + classification fallback)</td>
+            <td>~2.5 hours</td>
+            <td>92-95%</td>
+            <td>92-95%</td>
+        </tr>
+        <tr>
+            <td><strong>Option A: Pure ML</strong></td>
+            <td>0</td>
+            <td>~4 minutes</td>
+            <td>75-80%</td>
+            <td>50-65%</td>
+        </tr>
+        <tr>
+            <td><strong>Option B: Verify + ML</strong></td>
+            <td>1 (verification)</td>
+            <td>~4.5 minutes</td>
+            <td>75-80%</td>
+            <td>50-65%</td>
+        </tr>
+        <tr>
+            <td><strong>Option C: Quick Calibrate + ML</strong></td>
+            <td>~50 (quick discovery)</td>
+            <td>~6 minutes</td>
+            <td>80-85%</td>
+            <td>65-75%</td>
+        </tr>
+        <tr>
+            <td><strong>Current: ML + LLM Fallback</strong></td>
+            <td>~2100 (21% fallback rate)</td>
+            <td>~2.5 hours</td>
+            <td>92-95%</td>
+            <td>85-90%</td>
+        </tr>
+    </table>
+
+    <h2>10. The Real Question: Embeddings as Universal Features</h2>
+
+    <div class="success">
+        <h3>Why Your Intuition is Correct</h3>
+        <p>You said: "map it all to our structured embedding and that's how it gets done"</p>
+        <p><strong>This is exactly right.</strong></p>
+
+        <ul>
+            <li><strong>Embeddings are semantic representations</strong> - "Meeting tomorrow" has similar embedding whether it's from Enron or Gmail</li>
+            <li><strong>LightGBM learns patterns in embedding space</strong> - "High values in dimensions 50-70 = Meetings"</li>
+            <li><strong>These patterns transfer</strong> - Different mailboxes have similar semantic patterns</li>
+            <li><strong>Categories are just labels</strong> - The model doesn't care if you call it "Work" or "Business" - it learns the embedding pattern</li>
+        </ul>
+
+        <h3>The Limit</h3>
+        <p>Transfer learning works when:</p>
+        <ul>
+            <li>Email <strong>types</strong> are similar (business emails train well on business emails)</li>
+            <li>Email <strong>structure</strong> is similar (length, formality, sender patterns)</li>
+        </ul>
+
+        <p>Transfer learning fails when:</p>
+        <ul>
+            <li>Email <strong>domains</strong> differ significantly (e-commerce emails vs internal memos)</li>
+            <li>Email <strong>purposes</strong> differ (personal chitchat vs corporate announcements)</li>
+        </ul>
+    </div>
+
+    <h2>11. Recommended Next Step</h2>
+
+    <div class="code-section">
+<strong>Immediate action (works right now):</strong>
+
+# Test current model on new 10k sample WITHOUT calibration
+python -m src.cli run \
+  --source enron \
+  --limit 10000 \
+  --output ml_speed_test/ \
+  --no-llm-fallback
+
+# Expected:
+# - Time: ~4 minutes
+# - Accuracy: ~75-80%
+# - LLM calls: 0
+# - Categories used: 11 from trained model
+
+# Then inspect results:
+cat ml_speed_test/results.json | python -m json.tool | less
+
+# Check category distribution:
+cat ml_speed_test/results.json | \
+  python -c "import json, sys; data=json.load(sys.stdin); \
+  from collections import Counter; \
+  print(Counter(c['category'] for c in data['classifications']))"
+    </div>
+
+    <h2>12. If You Want Verification (Future Work)</h2>
+
+    <p>I can implement <code>--verify-categories</code> flag that:</p>
+    <ol>
+        <li>Samples 20 emails from new mailbox</li>
+        <li>Makes single LLM call showing both:
+            <ul>
+                <li>Trained model categories: [Work, Meetings, Financial, ...]</li>
+                <li>Sample emails from new mailbox</li>
+            </ul>
+        </li>
+        <li>Asks LLM: "Rate category fit: Good/Fair/Poor + suggest alternatives"</li>
+        <li>Reports confidence score</li>
+        <li>Proceeds with ML-only if score > threshold</li>
+    </ol>
+
+    <p><strong>Time cost:</strong> +20 seconds (1 LLM call)</p>
+    <p><strong>Value:</strong> Automated sanity check before bulk processing</p>
+
+    <script>
+        mermaid.initialize({
+            startOnLoad: true,
+            theme: 'default',
+            flowchart: {
+                useMaxWidth: true,
+                htmlLabels: true,
+                curve: 'basis'
+            }
+        });
+    </script>
+</body>
+</html>
--- a/docs/LABEL_TRAINING_PHASE_DETAIL.html
+++ b/docs/LABEL_TRAINING_PHASE_DETAIL.html
@ -0,0 +1,564 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>Label Training Phase - Detailed Analysis</title>
+    <script src="https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.min.js"></script>
+    <style>
+        body {
+            font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
+            margin: 20px;
+            background: #1e1e1e;
+            color: #d4d4d4;
+        }
+        h1, h2, h3 {
+            color: #4ec9b0;
+        }
+        .diagram {
+            background: white;
+            padding: 20px;
+            margin: 20px 0;
+            border-radius: 8px;
+        }
+        .timing-table {
+            width: 100%;
+            border-collapse: collapse;
+            margin: 20px 0;
+            background: #252526;
+        }
+        .timing-table th {
+            background: #37373d;
+            padding: 12px;
+            text-align: left;
+            color: #4ec9b0;
+        }
+        .timing-table td {
+            padding: 10px;
+            border-bottom: 1px solid #3e3e42;
+        }
+        .code-section {
+            background: #252526;
+            padding: 15px;
+            margin: 10px 0;
+            border-left: 4px solid #4ec9b0;
+            font-family: 'Courier New', monospace;
+        }
+        code {
+            background: #1e1e1e;
+            padding: 2px 6px;
+            border-radius: 3px;
+            color: #ce9178;
+        }
+        .warning {
+            background: #3e2a00;
+            border-left: 4px solid #ffd93d;
+            padding: 15px;
+            margin: 10px 0;
+        }
+        .critical {
+            background: #3e0000;
+            border-left: 4px solid #ff6b6b;
+            padding: 15px;
+            margin: 10px 0;
+        }
+    </style>
+</head>
+<body>
+    <h1>Label Training Phase - Deep Dive Analysis</h1>
+
+    <h2>1. What is "Label Training"?</h2>
+    <p><strong>Location:</strong> src/calibration/llm_analyzer.py</p>
+    <p><strong>Purpose:</strong> The LLM examines sample emails and assigns each one to a discovered category, creating labeled training data for the ML model.</p>
+    <p><strong>This is NOT the same as category discovery.</strong> Discovery finds WHAT categories exist. Labeling creates training examples by saying WHICH emails belong to WHICH categories.</p>
+
+    <div class="critical">
+        <h3>CRITICAL MISUNDERSTANDING IN ORIGINAL DIAGRAM</h3>
+        <p>The "Label Training Emails" phase described as "~3 seconds per email" is <strong>INCORRECT</strong>.</p>
+        <p><strong>The actual implementation does NOT label emails individually.</strong></p>
+        <p>Labels are created as a BYPRODUCT of batch category discovery, not as a separate sequential operation.</p>
+    </div>
+
+    <h2>2. Actual Label Training Flow</h2>
+    <div class="diagram">
+        <pre class="mermaid">
+flowchart TD
+    Start([Calibration Phase Starts]) --> Sample[Sample 300 emails<br/>stratified by sender]
+    Sample --> BatchSetup[Split into batches of 20 emails<br/>300 ÷ 20 = 15 batches]
+
+    BatchSetup --> Batch1[Batch 1: Emails 1-20]
+    Batch1 --> Stats1[Calculate batch statistics<br/>domains, keywords, attachments<br/>~0.1 seconds]
+
+    Stats1 --> BuildPrompt1[Build LLM prompt<br/>Include all 20 email summaries<br/>~0.05 seconds]
+
+    BuildPrompt1 --> LLMCall1[Single LLM call for entire batch<br/>Discovers categories AND labels all 20<br/>~20 seconds TOTAL for batch]
+
+    LLMCall1 --> Parse1[Parse JSON response<br/>Extract categories + labels<br/>~0.1 seconds]
+
+    Parse1 --> Store1[Store results<br/>categories: Dict<br/>labels: List of Tuples]
+
+    Store1 --> Batch2{More batches?}
+    Batch2 -->|Yes| NextBatch[Batch 2: Emails 21-40]
+    Batch2 -->|No| Consolidate
+
+    NextBatch --> Stats2[Same process<br/>15 total batches<br/>~20 seconds each]
+    Stats2 --> Batch2
+
+    Consolidate[Consolidate categories<br/>Merge duplicates<br/>Single LLM call<br/>~5 seconds]
+
+    Consolidate --> CacheSnap[Snap to cached categories<br/>Match against persistent cache<br/>~0.5 seconds]
+
+    CacheSnap --> Final[Final output<br/>10-12 categories<br/>300 labeled emails]
+
+    Final --> End([Labels ready for ML training])
+
+    style LLMCall1 fill:#ff6b6b
+    style Consolidate fill:#ff6b6b
+    style Stats2 fill:#ffd93d
+    style Final fill:#4ec9b0
+</pre>
+    </div>
+
+    <h2>3. Key Discovery: Batched Labeling</h2>
+
+    <div class="code-section">
+<strong>src/calibration/llm_analyzer.py:66-83</strong>
+
+batch_size = 20  # NOT 1 email at a time!
+
+for batch_idx in range(0, len(sample_emails), batch_size):
+    batch = sample_emails[batch_idx:batch_idx + batch_size]
+
+    # Single LLM call handles ENTIRE batch
+    batch_results = self._analyze_batch(batch, batch_idx)
+
+    # Returns BOTH categories AND labels for all 20 emails
+    for category, desc in batch_results.get('categories', {}).items():
+        discovered_categories[category] = desc
+
+    for email_id, category in batch_results.get('labels', []):
+        email_labels.append((email_id, category))
+    </div>
+
+    <div class="warning">
+        <h3>Why Batching Matters</h3>
+        <p><strong>Sequential (WRONG assumption):</strong> 300 emails × 3 sec/email = 900 seconds (15 minutes)</p>
+        <p><strong>Batched (ACTUAL):</strong> 15 batches × 20 sec/batch = 300 seconds (5 minutes)</p>
+        <p><strong>Savings:</strong> 10 minutes (67% faster than assumed)</p>
+    </div>
+
+    <h2>4. Single Batch Processing Detail</h2>
+    <div class="diagram">
+        <pre class="mermaid">
+flowchart TD
+    Start([Batch of 20 emails]) --> Stats[Calculate Statistics<br/>~0.1 seconds]
+
+    Stats --> StatDetails[Domain analysis<br/>Recipient counts<br/>Attachment detection<br/>Keyword extraction]
+
+    StatDetails --> BuildList[Build email summaries<br/>For each email:<br/>ID + From + Subject + Preview]
+
+    BuildList --> Prompt[Construct LLM prompt<br/>~2KB text<br/>Contains:<br/>- Statistics summary<br/>- All 20 email summaries<br/>- Instructions<br/>- JSON schema]
+
+    Prompt --> LLM[LLM Call<br/>POST /api/generate<br/>qwen3:4b-instruct-2507-q8_0<br/>temp=0.1, max_tokens=2000<br/>~18-22 seconds]
+
+    LLM --> Response[LLM Response<br/>JSON with:<br/>categories: Dict<br/>labels: List of 20 Tuples]
+
+    Response --> Parse[Parse JSON<br/>Regex extraction<br/>Brace counting<br/>~0.05 seconds]
+
+    Parse --> Validate{Valid JSON?}
+    Validate -->|Yes| Extract[Extract data<br/>categories: 3-8 new<br/>labels: 20 tuples]
+    Validate -->|No| FallbackParse[Fallback parsing<br/>Try to salvage partial data]
+
+    FallbackParse --> Extract
+
+    Extract --> Return[Return batch results<br/>categories: Dict str→str<br/>labels: List Tuple str,str]
+
+    Return --> End([Merge with global results])
+
+    style LLM fill:#ff6b6b
+    style Parse fill:#4ec9b0
+    style FallbackParse fill:#ffd93d
+</pre>
+    </div>
+
+    <h2>5. LLM Prompt Structure</h2>
+
+    <div class="code-section">
+<strong>Actual prompt sent to LLM (src/calibration/llm_analyzer.py:196-232):</strong>
+
+&lt;no_think&gt;You are analyzing emails to discover natural categories...
+
+BATCH STATISTICS (20 emails):
+- Top sender domains: example.com (5), company.org (3)...
+- Avg recipients per email: 2.3
+- Emails with attachments: 4/20
+- Avg subject length: 42 chars
+- Common keywords: meeting(3), report(2)...
+
+EMAILS TO ANALYZE:
+1. ID: maildir_allen-p__sent_mail_512
+   From: phillip.allen@enron.com
+   Subject: Re: AEC Volumes at OPAL
+   Preview: Here are the volumes...
+
+2. ID: maildir_allen-p__sent_mail_513
+   From: phillip.allen@enron.com
+   Subject: Meeting Tomorrow
+   Preview: Can we schedule...
+
+[... 18 more emails ...]
+
+TASK:
+1. Identify natural groupings based on PURPOSE
+2. Create SHORT category names
+3. Assign each email to exactly one category
+4. CRITICAL: Copy EXACT email IDs
+
+Return JSON:
+{
+  "categories": {"Work": "daily business communication", ...},
+  "labels": [["maildir_allen-p__sent_mail_512", "Work"], ...]
+}
+    </div>
+
+    <h2>6. Timing Breakdown - 300 Sample Emails</h2>
+
+    <table class="timing-table">
+        <tr>
+            <th>Operation</th>
+            <th>Per Batch (20 emails)</th>
+            <th>Total (15 batches)</th>
+            <th>% of Total Time</th>
+        </tr>
+        <tr>
+            <td>Calculate statistics</td>
+            <td>0.1 sec</td>
+            <td>1.5 sec</td>
+            <td>0.5%</td>
+        </tr>
+        <tr>
+            <td>Build email summaries</td>
+            <td>0.05 sec</td>
+            <td>0.75 sec</td>
+            <td>0.2%</td>
+        </tr>
+        <tr>
+            <td>Construct prompt</td>
+            <td>0.01 sec</td>
+            <td>0.15 sec</td>
+            <td>0.05%</td>
+        </tr>
+        <tr>
+            <td><strong>LLM API call</strong></td>
+            <td><strong>18-22 sec</strong></td>
+            <td><strong>270-330 sec</strong></td>
+            <td><strong>98%</strong></td>
+        </tr>
+        <tr>
+            <td>Parse JSON response</td>
+            <td>0.05 sec</td>
+            <td>0.75 sec</td>
+            <td>0.2%</td>
+        </tr>
+        <tr>
+            <td>Merge results</td>
+            <td>0.02 sec</td>
+            <td>0.3 sec</td>
+            <td>0.1%</td>
+        </tr>
+        <tr>
+            <td colspan="2"><strong>SUBTOTAL: Batch Discovery</strong></td>
+            <td><strong>~300 seconds (5 min)</strong></td>
+            <td><strong>98.5%</strong></td>
+        </tr>
+        <tr>
+            <td colspan="2">Consolidation LLM call</td>
+            <td>5 seconds</td>
+            <td>1.3%</td>
+        </tr>
+        <tr>
+            <td colspan="2">Cache snapping (semantic matching)</td>
+            <td>0.5 seconds</td>
+            <td>0.2%</td>
+        </tr>
+        <tr>
+            <td colspan="2"><strong>TOTAL LABELING PHASE</strong></td>
+            <td><strong>~305 seconds (5 min)</strong></td>
+            <td><strong>100%</strong></td>
+        </tr>
+    </table>
+
+    <div class="warning">
+        <h3>Corrected Understanding</h3>
+        <p><strong>Original estimate:</strong> "~3 seconds per email" = 900 seconds for 300 emails</p>
+        <p><strong>Actual timing:</strong> ~20 seconds per batch of 20 = ~305 seconds for 300 emails</p>
+        <p><strong>Difference:</strong> 3× faster than original assumption</p>
+        <p><strong>Why:</strong> Batching allows LLM to see context across multiple emails and make better category decisions in a single inference pass.</p>
+    </div>
+
+    <h2>7. What Gets Created</h2>
+
+    <div class="diagram">
+        <pre class="mermaid">
+flowchart LR
+    Input[300 sampled emails] --> Discovery[Category Discovery<br/>15 batches × 20 emails]
+
+    Discovery --> RawCats[Raw Categories<br/>~30-40 discovered<br/>May have duplicates:<br/>Work, work, Business, etc.]
+
+    RawCats --> Consolidate[Consolidation<br/>LLM merges similar<br/>~5 seconds]
+
+    Consolidate --> Merged[Merged Categories<br/>~12-15 categories<br/>Work, Financial, etc.]
+
+    Merged --> CacheSnap[Cache Snap<br/>Match against persistent cache<br/>~0.5 seconds]
+
+    CacheSnap --> Final[Final Categories<br/>10-12 categories]
+
+    Discovery --> RawLabels[Raw Labels<br/>300 tuples:<br/>email_id, category]
+
+    RawLabels --> UpdateLabels[Update label categories<br/>to match snapped names]
+
+    UpdateLabels --> FinalLabels[Final Labels<br/>300 training pairs]
+
+    Final --> Training[Training Data]
+    FinalLabels --> Training
+
+    Training --> MLTrain[Train LightGBM Model<br/>~5 seconds]
+
+    MLTrain --> Model[Trained Model<br/>1.8MB .pkl file]
+
+    style Discovery fill:#ff6b6b
+    style Consolidate fill:#ff6b6b
+    style Model fill:#4ec9b0
+</pre>
+    </div>
+
+    <h2>8. Example Output</h2>
+
+    <div class="code-section">
+<strong>discovered_categories (Dict[str, str]):</strong>
+{
+  "Work": "daily business communication and coordination",
+  "Financial": "budgets, reports, financial planning",
+  "Meetings": "scheduling and meeting coordination",
+  "Technical": "system issues and technical discussions",
+  "Requests": "action items and requests for information",
+  "Reports": "status reports and summaries",
+  "Administrative": "HR, policies, company announcements",
+  "Urgent": "time-sensitive matters",
+  "Conversational": "casual check-ins and social",
+  "External": "communication with external partners"
+}
+
+<strong>sample_labels (List[Tuple[str, str]]):</strong>
+[
+  ("maildir_allen-p__sent_mail_1", "Financial"),
+  ("maildir_allen-p__sent_mail_2", "Work"),
+  ("maildir_allen-p__sent_mail_3", "Meetings"),
+  ("maildir_allen-p__sent_mail_4", "Work"),
+  ("maildir_allen-p__sent_mail_5", "Financial"),
+  ... (300 total)
+]
+    </div>
+
+    <h2>9. Why Batching is Critical</h2>
+
+    <table class="timing-table">
+        <tr>
+            <th>Approach</th>
+            <th>LLM Calls</th>
+            <th>Time/Call</th>
+            <th>Total Time</th>
+            <th>Quality</th>
+        </tr>
+        <tr>
+            <td><strong>Sequential (1 email/call)</strong></td>
+            <td>300</td>
+            <td>3 sec</td>
+            <td>900 sec (15 min)</td>
+            <td>Poor - no context</td>
+        </tr>
+        <tr>
+            <td><strong>Small batches (5 emails/call)</strong></td>
+            <td>60</td>
+            <td>8 sec</td>
+            <td>480 sec (8 min)</td>
+            <td>Fair - limited context</td>
+        </tr>
+        <tr>
+            <td><strong>Current (20 emails/call)</strong></td>
+            <td>15</td>
+            <td>20 sec</td>
+            <td>300 sec (5 min)</td>
+            <td>Good - sufficient context</td>
+        </tr>
+        <tr>
+            <td><strong>Large batches (50 emails/call)</strong></td>
+            <td>6</td>
+            <td>45 sec</td>
+            <td>270 sec (4.5 min)</td>
+            <td>Risk - may exceed token limits</td>
+        </tr>
+    </table>
+
+    <div class="warning">
+        <h3>Why 20 emails per batch?</h3>
+        <ul>
+            <li><strong>Token limit:</strong> 20 emails × ~150 tokens/email = ~3000 tokens input, well under 8K limit</li>
+            <li><strong>Context window:</strong> LLM can see patterns across multiple emails</li>
+            <li><strong>Speed:</strong> Minimizes API calls while staying within limits</li>
+            <li><strong>Quality:</strong> Enough examples to identify patterns, not so many that it gets confused</li>
+        </ul>
+    </div>
+
+    <h2>10. Configuration Parameters</h2>
+
+    <table class="timing-table">
+        <tr>
+            <th>Parameter</th>
+            <th>Location</th>
+            <th>Default</th>
+            <th>Effect on Timing</th>
+        </tr>
+        <tr>
+            <td>sample_size</td>
+            <td>CalibrationConfig</td>
+            <td>300</td>
+            <td>300 samples = 15 batches = 5 min</td>
+        </tr>
+        <tr>
+            <td>batch_size</td>
+            <td>llm_analyzer.py:62</td>
+            <td>20</td>
+            <td>Hardcoded - affects batch count</td>
+        </tr>
+        <tr>
+            <td>llm_batch_size</td>
+            <td>CalibrationConfig</td>
+            <td>50</td>
+            <td>NOT USED for discovery (misleading name)</td>
+        </tr>
+        <tr>
+            <td>temperature</td>
+            <td>LLM call</td>
+            <td>0.1</td>
+            <td>Lower = faster, more deterministic</td>
+        </tr>
+        <tr>
+            <td>max_tokens</td>
+            <td>LLM call</td>
+            <td>2000</td>
+            <td>Higher = potentially slower response</td>
+        </tr>
+    </table>
+
+    <h2>11. Full Calibration Timeline</h2>
+
+    <div class="diagram">
+        <pre class="mermaid">
+gantt
+    title Calibration Phase Timeline (300 samples, 10k total emails)
+    dateFormat mm:ss
+    axisFormat %M:%S
+
+    section Sampling
+    Stratified sample (3% of 10k) :00:00, 01s
+
+    section Category Discovery
+    Batch 1 (emails 1-20)         :00:01, 20s
+    Batch 2 (emails 21-40)        :00:21, 20s
+    Batch 3 (emails 41-60)        :00:41, 20s
+    Batch 4-13 (emails 61-260)    :01:01, 200s
+    Batch 14 (emails 261-280)     :04:21, 20s
+    Batch 15 (emails 281-300)     :04:41, 20s
+
+    section Consolidation
+    LLM category merge            :05:01, 05s
+    Cache snap                    :05:06, 00.5s
+
+    section ML Training
+    Feature extraction (300)      :05:07, 06s
+    LightGBM training             :05:13, 05s
+    Validation (100 emails)       :05:18, 02s
+    Save model to disk            :05:20, 00.5s
+</pre>
+    </div>
+
+    <h2>12. Key Insights</h2>
+
+    <div class="critical">
+        <h3>1. Labels are NOT created sequentially</h3>
+        <p>The LLM creates labels as a byproduct of batch category discovery. There is NO separate "label each email one by one" phase.</p>
+    </div>
+
+    <div class="critical">
+        <h3>2. Batching is the optimization</h3>
+        <p>Processing 20 emails in a single LLM call (20 sec) is 3× faster than 20 individual calls (60 sec total).</p>
+    </div>
+
+    <div class="critical">
+        <h3>3. LLM time dominates everything</h3>
+        <p>98% of labeling phase time is LLM API calls. Everything else (parsing, merging, caching) is negligible.</p>
+    </div>
+
+    <div class="critical">
+        <h3>4. Consolidation is cheap</h3>
+        <p>Merging 30-40 raw categories into 10-12 final ones takes only ~5 seconds with a single LLM call.</p>
+    </div>
+
+    <h2>13. Optimization Opportunities</h2>
+
+    <table class="timing-table">
+        <tr>
+            <th>Optimization</th>
+            <th>Current</th>
+            <th>Potential</th>
+            <th>Tradeoff</th>
+        </tr>
+        <tr>
+            <td>Increase batch size</td>
+            <td>20 emails/batch</td>
+            <td>30-40 emails/batch</td>
+            <td>May hit token limits, slower per call</td>
+        </tr>
+        <tr>
+            <td>Reduce sample size</td>
+            <td>300 samples (3%)</td>
+            <td>200 samples (2%)</td>
+            <td>Less training data, potentially worse model</td>
+        </tr>
+        <tr>
+            <td>Parallel batching</td>
+            <td>Sequential 15 batches</td>
+            <td>3-5 concurrent batches</td>
+            <td>Requires async LLM client, more complex</td>
+        </tr>
+        <tr>
+            <td>Skip consolidation</td>
+            <td>Always consolidate if >10 cats</td>
+            <td>Skip if <15 cats</td>
+            <td>May leave duplicate categories</td>
+        </tr>
+        <tr>
+            <td>Cache-first approach</td>
+            <td>Discover then snap to cache</td>
+            <td>Snap to cache, only discover new</td>
+            <td>Less adaptive to new mailbox types</td>
+        </tr>
+    </table>
+
+    <script>
+        mermaid.initialize({
+            startOnLoad: true,
+            theme: 'default',
+            flowchart: {
+                useMaxWidth: true,
+                htmlLabels: true,
+                curve: 'basis'
+            },
+            gantt: {
+                useWidth: 1200
+            }
+        });
+    </script>
+</body>
+</html>
--- a/docs/MODEL_INFO.md
+++ b/docs/MODEL_INFO.md
--- a/docs/NEXT_STEPS.md
+++ b/docs/NEXT_STEPS.md
--- a/docs/PROJECT_BLUEPRINT.md
+++ b/docs/PROJECT_BLUEPRINT.md
--- a/docs/PROJECT_COMPLETE.md
+++ b/docs/PROJECT_COMPLETE.md
--- a/docs/PROJECT_STATUS.md
+++ b/docs/PROJECT_STATUS.md
--- a/docs/PROJECT_STATUS_AND_NEXT_STEPS.html
+++ b/docs/PROJECT_STATUS_AND_NEXT_STEPS.html
@ -0,0 +1,648 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>Email Sorter - Project Status & Next Steps</title>
+    <script src="https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.min.js"></script>
+    <style>
+        body {
+            font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
+            margin: 20px;
+            background: #1e1e1e;
+            color: #d4d4d4;
+        }
+        h1, h2, h3 {
+            color: #4ec9b0;
+        }
+        .diagram {
+            background: white;
+            padding: 20px;
+            margin: 20px 0;
+            border-radius: 8px;
+        }
+        .success {
+            background: #002a00;
+            border-left: 4px solid #4ec9b0;
+            padding: 15px;
+            margin: 10px 0;
+        }
+        .section {
+            background: #252526;
+            padding: 15px;
+            margin: 10px 0;
+            border-left: 4px solid #569cd6;
+        }
+        table {
+            width: 100%;
+            border-collapse: collapse;
+            margin: 20px 0;
+            background: #252526;
+        }
+        th {
+            background: #37373d;
+            padding: 12px;
+            text-align: left;
+            color: #4ec9b0;
+        }
+        td {
+            padding: 10px;
+            border-bottom: 1px solid #3e3e42;
+        }
+        code {
+            background: #1e1e1e;
+            padding: 2px 6px;
+            border-radius: 3px;
+            color: #ce9178;
+        }
+        .mvp-proven {
+            background: #003a00;
+            border: 3px solid #4ec9b0;
+            padding: 20px;
+            margin: 20px 0;
+            border-radius: 8px;
+            text-align: center;
+        }
+        .mvp-proven h2 {
+            font-size: 2em;
+            margin: 0;
+        }
+    </style>
+</head>
+<body>
+    <div class="mvp-proven">
+        <h2>🎉 MVP PROVEN AND WORKING 🎉</h2>
+        <p style="font-size: 1.2em; margin: 10px 0;">
+            <strong>10,000 emails classified in 4 minutes</strong><br/>
+            72.7% accuracy | 0 LLM calls | Pure ML speed
+        </p>
+    </div>
+
+    <h1>Email Sorter - Project Status & Next Steps</h1>
+
+    <h2>✅ What We've Achieved (MVP Complete)</h2>
+
+    <div class="success">
+        <h3>Core System Working</h3>
+        <ul>
+            <li><strong>LLM-Driven Calibration:</strong> Discovers categories from email samples (11 categories found)</li>
+            <li><strong>ML Model Training:</strong> LightGBM trained on 10k emails (1.8MB model)</li>
+            <li><strong>Fast Classification:</strong> 10k emails in ~4 minutes with --no-llm-fallback</li>
+            <li><strong>Category Verification:</strong> Single LLM call validates model fit for new mailboxes</li>
+            <li><strong>Embedding-Based Features:</strong> Universal 384-dim embeddings transfer across mailboxes</li>
+            <li><strong>Threshold Optimization:</strong> 0.55 threshold reduces LLM fallback by 40%</li>
+        </ul>
+    </div>
+
+    <h2>📊 Test Results Summary</h2>
+
+    <table>
+        <tr>
+            <th>Metric</th>
+            <th>Result</th>
+            <th>Status</th>
+        </tr>
+        <tr>
+            <td>Total emails processed</td>
+            <td>10,000</td>
+            <td>✅</td>
+        </tr>
+        <tr>
+            <td>Processing time</td>
+            <td>~4 minutes</td>
+            <td>✅</td>
+        </tr>
+        <tr>
+            <td>ML classification rate</td>
+            <td>78.4%</td>
+            <td>✅</td>
+        </tr>
+        <tr>
+            <td>LLM calls (with --no-llm-fallback)</td>
+            <td>0</td>
+            <td>✅</td>
+        </tr>
+        <tr>
+            <td>Accuracy estimate</td>
+            <td>72.7%</td>
+            <td>✅ (acceptable for speed)</td>
+        </tr>
+        <tr>
+            <td>Categories discovered</td>
+            <td>11 (Work, Financial, Updates, etc.)</td>
+            <td>✅</td>
+        </tr>
+        <tr>
+            <td>Model size</td>
+            <td>1.8MB</td>
+            <td>✅ (portable)</td>
+        </tr>
+    </table>
+
+    <h2>🗂️ Project Organization</h2>
+
+    <h3>Core Modules</h3>
+    <table>
+        <tr>
+            <th>Module</th>
+            <th>Purpose</th>
+            <th>Status</th>
+        </tr>
+        <tr>
+            <td><code>src/cli.py</code></td>
+            <td>Main CLI with all flags (--verify-categories, --no-llm-fallback)</td>
+            <td>✅ Complete</td>
+        </tr>
+        <tr>
+            <td><code>src/calibration/workflow.py</code></td>
+            <td>LLM-driven category discovery + training</td>
+            <td>✅ Complete</td>
+        </tr>
+        <tr>
+            <td><code>src/calibration/llm_analyzer.py</code></td>
+            <td>Batch LLM analysis (20 emails/call)</td>
+            <td>✅ Complete</td>
+        </tr>
+        <tr>
+            <td><code>src/calibration/category_verifier.py</code></td>
+            <td>Single LLM call to verify categories</td>
+            <td>✅ New feature</td>
+        </tr>
+        <tr>
+            <td><code>src/classification/ml_classifier.py</code></td>
+            <td>LightGBM model wrapper</td>
+            <td>✅ Complete</td>
+        </tr>
+        <tr>
+            <td><code>src/classification/adaptive_classifier.py</code></td>
+            <td>Rule → ML → LLM orchestrator</td>
+            <td>✅ Complete</td>
+        </tr>
+        <tr>
+            <td><code>src/classification/feature_extractor.py</code></td>
+            <td>Embeddings (384-dim) + TF-IDF</td>
+            <td>✅ Complete</td>
+        </tr>
+    </table>
+
+    <h3>Models & Data</h3>
+    <table>
+        <tr>
+            <th>Asset</th>
+            <th>Location</th>
+            <th>Status</th>
+        </tr>
+        <tr>
+            <td>Trained model</td>
+            <td><code>src/models/calibrated/classifier.pkl</code></td>
+            <td>✅ 1.8MB, 11 categories</td>
+        </tr>
+        <tr>
+            <td>Pretrained copy</td>
+            <td><code>src/models/pretrained/classifier.pkl</code></td>
+            <td>✅ Ready for fast load</td>
+        </tr>
+        <tr>
+            <td>Category cache</td>
+            <td><code>src/models/category_cache.json</code></td>
+            <td>✅ 10 cached categories</td>
+        </tr>
+        <tr>
+            <td>Test results</td>
+            <td><code>test/results.json</code></td>
+            <td>✅ 10k classifications</td>
+        </tr>
+    </table>
+
+    <h3>Documentation</h3>
+    <table>
+        <tr>
+            <th>Document</th>
+            <th>Purpose</th>
+        </tr>
+        <tr>
+            <td><code>SYSTEM_FLOW.html</code></td>
+            <td>Complete system flow diagrams with timing</td>
+        </tr>
+        <tr>
+            <td><code>LABEL_TRAINING_PHASE_DETAIL.html</code></td>
+            <td>Deep dive into calibration phase</td>
+        </tr>
+        <tr>
+            <td><code>FAST_ML_ONLY_WORKFLOW.html</code></td>
+            <td>Pure ML workflow analysis</td>
+        </tr>
+        <tr>
+            <td><code>VERIFY_CATEGORIES_FEATURE.html</code></td>
+            <td>Category verification documentation</td>
+        </tr>
+        <tr>
+            <td><code>PROJECT_STATUS_AND_NEXT_STEPS.html</code></td>
+            <td>This document - status and roadmap</td>
+        </tr>
+    </table>
+
+    <h2>🎯 Next Steps (Priority Order)</h2>
+
+    <h3>Phase 1: Clean Up & Organize (Next Session)</h3>
+    <div class="section">
+        <h4>1.1 Clean Root Directory</h4>
+        <p><strong>Goal:</strong> Move test artifacts and scripts to organized locations</p>
+        <ul>
+            <li>Create <code>docs/</code> folder - move all .html files there</li>
+            <li>Create <code>scripts/</code> folder - move all .sh files there</li>
+            <li>Create <code>logs/</code> folder - move all .log files there</li>
+            <li>Delete debug files (debug_*.txt, spot_check_results.txt)</li>
+            <li>Create .gitignore for logs/, results/, test/, ml_only_test/, etc.</li>
+        </ul>
+        <p><strong>Time:</strong> 10 minutes</p>
+    </div>
+
+    <div class="section">
+        <h4>1.2 Create README.md</h4>
+        <p><strong>Goal:</strong> Professional project documentation</p>
+        <ul>
+            <li>Overview of system architecture</li>
+            <li>Quick start guide</li>
+            <li>Usage examples (with/without calibration, with/without verification)</li>
+            <li>Performance benchmarks (from our tests)</li>
+            <li>Configuration options</li>
+        </ul>
+        <p><strong>Time:</strong> 30 minutes</p>
+    </div>
+
+    <div class="section">
+        <h4>1.3 Add Tests</h4>
+        <p><strong>Goal:</strong> Ensure code quality and catch regressions</p>
+        <ul>
+            <li>Unit tests for feature extraction</li>
+            <li>Unit tests for category verification</li>
+            <li>Integration test for full pipeline</li>
+            <li>Test for --no-llm-fallback flag</li>
+            <li>Test for --verify-categories flag</li>
+        </ul>
+        <p><strong>Time:</strong> 2 hours</p>
+    </div>
+
+    <h3>Phase 2: Real-World Integration (Week 1-2)</h3>
+    <div class="section">
+        <h4>2.1 Gmail Provider Implementation</h4>
+        <p><strong>Goal:</strong> Connect to real Gmail accounts</p>
+        <ul>
+            <li>Implement Gmail API authentication (OAuth2)</li>
+            <li>Fetch emails with pagination</li>
+            <li>Handle Gmail-specific metadata (labels, threads)</li>
+            <li>Test with personal Gmail account</li>
+        </ul>
+        <p><strong>Time:</strong> 4-6 hours</p>
+    </div>
+
+    <div class="section">
+        <h4>2.2 IMAP Provider Implementation</h4>
+        <p><strong>Goal:</strong> Support any email provider (Outlook, custom servers)</p>
+        <ul>
+            <li>IMAP connection handling</li>
+            <li>SSL/TLS support</li>
+            <li>Folder navigation</li>
+            <li>Test with Outlook/Protonmail</li>
+        </ul>
+        <p><strong>Time:</strong> 3-4 hours</p>
+    </div>
+
+    <div class="section">
+        <h4>2.3 Email Syncing (Apply Classifications)</h4>
+        <p><strong>Goal:</strong> Move/label emails based on classification</p>
+        <ul>
+            <li>Gmail: Apply labels to emails</li>
+            <li>IMAP: Move emails to folders</li>
+            <li>Dry-run mode (preview without applying)</li>
+            <li>Batch operations for speed</li>
+            <li>Rollback capability</li>
+        </ul>
+        <p><strong>Time:</strong> 6-8 hours</p>
+    </div>
+
+    <h3>Phase 3: Production Features (Week 3-4)</h3>
+    <div class="section">
+        <h4>3.1 Incremental Classification</h4>
+        <p><strong>Goal:</strong> Only classify new emails, not entire inbox</p>
+        <ul>
+            <li>Track last processed email ID</li>
+            <li>Resume from checkpoint</li>
+            <li>Database/file-based state tracking</li>
+            <li>Scheduled runs (cron integration)</li>
+        </ul>
+        <p><strong>Time:</strong> 4-6 hours</p>
+    </div>
+
+    <div class="section">
+        <h4>3.2 Multi-Account Support</h4>
+        <p><strong>Goal:</strong> Manage multiple email accounts</p>
+        <ul>
+            <li>Per-account configuration</li>
+            <li>Per-account trained models</li>
+            <li>Account switching CLI</li>
+            <li>Shared category cache across accounts</li>
+        </ul>
+        <p><strong>Time:</strong> 3-4 hours</p>
+    </div>
+
+    <div class="section">
+        <h4>3.3 Model Management</h4>
+        <p><strong>Goal:</strong> Handle model lifecycle</p>
+        <ul>
+            <li>Model versioning (timestamps)</li>
+            <li>Model comparison (A/B testing)</li>
+            <li>Model export/import</li>
+            <li>Retraining scheduler</li>
+            <li>Model degradation detection</li>
+        </ul>
+        <p><strong>Time:</strong> 4-5 hours</p>
+    </div>
+
+    <h3>Phase 4: Advanced Features (Month 2)</h3>
+    <div class="section">
+        <h4>4.1 Web Dashboard</h4>
+        <p><strong>Goal:</strong> Visual interface for monitoring and management</p>
+        <ul>
+            <li>Flask/FastAPI backend</li>
+            <li>React/Vue frontend</li>
+            <li>View classification results</li>
+            <li>Manually correct classifications (feedback loop)</li>
+            <li>Monitor accuracy over time</li>
+            <li>Trigger recalibration</li>
+        </ul>
+        <p><strong>Time:</strong> 20-30 hours</p>
+    </div>
+
+    <div class="section">
+        <h4>4.2 Active Learning</h4>
+        <p><strong>Goal:</strong> Improve model from user corrections</p>
+        <ul>
+            <li>User feedback collection</li>
+            <li>Disagreement-based sampling (low confidence + user correction)</li>
+            <li>Incremental model updates</li>
+            <li>Feedback-driven category evolution</li>
+        </ul>
+        <p><strong>Time:</strong> 8-10 hours</p>
+    </div>
+
+    <div class="section">
+        <h4>4.3 Performance Optimization</h4>
+        <p><strong>Goal:</strong> Scale to 100k+ emails</p>
+        <ul>
+            <li>Batch embedding generation (reduce API calls)</li>
+            <li>Async/parallel classification</li>
+            <li>Model quantization (reduce size)</li>
+            <li>GPU acceleration for embeddings</li>
+            <li>Caching layer (Redis)</li>
+        </ul>
+        <p><strong>Time:</strong> 10-15 hours</p>
+    </div>
+
+    <h2>🔧 Immediate Action Items (This Week)</h2>
+
+    <table>
+        <tr>
+            <th>Task</th>
+            <th>Priority</th>
+            <th>Time</th>
+            <th>Status</th>
+        </tr>
+        <tr>
+            <td>Clean root directory - organize files</td>
+            <td>High</td>
+            <td>10 min</td>
+            <td>Pending</td>
+        </tr>
+        <tr>
+            <td>Create comprehensive README.md</td>
+            <td>High</td>
+            <td>30 min</td>
+            <td>Pending</td>
+        </tr>
+        <tr>
+            <td>Add .gitignore for test artifacts</td>
+            <td>High</td>
+            <td>5 min</td>
+            <td>Pending</td>
+        </tr>
+        <tr>
+            <td>Create setup.py for pip installation</td>
+            <td>Medium</td>
+            <td>20 min</td>
+            <td>Pending</td>
+        </tr>
+        <tr>
+            <td>Write basic unit tests</td>
+            <td>Medium</td>
+            <td>2 hours</td>
+            <td>Pending</td>
+        </tr>
+        <tr>
+            <td>Test Gmail provider (basic fetch)</td>
+            <td>Medium</td>
+            <td>2 hours</td>
+            <td>Pending</td>
+        </tr>
+    </table>
+
+    <h2>📈 Success Metrics</h2>
+
+    <div class="diagram">
+        <pre class="mermaid">
+flowchart LR
+    MVP[MVP Proven] --> P1[Phase 1: Organization]
+    P1 --> P2[Phase 2: Integration]
+    P2 --> P3[Phase 3: Production]
+    P3 --> P4[Phase 4: Advanced]
+
+    P1 --> M1[Metric: Clean codebase<br/>100% docs coverage]
+    P2 --> M2[Metric: Real email support<br/>Gmail + IMAP working]
+    P3 --> M3[Metric: Daily automation<br/>Incremental processing]
+    P4 --> M4[Metric: User adoption<br/>10+ users, 90%+ satisfaction]
+
+    style MVP fill:#4ec9b0
+    style P1 fill:#569cd6
+    style P2 fill:#569cd6
+    style P3 fill:#569cd6
+    style P4 fill:#569cd6
+</pre>
+    </div>
+
+    <h2>🚀 Quick Start Commands</h2>
+
+    <div class="section">
+        <h3>Train New Model (Full Calibration)</h3>
+        <code>
+source venv/bin/activate<br/>
+python -m src.cli run \<br/>
+&nbsp;&nbsp;--source enron \<br/>
+&nbsp;&nbsp;--limit 10000 \<br/>
+&nbsp;&nbsp;--output results/<br/>
+        </code>
+        <p><strong>Time:</strong> ~25 minutes | <strong>LLM calls:</strong> ~500 | <strong>Accuracy:</strong> 92-95%</p>
+    </div>
+
+    <div class="section">
+        <h3>Fast ML-Only Classification (Existing Model)</h3>
+        <code>
+source venv/bin/activate<br/>
+python -m src.cli run \<br/>
+&nbsp;&nbsp;--source enron \<br/>
+&nbsp;&nbsp;--limit 10000 \<br/>
+&nbsp;&nbsp;--output fast_test/ \<br/>
+&nbsp;&nbsp;--no-llm-fallback<br/>
+        </code>
+        <p><strong>Time:</strong> ~4 minutes | <strong>LLM calls:</strong> 0 | <strong>Accuracy:</strong> 72-78%</p>
+    </div>
+
+    <div class="section">
+        <h3>ML with Category Verification (Recommended)</h3>
+        <code>
+source venv/bin/activate<br/>
+python -m src.cli run \<br/>
+&nbsp;&nbsp;--source enron \<br/>
+&nbsp;&nbsp;--limit 10000 \<br/>
+&nbsp;&nbsp;--output verified_test/ \<br/>
+&nbsp;&nbsp;--no-llm-fallback \<br/>
+&nbsp;&nbsp;--verify-categories<br/>
+        </code>
+        <p><strong>Time:</strong> ~4.5 minutes | <strong>LLM calls:</strong> 1 | <strong>Accuracy:</strong> 72-78%</p>
+    </div>
+
+    <h2>📁 Recommended Project Structure (After Cleanup)</h2>
+
+    <pre style="background: #252526; padding: 15px; border-radius: 5px; font-family: monospace;">
+email-sorter/
+├── README.md                  # Main documentation
+├── setup.py                   # Pip installation
+├── requirements.txt           # Dependencies
+├── .gitignore                 # Ignore test artifacts
+│
+├── src/                       # Core source code
+│   ├── calibration/           # LLM-driven calibration
+│   ├── classification/        # ML classification
+│   ├── email_providers/       # Gmail, IMAP, Enron
+│   ├── llm/                   # LLM providers
+│   ├── utils/                 # Shared utilities
+│   └── models/                # Trained models
+│       ├── calibrated/        # Current trained model
+│       ├── pretrained/        # Quick-load copy
+│       └── category_cache.json
+│
+├── config/                    # Configuration files
+│   ├── default_config.yaml
+│   └── categories.yaml
+│
+├── tests/                     # Unit & integration tests
+│   ├── test_calibration.py
+│   ├── test_classification.py
+│   └── test_verification.py
+│
+├── scripts/                   # Helper scripts
+│   ├── train_model.sh
+│   ├── fast_classify.sh
+│   └── verify_and_classify.sh
+│
+├── docs/                      # HTML documentation
+│   ├── SYSTEM_FLOW.html
+│   ├── LABEL_TRAINING_PHASE_DETAIL.html
+│   ├── FAST_ML_ONLY_WORKFLOW.html
+│   └── VERIFY_CATEGORIES_FEATURE.html
+│
+├── logs/                      # Runtime logs (gitignored)
+│   └── *.log
+│
+└── results/                   # Test results (gitignored)
+    └── *.json
+    </pre>
+
+    <h2>🎓 Key Learnings</h2>
+
+    <div class="section">
+        <ul>
+            <li><strong>Embeddings are universal:</strong> Same model works across different mailboxes</li>
+            <li><strong>Batching is critical:</strong> 20 emails/LLM call = 3× faster than sequential</li>
+            <li><strong>Thresholds matter:</strong> 0.55 threshold reduces LLM usage by 40%</li>
+            <li><strong>Category verification adds value:</strong> 20 sec for confidence check is worth it</li>
+            <li><strong>Pure ML is viable:</strong> 73% accuracy with 0 LLM calls for speed tests</li>
+            <li><strong>LLM-driven calibration works:</strong> Discovers natural categories without hardcoding</li>
+        </ul>
+    </div>
+
+    <h2>✅ Ready for Production?</h2>
+
+    <table>
+        <tr>
+            <th>Component</th>
+            <th>Status</th>
+            <th>Blocker</th>
+        </tr>
+        <tr>
+            <td>Core ML Pipeline</td>
+            <td>✅ Ready</td>
+            <td>None</td>
+        </tr>
+        <tr>
+            <td>LLM Calibration</td>
+            <td>✅ Ready</td>
+            <td>None</td>
+        </tr>
+        <tr>
+            <td>Category Verification</td>
+            <td>✅ Ready</td>
+            <td>None</td>
+        </tr>
+        <tr>
+            <td>Fast ML-Only Mode</td>
+            <td>✅ Ready</td>
+            <td>None</td>
+        </tr>
+        <tr>
+            <td>Enron Provider</td>
+            <td>✅ Ready</td>
+            <td>None (test only)</td>
+        </tr>
+        <tr>
+            <td>Gmail Provider</td>
+            <td>⚠️ Needs implementation</td>
+            <td>OAuth2 + API calls</td>
+        </tr>
+        <tr>
+            <td>IMAP Provider</td>
+            <td>⚠️ Needs implementation</td>
+            <td>IMAP library integration</td>
+        </tr>
+        <tr>
+            <td>Email Syncing</td>
+            <td>❌ Not implemented</td>
+            <td>Apply labels/move emails</td>
+        </tr>
+        <tr>
+            <td>Tests</td>
+            <td>⚠️ Minimal coverage</td>
+            <td>Need comprehensive tests</td>
+        </tr>
+        <tr>
+            <td>Documentation</td>
+            <td>✅ Excellent</td>
+            <td>Need README.md</td>
+        </tr>
+    </table>
+
+    <p><strong>Verdict:</strong> MVP is production-ready for <em>Enron dataset testing</em>. Need Gmail/IMAP providers for real-world use.</p>
+
+    <script>
+        mermaid.initialize({
+            startOnLoad: true,
+            theme: 'default',
+            flowchart: {
+                useMaxWidth: true,
+                htmlLabels: true,
+                curve: 'basis'
+            }
+        });
+    </script>
+</body>
+</html>
--- a/docs/RESEARCH_FINDINGS.md
+++ b/docs/RESEARCH_FINDINGS.md
--- a/docs/ROOT_CAUSE_ANALYSIS.md
+++ b/docs/ROOT_CAUSE_ANALYSIS.md
@ -0,0 +1,319 @@
+# Root Cause Analysis: Category Explosion & Over-Confidence
+
+**Date:** 2025-10-24
+**Run:** 100k emails, qwen3:4b model
+**Issue:** Model trained on 29 categories instead of expected 11, with extreme over-confidence
+
+---
+
+## Executive Summary
+
+The 100k classification run technically succeeded (92.1% accuracy estimate) but revealed critical architectural issues:
+
+1. **Category Explosion:** 29 training categories vs expected 11
+2. **Duplicate Categories:** Work/work, Administrative/auth, finance/Financial
+3. **Extreme Over-Confidence:** 99%+ classifications at 1.0 confidence
+4. **Category Leakage:** Hardcoded categories leaked into LLM-discovered categories
+
+---
+
+## The Bug
+
+### Location
+[src/calibration/workflow.py:110](src/calibration/workflow.py#L110)
+
+```python
+all_categories = list(set(self.categories) | set(discovered_categories.keys()) | label_categories)
+```
+
+### What Happened
+
+The workflow merges THREE category sources:
+
+1. **`self.categories`** - 12 hardcoded categories from `config/categories.yaml`:
+   - junk, transactional, auth, newsletters, social, automated
+   - conversational, work, personal, finance, travel, unknown
+
+2. **`discovered_categories.keys()`** - 11 LLM-discovered categories:
+   - Work, Financial, Administrative, Operational, Meeting
+   - Technical, External, Announcements, Urgent, Miscellaneous, Forwarded
+
+3. **`label_categories`** - Additional categories from LLM labels:
+   - Bowl Pool 2000, California Market, Prehearing, Change, Monitoring
+   - Information
+
+### Result: 29 Total Categories
+
+```
+1. Administrative           (LLM discovered)
+2. Announcements           (LLM discovered)
+3. Bowl Pool 2000          (LLM label - weird)
+4. California Market       (LLM label - too specific)
+5. Change                  (LLM label - vague)
+6. External                (LLM discovered)
+7. Financial               (LLM discovered)
+8. Forwarded               (LLM discovered)
+9. Information             (LLM label - vague)
+10. Meeting                (LLM discovered)
+11. Miscellaneous          (LLM discovered)
+12. Monitoring             (LLM label - too specific)
+13. Operational            (LLM discovered)
+14. Prehearing             (LLM label - too specific)
+15. Technical              (LLM discovered)
+16. Urgent                 (LLM discovered)
+17. Work                   (LLM discovered)
+18. auth                   (hardcoded)
+19. automated              (hardcoded)
+20. conversational         (hardcoded)
+21. finance                (hardcoded)
+22. junk                   (hardcoded)
+23. newsletters            (hardcoded)
+24. personal               (hardcoded)
+25. social                 (hardcoded)
+26. transactional          (hardcoded)
+27. travel                 (hardcoded)
+28. unknown                (hardcoded)
+29. work                   (hardcoded)
+```
+
+### Duplicates Identified
+
+- **Work (LLM) vs work (hardcoded)** - 14,223 vs 368 emails
+- **Financial (LLM) vs finance (hardcoded)** - 5,943 vs 0 emails
+- **Administrative (LLM) vs auth (hardcoded)** - 67,195 vs 37 emails
+
+---
+
+## Impact Analysis
+
+### 1. Category Distribution (100k Results)
+
+| Category | Count | Confidence | Source |
+|----------|-------|------------|--------|
+| Administrative | 67,195 | 1.000 | LLM discovered |
+| Work | 14,223 | 1.000 | LLM discovered |
+| Meeting | 7,785 | 1.000 | LLM discovered |
+| Financial | 5,943 | 1.000 | LLM discovered |
+| Operational | 3,274 | 1.000 | LLM discovered |
+| junk | 394 | 0.960 | Hardcoded |
+| work | 368 | 0.950 | Hardcoded |
+| Miscellaneous | 238 | 1.000 | LLM discovered |
+| Technical | 193 | 1.000 | LLM discovered |
+| External | 137 | 1.000 | LLM discovered |
+| transactional | 44 | 0.970 | Hardcoded |
+| auth | 37 | 0.990 | Hardcoded |
+| unknown | 23 | 0.500 | Hardcoded |
+| Others | <20 each | Various | Mixed |
+
+### 2. Extreme Over-Confidence
+
+- **67,195 emails** classified as "Administrative" with **1.0 confidence**
+- **99.9%** of all classifications have confidence >= 0.95
+- This is unrealistic - suggests overfitting or poor calibration
+
+### 3. Why It Still "Worked"
+
+- LLM-discovered categories (uppercase) handled 99%+ of emails
+- Hardcoded categories (lowercase) mostly unused except for rules
+- Model learned both sets but strongly preferred LLM categories
+- Enron dataset doesn't match hardcoded categories well
+
+---
+
+## Why This Happened
+
+### Design Intent vs Reality
+
+**Original Design:**
+- Hardcoded categories in `categories.yaml` for rule-based matching
+- LLM discovers NEW categories during calibration
+- Merge both for flexible classification
+
+**Reality:**
+- Hardcoded categories leak into ML training
+- Creates duplicate concepts (Work vs work)
+- LLM labels include one-off categories (Bowl Pool 2000)
+- No deduplication or conflict resolution
+
+### The Workflow Path
+
+```
+1. CLI loads hardcoded categories from categories.yaml
+   → ['junk', 'transactional', 'auth', ... 'work', 'finance', 'unknown']
+
+2. Passes to CalibrationWorkflow.__init__(categories=...)
+   → self.categories = list(categories.keys())
+
+3. LLM discovers categories from emails
+   → {'Work': 'business emails', 'Financial': 'budgets', ...}
+
+4. Consolidation reduces duplicates (within LLM categories only)
+   → But doesn't see hardcoded categories
+
+5. Merge ALL sources at workflow.py:110
+   → Hardcoded + Discovered + Label anomalies = 29 categories
+
+6. Trainer learns all 29 categories
+   → Model becomes confused but weights LLM categories heavily
+```
+
+---
+
+## Spot-Check Findings
+
+### High Confidence Samples (Correct)
+
+✅ **Sample 1:** "i'll get the movie and wine. my suggestion is something from central market"
+   - Classified: Administrative (1.0)
+   - **Assessment:** Questionable - looks more personal
+
+✅ **Sample 2:** "Can you spell S-N-O-O-T-Y?"
+   - Classified: Administrative (1.0)
+   - **Assessment:** Wrong - clearly conversational/personal
+
+✅ **Sample 3:** "MEETING TONIGHT - 6:00 pm Central Time at The Houstonian"
+   - Classified: Meeting (1.0)
+   - **Assessment:** Correct
+
+### Low Confidence Samples (Unknown)
+
+⚠️ **All low confidence samples classified as "unknown" (0.500)**
+- These fell back to LLM
+- LLM failed to classify (returned unknown)
+- Actual content: Legitimate business emails about deferrals, power units
+
+### Category Anomalies
+
+❌ **"California Market" (6 emails, 1.0 confidence)**
+- Too specific - shouldn't be a standalone category
+- Should be "Work" or "External"
+
+❌ **"Bowl Pool 2000" (exists in training set)**
+- One-off event category
+- Should never have been kept
+
+---
+
+## Performance Impact
+
+### What Went Right
+
+- **ML handled 99.1%** of emails (99,134 / 100,000)
+- **Only 31 fell to LLM** (0.03%)
+- Fast classification (~3 minutes for 100k)
+- Discovered categories are semantically good
+
+### What Went Wrong
+
+- **Unrealistic confidence** - Almost everything is 1.0
+- **Category pollution** - 29 instead of 11
+- **Duplicates** - Work/work, finance/Financial
+- **No calibration** - Model confidence not properly calibrated
+- **Hardcoded categories unused** - 368 "work" vs 14,223 "Work"
+
+---
+
+## Root Causes
+
+### 1. Architectural Confusion
+
+**Two competing philosophies:**
+- **Rule-based system:** Use hardcoded categories with pattern matching
+- **LLM-driven system:** Discover categories from data
+
+**Result:** They interfere with each other instead of complementing
+
+### 2. Missing Deduplication
+
+The workflow.py:110 line does a simple set union without:
+- Case normalization
+- Semantic similarity checking
+- Conflict resolution
+- Priority rules
+
+### 3. No Consolidation Across Sources
+
+The LLM consolidation step (line 91-100) only consolidates within discovered categories. It doesn't:
+- Check against hardcoded categories
+- Merge similar concepts
+- Remove one-off labels
+
+### 4. Poor Category Cache Design
+
+The category cache (src/models/category_cache.json) saves LLM categories but:
+- Doesn't deduplicate against hardcoded categories
+- Allows case-sensitive duplicates
+- No validation of category quality
+
+---
+
+## Recommendations
+
+### Immediate Fixes
+
+1. **Remove hardcoded categories from ML training**
+   - Use them ONLY for rule-based matching
+   - Don't merge into `all_categories` for training
+   - Let LLM discover all ML categories
+
+2. **Add case-insensitive deduplication**
+   - Normalize to title case
+   - Check semantic similarity
+   - Merge duplicates before training
+
+3. **Filter label anomalies**
+   - Reject categories with <10 training samples
+   - Reject overly specific categories (Bowl Pool 2000)
+   - LLM review step for quality
+
+4. **Calibrate model confidence**
+   - Use temperature scaling or Platt scaling
+   - Ensure confidence reflects actual accuracy
+
+### Architecture Decision
+
+**Option A: Rule-Based + ML (Current)**
+- Keep hardcoded categories for RULES ONLY
+- LLM discovers categories for ML ONLY
+- Never merge the two
+
+**Option B: Pure LLM Discovery (Recommended)**
+- Remove categories.yaml entirely
+- LLM discovers ALL categories
+- Rules can still match on keywords but don't define categories
+
+**Option C: Hybrid with Priority**
+- Define 3-5 HIGH-PRIORITY hardcoded categories (junk, auth, transactional)
+- Let LLM discover everything else
+- Clear hierarchy: Rules → Hardcoded ML → Discovered ML
+
+---
+
+## Next Steps
+
+1. **Decision:** Choose architecture (A, B, or C above)
+2. **Fix workflow.py:110** - Implement chosen strategy
+3. **Add deduplication logic** - Case-insensitive, semantic matching
+4. **Rerun calibration** - Clean 250-sample run
+5. **Validate results** - Ensure clean categories
+6. **Fix confidence** - Add calibration layer
+
+---
+
+## Files to Modify
+
+1. [src/calibration/workflow.py:110](src/calibration/workflow.py#L110) - Category merging logic
+2. [src/calibration/llm_analyzer.py](src/calibration/llm_analyzer.py) - Add cross-source consolidation
+3. [src/cli.py:70](src/cli.py#L70) - Decide whether to load hardcoded categories
+4. [config/categories.yaml](config/categories.yaml) - Clarify purpose (rules only?)
+5. [src/calibration/trainer.py](src/calibration/trainer.py) - Add confidence calibration
+
+---
+
+## Conclusion
+
+The system technically worked - it classified 100k emails with high ML efficiency. However, the category explosion and over-confidence issues reveal fundamental architectural problems that need resolution before production use.
+
+The core question: **Should hardcoded categories participate in ML training at all?**
+
+My recommendation: **No.** Use them for rules only, let LLM discover ML categories cleanly.
--- a/docs/START_HERE.md
+++ b/docs/START_HERE.md
--- a/docs/SYSTEM_FLOW.html
+++ b/docs/SYSTEM_FLOW.html
@ -0,0 +1,493 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>Email Sorter System Flow</title>
+    <script src="https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.min.js"></script>
+    <style>
+        body {
+            font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
+            margin: 20px;
+            background: #1e1e1e;
+            color: #d4d4d4;
+        }
+        h1, h2, h3 {
+            color: #4ec9b0;
+        }
+        .diagram {
+            background: white;
+            padding: 20px;
+            margin: 20px 0;
+            border-radius: 8px;
+        }
+        .timing-table {
+            width: 100%;
+            border-collapse: collapse;
+            margin: 20px 0;
+            background: #252526;
+        }
+        .timing-table th {
+            background: #37373d;
+            padding: 12px;
+            text-align: left;
+            color: #4ec9b0;
+        }
+        .timing-table td {
+            padding: 10px;
+            border-bottom: 1px solid #3e3e42;
+        }
+        .flag-section {
+            background: #252526;
+            padding: 15px;
+            margin: 10px 0;
+            border-left: 4px solid #4ec9b0;
+        }
+        code {
+            background: #1e1e1e;
+            padding: 2px 6px;
+            border-radius: 3px;
+            color: #ce9178;
+        }
+    </style>
+</head>
+<body>
+    <h1>Email Sorter System Flow Documentation</h1>
+
+    <h2>1. Main Execution Flow</h2>
+    <div class="diagram">
+        <pre class="mermaid">
+flowchart TD
+    Start([python -m src.cli run]) --> LoadConfig[Load config/default_config.yaml]
+    LoadConfig --> InitProviders[Initialize Email Provider<br/>Enron/Gmail/IMAP]
+    InitProviders --> FetchEmails[Fetch Emails<br/>--limit N]
+
+    FetchEmails --> CheckSize{Email Count?}
+    CheckSize -->|"< 1000"| SetMockMode[Set ml_classifier.is_mock = True<br/>LLM-only mode]
+    CheckSize -->|">= 1000"| CheckModel{Model Exists?}
+
+    CheckModel -->|No model at<br/>src/models/pretrained/classifier.pkl| RunCalibration[CALIBRATION PHASE<br/>LLM category discovery<br/>Train ML model]
+    CheckModel -->|Model exists| SkipCalibration[Skip Calibration<br/>Load existing model]
+    SetMockMode --> SkipCalibration
+
+    RunCalibration --> ClassifyPhase[CLASSIFICATION PHASE]
+    SkipCalibration --> ClassifyPhase
+
+    ClassifyPhase --> Loop{For each email}
+    Loop --> RuleCheck{Hard rule match?}
+    RuleCheck -->|Yes| RuleClassify[Category by rule<br/>confidence=1.0<br/>method='rule']
+    RuleCheck -->|No| MLClassify[ML Classification<br/>Get category + confidence]
+
+    MLClassify --> ConfCheck{Confidence >= threshold?}
+    ConfCheck -->|Yes| AcceptML[Accept ML result<br/>method='ml'<br/>needs_review=False]
+    ConfCheck -->|No| LowConf[Low confidence detected<br/>needs_review=True]
+
+    LowConf --> FlagCheck{--no-llm-fallback?}
+    FlagCheck -->|Yes| AcceptMLAnyway[Accept ML anyway<br/>needs_review=False]
+    FlagCheck -->|No| LLMCheck{LLM available?}
+
+    LLMCheck -->|Yes| LLMReview[LLM Classification<br/>~4 seconds<br/>method='llm']
+    LLMCheck -->|No| AcceptMLAnyway
+
+    RuleClassify --> NextEmail{More emails?}
+    AcceptML --> NextEmail
+    AcceptMLAnyway --> NextEmail
+    LLMReview --> NextEmail
+
+    NextEmail -->|Yes| Loop
+    NextEmail -->|No| SaveResults[Save results.json]
+    SaveResults --> End([Complete])
+
+    style RunCalibration fill:#ff6b6b
+    style LLMReview fill:#ff6b6b
+    style SetMockMode fill:#ffd93d
+    style FlagCheck fill:#4ec9b0
+    style AcceptMLAnyway fill:#4ec9b0
+</pre>
+    </div>
+
+    <h2>2. Calibration Phase Detail (When Triggered)</h2>
+    <div class="diagram">
+        <pre class="mermaid">
+flowchart TD
+    Start([Calibration Triggered]) --> Sample[Stratified Sampling<br/>3% of emails<br/>min 250, max 1500]
+    Sample --> LLMBatch[LLM Category Discovery<br/>50 emails per batch]
+
+    LLMBatch --> Batch1[Batch 1: 50 emails<br/>~20 seconds]
+    Batch1 --> Batch2[Batch 2: 50 emails<br/>~20 seconds]
+    Batch2 --> BatchN[... N batches<br/>For 300 samples: 6 batches]
+
+    BatchN --> Consolidate[LLM Consolidation<br/>Merge similar categories<br/>~5 seconds]
+    Consolidate --> Categories[Final Categories<br/>~10-12 unique categories]
+
+    Categories --> Label[Label Training Emails<br/>LLM labels each sample<br/>~3 seconds per email]
+    Label --> Extract[Feature Extraction<br/>Embeddings + TF-IDF<br/>~0.02 seconds per email]
+    Extract --> Train[Train LightGBM Model<br/>~5 seconds total]
+
+    Train --> Validate[Validate on 100 samples<br/>~2 seconds]
+    Validate --> Save[Save Model<br/>src/models/calibrated/classifier.pkl]
+    Save --> End([Calibration Complete<br/>Total time: 15-25 minutes for 10k emails])
+
+    style LLMBatch fill:#ff6b6b
+    style Label fill:#ff6b6b
+    style Consolidate fill:#ff6b6b
+    style Train fill:#4ec9b0
+</pre>
+    </div>
+
+    <h2>3. Classification Phase Detail</h2>
+    <div class="diagram">
+        <pre class="mermaid">
+flowchart TD
+    Start([Classification Phase]) --> Email[Get Email]
+    Email --> Rules{Check Hard Rules<br/>Pattern matching}
+
+    Rules -->|Match| RuleDone[Rule Match<br/>~0.001 seconds<br/>59 of 10000 emails]
+    Rules -->|No match| Embed[Generate Embedding<br/>all-minilm:l6-v2<br/>~0.02 seconds]
+
+    Embed --> TFIDF[TF-IDF Features<br/>~0.001 seconds]
+    TFIDF --> MLPredict[ML Prediction<br/>LightGBM<br/>~0.003 seconds]
+
+    MLPredict --> Threshold{Confidence >= 0.55?}
+    Threshold -->|Yes| MLDone[ML Classification<br/>7842 of 10000 emails<br/>78.4%]
+    Threshold -->|No| Flag{--no-llm-fallback?}
+
+    Flag -->|Yes| MLForced[Force ML result<br/>No LLM call]
+    Flag -->|No| LLM[LLM Classification<br/>~4 seconds<br/>2099 of 10000 emails<br/>21%]
+
+    RuleDone --> Next([Next Email])
+    MLDone --> Next
+    MLForced --> Next
+    LLM --> Next
+
+    style LLM fill:#ff6b6b
+    style MLDone fill:#4ec9b0
+    style MLForced fill:#ffd93d
+</pre>
+    </div>
+
+    <h2>4. Model Loading Logic</h2>
+    <div class="diagram">
+        <pre class="mermaid">
+flowchart TD
+    Start([MLClassifier.__init__]) --> CheckPath{model_path provided?}
+    CheckPath -->|Yes| UsePath[Use provided path]
+    CheckPath -->|No| Default[Default:<br/>src/models/pretrained/classifier.pkl]
+
+    UsePath --> FileCheck{File exists?}
+    Default --> FileCheck
+
+    FileCheck -->|Yes| Load[Load pickle file]
+    FileCheck -->|No| CreateMock[Create MOCK model<br/>Random Forest<br/>12 hardcoded categories]
+
+    Load --> ValidCheck{Valid model data?}
+    ValidCheck -->|Yes| CheckMock{is_mock flag?}
+    ValidCheck -->|No| CreateMock
+
+    CheckMock -->|True| WarnMock[Warn: MOCK model active]
+    CheckMock -->|False| RealModel[Real trained model loaded]
+
+    CreateMock --> MockWarnings[Multiple warnings printed<br/>NOT for production]
+    WarnMock --> Ready[Model Ready]
+    RealModel --> Ready
+    MockWarnings --> Ready
+
+    Ready --> End([Classification can start])
+
+    style CreateMock fill:#ff6b6b
+    style RealModel fill:#4ec9b0
+    style WarnMock fill:#ffd93d
+</pre>
+    </div>
+
+    <h2>5. Flag Conditions & Effects</h2>
+
+    <div class="flag-section">
+        <h3>--no-llm-fallback</h3>
+        <p><strong>Location:</strong> src/cli.py:46, src/classification/adaptive_classifier.py:152-161</p>
+        <p><strong>Effect:</strong> When ML confidence < threshold, accept ML result anyway instead of calling LLM</p>
+        <p><strong>Use case:</strong> Test pure ML performance, avoid LLM costs</p>
+        <p><strong>Code path:</strong></p>
+        <code>
+if self.disable_llm_fallback:<br/>
+&nbsp;&nbsp;# Just return ML result without LLM fallback<br/>
+&nbsp;&nbsp;return ClassificationResult(needs_review=False)
+        </code>
+    </div>
+
+    <div class="flag-section">
+        <h3>--limit N</h3>
+        <p><strong>Location:</strong> src/cli.py:38</p>
+        <p><strong>Effect:</strong> Limits number of emails fetched from source</p>
+        <p><strong>Calibration trigger:</strong> If N < 1000, forces LLM-only mode (no ML training)</p>
+        <p><strong>Code path:</strong></p>
+        <code>
+if total_emails < 1000:<br/>
+&nbsp;&nbsp;ml_classifier.is_mock = True  # Skip ML, use LLM only
+        </code>
+    </div>
+
+    <div class="flag-section">
+        <h3>Model Path Override</h3>
+        <p><strong>Location:</strong> src/classification/ml_classifier.py:43</p>
+        <p><strong>Default:</strong> src/models/pretrained/classifier.pkl</p>
+        <p><strong>Calibration saves to:</strong> src/models/calibrated/classifier.pkl</p>
+        <p><strong>Problem:</strong> Calibration saves to different location than default load location</p>
+        <p><strong>Solution:</strong> Copy calibrated model to pretrained location OR pass model_path parameter</p>
+    </div>
+
+    <h2>6. Timing Breakdown (10,000 emails)</h2>
+
+    <table class="timing-table">
+        <tr>
+            <th>Phase</th>
+            <th>Operation</th>
+            <th>Time per Email</th>
+            <th>Total Time (10k)</th>
+            <th>LLM Required?</th>
+        </tr>
+        <tr>
+            <td rowspan="6"><strong>Calibration</strong><br/>(if model doesn't exist)</td>
+            <td>Stratified sampling (300 emails)</td>
+            <td>-</td>
+            <td>~1 second</td>
+            <td>No</td>
+        </tr>
+        <tr>
+            <td>LLM category discovery (6 batches)</td>
+            <td>~0.4 sec/email</td>
+            <td>~2 minutes</td>
+            <td>YES</td>
+        </tr>
+        <tr>
+            <td>LLM consolidation</td>
+            <td>-</td>
+            <td>~5 seconds</td>
+            <td>YES</td>
+        </tr>
+        <tr>
+            <td>LLM labeling (300 samples)</td>
+            <td>~3 sec/email</td>
+            <td>~15 minutes</td>
+            <td>YES</td>
+        </tr>
+        <tr>
+            <td>Feature extraction (300 samples)</td>
+            <td>~0.02 sec/email</td>
+            <td>~6 seconds</td>
+            <td>No (embeddings)</td>
+        </tr>
+        <tr>
+            <td>Model training (LightGBM)</td>
+            <td>-</td>
+            <td>~5 seconds</td>
+            <td>No</td>
+        </tr>
+        <tr>
+            <td colspan="3"><strong>CALIBRATION TOTAL</strong></td>
+            <td><strong>~17-20 minutes</strong></td>
+            <td><strong>YES</strong></td>
+        </tr>
+        <tr>
+            <td rowspan="5"><strong>Classification</strong><br/>(with model)</td>
+            <td>Hard rule matching</td>
+            <td>~0.001 sec</td>
+            <td>~10 seconds (all 10k)</td>
+            <td>No</td>
+        </tr>
+        <tr>
+            <td>Embedding generation</td>
+            <td>~0.02 sec</td>
+            <td>~200 seconds (all 10k)</td>
+            <td>No (Ollama embed)</td>
+        </tr>
+        <tr>
+            <td>ML prediction</td>
+            <td>~0.003 sec</td>
+            <td>~30 seconds (all 10k)</td>
+            <td>No</td>
+        </tr>
+        <tr>
+            <td>LLM fallback (21% of emails)</td>
+            <td>~4 sec/email</td>
+            <td>~140 minutes (2100 emails)</td>
+            <td>YES</td>
+        </tr>
+        <tr>
+            <td>Saving results</td>
+            <td>-</td>
+            <td>~1 second</td>
+            <td>No</td>
+        </tr>
+        <tr>
+            <td colspan="3"><strong>CLASSIFICATION TOTAL (with LLM fallback)</strong></td>
+            <td><strong>~2.5 hours</strong></td>
+            <td><strong>YES (21%)</strong></td>
+        </tr>
+        <tr>
+            <td colspan="3"><strong>CLASSIFICATION TOTAL (--no-llm-fallback)</strong></td>
+            <td><strong>~4 minutes</strong></td>
+            <td><strong>No</strong></td>
+        </tr>
+    </table>
+
+    <h2>7. Why LLM Still Loads</h2>
+
+    <div class="diagram">
+        <pre class="mermaid">
+flowchart TD
+    Start([CLI startup]) --> Always1[ALWAYS: Load LLM provider<br/>src/cli.py:98-117]
+    Always1 --> Reason1[Reason: Needed for calibration<br/>if model doesn't exist]
+
+    Reason1 --> Check{Model exists?}
+    Check -->|No| NeedLLM1[LLM required for calibration<br/>Category discovery<br/>Sample labeling]
+    Check -->|Yes| SkipCal[Skip calibration]
+
+    SkipCal --> ClassStart[Start classification]
+    NeedLLM1 --> DoCalibration[Run calibration<br/>Uses LLM]
+    DoCalibration --> ClassStart
+
+    ClassStart --> Always2[ALWAYS: LLM provider is available<br/>llm.is_available = True]
+    Always2 --> EmailLoop[For each email...]
+
+    EmailLoop --> LowConf{Low confidence?}
+    LowConf -->|No| NoLLM[No LLM call]
+    LowConf -->|Yes| FlagCheck{--no-llm-fallback?}
+
+    FlagCheck -->|Yes| NoLLMCall[No LLM call<br/>Accept ML result]
+    FlagCheck -->|No| LLMAvail{llm.is_available?}
+
+    LLMAvail -->|Yes| CallLLM[LLM called<br/>src/cli.py:227-228]
+    LLMAvail -->|No| NoLLMCall
+
+    NoLLM --> End([Next email])
+    NoLLMCall --> End
+    CallLLM --> End
+
+    style Always1 fill:#ffd93d
+    style Always2 fill:#ffd93d
+    style CallLLM fill:#ff6b6b
+    style NoLLMCall fill:#4ec9b0
+</pre>
+    </div>
+
+    <h3>Why LLM Provider is Always Initialized:</h3>
+    <ul>
+        <li><strong>Line 98-117 (src/cli.py):</strong> LLM provider is created before checking if model exists</li>
+        <li><strong>Reason:</strong> Need LLM ready in case calibration is required</li>
+        <li><strong>Result:</strong> Even with --no-llm-fallback, LLM provider loads (but won't be called for classification)</li>
+    </ul>
+
+    <h2>8. Command Scenarios</h2>
+
+    <table class="timing-table">
+        <tr>
+            <th>Command</th>
+            <th>Model Exists?</th>
+            <th>Calibration Runs?</th>
+            <th>LLM Used for Classification?</th>
+            <th>Total Time (10k)</th>
+        </tr>
+        <tr>
+            <td><code>python -m src.cli run --source enron --limit 10000</code></td>
+            <td>No</td>
+            <td>YES (~20 min)</td>
+            <td>YES (~2.5 hours)</td>
+            <td>~2 hours 50 min</td>
+        </tr>
+        <tr>
+            <td><code>python -m src.cli run --source enron --limit 10000</code></td>
+            <td>Yes</td>
+            <td>No</td>
+            <td>YES (~2.5 hours)</td>
+            <td>~2.5 hours</td>
+        </tr>
+        <tr>
+            <td><code>python -m src.cli run --source enron --limit 10000 --no-llm-fallback</code></td>
+            <td>No</td>
+            <td>YES (~20 min)</td>
+            <td>NO</td>
+            <td>~24 minutes</td>
+        </tr>
+        <tr>
+            <td><code>python -m src.cli run --source enron --limit 10000 --no-llm-fallback</code></td>
+            <td>Yes</td>
+            <td>No</td>
+            <td>NO</td>
+            <td>~4 minutes</td>
+        </tr>
+        <tr>
+            <td><code>python -m src.cli run --source enron --limit 500</code></td>
+            <td>Any</td>
+            <td>No (too few emails)</td>
+            <td>YES (100% LLM-only)</td>
+            <td>~35 minutes</td>
+        </tr>
+    </table>
+
+    <h2>9. Current System State</h2>
+
+    <div class="flag-section">
+        <h3>Model Status</h3>
+        <ul>
+            <li><strong>src/models/calibrated/classifier.pkl</strong> - 1.8MB, trained at 02:54, 10 categories</li>
+            <li><strong>src/models/pretrained/classifier.pkl</strong> - Copy of calibrated model (created manually)</li>
+        </ul>
+    </div>
+
+    <div class="flag-section">
+        <h3>Threshold Configuration</h3>
+        <ul>
+            <li><strong>config/default_config.yaml:</strong> default_threshold = 0.55</li>
+            <li><strong>config/categories.yaml:</strong> All category thresholds = 0.55</li>
+            <li><strong>Effect:</strong> ML must be ≥55% confident to skip LLM</li>
+        </ul>
+    </div>
+
+    <div class="flag-section">
+        <h3>Last Run Results (10k emails)</h3>
+        <ul>
+            <li><strong>Rules:</strong> 59 emails (0.6%)</li>
+            <li><strong>ML:</strong> 7,842 emails (78.4%)</li>
+            <li><strong>LLM fallback:</strong> 2,099 emails (21%)</li>
+            <li><strong>Accuracy estimate:</strong> 92.7%</li>
+        </ul>
+    </div>
+
+    <h2>10. To Run ML-Only Test (No LLM Calls During Classification)</h2>
+
+    <div class="flag-section">
+        <h3>Requirements:</h3>
+        <ol>
+            <li>Model must exist at <code>src/models/pretrained/classifier.pkl</code> ✓ (done)</li>
+            <li>Use <code>--no-llm-fallback</code> flag</li>
+            <li>Ensure sufficient emails (≥1000) to avoid LLM-only mode</li>
+        </ol>
+
+        <h3>Command:</h3>
+        <code>
+python -m src.cli run --source enron --limit 10000 --output ml_only_10k/ --no-llm-fallback
+        </code>
+
+        <h3>Expected Results:</h3>
+        <ul>
+            <li><strong>Calibration:</strong> Skipped (model exists)</li>
+            <li><strong>LLM calls during classification:</strong> 0</li>
+            <li><strong>Total time:</strong> ~4 minutes</li>
+            <li><strong>ML acceptance rate:</strong> 100% (all emails classified by ML, even low confidence)</li>
+        </ul>
+    </div>
+
+    <script>
+        mermaid.initialize({
+            startOnLoad: true,
+            theme: 'default',
+            flowchart: {
+                useMaxWidth: true,
+                htmlLabels: true,
+                curve: 'basis'
+            }
+        });
+    </script>
+</body>
+</html>
--- a/docs/VERIFY_CATEGORIES_FEATURE.html
+++ b/docs/VERIFY_CATEGORIES_FEATURE.html
@ -0,0 +1,357 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>Category Verification Feature</title>
+    <script src="https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.min.js"></script>
+    <style>
+        body {
+            font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
+            margin: 20px;
+            background: #1e1e1e;
+            color: #d4d4d4;
+        }
+        h1, h2, h3 {
+            color: #4ec9b0;
+        }
+        .diagram {
+            background: white;
+            padding: 20px;
+            margin: 20px 0;
+            border-radius: 8px;
+        }
+        .code-section {
+            background: #252526;
+            padding: 15px;
+            margin: 10px 0;
+            border-left: 4px solid #4ec9b0;
+            font-family: 'Courier New', monospace;
+        }
+        code {
+            background: #1e1e1e;
+            padding: 2px 6px;
+            border-radius: 3px;
+            color: #ce9178;
+        }
+        .success {
+            background: #002a00;
+            border-left: 4px solid #4ec9b0;
+            padding: 15px;
+            margin: 10px 0;
+        }
+    </style>
+</head>
+<body>
+    <h1>--verify-categories Feature</h1>
+
+    <div class="success">
+        <h2>✅ IMPLEMENTED AND READY TO USE</h2>
+        <p><strong>Feature:</strong> Single LLM call to verify model categories fit new mailbox</p>
+        <p><strong>Cost:</strong> +20 seconds, 1 LLM call</p>
+        <p><strong>Value:</strong> Confidence check before bulk ML classification</p>
+    </div>
+
+    <h2>Usage</h2>
+
+    <div class="code-section">
+<strong>Basic usage (with verification):</strong>
+python -m src.cli run \
+  --source enron \
+  --limit 10000 \
+  --output verified_test/ \
+  --no-llm-fallback \
+  --verify-categories
+
+<strong>Custom verification sample size:</strong>
+python -m src.cli run \
+  --source enron \
+  --limit 10000 \
+  --output verified_test/ \
+  --no-llm-fallback \
+  --verify-categories \
+  --verify-sample 30
+
+<strong>Without verification (fastest):</strong>
+python -m src.cli run \
+  --source enron \
+  --limit 10000 \
+  --output fast_test/ \
+  --no-llm-fallback
+    </div>
+
+    <h2>How It Works</h2>
+
+    <div class="diagram">
+        <pre class="mermaid">
+flowchart TD
+    Start([Run with --verify-categories]) --> LoadModel[Load trained model<br/>Categories: Updates, Work,<br/>Meetings, etc.]
+
+    LoadModel --> FetchEmails[Fetch all emails<br/>10,000 total]
+
+    FetchEmails --> CheckFlag{--verify-categories?}
+    CheckFlag -->|No| SkipVerify[Skip verification<br/>Proceed to classification]
+    CheckFlag -->|Yes| Sample[Sample random emails<br/>Default: 20 emails]
+
+    Sample --> BuildPrompt[Build verification prompt<br/>Show model categories<br/>Show sample emails]
+
+    BuildPrompt --> LLMCall[Single LLM call<br/>~20 seconds<br/>Task: Rate category fit]
+
+    LLMCall --> ParseResponse[Parse JSON response<br/>Extract verdict + confidence]
+
+    ParseResponse --> Verdict{Verdict?}
+
+    Verdict -->|GOOD_MATCH<br/>80%+ fit| LogGood[Log: Categories appropriate<br/>Confidence: 0.8-1.0]
+    Verdict -->|FAIR_MATCH<br/>60-80% fit| LogFair[Log: Categories acceptable<br/>Confidence: 0.6-0.8]
+    Verdict -->|POOR_MATCH<br/><60% fit| LogPoor[Log WARNING<br/>Show suggested categories<br/>Recommend calibration<br/>Confidence: 0.0-0.6]
+
+    LogGood --> Proceed[Proceed with ML classification]
+    LogFair --> Proceed
+    LogPoor --> Proceed
+
+    SkipVerify --> Proceed
+
+    Proceed --> ClassifyAll[Classify all 10,000 emails<br/>Pure ML, no LLM fallback<br/>~4 minutes]
+
+    ClassifyAll --> Done[Results saved]
+
+    style LLMCall fill:#ffd93d
+    style LogGood fill:#4ec9b0
+    style LogPoor fill:#ff6b6b
+    style ClassifyAll fill:#4ec9b0
+</pre>
+    </div>
+
+    <h2>Example Outputs</h2>
+
+    <h3>Scenario 1: GOOD_MATCH (Enron → Enron)</h3>
+    <div class="code-section">
+================================================================================
+VERIFYING MODEL CATEGORIES
+================================================================================
+Verifying model categories against 10000 emails
+Model categories (11): Updates, Work, Meetings, External, Financial, Test, Administrative, Operational, Technical, Urgent, Requests
+Sampled 20 emails for verification
+Calling LLM for category verification...
+Verification complete: GOOD_MATCH (0.85)
+Reasoning: The sample emails fit well into the trained categories. Most are work-related correspondence, meetings, and operational updates which align with the model.
+
+Verification: GOOD_MATCH
+Confidence: 85%
+Model categories look appropriate for this mailbox
+================================================================================
+
+Starting classification...
+    </div>
+
+    <h3>Scenario 2: POOR_MATCH (Enron → Personal Gmail)</h3>
+    <div class="code-section">
+================================================================================
+VERIFYING MODEL CATEGORIES
+================================================================================
+Verifying model categories against 10000 emails
+Model categories (11): Updates, Work, Meetings, External, Financial, Test, Administrative, Operational, Technical, Urgent, Requests
+Sampled 20 emails for verification
+Calling LLM for category verification...
+Verification complete: POOR_MATCH (0.45)
+Reasoning: Many sample emails are shopping confirmations, social media notifications, and personal correspondence which don't fit the business-focused categories well.
+
+Verification: POOR_MATCH
+Confidence: 45%
+================================================================================
+WARNING: Model categories may not fit this mailbox well
+Suggested categories: ['Shopping', 'Social', 'Travel', 'Newsletters', 'Personal']
+Consider running full calibration for better accuracy
+Proceeding with existing model anyway...
+================================================================================
+
+Starting classification...
+    </div>
+
+    <h2>LLM Prompt Structure</h2>
+
+    <div class="code-section">
+You are evaluating whether pre-trained email categories fit a new mailbox.
+
+TRAINED MODEL CATEGORIES (11 categories):
+  - Updates
+  - Work
+  - Meetings
+  - External
+  - Financial
+  - Test
+  - Administrative
+  - Operational
+  - Technical
+  - Urgent
+  - Requests
+
+SAMPLE EMAILS FROM NEW MAILBOX (20 total, showing first 20):
+1. From: phillip.allen@enron.com
+   Subject: Re: AEC Volumes at OPAL
+   Preview: Here are the volumes for today...
+
+2. From: notifications@amazon.com
+   Subject: Your order has shipped
+   Preview: Your Amazon.com order #123-4567890...
+
+[... 18 more emails ...]
+
+TASK:
+Evaluate if the trained categories are appropriate for this mailbox.
+
+Consider:
+1. Do the sample emails naturally fit into the trained categories?
+2. Are there obvious email types that don't match any category?
+3. Are the category names semantically appropriate?
+4. Would a user find these categories helpful for THIS mailbox?
+
+Respond with JSON:
+{
+  "verdict": "GOOD_MATCH" | "FAIR_MATCH" | "POOR_MATCH",
+  "confidence": 0.0-1.0,
+  "reasoning": "brief explanation",
+  "fit_percentage": 0-100,
+  "suggested_categories": ["cat1", "cat2", ...],
+  "category_mapping": {"old_name": "better_name", ...}
+}
+    </div>
+
+    <h2>Configuration</h2>
+
+    <table style="width:100%; border-collapse: collapse; background: #252526; margin: 20px 0;">
+        <tr style="background: #37373d;">
+            <th style="padding: 12px; text-align: left; color: #4ec9b0;">Flag</th>
+            <th style="padding: 12px; text-align: left; color: #4ec9b0;">Type</th>
+            <th style="padding: 12px; text-align: left; color: #4ec9b0;">Default</th>
+            <th style="padding: 12px; text-align: left; color: #4ec9b0;">Description</th>
+        </tr>
+        <tr style="border-bottom: 1px solid #3e3e42;">
+            <td style="padding: 10px;"><code>--verify-categories</code></td>
+            <td style="padding: 10px;">Flag</td>
+            <td style="padding: 10px;">False</td>
+            <td style="padding: 10px;">Enable category verification</td>
+        </tr>
+        <tr style="border-bottom: 1px solid #3e3e42;">
+            <td style="padding: 10px;"><code>--verify-sample</code></td>
+            <td style="padding: 10px;">Integer</td>
+            <td style="padding: 10px;">20</td>
+            <td style="padding: 10px;">Number of emails to sample</td>
+        </tr>
+        <tr style="border-bottom: 1px solid #3e3e42;">
+            <td style="padding: 10px;"><code>--no-llm-fallback</code></td>
+            <td style="padding: 10px;">Flag</td>
+            <td style="padding: 10px;">False</td>
+            <td style="padding: 10px;">Disable LLM fallback during classification</td>
+        </tr>
+    </table>
+
+    <h2>When Verification Runs</h2>
+
+    <ul>
+        <li>✅ Only if <code>--verify-categories</code> flag is set</li>
+        <li>✅ Only if trained model exists (not mock)</li>
+        <li>✅ After emails are fetched, before calibration/classification</li>
+        <li>❌ Skipped if using mock model</li>
+        <li>❌ Skipped if model doesn't exist (calibration will run anyway)</li>
+    </ul>
+
+    <h2>Timing Impact</h2>
+
+    <table style="width:100%; border-collapse: collapse; background: #252526; margin: 20px 0;">
+        <tr style="background: #37373d;">
+            <th style="padding: 12px; text-align: left; color: #4ec9b0;">Configuration</th>
+            <th style="padding: 12px; text-align: left; color: #4ec9b0;">Time (10k emails)</th>
+            <th style="padding: 12px; text-align: left; color: #4ec9b0;">LLM Calls</th>
+        </tr>
+        <tr style="border-bottom: 1px solid #3e3e42;">
+            <td style="padding: 10px;">ML-only (no flags)</td>
+            <td style="padding: 10px;">~4 minutes</td>
+            <td style="padding: 10px;">0</td>
+        </tr>
+        <tr style="border-bottom: 1px solid #3e3e42;">
+            <td style="padding: 10px;">ML-only + <code>--verify-categories</code></td>
+            <td style="padding: 10px;">~4.3 minutes</td>
+            <td style="padding: 10px;">1 (verification)</td>
+        </tr>
+        <tr style="border-bottom: 1px solid #3e3e42;">
+            <td style="padding: 10px;">Full calibration (no model)</td>
+            <td style="padding: 10px;">~25 minutes</td>
+            <td style="padding: 10px;">~500</td>
+        </tr>
+        <tr style="border-bottom: 1px solid #3e3e42;">
+            <td style="padding: 10px;">ML + LLM fallback (21%)</td>
+            <td style="padding: 10px;">~2.5 hours</td>
+            <td style="padding: 10px;">~2100</td>
+        </tr>
+    </table>
+
+    <h2>Decision Tree</h2>
+
+    <div class="diagram">
+        <pre class="mermaid">
+flowchart TD
+    Start([Need to classify emails]) --> HaveModel{Trained model<br/>exists?}
+
+    HaveModel -->|No| MustCalibrate[Must run calibration<br/>~20 minutes<br/>~500 LLM calls]
+
+    HaveModel -->|Yes| SameDomain{Same domain as<br/>training data?}
+
+    SameDomain -->|Yes, confident| FastML[Pure ML<br/>4 minutes<br/>0 LLM calls]
+
+    SameDomain -->|Unsure| VerifyML[ML + Verification<br/>4.3 minutes<br/>1 LLM call]
+
+    SameDomain -->|No, different| Options{Accuracy needs?}
+
+    Options -->|High accuracy required| MustCalibrate
+    Options -->|Speed more important| VerifyML
+    Options -->|Experimental| FastML
+
+    MustCalibrate --> Done[Classification complete]
+    FastML --> Done
+    VerifyML --> Done
+
+    style FastML fill:#4ec9b0
+    style VerifyML fill:#ffd93d
+    style MustCalibrate fill:#ff6b6b
+</pre>
+    </div>
+
+    <h2>Quick Start</h2>
+
+    <div class="code-section">
+<strong>Test with verification on same domain (Enron → Enron):</strong>
+python -m src.cli run \
+  --source enron \
+  --limit 1000 \
+  --output verify_test_same/ \
+  --no-llm-fallback \
+  --verify-categories
+
+Expected: GOOD_MATCH (0.80-0.95)
+Time: ~30 seconds
+
+<strong>Test without verification for speed comparison:</strong>
+python -m src.cli run \
+  --source enron \
+  --limit 1000 \
+  --output no_verify_test/ \
+  --no-llm-fallback
+
+Expected: Same accuracy, 20 seconds faster
+Time: ~10 seconds
+    </div>
+
+    <script>
+        mermaid.initialize({
+            startOnLoad: true,
+            theme: 'default',
+            flowchart: {
+                useMaxWidth: true,
+                htmlLabels: true,
+                curve: 'basis'
+            }
+        });
+    </script>
+</body>
+</html>
--- a/docs/WORKFLOW_DIAGRAM.md
+++ b/docs/WORKFLOW_DIAGRAM.md
--- a/docs/chat-gippity-research.md
+++ b/docs/chat-gippity-research.md
--- a/scripts/experimental/spot_check_results.txt
+++ b/scripts/experimental/spot_check_results.txt
@ -0,0 +1,303 @@
+================================================================================
+SMART CLASSIFICATION SPOT-CHECK
+================================================================================
+
+Loading results from: results_100k/results.json
+Total emails: 100,000
+
+Analyzing classification patterns...
+Selected 30 emails for spot-checking
+
+  - high_conf_suspicious: 10 samples
+  - low_conf_obvious: 2 samples
+  - mid_conf_edge_cases: 0 samples
+  - category_anomalies: 8 samples
+  - random_check: 10 samples
+
+Loading email content...
+Loaded 100,000 emails
+
+================================================================================
+SPOT-CHECK SAMPLES
+================================================================================
+
+[1] HIGH CONFIDENCE - Potential Overconfidence
+--------------------------------------------------------------------------------
+These have very high confidence. Check if they're actually correct.
+
+Sample 1:
+  Category: Administrative
+  Confidence: 1.000
+  Method: ml
+  From: john.arnold@enron.com
+  Subject: RE:
+  Body preview: i'll get the movie and wine.  my suggestion is something from central market but i'm easy
+
+ -----Original Message-----
+From: 	Ward, Kim S (Houston)  
+Sent:	Monday, July 02, 2001 5:29 PM
+To:	Arnold, Jo...
+
+Sample 2:
+  Category: Administrative
+  Confidence: 1.000
+  Method: ml
+  From: eric.bass@enron.com
+  Subject: Re: New deals
+  Body preview: Can you spell S-N-O-O-T-Y?
+
+e
+
+
+   
+	
+	
+	From:  Ami Chokshi @ ENRON                           01/06/2000 05:38 PM
+	
+
+To: Eric Bass/HOU/ECT@ECT
+cc:  
+Subject: Re: New deals  
+
+Was E-R-I-C too hard to w...
+
+Sample 3:
+  Category: Meeting
+  Confidence: 1.000
+  Method: ml
+  From: amy.fitzpatrick@enron.com
+  Subject: MEETING TONIGHT - 6:00 pm Central Time at The Houstonian
+  Body preview: Throughout this week, we have a team from UBS in Houston to introduce and discuss the NETCO business and associated HR matters. 
+
+In this regard, please make yourself available for a meeting tonight b...
+
+Sample 4:
+  Category: Meeting
+  Confidence: 1.000
+  Method: ml
+  From: james.steffes@enron.com
+  Subject: 
+  Body preview: Jeff --
+
+Please add John Neslage to your e-mail list.
+
+Jim...
+
+Sample 5:
+  Category: Financial
+  Confidence: 1.000
+  Method: ml
+  From: sheri.thomas@enron.com
+  Subject: Fercinfo2 (The Whole Picture)
+  Body preview: Sally - just an fyi...  Jeff Hodge requested that we send him the information 
+below.  Evidently, the FERC has requested that several US wholesale companies 
+provide a great deal of information to the...
+
+[2] LOW CONFIDENCE - Might Be Obvious
+--------------------------------------------------------------------------------
+These have low confidence. Check if they're actually obvious.
+
+Sample 1:
+  Category: unknown
+  Confidence: 0.500
+  Method: llm
+  From: k..allen@enron.com
+  Subject: FW:
+  Body preview: Greg,
+
+After making an election in October to receive a full distribution of my deferral account under Section 6.3 of the plan, a disagreement has arisen regarding the Phantom Stock Account.  
+
+Se...
+
+Sample 2:
+  Category: unknown
+  Confidence: 0.500
+  Method: llm
+  From: mitch.robinson@enron.com
+  Subject: Running Units
+  Body preview: Given the sale, etc of the units, don't sell any power off the units, and 
+don't run the units (any of the six plants) for any reason without first 
+getting my specific permission.
+
+Thanks,
+
+Mitch...
+
+[3] MIDDLE CONFIDENCE - Edge Cases
+--------------------------------------------------------------------------------
+These are in the middle. Most likely to be tricky classifications.
+
+[4] CATEGORY ANOMALIES - Rare Categories with High Confidence
+--------------------------------------------------------------------------------
+These are high confidence but in small categories. Might be mislabeled.
+
+Sample 1:
+  Category: California Market
+  Confidence: 1.000
+  Method: ml
+  From: dhunter@s-k-w.com
+  Subject: FW: Direct Access Language
+  Body preview: -----Original Message-----
+From: Mike Florio [mailto:mflorio@turn.org]
+Sent: Tuesday, September 11, 2001 3:23 AM
+To: Delaney Hunter
+Subject: Direct Access Language
+
+
+Delaney--  DJ asked me to forward ...
+
+Sample 2:
+  Category: auth
+  Confidence: 0.990
+  Method: rule
+  From: david.roland@enron.com
+  Subject: FW: Notices and Agenda for Dec 21 ServiceCo Board Meeting
+  Body preview: Vicki, Dave, Mark and Jimmie,
+
+We're scheduling a pre-meeting to the ServiceCo Board meeting at 11:30 a.m. tomorrow (Friday) in Dave's office.
+
+Thanks,
+David
+
+
+ -----Original Message-----
+From: 	Rolan...
+
+Sample 3:
+  Category: transactional
+  Confidence: 0.970
+  Method: rule
+  From: orders@amazon.com
+  Subject: Cancellation from Amazon.com Order (#107-0663988-7584503)
+  Body preview: Greetings from Amazon.com.  You have successfully cancelled an item
+from your order #107-0663988-7584503
+
+For your reference, here is a summary of your order:
+
+
+Order #107-0663988-7584503 - placed Dec...
+
+Sample 4:
+  Category: Forwarded
+  Confidence: 1.000
+  Method: ml
+  From: jefferson.sorenson@enron.com
+  Subject: UNIFY TO SAP INTERFACES
+  Body preview: ---------------------- Forwarded by Jefferson D Sorenson/HOU/ECT on 
+07/05/2000 04:58 PM ---------------------------
+
+
+Bob Klein
+07/05/2000 04:57 PM
+To: Jefferson D Sorenson/HOU/ECT@ECT
+cc: Rebecca Fo...
+
+Sample 5:
+  Category: Urgent
+  Confidence: 1.000
+  Method: ml
+  From: l..garcia@enron.com
+  Subject: RE: LUNCH
+  Body preview: You Idiot! Why are you sending emails to people who wont get them (Reese, Dustin, Blaine, Greer, Reeves), and who the hell is AC? Mr. Huddle and the Horseman?????????????? Did you fall and hit your he...
+
+[5] RANDOM CHECK - General Quality Check
+--------------------------------------------------------------------------------
+Random samples from each category for general quality assessment.
+
+Sample 1:
+  Category: Administrative
+  Confidence: 1.000
+  Method: ml
+  From: cameron@perfect.com
+  Subject: RE: Directions
+  Body preview: I will send this out.  Yes, we can talk tonight.  When will you be at the
+house?
+
+
+Cameron Sellers
+Vice President, Business Development
+PERFECT
+1860 Embarcadero Road - Suite 210
+Palo Alto, CA 94303
+ca...
+
+Sample 2:
+  Category: Meeting
+  Confidence: 1.000
+  Method: ml
+  From: perfmgmt@enron.com
+  Subject: Mid-Year 2001 Performance Feedback
+  Body preview: DEAN, CLINT E,
+?
+You have been selected to participate in the Mid Year 2001 Performance 
+Management process.  Your feedback plays an important role in the process, 
+and your participation is critical ...
+
+Sample 3:
+  Category: Financial
+  Confidence: 1.000
+  Method: ml
+  From: schwabalerts.marketupdates@schwab.com
+  Subject: Midday Market View for June 7, 2001
+  Body preview: Charles Schwab & Co., Inc.
+
+Midday Market View(TM) for Thursday, June 7, 2001
+as of 1:00PM EDT
+Information provided by Standard & Poor's
+
+==============================================================...
+
+Sample 4:
+  Category: Work
+  Confidence: 1.000
+  Method: ml
+  From: enron.announcements@enron.com
+  Subject: SUPPLEMENTAL Weekend Outage Report for 11-10-00
+  Body preview: ------------------------------------------------------------------------------
+------------------------
+W E E K E N D   S Y S T E M S   A V A I L A B I L I T Y
+
+F O R
+
+November 10, 2000 5:00pm through...
+
+Sample 5:
+  Category: Operational
+  Confidence: 1.000
+  Method: ml
+  From: phillip.allen@enron.com
+  Subject: Re: Insight Hardware
+  Body preview: I have not received the aircard 300 yet.
+
+Phillip...
+
+================================================================================
+CATEGORY DISTRIBUTION
+================================================================================
+
+Category                Total  High Conf   Low Conf   Avg Conf
+--------------------------------------------------------------------------------
+Administrative         67,195     67,191          0      1.000
+Work                   14,223     14,213          0      1.000
+Meeting                 7,785      7,783          0      1.000
+Financial               5,943      5,943          0      1.000
+Operational             3,274      3,272          0      1.000
+junk                      394        394          0      0.960
+work                      368        368          0      0.950
+Miscellaneous             238        238          0      1.000
+Technical                 193        193          0      1.000
+External                  137        137          0      1.000
+Announcements             113        112          0      0.999
+transactional              44         44          0      0.970
+auth                       37         37          0      0.990
+unknown                    23          0         23      0.500
+Forwarded                  16         16          0      0.999
+California Market           6          6          0      1.000
+Prehearing                  6          6          0      0.974
+Change                      3          3          0      1.000
+Urgent                      1          1          0      1.000
+Monitoring                  1          1          0      1.000
+
+================================================================================
+DONE!
+================================================================================
--- a/scripts/run_clean_10k.sh
+++ b/scripts/run_clean_10k.sh
@ -0,0 +1,50 @@
+#!/usr/bin/env bash
+# Clean 10k test with all fixes applied
+# Run this when ready: ./run_clean_10k.sh
+
+set -e
+
+echo "=========================================="
+echo "CLEAN 10K TEST - Fixed Category System"
+echo "=========================================="
+echo ""
+echo "Fixes applied:"
+echo "  ✓ Removed hardcoded category pollution"
+echo "  ✓ LLM-only category discovery"
+echo "  ✓ Intelligent scaling (3% cal, 1% val)"
+echo ""
+echo "Expected results:"
+echo "  - ~11 clean categories (not 29)"
+echo "  - No duplicates (Work vs work)"
+echo "  - Realistic confidence scores"
+echo ""
+echo "Starting at: $(date)"
+echo ""
+
+# Activate venv
+if [ -z "$VIRTUAL_ENV" ]; then
+    source venv/bin/activate
+fi
+
+# Clean start
+rm -rf results_10k/
+rm -f src/models/calibrated/classifier.pkl
+rm -f src/models/category_cache.json
+
+# Run with progress visible
+python -m src.cli run \
+    --source enron \
+    --limit 10000 \
+    --output results_10k/ \
+    --verbose
+
+echo ""
+echo "=========================================="
+echo "COMPLETE at: $(date)"
+echo "=========================================="
+echo ""
+echo "Check results:"
+echo "  - Categories: cat src/models/category_cache.json | python3 -m json.tool"
+echo "  - Model: ls -lh src/models/calibrated/"
+echo "  - Results: ls -lh results_10k/"
+echo ""
--- a/scripts/test_ml_only.sh
+++ b/scripts/test_ml_only.sh
@ -0,0 +1,30 @@
+#!/bin/bash
+# Test ML performance without LLM fallback using trained model
+
+set -e
+
+echo "=========================================="
+echo "ML-ONLY TEST (No LLM Fallback)"
+echo "=========================================="
+echo ""
+echo "Using model: src/models/calibrated/classifier.pkl"
+echo "Testing on: 1000 emails"
+echo ""
+
+# Activate venv
+if [ -z "$VIRTUAL_ENV" ]; then
+    source venv/bin/activate
+fi
+
+# Run classification with trained model, NO LLM fallback
+python -m src.cli run \
+    --source enron \
+    --limit 1000 \
+    --output ml_only_test/ \
+    --no-llm-fallback \
+    2>&1 | tee ml_only_test.log
+
+echo ""
+echo "=========================================="
+echo "Test complete. Check ml_only_test.log"
+echo "=========================================="
--- a/scripts/train_final_model.sh
+++ b/scripts/train_final_model.sh
@ -0,0 +1,51 @@
+#!/bin/bash
+# Train final production model with 10k emails and 0.55 thresholds
+
+set -e
+
+echo "=========================================="
+echo "TRAINING FINAL MODEL"
+echo "=========================================="
+echo ""
+echo "Config: 0.55 thresholds across all categories"
+echo "Training set: 10,000 Enron emails"
+echo "Calibration: 300 samples (3%)"
+echo "Validation: 100 samples (1%)"
+echo ""
+
+# Backup existing model if it exists
+if [ -f src/models/calibrated/classifier.pkl ]; then
+    BACKUP_FILE="src/models/calibrated/classifier.pkl.backup-$(date +%Y%m%d-%H%M%S)"
+    cp src/models/calibrated/classifier.pkl "$BACKUP_FILE"
+    echo "Backed up existing model to: $BACKUP_FILE"
+fi
+
+# Clean old results
+rm -rf results_final/ final_training.log
+
+# Activate venv
+if [ -z "$VIRTUAL_ENV" ]; then
+    source venv/bin/activate
+fi
+
+# Train model
+python -m src.cli run \
+    --source enron \
+    --limit 10000 \
+    --output results_final/ \
+    2>&1 | tee final_training.log
+
+# Create timestamped backup of trained model
+if [ -f src/models/calibrated/classifier.pkl ]; then
+    TRAINED_BACKUP="src/models/calibrated/classifier.pkl.backup-trained-$(date +%Y%m%d-%H%M%S)"
+    cp src/models/calibrated/classifier.pkl "$TRAINED_BACKUP"
+    echo "Created backup of trained model: $TRAINED_BACKUP"
+fi
+
+echo ""
+echo "=========================================="
+echo "Training complete!"
+echo "Model saved to: src/models/calibrated/classifier.pkl"
+echo "Backup created with timestamp"
+echo "Log: final_training.log"
+echo "=========================================="
--- a/src/calibration/category_verifier.py
+++ b/src/calibration/category_verifier.py
@ -0,0 +1,190 @@
+"""Category verification for existing models on new mailboxes."""
+import logging
+import json
+import re
+import random
+from typing import List, Dict, Any
+
+from src.email_providers.base import Email
+from src.llm.base import BaseLLMProvider
+
+logger = logging.getLogger(__name__)
+
+
+def verify_model_categories(
+    emails: List[Email],
+    model_categories: List[str],
+    llm_provider: BaseLLMProvider,
+    sample_size: int = 20
+) -> Dict[str, Any]:
+    """
+    Verify if trained model categories fit a new mailbox.
+
+    Single LLM call to check if categories are appropriate.
+
+    Args:
+        emails: All emails from new mailbox
+        model_categories: Categories the model was trained on
+        llm_provider: LLM provider for verification
+        sample_size: Number of emails to sample for verification
+
+    Returns:
+        {
+            'verdict': 'GOOD_MATCH' | 'FAIR_MATCH' | 'POOR_MATCH',
+            'confidence': float (0-1),
+            'reasoning': str,
+            'suggested_categories': List[str] (if poor match),
+            'category_mapping': Dict[str, str] (suggested name changes)
+        }
+    """
+    logger.info(f"Verifying model categories against {len(emails)} emails")
+    logger.info(f"Model categories ({len(model_categories)}): {', '.join(model_categories)}")
+
+    # Sample random emails
+    sample = random.sample(emails, min(sample_size, len(emails)))
+    logger.info(f"Sampled {len(sample)} emails for verification")
+
+    # Build email summaries
+    email_summaries = []
+    for i, email in enumerate(sample[:20]):  # Limit to 20 to avoid token limits
+        summary = f"{i+1}. From: {email.sender}\n   Subject: {email.subject}\n   Preview: {email.body_snippet[:80]}..."
+        email_summaries.append(summary)
+
+    email_text = "\n\n".join(email_summaries)
+
+    # Build categories list
+    categories_text = "\n".join([f"  - {cat}" for cat in model_categories])
+
+    # Build verification prompt
+    prompt = f"""<no_think>You are evaluating whether pre-trained email categories fit a new mailbox.
+
+TRAINED MODEL CATEGORIES ({len(model_categories)} categories):
+{categories_text}
+
+SAMPLE EMAILS FROM NEW MAILBOX ({len(sample)} total, showing first {len(email_summaries)}):
+{email_text}
+
+TASK:
+Evaluate if the trained categories are appropriate for this mailbox.
+
+Consider:
+1. Do the sample emails naturally fit into the trained categories?
+2. Are there obvious email types that don't match any category?
+3. Are the category names semantically appropriate?
+4. Would a user find these categories helpful for THIS mailbox?
+
+Respond with JSON:
+{{
+  "verdict": "GOOD_MATCH" | "FAIR_MATCH" | "POOR_MATCH",
+  "confidence": 0.0-1.0,
+  "reasoning": "brief explanation",
+  "fit_percentage": 0-100,
+  "suggested_categories": ["cat1", "cat2", ...],  // Only if POOR_MATCH
+  "category_mapping": {{"old_name": "better_name", ...}}  // Optional renames
+}}
+
+Verdict criteria:
+- GOOD_MATCH: 80%+ of emails fit well, categories are appropriate
+- FAIR_MATCH: 60-80% fit, some gaps but usable
+- POOR_MATCH: <60% fit, significant category mismatch
+
+JSON:
+"""
+
+    try:
+        logger.info("Calling LLM for category verification...")
+        response = llm_provider.complete(
+            prompt,
+            temperature=0.1,
+            max_tokens=1000
+        )
+
+        logger.debug(f"LLM verification response: {response[:500]}")
+
+        # Parse response
+        result = _parse_verification_response(response)
+
+        logger.info(f"Verification complete: {result['verdict']} ({result['confidence']:.0%})")
+        if result.get('reasoning'):
+            logger.info(f"Reasoning: {result['reasoning']}")
+
+        return result
+
+    except Exception as e:
+        logger.error(f"Verification failed: {e}")
+        # Return conservative default
+        return {
+            'verdict': 'FAIR_MATCH',
+            'confidence': 0.5,
+            'reasoning': f'Verification failed: {e}',
+            'fit_percentage': 50,
+            'suggested_categories': [],
+            'category_mapping': {}
+        }
+
+
+def _parse_verification_response(response: str) -> Dict[str, Any]:
+    """Parse LLM verification response."""
+    try:
+        # Strip think tags
+        cleaned = re.sub(r'<think>.*?</think>', '', response, flags=re.DOTALL)
+
+        # Extract JSON
+        json_match = re.search(r'\{.*\}', cleaned, re.DOTALL)
+        if json_match:
+            # Find complete JSON by counting braces
+            brace_count = 0
+            for i, char in enumerate(cleaned):
+                if char == '{':
+                    brace_count += 1
+                    if brace_count == 1:
+                        start = i
+                elif char == '}':
+                    brace_count -= 1
+                    if brace_count == 0:
+                        json_str = cleaned[start:i+1]
+                        break
+
+            parsed = json.loads(json_str)
+
+            # Validate and set defaults
+            result = {
+                'verdict': parsed.get('verdict', 'FAIR_MATCH'),
+                'confidence': float(parsed.get('confidence', 0.5)),
+                'reasoning': parsed.get('reasoning', ''),
+                'fit_percentage': int(parsed.get('fit_percentage', 50)),
+                'suggested_categories': parsed.get('suggested_categories', []),
+                'category_mapping': parsed.get('category_mapping', {})
+            }
+
+            # Validate verdict
+            if result['verdict'] not in ['GOOD_MATCH', 'FAIR_MATCH', 'POOR_MATCH']:
+                logger.warning(f"Invalid verdict: {result['verdict']}, defaulting to FAIR_MATCH")
+                result['verdict'] = 'FAIR_MATCH'
+
+            # Clamp confidence
+            result['confidence'] = max(0.0, min(1.0, result['confidence']))
+
+            return result
+
+    except json.JSONDecodeError as e:
+        logger.warning(f"JSON parse error: {e}")
+    except Exception as e:
+        logger.warning(f"Parse error: {e}")
+
+    # Fallback parsing - try to extract verdict from text
+    verdict = 'FAIR_MATCH'
+    if 'GOOD_MATCH' in response or 'good match' in response.lower():
+        verdict = 'GOOD_MATCH'
+    elif 'POOR_MATCH' in response or 'poor match' in response.lower():
+        verdict = 'POOR_MATCH'
+
+    logger.warning(f"Using fallback parsing, verdict: {verdict}")
+    return {
+        'verdict': verdict,
+        'confidence': 0.5,
+        'reasoning': 'Fallback parsing - response format invalid',
+        'fit_percentage': 50,
+        'suggested_categories': [],
+        'category_mapping': {}
+    }
--- a/src/calibration/llm_analyzer.py
+++ b/src/calibration/llm_analyzer.py
@ -267,10 +267,28 @@ JSON:
            # Strip <think> tags if present
            cleaned = re.sub(r'<think>.*?</think>', '', response, flags=re.DOTALL)

-            # Extract JSON
-            json_match = re.search(r'\{.*\}', cleaned, re.DOTALL)
+            # Stop at endoftext token if present
+            if '<|endoftext|>' in cleaned:
+                cleaned = cleaned.split('<|endoftext|>')[0]
+
+            # Extract JSON - use non-greedy match and stop at first valid JSON
+            json_match = re.search(r'\{.*?\}', cleaned, re.DOTALL)
            if json_match:
-                parsed = json.loads(json_match.group())
+                json_str = json_match.group()
+                # Try to find the complete JSON by counting braces
+                brace_count = 0
+                for i, char in enumerate(cleaned):
+                    if char == '{':
+                        brace_count += 1
+                        if brace_count == 1:
+                            start = i
+                    elif char == '}':
+                        brace_count -= 1
+                        if brace_count == 0:
+                            json_str = cleaned[start:i+1]
+                            break
+
+                parsed = json.loads(json_str)
                logger.debug(f"Successfully parsed JSON: {len(parsed.get('categories', {}))} categories, {len(parsed.get('labels', []))} labels")
                return parsed
        except json.JSONDecodeError as e:
--- a/src/calibration/workflow.py
+++ b/src/calibration/workflow.py
@ -104,11 +104,12 @@ class CalibrationWorkflow:
        # Create lookup for LLM labels
        label_map = {email_id: category for email_id, category in sample_labels}

-        # Update categories to include ALL categories from labels (not just discovered_categories dict)
-        # This ensures we include categories that were ambiguous and kept their original names
+        # Use ONLY LLM-discovered categories for training
+        # DO NOT merge self.categories (hardcoded) - those are for rule-based matching only
        label_categories = set(category for _, category in sample_labels)
-        all_categories = list(set(self.categories) | set(discovered_categories.keys()) | label_categories)
-        logger.info(f"Using categories: {all_categories}")
+        all_categories = list(set(discovered_categories.keys()) | label_categories)
+        logger.info(f"Using categories (LLM-discovered): {all_categories}")
+        logger.info(f"Categories count: {len(all_categories)}")

        # Update trainer with discovered categories
        self.trainer.categories = all_categories
@ -148,10 +149,10 @@ class CalibrationWorkflow:

        # Prepare validation data
        validation_data = []
+        # Use first discovered category as default for validation
+        default_category = all_categories[0] if all_categories else 'unknown'
        for email in validation_emails:
-            # Use LLM to label validation set (or use heuristics)
-            # For now, use first category as default
-            validation_data.append((email, self.categories[0]))
+            validation_data.append((email, default_category))

        try:
            train_results = self.trainer.train(
--- a/src/classification/adaptive_classifier.py
+++ b/src/classification/adaptive_classifier.py
@ -68,7 +68,8 @@ class AdaptiveClassifier:
        ml_classifier: MLClassifier,
        llm_classifier: Optional[LLMClassifier],
        categories: Dict[str, Dict],
-        config: Dict[str, Any]
+        config: Dict[str, Any],
+        disable_llm_fallback: bool = False
    ):
        """Initialize adaptive classifier."""
        self.feature_extractor = feature_extractor
@ -76,6 +77,7 @@ class AdaptiveClassifier:
        self.llm_classifier = llm_classifier
        self.categories = categories
        self.config = config
+        self.disable_llm_fallback = disable_llm_fallback

        self.thresholds = self._init_thresholds()
        self.stats = ClassificationStats()
@ -85,10 +87,10 @@ class AdaptiveClassifier:
        thresholds = {}

        for category, cat_config in self.categories.items():
-            threshold = cat_config.get('threshold', 0.75)
+            threshold = cat_config.get('threshold', 0.55)
            thresholds[category] = threshold

-        default = self.config.get('classification', {}).get('default_threshold', 0.75)
+        default = self.config.get('classification', {}).get('default_threshold', 0.55)
        thresholds['default'] = default

        logger.info(f"Initialized thresholds: {thresholds}")
@ -143,17 +145,29 @@ class AdaptiveClassifier:
                    probabilities=ml_result.get('probabilities', {})
                )
            else:
-                # Low confidence: Queue for LLM
+                # Low confidence: Queue for LLM (unless disabled)
                logger.debug(f"Low confidence for {email.id}: {category} ({confidence:.2f})")
                self.stats.needs_review += 1
-                return ClassificationResult(
-                    email_id=email.id,
-                    category=category,
-                    confidence=confidence,
-                    method='ml',
-                    needs_review=True,
-                    probabilities=ml_result.get('probabilities', {})
-                )
+
+                if self.disable_llm_fallback:
+                    # Just return ML result without LLM fallback
+                    return ClassificationResult(
+                        email_id=email.id,
+                        category=category,
+                        confidence=confidence,
+                        method='ml',
+                        needs_review=False,
+                        probabilities=ml_result.get('probabilities', {})
+                    )
+                else:
+                    return ClassificationResult(
+                        email_id=email.id,
+                        category=category,
+                        confidence=confidence,
+                        method='ml',
+                        needs_review=True,
+                        probabilities=ml_result.get('probabilities', {})
+                    )

        except Exception as e:
            logger.error(f"Classification error for {email.id}: {e}")
--- a/src/cli.py
+++ b/src/cli.py
@ -43,6 +43,12 @@ def cli():
              help='Do not sync results back')
@click.option('--verbose', is_flag=True,
              help='Verbose logging')
+@click.option('--no-llm-fallback', is_flag=True,
+              help='Disable LLM fallback - test pure ML performance')
+@click.option('--verify-categories', is_flag=True,
+              help='Verify model categories fit new mailbox (single LLM call)')
+@click.option('--verify-sample', type=int, default=20,
+              help='Number of emails to sample for category verification')
 def run(
    source: str,
    credentials: Optional[str],
@ -51,7 +57,10 @@ def run(
    limit: Optional[int],
    llm_provider: str,
    dry_run: bool,
-    verbose: bool
+    verbose: bool,
+    no_llm_fallback: bool,
+    verify_categories: bool,
+    verify_sample: int
 ):
    """Run email sorter pipeline."""

@ -125,7 +134,8 @@ def run(
        ml_classifier,
        llm_classifier,
        categories,
-        cfg.dict()
+        cfg.dict(),
+        disable_llm_fallback=no_llm_fallback
    )

    # Fetch emails
@ -138,56 +148,106 @@ def run(

    logger.info(f"Fetched {len(emails)} emails")

+    # Category verification (if requested and model exists)
+    if verify_categories and not ml_classifier.is_mock and ml_classifier.model:
+        logger.info("=" * 80)
+        logger.info("VERIFYING MODEL CATEGORIES")
+        logger.info("=" * 80)
+
+        from src.calibration.category_verifier import verify_model_categories
+
+        verification_result = verify_model_categories(
+            emails=emails,
+            model_categories=ml_classifier.categories,
+            llm_provider=llm,
+            sample_size=min(verify_sample, len(emails))
+        )
+
+        logger.info(f"Verification: {verification_result['verdict']}")
+        logger.info(f"Confidence: {verification_result['confidence']:.0%}")
+
+        if verification_result['verdict'] == 'POOR_MATCH':
+            logger.warning("=" * 80)
+            logger.warning("WARNING: Model categories may not fit this mailbox well")
+            logger.warning(f"Suggested categories: {verification_result.get('suggested_categories', [])}")
+            logger.warning("Consider running full calibration for better accuracy")
+            logger.warning("Proceeding with existing model anyway...")
+            logger.warning("=" * 80)
+        elif verification_result['verdict'] == 'GOOD_MATCH':
+            logger.info("Model categories look appropriate for this mailbox")
+
+        logger.info("=" * 80)
+
+    # Intelligent scaling: Decide if we need ML at all
+    total_emails = len(emails)
+
+    # Skip ML for small datasets (<1000 emails) - use LLM only
+    if total_emails < 1000:
+        logger.warning(f"Only {total_emails} emails - too few for ML training")
+        logger.warning("Using LLM-only classification (no ML model)")
+        ml_classifier.is_mock = True
+
    # Check if we need calibration (no good ML model)
    if ml_classifier.is_mock or not ml_classifier.model:
-        logger.info("=" * 80)
-        logger.info("RUNNING CALIBRATION - Training ML model on LLM-labeled samples")
-        logger.info("=" * 80)
+        if total_emails >= 1000:
+            logger.info("=" * 80)
+            logger.info("RUNNING CALIBRATION - Training ML model")
+            logger.info("=" * 80)

-        from src.calibration.workflow import CalibrationWorkflow, CalibrationConfig
+            from src.calibration.workflow import CalibrationWorkflow, CalibrationConfig

-        # Create calibration LLM provider with smaller model
-        calibration_llm = OllamaProvider(
-            base_url=cfg.llm.ollama.base_url,
-            model=cfg.llm.ollama.calibration_model,
-            temperature=cfg.llm.ollama.temperature,
-            max_tokens=cfg.llm.ollama.max_tokens
-        )
-        logger.info(f"Using calibration model: {cfg.llm.ollama.calibration_model}")
+            # Intelligent scaling for calibration and validation
+            # Calibration: 3% of emails (min 250, max 1500)
+            calibration_size = max(250, min(1500, int(total_emails * 0.03)))
+            # Validation: 1% of emails (min 100, max 300)
+            validation_size = max(100, min(300, int(total_emails * 0.01)))

-        # Create consolidation LLM provider with larger model (needs structured JSON output)
-        consolidation_model = getattr(cfg.llm.ollama, 'consolidation_model', cfg.llm.ollama.calibration_model)
-        consolidation_llm = OllamaProvider(
-            base_url=cfg.llm.ollama.base_url,
-            model=consolidation_model,
-            temperature=cfg.llm.ollama.temperature,
-            max_tokens=cfg.llm.ollama.max_tokens
-        )
-        logger.info(f"Using consolidation model: {consolidation_model}")
+            logger.info(f"Total emails: {total_emails:,}")
+            logger.info(f"Calibration samples: {calibration_size} ({calibration_size/total_emails*100:.1f}%)")
+            logger.info(f"Validation samples: {validation_size} ({validation_size/total_emails*100:.1f}%)")

-        calibration_config = CalibrationConfig(
-            sample_size=min(1500, len(emails) // 2),  # Use 1500 or half the emails
-            validation_size=300,
-            llm_batch_size=50
-        )
+            # Create calibration LLM provider
+            calibration_llm = OllamaProvider(
+                base_url=cfg.llm.ollama.base_url,
+                model=cfg.llm.ollama.calibration_model,
+                temperature=cfg.llm.ollama.temperature,
+                max_tokens=cfg.llm.ollama.max_tokens
+            )
+            logger.info(f"Calibration model: {cfg.llm.ollama.calibration_model}")

-        calibration = CalibrationWorkflow(
-            llm_provider=calibration_llm,
-            consolidation_llm_provider=consolidation_llm,
-            feature_extractor=feature_extractor,
-            categories=categories,
-            config=calibration_config
-        )
+            # Create consolidation LLM provider
+            consolidation_model = getattr(cfg.llm.ollama, 'consolidation_model', cfg.llm.ollama.calibration_model)
+            consolidation_llm = OllamaProvider(
+                base_url=cfg.llm.ollama.base_url,
+                model=consolidation_model,
+                temperature=cfg.llm.ollama.temperature,
+                max_tokens=cfg.llm.ollama.max_tokens
+            )
+            logger.info(f"Consolidation model: {consolidation_model}")

-        # Run calibration to train ML model
-        cal_results = calibration.run(emails, model_output_path="src/models/calibrated/classifier.pkl")
+            calibration_config = CalibrationConfig(
+                sample_size=calibration_size,
+                validation_size=validation_size,
+                llm_batch_size=50
+            )

-        # Reload the ML classifier with the new model
-        ml_classifier = MLClassifier(model_path="src/models/calibrated/classifier.pkl")
-        adaptive_classifier.ml_classifier = ml_classifier
+            calibration = CalibrationWorkflow(
+                llm_provider=calibration_llm,
+                consolidation_llm_provider=consolidation_llm,
+                feature_extractor=feature_extractor,
+                categories={},  # Don't pass hardcoded - let LLM discover
+                config=calibration_config
+            )

-        logger.info(f"Calibration complete! Accuracy: {cal_results.get('validation_accuracy', 0):.1%}")
-        logger.info("=" * 80)
+            # Run calibration to train ML model
+            cal_results = calibration.run(emails, model_output_path="src/models/calibrated/classifier.pkl")
+
+            # Reload the ML classifier with the new model
+            ml_classifier = MLClassifier(model_path="src/models/calibrated/classifier.pkl")
+            adaptive_classifier.ml_classifier = ml_classifier
+
+            logger.info(f"Calibration complete! Accuracy: {cal_results.get('validation_accuracy', 0):.1%}")
+            logger.info("=" * 80)

    # Classify emails
    logger.info("Starting classification")
--- a/src/models/calibrated/classifier.pkl
+++ b/src/models/calibrated/classifier.pkl