Rewrite CLAUDE.md and clean project structure

- Rewrote CLAUDE.md with comprehensive development guide - Archived 20 old docs to docs/archive/ - Added PROJECT_ROADMAP_2025.md with research learnings - Added CLASSIFICATION_METHODS_COMPARISON.md - Added SESSION_HANDOVER_20251128.md - Added tools for analysis (brett_gmail/microsoft analyzers) - Updated .gitignore for archive folders - Config changes for local vLLM endpoint
2025-11-28 13:07:27 +11:00 · 2025-11-28 13:07:27 +11:00 · 8f25e30f52
commit 8f25e30f52
parent 4eee962c09
32 changed files with 3592 additions and 14417 deletions
--- a/.gitignore
+++ b/.gitignore
@ -72,8 +72,17 @@ ml_only_test/
 results_*/
 phase1_*/
-# Python scripts (experimental/research)
+# Python scripts (experimental/research - not in src/tests/tools)
 *.py
 !src/**/*.py
 !tests/**/*.py
-!setup.py
+!tools/**/*.py
 !setup.py
 # Archive folders (historical content)
 archive/
 docs/archive/
 # Data folders (user-specific content)
 data/Bruce emails/
 data/emails-for-link/
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -1,377 +1,304 @@
-# Email Sorter - Claude Development Guide
+# Email Sorter - Development Guide
-This document provides essential context for Claude (or other AI assistants) working on this project.
+## What This Tool Does
-## Project Overview
+**Email Sorter is a TRIAGE tool** that sorts emails into buckets for downstream processing. It is NOT a complete email management solution - it's one part of a larger ecosystem.
 **Email Sorter** is a hybrid ML/LLM email classification system designed to process large email backlogs (10k-100k+ emails) with high speed and accuracy.
 ### Current MVP Status
 **✅ PROVEN WORKING** - 10,000 emails classified in ~24 seconds with 72.7% accuracy
 **Core Features:**
 - LLM-driven category discovery (no hardcoded categories)
 - ML model training on discovered categories (LightGBM)
 - Fast pure-ML classification with `--no-llm-fallback`
 - Category verification for new mailboxes with `--verify-categories`
 - Batched embedding extraction (512 emails/batch)
 - Multiple email provider support (Gmail, Outlook, IMAP, Enron)
 ## Architecture
 ### Three-Tier Classification Pipeline
 ```
-Email → Rules Check → ML Classifier → LLM Fallback (optional)
+Raw Inbox (10k+) --> Email Sorter --> Categorized Buckets --> Specialized Tools
-         ↓              ↓                ↓
+                    (this tool)     (output)               (other tools)
      Definite     High Confidence   Low Confidence
      (5-10%)        (70-80%)          (10-20%)
 ```
-### Key Technologies
+---
- **ML Model**: LightGBM (1.8MB, 11 categories, 28 threads)
+## Quick Start
 - **Embeddings**: all-minilm:l6-v2 via Ollama (384-dim, universal)
 - **LLM**: qwen3:4b-instruct-2507-q8_0 via Ollama (calibration only)
 - **Feature Extraction**: Embeddings + TF-IDF + pattern detection
 - **Thresholds**: 0.55 (optimized from 0.75 to reduce LLM fallback)
-### Performance Metrics
+```bash
 cd /MASTERFOLDER/Tools/email-sorter
 source venv/bin/activate
-| Emails | Time | Accuracy | LLM Calls | Throughput |
+# Classify emails with ML + LLM fallback
-|--------|------|----------|-----------|------------|
+python -m src.cli run --source local \
-| 10,000 | 24s  | 72.7%    | 0         | 423/sec    |
+  --directory "/path/to/emails" \
-| 10,000 | 5min | 92.7%    | 2,100     | 33/sec     |
+  --output "/path/to/output" \
  --force-ml --llm-provider openai
 # Generate HTML report from results
 python tools/generate_html_report.py --input /path/to/results.json
 ```
 ---
 ## Key Documentation
 | Document | Purpose | Location |
 |----------|---------|----------|
 | **PROJECT_ROADMAP_2025.md** | Master learnings, research findings, development roadmap | `docs/` |
 | **CLASSIFICATION_METHODS_COMPARISON.md** | ML vs LLM vs Agent comparison | `docs/` |
 | **REPORT_FORMAT.md** | HTML report documentation | `docs/` |
 | **BATCH_LLM_QUICKSTART.md** | Quick LLM batch processing guide | root |
 ---
 ## Research Findings Summary
 ### Dataset Size Routing
 | Size | Best Method | Why |
 |------|-------------|-----|
 | <500 | Agent-only | ML overhead exceeds benefit |
 | 500-5000 | Agent pre-scan + ML | Discovery improves accuracy |
 | >5000 | ML pipeline | Speed critical |
 ### Research Results
 | Dataset | Type | ML-Only | ML+LLM | Agent |
 |---------|------|---------|--------|-------|
 | brett-gmail (801) | Personal | 54.9% | 93.3% | 99.8% |
 | brett-microsoft (596) | Business | - | - | 98.2% |
 ### Key Insight: Inbox Character Matters
 | Type | Pattern | Approach |
 |------|---------|----------|
 | **Personal** | Subscriptions, marketing (40-50% automated) | Sender domain first |
 | **Business** | Client work, operations (60-70% professional) | Sender + Subject context |
 ---
 ## Project Structure
 ```
 email-sorter/
-├── src/
+├── CLAUDE.md                 # THIS FILE
-│   ├── cli.py                      # Main CLI interface
+├── README.md                 # General readme
-│   ├── classification/              # Classification pipeline
+├── BATCH_LLM_QUICKSTART.md   # LLM batch processing
-│   │   ├── adaptive_classifier.py  # Rules → ML → LLM orchestration
+│
-│   │   ├── ml_classifier.py        # LightGBM classifier
+├── src/                      # Source code
-│   │   ├── llm_classifier.py       # LLM fallback
+│   ├── cli.py               # Main entry point
-│   │   └── feature_extractor.py    # Batched embedding extraction
+│   ├── classification/      # ML/LLM classification
-│   ├── calibration/                 # LLM-driven calibration
+│   ├── calibration/         # Model training, email parsing
-│   │   ├── workflow.py             # Calibration orchestration
+│   ├── email_providers/     # Gmail, Outlook, IMAP, Local
-│   │   ├── llm_analyzer.py         # Batch category discovery (20 emails/call)
+│   └── llm/                 # LLM providers
-│   │   ├── trainer.py              # ML model training
+│
-│   │   └── category_verifier.py    # Category verification
+├── tools/                    # Utility scripts
-│   ├── email_providers/             # Email source connectors
+│   ├── brett_gmail_analyzer.py      # Personal inbox template
-│   │   ├── gmail.py                # Gmail API (OAuth 2.0)
+│   ├── brett_microsoft_analyzer.py  # Business inbox template
-│   │   ├── outlook.py              # Microsoft Graph API (OAuth 2.0)
+│   ├── generate_html_report.py      # HTML report generator
-│   │   ├── imap.py                 # IMAP protocol
+│   └── batch_llm_classifier.py      # Batch LLM classification
-│   │   └── enron.py                # Enron dataset (testing)
+│
-│   ├── llm/                         # LLM provider interfaces
+├── config/                   # Configuration
-│   │   ├── ollama.py               # Ollama provider
+│   ├── default_config.yaml  # LLM endpoints, thresholds
-│   │   └── openai_compat.py        # OpenAI-compatible provider
+│   └── categories.yaml      # Category definitions
-│   └── models/                      # Trained models
+│
-│       ├── calibrated/              # User-calibrated models
+├── docs/                     # Current documentation
-│       │   └── classifier.pkl      # Current trained model (1.8MB)
+│   ├── PROJECT_ROADMAP_2025.md
-│       └── pretrained/              # Default models
+│   ├── CLASSIFICATION_METHODS_COMPARISON.md
-├── config/
+│   ├── REPORT_FORMAT.md
-│   ├── default_config.yaml         # System defaults
+│   └── archive/             # Old docs (historical)
-│   ├── categories.yaml             # Category definitions (thresholds: 0.55)
+│
-│   └── llm_models.yaml             # LLM configuration
+├── data/                     # Analysis outputs (gitignored)
-├── credentials/                     # Email provider credentials (gitignored)
+│   ├── brett_gmail_analysis.json
-│   ├── gmail/                      # Gmail OAuth (3 accounts)
+│   └── brett_microsoft_analysis.json
-│   ├── outlook/                    # Outlook OAuth (3 accounts)
+│
-│   └── imap/                       # IMAP credentials (3 accounts)
+├── credentials/              # OAuth/API creds (gitignored)
-├── docs/                            # Documentation
+├── results/                  # Classification outputs (gitignored)
-├── scripts/                         # Utility scripts
+├── archive/                  # Old scripts (gitignored)
-└── logs/                            # Log files (gitignored)
+├── maildir/                  # Enron test data
 └── venv/                     # Python environment
 ```
 ## Critical Implementation Details
 ### 1. Batched Embedding Extraction (CRITICAL!)
 **ALWAYS use batched feature extraction:**
 ```python
 # ✅ CORRECT - Batched (150x faster)
 all_features = feature_extractor.extract_batch(emails, batch_size=512)
 for email, features in zip(emails, all_features):
    result = adaptive_classifier.classify_with_features(email, features)
 # ❌ WRONG - Sequential (extremely slow)
 for email in emails:
    result = adaptive_classifier.classify(email)  # Extracts features one-at-a-time
 ```
 **Why this matters:**
 - Sequential: 10,000 emails × 15ms = 150 seconds just for embeddings
 - Batched: 20 batches × 1s = 20 seconds for embeddings
 - **150x performance difference**
 ### 2. Model Paths
 **The model exists in TWO locations:**
 - `src/models/calibrated/classifier.pkl` - Created during calibration (authoritative)
 - `src/models/pretrained/classifier.pkl` - Loaded by default (copy of calibrated)
 **When calibration runs:**
 1. Saves model to `calibrated/classifier.pkl`
 2. MLClassifier loads from `pretrained/classifier.pkl` by default
 3. Need to copy or update path
 **Current status:** Both paths have the same 1.8MB model (Oct 25 02:54)
 ### 3. LLM-Driven Calibration
 **NOT hardcoded categories** - categories are discovered by LLM:
 ```python
 # Calibration process:
 1. Sample 300 emails (3% of 10k)
 2. Batch process in groups of 20 emails
 3. LLM discovers categories (not predefined)
 4. LLM labels each email
 5. Train LightGBM on discovered categories
 ```
 **Result:** 11 categories discovered from Enron dataset:
 - Updates, Work, Meetings, External, Financial, Test, Administrative, Operational, Technical, Urgent, Requests
 ### 4. Threshold Optimization
 **Default threshold: 0.55** (reduced from 0.75)
 **Impact:**
 - 0.75 threshold: 35% LLM fallback
 - 0.55 threshold: 21% LLM fallback
 - **40% reduction in LLM usage**
 All category thresholds in `config/categories.yaml` set to 0.55.
 ### 5. Email Provider Credentials
 **Multi-account support:** 3 accounts per provider type
 **Credential files:**
 ```
 credentials/
 ├── gmail/
 │   ├── account1.json  # Gmail OAuth credentials
 │   ├── account2.json
 │   └── account3.json
 ├── outlook/
 │   ├── account1.json  # Outlook OAuth credentials
 │   ├── account2.json
 │   └── account3.json
 └── imap/
    ├── account1.json  # IMAP username/password
    ├── account2.json
    └── account3.json
 ```
 **Security:** All `*.json` files in `credentials/` are gitignored (only `.example` files tracked).
 ## Common Commands
 ### Development
 ```bash
 # Activate virtual environment
 source venv/bin/activate
 # Run classification (Enron dataset)
 python -m src.cli run --source enron --limit 10000 --output results/
 # Pure ML (no LLM fallback) - FAST
 python -m src.cli run --source enron --limit 10000 --output results/ --no-llm-fallback
 # With category verification
 python -m src.cli run --source enron --limit 10000 --output results/ --verify-categories
 # Gmail
 python -m src.cli run --source gmail --credentials credentials/gmail/account1.json --limit 1000
 # Outlook
 python -m src.cli run --source outlook --credentials credentials/outlook/account1.json --limit 1000
 ```
 ### Training
 ```bash
 # Force recalibration (clears cached model)
 rm -rf src/models/calibrated/ src/models/pretrained/
 python -m src.cli run --source enron --limit 10000 --output results/
 ```
 ## Code Patterns
 ### Adding New Features
 1. **Update CLI** ([src/cli.py](src/cli.py)):
   - Add click options
   - Pass to appropriate modules
 2. **Update Classifier** ([src/classification/adaptive_classifier.py](src/classification/adaptive_classifier.py)):
   - Add methods following existing pattern
   - Use `classify_with_features()` for batched processing
 3. **Update Feature Extractor** ([src/classification/feature_extractor.py](src/classification/feature_extractor.py)):
   - Always support batching (`extract_batch()`)
   - Keep `extract()` for backward compatibility
 ### Testing
 ```bash
 # Test imports
 python -c "from src.cli import cli; print('OK')"
 # Test providers
 python -c "from src.email_providers.gmail import GmailProvider; from src.email_providers.outlook import OutlookProvider; print('OK')"
 # Test classification
 python -m src.cli run --source enron --limit 100 --output test/
 ```
 ## Performance Optimization
 ### Current Bottlenecks
 1. **Embedding generation** - 20s for 10k emails (batched)
   - Optimized with batch_size=512
   - Could use local sentence-transformers for 5-10x speedup
 2. **Email parsing** - 0.5s for 10k emails (fast)
 3. **ML inference** - 0.7s for 10k emails (very fast)
 ### Optimization Opportunities
 1. **Local embeddings** - Replace Ollama API with sentence-transformers
   - Current: 20 API calls, ~20 seconds
   - With local: Direct GPU, ~2-5 seconds
   - Trade-off: More dependencies, larger memory footprint
 2. **Embedding cache** - Pre-compute and cache to disk
   - One-time cost: 20 seconds
   - Subsequent runs: 2-3 seconds to load from disk
   - Perfect for development/testing
 3. **Larger batches** - Tested 512, 1024, 2048
   - 512: 23.6s (chosen for balance)
   - 1024: 22.1s (6.6% faster)
   - 2048: 21.9s (7.5% faster, diminishing returns)
 ## Known Issues
 ### 1. Background Processes
 There are stale background bash processes from previous sessions:
 - These can be safely ignored
 - Do NOT try to kill them (per user's CLAUDE.md instructions)
 ### 2. Model Path Confusion
 - Calibration saves to `src/models/calibrated/`
 - Default loads from `src/models/pretrained/`
 - Both currently have the same model (synced)
 ### 3. Category Cache
 - `src/models/category_cache.json` stores discovered categories
 - Can become polluted if different datasets used
 - Clear with `rm src/models/category_cache.json` if issues
 ## Dependencies
 ### Required
 ```bash
 pip install click pyyaml lightgbm numpy scikit-learn ollama
 ```
 ### Email Providers
 ```bash
 # Gmail
 pip install google-api-python-client google-auth-oauthlib google-auth-httplib2
 # Outlook
 pip install msal requests
 # IMAP - no additional dependencies (Python stdlib)
 ```
 ### Optional
 ```bash
 # For faster local embeddings
 pip install sentence-transformers
 # For development
 pip install pytest black mypy
 ```
 ## Git Workflow
 ### What's Gitignored
 - `credentials/` (except `.example` files)
 - `logs/`
 - `results/`
 - `src/models/calibrated/` (trained models)
 - `*.log`
 - `debug_*.txt`
 - Test directories
 ### What's Tracked
 - All source code
 - Configuration files
 - Documentation
 - Example credential files
 - Pretrained model (if present)
 ## Important Notes for AI Assistants
 1. **NEVER create files unless necessary** - Always prefer editing existing files
 2. **ALWAYS use batching** - Feature extraction MUST be batched (512 emails/batch)
 3. **Read before writing** - Use Read tool before any Edit operations
 4. **Verify paths** - Model paths can be confusing (calibrated vs pretrained)
 5. **No emoji in commits** - Per user's CLAUDE.md preferences
 6. **Test before committing** - Verify imports and CLI work
 7. **Security** - Never commit actual credentials, only `.example` files
 8. **Performance matters** - 10x performance differences are common, always batch
 9. **LLM is optional** - System works without LLM (pure ML mode with --no-llm-fallback)
 10. **Categories are dynamic** - They're discovered by LLM, not hardcoded
 ## Recent Changes (Last Session)
 1. **Fixed embedding bottleneck** - Changed from sequential to batched feature extraction (10x speedup)
 2. **Added Outlook provider** - Full Microsoft Graph API integration
 3. **Added credentials system** - Support for 3 accounts per provider type
 4. **Optimized thresholds** - Reduced from 0.75 to 0.55 (40% less LLM usage)
 5. **Added category verifier** - Optional single LLM call to verify model fit
 6. **Project reorganization** - Clean docs/, scripts/, logs/ structure
 ## Next Steps (Roadmap)
 See [docs/PROJECT_STATUS_AND_NEXT_STEPS.html](docs/PROJECT_STATUS_AND_NEXT_STEPS.html) for complete roadmap.
 **Immediate priorities:**
 1. Test Gmail provider with real credentials
 2. Test Outlook provider with real credentials
 3. Implement email syncing (apply labels back to mailbox)
 4. Add incremental classification (process only new emails)
 5. Create web dashboard for results visualization
 ---
-**Remember:** This is an MVP with proven performance. Don't over-engineer. Keep it fast and simple.
+## Common Operations
 ### 1. Classify Emails (ML Pipeline)
 ```bash
 source venv/bin/activate
 # With LLM fallback for low confidence
 python -m src.cli run --source local \
  --directory "/path/to/emails" \
  --output "/path/to/output" \
  --force-ml --llm-provider openai
 # Pure ML (fastest, no LLM)
 python -m src.cli run --source local \
  --directory "/path/to/emails" \
  --output "/path/to/output" \
  --force-ml --no-llm-fallback
 ```
 ### 2. Generate HTML Report
 ```bash
 python tools/generate_html_report.py --input /path/to/results.json
 # Creates report.html in same directory
 ```
 ### 3. Manual Agent Analysis (Best Accuracy)
 For <1000 emails, agent analysis gives 98-99% accuracy:
 ```bash
 # Copy and customize analyzer template
 cp tools/brett_gmail_analyzer.py tools/my_inbox_analyzer.py
 # Edit classify_email() function for your inbox patterns
 # Update email_dir path
 # Run
 python tools/my_inbox_analyzer.py
 ```
 ### 4. Different Email Sources
 ```bash
 # Local .eml/.msg files
 --source local --directory "/path/to/emails"
 # Gmail (OAuth)
 --source gmail --credentials credentials/gmail/account1.json
 # Outlook (OAuth)
 --source outlook --credentials credentials/outlook/account1.json
 # Enron test data
 --source enron --limit 10000
 ```
 ---
 ## Output Locations
 **Analysis reports are stored OUTSIDE this project:**
 ```
 /home/bob/Documents/Email Manager/emails/
 ├── brett-gmail/           # Source emails (untouched)
 ├── brett-gm-md/          # ML-only classification output
 │   ├── results.json
 │   ├── report.html
 │   └── BRETT_GMAIL_ANALYSIS_REPORT.md
 ├── brett-gm-llm/         # ML+LLM classification output
 │   ├── results.json
 │   └── report.html
 └── brett-ms-sorter/      # Microsoft inbox analysis
    └── BRETT_MICROSOFT_ANALYSIS_REPORT.md
 ```
 **Project data outputs (gitignored):**
 ```
 /MASTERFOLDER/Tools/email-sorter/data/
 ├── brett_gmail_analysis.json
 └── brett_microsoft_analysis.json
 ```
 ---
 ## Configuration
 ### LLM Endpoint (config/default_config.yaml)
 ```yaml
 llm:
  provider: "openai"
  openai:
    base_url: "http://localhost:11433/v1"  # vLLM endpoint
    api_key: "not-needed"
    classification_model: "qwen3-coder-30b"
 ```
 ### Thresholds (config/categories.yaml)
 Default: 0.55 (reduced from 0.75 for 40% less LLM fallback)
 ---
 ## Key Code Locations
 | Function | File |
 |----------|------|
 | CLI entry | `src/cli.py` |
 | ML classifier | `src/classification/ml_classifier.py` |
 | LLM classifier | `src/classification/llm_classifier.py` |
 | Feature extraction | `src/classification/feature_extractor.py` |
 | Email parsing | `src/calibration/local_file_parser.py` |
 | OpenAI-compat LLM | `src/llm/openai_compat.py` |
 ---
 ## Recent Changes (Nov 2025)
 1. **cli.py**: Added `--force-ml` flag, enriched results.json with metadata
 2. **openai_compat.py**: Removed API key requirement for local vLLM
 3. **default_config.yaml**: Changed to openai provider on localhost:11433
 4. **tools/**: Added brett_gmail_analyzer.py, brett_microsoft_analyzer.py, generate_html_report.py
 5. **docs/**: Added PROJECT_ROADMAP_2025.md, CLASSIFICATION_METHODS_COMPARISON.md
 ---
 ## Troubleshooting
 ### "LLM endpoint not responding"
 - Check vLLM running on localhost:11433
 - Verify model name in config matches running model
 ### "Low accuracy (50-60%)"
 - For <1000 emails, use agent analysis
 - Dataset may differ from Enron training data
 ### "Too many LLM calls"
 - Use `--no-llm-fallback` for pure ML
 - Increase threshold in categories.yaml
 ---
 ## Development Notes
 ### Virtual Environment Required
 ```bash
 source venv/bin/activate
 # ALWAYS activate before Python commands
 ```
 ### Batched Feature Extraction (CRITICAL)
 ```python
 # CORRECT - Batched (150x faster)
 all_features = feature_extractor.extract_batch(emails, batch_size=512)
 # WRONG - Sequential (extremely slow)
 for email in emails:
    result = classifier.classify(email)  # Don't do this
 ```
 ### Model Paths
 - `src/models/calibrated/` - Created during calibration
 - `src/models/pretrained/` - Loaded by default
 ---
 ## What's Gitignored
 - `credentials/` - OAuth tokens
 - `results/`, `data/` - User data
 - `archive/`, `docs/archive/` - Historical content
 - `maildir/` - Enron test data (large)
 - `enron_mail_20150507.tar.gz` - Source archive
 - `venv/` - Python environment
 - `*.log`, `logs/` - Log files
 ---
 ## Philosophy
 1. **Triage, not management** - Sort into buckets for other tools
 2. **Risk-based accuracy** - High for personal, acceptable errors for junk
 3. **Speed matters** - 10k emails in <1 min
 4. **Inbox character matters** - Business vs personal = different approaches
 5. **Agent pre-scan adds value** - 10-15 min discovery improves everything
 ---
 *Last Updated: 2025-11-28*
 *See docs/PROJECT_ROADMAP_2025.md for full research findings*
--- a/config/default_config.yaml
+++ b/config/default_config.yaml
@ -27,7 +27,7 @@ classification:
    conversational: 0.55
 llm:
-  provider: "ollama"
+  provider: "openai"
  fallback_enabled: true
  ollama:
@ -41,9 +41,10 @@ llm:
    retry_attempts: 3
  openai:
-    base_url: "https://rtx3090.bobai.com.au/v1"
+    base_url: "http://localhost:11433/v1"
-    api_key: "rtx3090_foxadmin_10_8034ecb47841f45ba1d5f3f5d875c092"
+    api_key: "not-needed"
    calibration_model: "qwen3-coder-30b"
    consolidation_model: "qwen3-coder-30b"
    classification_model: "qwen3-coder-30b"
    temperature: 0.1
    max_tokens: 500
--- a/docs/BUILD_INSTRUCTIONS.md
+++ b/docs/BUILD_INSTRUCTIONS.md
--- a/docs/CLASSIFICATION_METHODS_COMPARISON.md
+++ b/docs/CLASSIFICATION_METHODS_COMPARISON.md
@ -0,0 +1,518 @@
 # Email Classification Methods: Comparative Analysis
 ## Executive Summary
 This document compares three email classification approaches tested on an 801-email personal Gmail dataset:
 | Method | Accuracy | Time | Best For |
 |--------|----------|------|----------|
 | ML-Only | 54.9% | 5 sec | 10k+ emails, speed critical |
 | ML+LLM Fallback | 93.3% | 3.5 min | 1k-10k emails, balanced |
 | Agent Analysis | 99.8% | 15-30 min | <1k emails, deep insights |
 **Key Finding:** The ML pipeline is overkill for datasets under ~5,000 emails. A 10-15 minute agent pre-analysis phase could dramatically improve ML accuracy for larger datasets.
 ---
 ## Test Dataset Profile
 | Characteristic | Value |
 |----------------|-------|
 | Total Emails | 801 |
 | Date Range | 20 years (2005-2025) |
 | Unique Senders | ~150 |
 | Automated % | 48.8% |
 | Personal % | 1.6% |
 | Structure Level | MEDIUM-HIGH |
 ### Email Type Breakdown (Sanitized)
 ```
 Automated Notifications     48.8%  ████████████████████████
 ├─ Art marketplace alerts   16.2%  ████████
 ├─ Shopping promotions      15.4%  ███████
 ├─ Travel recommendations   13.4%  ██████
 └─ Streaming promotions      8.5%  ████
 Business/Professional       20.1%  ██████████
 ├─ Cloud service reports    13.0%  ██████
 ├─ Security alerts           7.1%  ███
 AI/Developer Services       12.8%  ██████
 ├─ AI platform updates       6.4%  ███
 ├─ Developer tool updates    6.4%  ███
 Personal/Other              18.3%  █████████
 ├─ Entertainment             5.1%  ██
 ├─ Productivity tools        3.7%  █
 ├─ Direct correspondence     1.6%  █
 └─ Miscellaneous             7.9%  ███
 ```
 ---
 ## Method 1: ML-Only Classification
 ### Configuration
 ```yaml
 model: LightGBM (pretrained on Enron dataset)
 embeddings: all-minilm:l6-v2 (384 dimensions)
 threshold: 0.55 confidence
 categories: 11 generic (Work, Updates, Financial, etc.)
 ```
 ### Results
 | Metric | Value |
 |--------|-------|
 | Accuracy Estimate | 54.9% |
 | High Confidence (>55%) | 477 (59.6%) |
 | Low Confidence | 324 (40.4%) |
 | Processing Time | ~5 seconds |
 | LLM Calls | 0 |
 ### Category Distribution (ML-Only)
 | Category | Count | % |
 |----------|-------|---|
 | Work | 243 | 30.3% |
 | Technical | 198 | 24.7% |
 | Updates | 156 | 19.5% |
 | External | 89 | 11.1% |
 | Operational | 45 | 5.6% |
 | Financial | 38 | 4.7% |
 | Other | 32 | 4.0% |
 ### Limitations Observed
 1. **Domain Mismatch:** Trained on corporate Enron emails, applied to personal Gmail
 2. **Generic Categories:** "Work" and "Technical" absorbed everything
 3. **No Sender Intelligence:** Didn't leverage sender domain patterns
 4. **High Uncertainty:** 40% needed LLM review but got none
 ### When ML-Only Works
 - 10,000+ emails where speed matters
 - Corporate/enterprise datasets similar to training data
 - Pre-filtering before human review
 - Cost-constrained environments (no LLM API)
 ---
 ## Method 2: ML + LLM Fallback
 ### Configuration
 ```yaml
 ml_model: LightGBM (same as above)
 llm_model: qwen3-coder-30b (vLLM on localhost:11433)
 threshold: 0.55 confidence
 fallback_trigger: confidence < threshold
 ```
 ### Results
 | Metric | Value |
 |--------|-------|
 | Accuracy Estimate | 93.3% |
 | ML Classified | 477 (59.6%) |
 | LLM Classified | 324 (40.4%) |
 | Processing Time | ~3.5 minutes |
 | LLM Calls | 324 |
 ### Category Distribution (ML+LLM)
 | Category | Count | % | Source |
 |----------|-------|---|--------|
 | Work | 243 | 30.3% | ML |
 | Technical | 156 | 19.5% | ML |
 | newsletters | 98 | 12.2% | LLM |
 | junk | 87 | 10.9% | LLM |
 | transactional | 76 | 9.5% | LLM |
 | Updates | 62 | 7.7% | ML |
 | auth | 45 | 5.6% | LLM |
 | Other | 34 | 4.2% | Mixed |
 ### Improvements Over ML-Only
 1. **New Categories:** LLM introduced "newsletters", "junk", "transactional", "auth"
 2. **Better Separation:** Marketing vs. transactional distinguished
 3. **Higher Confidence:** 93.3% vs 54.9% accuracy estimate
 ### Limitations Observed
 1. **Category Inconsistency:** ML uses "Updates", LLM uses "newsletters"
 2. **No Sender Context:** Still classifying email-by-email
 3. **Generic LLM Prompt:** Doesn't know about user's specific interests
 4. **Time Cost:** 324 sequential LLM calls at ~0.6s each
 ### When ML+LLM Works
 - 1,000-10,000 emails
 - Mixed automated/personal content
 - When accuracy matters more than speed
 - Local LLM available (cost-free fallback)
 ---
 ## Method 3: Agent Analysis (Manual)
 ### Approach
 ```
 Phase 1: Initial Discovery (5 min)
  - Sample filenames and subjects
  - Identify sender domains
  - Detect patterns
 Phase 2: Pattern Extraction (10 min)
  - Design domain-specific rules
  - Test regex patterns
  - Validate on subset
 Phase 3: Deep Dive (5 min)
  - Track order lifecycles
  - Identify billing patterns
  - Find edge cases
 Phase 4: Report Generation (5 min)
  - Synthesize findings
  - Create actionable recommendations
 ```
 ### Results
 | Metric | Value |
 |--------|-------|
 | Accuracy | 99.8% (799/801) |
 | Categories | 15 custom |
 | Processing Time | ~25 minutes |
 | LLM Calls | ~20 (analysis only) |
 ### Category Distribution (Agent Analysis)
 | Category | Count | % | Subcategories |
 |----------|-------|---|---------------|
 | Art & Collectibles | 130 | 16.2% | Marketplace alerts |
 | Shopping | 123 | 15.4% | eBay, AliExpress, Automotive |
 | Entertainment | 109 | 13.6% | Streaming, Gaming, Social |
 | Travel & Tourism | 107 | 13.4% | Review sites, Bookings |
 | Google Services | 104 | 13.0% | Business, Ads, Analytics |
 | Security | 57 | 7.1% | Sign-in alerts, 2FA |
 | AI Services | 51 | 6.4% | Claude, OpenAI, Lambda |
 | Developer Tools | 51 | 6.4% | ngrok, Firebase, Docker |
 | Productivity | 30 | 3.7% | Screen recording, Docs |
 | Personal | 13 | 1.6% | Direct correspondence |
 | Other | 26 | 3.2% | Childcare, Legal, etc. |
 ### Unique Insights (Not Found by ML)
 1. **Specific Artist Tracking:** 95 alerts for specific artist "Dan Colen"
 2. **Order Lifecycle:** Single order generated 7 notification emails
 3. **Billing Patterns:** Monthly receipts from AI services on 15th
 4. **Business Context:** User runs "Fox Software Solutions"
 5. **Filtering Rules:** Ready-to-implement Gmail filters
 ### When Agent Analysis Works
 - Under 1,000 emails
 - Initial dataset understanding
 - Creating filtering rules
 - One-time deep analysis
 - Training data preparation
 ---
 ## Comparative Analysis
 ### Accuracy vs Time Tradeoff
 ```
 Accuracy
 100% ─┬─────────────────────────●─── Agent (99.8%)
      │                    ●─────── ML+LLM (93.3%)
 75% ─┤
      │
 50% ─┼────●───────────────────────── ML-Only (54.9%)
      │
 25% ─┤
      │
  0% ─┴────┬────────┬────────┬────────┬─── Time
          5s      1m       5m      30m
 ```
 ### Cost Analysis (per 1000 emails)
 | Method | Compute | LLM Calls | Est. Cost |
 |--------|---------|-----------|-----------|
 | ML-Only | 5 sec | 0 | $0.00 |
 | ML+LLM | 4 min | ~400 | $0.02-0.40* |
 | Agent | 30 min | ~30 | $0.01-0.10* |
 *Depends on LLM provider; local = free, cloud = varies
 ### Category Quality
 | Aspect | ML-Only | ML+LLM | Agent |
 |--------|---------|--------|-------|
 | Granularity | Low (11) | Medium (16) | High (15+subs) |
 | Domain-Specific | No | Partial | Yes |
 | Actionable | Limited | Moderate | High |
 | Sender-Aware | No | No | Yes |
 | Context-Aware | No | Limited | Yes |
 ---
 ## Enhancement Recommendations
 ### 1. Pre-Analysis Phase (10-15 min investment)
 **Concept:** Run agent analysis BEFORE ML classification to:
 - Discover sender domains and their purposes
 - Identify category patterns specific to dataset
 - Generate custom classification rules
 - Create sender-to-category mappings
 **Implementation:**
 ```python
 class PreAnalysisAgent:
    def analyze(self, emails: List[Email], sample_size=100):
        # Phase 1: Sender domain clustering
        domains = self.cluster_by_sender_domain(emails)
        # Phase 2: Subject pattern extraction
        patterns = self.extract_subject_patterns(emails)
        # Phase 3: Generate custom categories
        categories = self.generate_categories(domains, patterns)
        # Phase 4: Create sender-category mapping
        sender_map = self.map_senders_to_categories(domains, categories)
        return {
            'categories': categories,
            'sender_map': sender_map,
            'patterns': patterns
        }
 ```
 **Expected Impact:**
 - Accuracy: 54.9% → 85-90% (ML-only with pre-analysis)
 - Time: +10 min setup, same runtime
 - Best for: 5,000+ email datasets
 ### 2. Sender-First Classification
 **Concept:** Classify by sender domain BEFORE content analysis:
 ```python
 SENDER_CATEGORIES = {
    # High-volume automated
    'mutualart.com': ('Notifications', 'Art Alerts'),
    'tripadvisor.com': ('Notifications', 'Travel Marketing'),
    'ebay.com': ('Shopping', 'Marketplace'),
    'spotify.com': ('Entertainment', 'Streaming'),
    # Security - never auto-filter
    'accounts.google.com': ('Security', 'Account Alerts'),
    # Business
    'businessprofile-noreply@google.com': ('Business', 'Reports'),
 }
 def classify(email):
    domain = extract_domain(email.sender)
    if domain in SENDER_CATEGORIES:
        return SENDER_CATEGORIES[domain]  # 80% of emails
    else:
        return ml_classify(email)  # Fallback for 20%
 ```
 **Expected Impact:**
 - Accuracy: 85-95% for known senders
 - Speed: 10x faster (skip ML for known senders)
 - Maintenance: Requires sender map updates
 ### 3. Post-Analysis Enhancement
 **Concept:** Run agent analysis AFTER ML to:
 - Validate classification quality
 - Extract deeper insights
 - Generate reports and recommendations
 - Identify misclassifications
 **Implementation:**
 ```python
 class PostAnalysisAgent:
    def analyze(self, emails: List[Email], classifications: List[Result]):
        # Validate: Check for obvious errors
        errors = self.detect_misclassifications(emails, classifications)
        # Enrich: Add metadata not captured by ML
        enriched = self.extract_metadata(emails)
        # Insights: Generate actionable recommendations
        insights = self.generate_insights(emails, classifications)
        return {
            'corrections': errors,
            'enrichments': enriched,
            'insights': insights
        }
 ```
 ### 4. Dataset Size Routing
 **Concept:** Automatically choose method based on volume:
 ```python
 def choose_method(email_count: int, time_budget: str = 'normal'):
    if email_count < 500:
        return 'agent_only'  # Full agent analysis
    elif email_count < 2000:
        return 'agent_then_ml'  # Pre-analysis + ML
    elif email_count < 10000:
        return 'ml_with_llm'  # ML + LLM fallback
    else:
        return 'ml_only'  # Pure ML for speed
 ```
 **Recommended Thresholds:**
 | Volume | Recommended Method | Rationale |
 |--------|-------------------|-----------|
 | <500 | Agent Only | ML overhead not worth it |
 | 500-2000 | Agent Pre-Analysis + ML | Investment pays off |
 | 2000-10000 | ML + LLM Fallback | Balanced approach |
 | >10000 | ML-Only | Speed critical |
 ### 5. Hybrid Category System
 **Concept:** Merge ML categories with agent-discovered categories:
 ```python
 # ML Generic Categories (trained)
 ML_CATEGORIES = ['Work', 'Updates', 'Technical', 'Financial', ...]
 # Agent-Discovered Categories (per-dataset)
 AGENT_CATEGORIES = {
    'Art Alerts': {'parent': 'Updates', 'sender': 'mutualart.com'},
    'Travel Marketing': {'parent': 'Updates', 'sender': 'tripadvisor.com'},
    'AI Services': {'parent': 'Technical', 'keywords': ['anthropic', 'openai']},
 }
 def classify_hybrid(email, ml_result):
    # First: Check agent-specific rules
    for cat, rules in AGENT_CATEGORIES.items():
        if matches_rules(email, rules):
            return (cat, ml_result.category)  # Specific + generic
    # Fallback: ML result
    return (ml_result.category, None)
 ```
 ---
 ## Implementation Roadmap
 ### Phase 1: Quick Wins (1-2 hours)
 1. **Add sender-domain classifier**
   - Map top 20 senders to categories
   - Use as fast-path before ML
   - Expected: +20% accuracy
 2. **Add dataset size routing**
   - Check email count before processing
   - Route small datasets to agent analysis
   - Route large datasets to ML pipeline
 ### Phase 2: Pre-Analysis Agent (4-8 hours)
 1. **Build sender clustering**
   - Group emails by domain
   - Calculate volume per domain
   - Identify automated vs personal
 2. **Build pattern extraction**
   - Find subject templates
   - Extract IDs and tracking numbers
   - Identify lifecycle stages
 3. **Generate sender map**
   - Output: JSON mapping senders to categories
   - Feed into ML pipeline as rules
 ### Phase 3: Post-Analysis Enhancement (4-8 hours)
 1. **Build validation agent**
   - Check low-confidence results
   - Detect category conflicts
   - Flag for review
 2. **Build enrichment agent**
   - Extract order IDs
   - Track lifecycles
   - Generate insights
 3. **Integrate with HTML report**
   - Add insights section
   - Show lifecycle tracking
   - Include recommendations
 ---
 ## Conclusion
 ### Key Takeaways
 1. **ML pipeline is overkill for <5,000 emails** - Agent analysis provides better accuracy with similar time investment
 2. **Sender domain is the strongest signal** - 80%+ emails can be classified by sender alone
 3. **Pre-analysis investment pays off** - 10-15 min agent setup dramatically improves ML accuracy
 4. **One-size-fits-all doesn't work** - Route by dataset size for optimal results
 5. **Post-analysis adds unique value** - Lifecycle tracking and insights not possible with ML alone
 ### Recommended Default Pipeline
 ```
 ┌─────────────────────────────────────────────────────────────┐
 │                    EMAIL CLASSIFICATION                      │
 └─────────────────────────────────────────────────────────────┘
                              │
                              ▼
                    ┌─────────────────┐
                    │ Count Emails    │
                    └────────┬────────┘
                             │
          ┌──────────────────┼──────────────────┐
          │                  │                  │
          ▼                  ▼                  ▼
     <500 emails       500-5000            >5000
          │                  │                  │
          ▼                  ▼                  ▼
   ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
   │ Agent Only   │  │ Pre-Analysis │  │ ML Pipeline  │
   │ (15-30 min)  │  │ + ML + Post  │  │ (fast)       │
   │              │  │ (15 min + ML)│  │              │
   └──────────────┘  └──────────────┘  └──────────────┘
          │                  │                  │
          ▼                  ▼                  ▼
   ┌──────────────────────────────────────────────────┐
   │              UNIFIED OUTPUT                       │
   │  - Categorized emails                            │
   │  - Confidence scores                             │
   │  - Insights & recommendations                    │
   │  - Filtering rules                               │
   └──────────────────────────────────────────────────┘
 ```
 ---
 *Document Version: 1.0*
 *Created: 2025-11-28*
 *Based on: brett-gmail dataset analysis (801 emails)*
--- a/docs/COMPLETION_ASSESSMENT.md
+++ b/docs/COMPLETION_ASSESSMENT.md
@ -1,526 +0,0 @@
 # Email Sorter - Completion Assessment
 **Date**: 2025-10-21
 **Status**: FEATURE COMPLETE - All 16 Phases Implemented
 **Test Results**: 27/30 passing (90% success rate)
 **Code Quality**: Complete with full type hints and clear mock labeling
 ---
 ## Executive Summary
 The Email Sorter framework is **100% feature-complete** with all 16 development phases implemented. The system is ready for:
 1. **Immediate Use**: Framework testing with mock model (~90% test pass rate)
 2. **Real Model Integration**: Download/train LightGBM model and deploy
 3. **Production Processing**: Process Marion's 80k+ emails with real Gmail integration
 All core infrastructure, classifiers, learning systems, and export/sync mechanisms are complete and tested.
 ---
 ## Phase Completion Checklist
 ### Phase 1-3: Core Infrastructure ✅
 - [x] Project setup & dependencies (42 packages)
 - [x] YAML-based configuration system
 - [x] Rich-based logging with file output
 - [x] Email data models with full type hints
 - [x] Pydantic validation
 - **Status**: Complete
 ### Phase 4: Email Providers ✅
 - [x] MockProvider (fully functional for testing)
 - [x] GmailProvider stub (OAuth-ready, graceful error handling)
 - [x] IMAPProvider stub (ready for server config)
 - [x] Attachment handling
 - **Status**: Framework complete, awaiting credentials
 ### Phase 5: Feature Extraction ✅
 - [x] Semantic embeddings (sentence-transformers, 384 dims)
 - [x] Hard pattern matching (20+ regex patterns)
 - [x] Structural features (metadata, timing, attachments)
 - [x] Attachment analysis (PDF, DOCX, XLSX text extraction)
 - [x] Embedding cache with MD5 hashing
 - [x] Batch processing for efficiency
 - **Status**: Complete with 90%+ test coverage
 ### Phase 6: ML Classifier ✅
 - [x] Mock Random Forest (clearly labeled)
 - [x] LightGBM trainer for real models
 - [x] Model serialization/deserialization
 - [x] Model integration framework
 - [x] Pre-trained model loading
 - **Status**: Framework ready, mock model for testing, real model integration tools provided
 ### Phase 7: LLM Integration ✅
 - [x] OllamaProvider (local, with retry logic)
 - [x] OpenAIProvider (API-compatible)
 - [x] Graceful degradation when unavailable
 - [x] Batch processing support
 - **Status**: Complete
 ### Phase 8: Adaptive Classifier ✅
 - [x] Three-tier classification system
 - [x] Hard rules (instant, ~10%)
 - [x] ML classifier (fast, ~85%)
 - [x] LLM review (uncertain cases, ~5%)
 - [x] Dynamic threshold management
 - [x] Statistics tracking
 - **Status**: Complete
 ### Phase 9: Processing Pipeline ✅
 - [x] BulkProcessor with checkpointing
 - [x] Resumable processing from checkpoints
 - [x] Batch-based processing
 - [x] Progress tracking
 - [x] Error recovery
 - **Status**: Complete with test coverage
 ### Phase 10: Calibration System ✅
 - [x] EmailSampler (stratified + random)
 - [x] LLMAnalyzer (discover natural categories)
 - [x] CalibrationWorkflow (end-to-end)
 - [x] Category validation
 - **Status**: Complete with Enron dataset support
 ### Phase 11: Export & Reporting ✅
 - [x] JSON export with metadata
 - [x] CSV export for analysis
 - [x] Organization by category
 - [x] Human-readable reports
 - [x] Statistics and metrics
 - **Status**: Complete
 ### Phase 12: Threshold & Pattern Learning ✅
 - [x] ThresholdAdjuster (learn from LLM feedback)
 - [x] Agreement tracking per category
 - [x] Automatic threshold suggestions
 - [x] PatternLearner (sender-specific rules)
 - [x] Category distribution tracking
 - [x] Hard rule suggestions
 - **Status**: Complete
 ### Phase 13: Advanced Processing ✅
 - [x] EnronParser (maildir format support)
 - [x] AttachmentHandler (PDF/DOCX content extraction)
 - [x] ModelTrainer (real LightGBM training)
 - [x] EmbeddingCache (MD5-based with disk persistence)
 - [x] EmbeddingBatcher (parallel processing)
 - [x] QueueManager (batch persistence)
 - **Status**: Complete
 ### Phase 14: Provider Sync ✅
 - [x] GmailSync (sync to Gmail labels)
 - [x] IMAPSync (sync to IMAP keywords)
 - [x] Configurable label mapping
 - [x] Batch update support
 - [x] Error handling and retry logic
 - **Status**: Complete
 ### Phase 15: Orchestration ✅
 - [x] EmailSorterOrchestrator (4-phase pipeline)
 - [x] Full progress tracking
 - [x] Timing and metrics
 - [x] Error recovery
 - [x] Modular component design
 - **Status**: Complete
 ### Phase 16: Packaging ✅
 - [x] setup.py with setuptools
 - [x] pyproject.toml with PEP 517/518
 - [x] Optional dependencies (dev, gmail, ollama, openai)
 - [x] Console script entry point
 - [x] Git history with 11 commits
 - **Status**: Complete
 ### Phase 17: Testing ✅
 - [x] 23 unit tests
 - [x] Integration tests
 - [x] E2E pipeline tests
 - [x] Feature extraction validation
 - [x] Classifier flow testing
 - **Status**: 27/30 passing (90% success rate)
 ---
 ## Test Results Summary
 ```
 ======================== Test Execution Results ========================
 PASSED (27 tests):
 ✅ test_email_model_validation - Email dataclass validation
 ✅ test_attachment_parsing - Attachment metadata extraction
 ✅ test_mock_provider - Mock email provider
 ✅ test_feature_extraction_basic - Basic feature extraction
 ✅ test_semantic_embeddings - Embedding generation (384 dims)
 ✅ test_hard_pattern_matching - Pattern detection (19/20 patterns)
 ✅ test_ml_classifier_prediction - Random Forest predictions
 ✅ test_adaptive_classifier_workflow - Three-tier classification
 ✅ test_embedding_cache - MD5-based cache hits/misses
 ✅ test_embedding_batcher - Batch processing
 ✅ test_queue_manager - LLM queue management
 ✅ test_bulk_processor - Resumable checkpointing
 ✅ test_email_sampler - Stratified sampling
 ✅ test_llm_analyzer - Category discovery
 ✅ test_threshold_adjuster - Dynamic threshold learning
 ✅ test_pattern_learner - Sender-specific rules
 ✅ test_results_exporter - JSON/CSV export
 ✅ test_provider_sync - Gmail/IMAP sync
 ✅ test_ollama_provider - LLM provider integration
 ✅ test_openai_provider - API-compatible LLM
 ✅ test_configuration_loading - YAML config parsing
 ✅ test_logging_system - Rich logging output
 ✅ test_end_to_end_mock_classification - Full pipeline
 ✅ test_e2e_mock_pipeline - Mock pipeline validation
 ✅ test_e2e_export_formats - Export format validation
 ✅ test_e2e_hard_rules_accuracy - Hard rule precision
 ✅ test_e2e_batch_processing_performance - Batch efficiency
 FAILED (3 tests - Expected/Documented):
 ❌ test_e2e_checkpoint_resume - Feature vector mismatch (expected when upgrading models)
 ❌ test_e2e_enron_parsing - Parser validation (Enron dataset needs validation)
 ❌ test_pattern_detection_invoice - Minor regex pattern issue (cosmetic)
 ======================== Summary ========================
 Total: 30 tests
 Passed: 27 (90%)
 Failed: 3 (10% - all expected and documented)
 Duration: ~90 seconds
 Coverage: All major components
 ```
 ---
 ## Code Statistics
 ```
 Files:              38 Python modules + configs
 Lines of Code:      ~6,000+ production code
 Core Modules:       16 major components
 Test Files:         6 test suites
 Dependencies:       42 packages installed
 Git Commits:        11 tracking full development
 Total Size:         ~450 MB (includes venv + Enron dataset)
 ```
 ### Module Breakdown
 **Core Infrastructure (3 modules)**
 - `src/utils/config.py` - Configuration management
 - `src/utils/logging.py` - Logging system
 - `src/email_providers/base.py` - Base classes
 **Classification (5 modules)**
 - `src/classification/feature_extractor.py` - Feature extraction
 - `src/classification/ml_classifier.py` - ML predictions
 - `src/classification/llm_classifier.py` - LLM predictions
 - `src/classification/adaptive_classifier.py` - Orchestration
 - `src/classification/embedding_cache.py` - Caching & batching
 **Calibration (4 modules)**
 - `src/calibration/sampler.py` - Email sampling
 - `src/calibration/llm_analyzer.py` - Category discovery
 - `src/calibration/trainer.py` - Model training
 - `src/calibration/workflow.py` - Calibration pipeline
 **Processing & Learning (5 modules)**
 - `src/processing/bulk_processor.py` - Batch processing
 - `src/processing/queue_manager.py` - Queue management
 - `src/processing/attachment_handler.py` - Attachment analysis
 - `src/adjustment/threshold_adjuster.py` - Threshold learning
 - `src/adjustment/pattern_learner.py` - Pattern learning
 **Export & Sync (4 modules)**
 - `src/export/exporter.py` - Results export
 - `src/export/provider_sync.py` - Gmail/IMAP sync
 **Integration (3 modules)**
 - `src/llm/ollama.py` - Ollama provider
 - `src/llm/openai_compat.py` - OpenAI provider
 - `src/orchestration.py` - Main orchestrator
 **Email Providers (3 modules)**
 - `src/email_providers/gmail.py` - Gmail provider
 - `src/email_providers/imap.py` - IMAP provider
 - `src/email_providers/mock.py` - Mock provider
 **CLI & Testing (2 modules)**
 - `src/cli.py` - Command-line interface
 - `tests/` - 23 test cases
 **Tools & Setup (2 scripts)**
 - `tools/download_pretrained_model.py` - Model downloading
 - `tools/setup_real_model.py` - Model setup
 ---
 ## Current Framework Status
 ### What's Complete Now
 ✅ All core infrastructure
 ✅ Feature extraction system
 ✅ Three-tier adaptive classifier
 ✅ Embedding cache and batching
 ✅ Mock model for testing
 ✅ LLM integration (Ollama/OpenAI)
 ✅ Processing pipeline with checkpointing
 ✅ Calibration workflow
 ✅ Export (JSON/CSV)
 ✅ Provider sync (Gmail/IMAP)
 ✅ Learning systems (threshold + patterns)
 ✅ CLI interface
 ✅ Test suite (90% pass rate)
 ### What Requires Your Input
 1. **Real Model**: Download or train LightGBM model
 2. **Gmail Credentials**: OAuth setup for live email access
 3. **Real Data**: Use Enron dataset (already downloaded) or your email data
 ---
 ## Real Model Integration
 ### Quick Start: Using Pre-trained Model
 ```bash
 # Check if model is installed
 python tools/setup_real_model.py --check
 # Setup a pre-trained model (download or local file)
 python tools/setup_real_model.py --model-path /path/to/model.pkl
 # Create model info documentation
 python tools/setup_real_model.py --info
 ```
 ### Step 1: Get a Real Model
 **Option A: Train on Enron Dataset** (Recommended)
 ```python
 from src.calibration.enron_parser import EnronParser
 from src.calibration.trainer import ModelTrainer
 from src.classification.feature_extractor import FeatureExtractor
 # Parse Enron
 parser = EnronParser("enron_mail_20150507")
 emails = parser.parse_emails(limit=5000)
 # Train model
 extractor = FeatureExtractor()
 trainer = ModelTrainer(extractor, categories=['junk', 'transactional', ...])
 results = trainer.train(labeled_data)
 # Save
 trainer.save_model("src/models/pretrained/classifier.pkl")
 ```
 **Option B: Download Pre-trained**
 ```bash
 python tools/download_pretrained_model.py \
  --url https://example.com/model.pkl \
  --hash abc123def456
 ```
 ### Step 2: Verify Integration
 ```bash
 # Check model is loaded
 python -c "from src.classification.ml_classifier import MLClassifier; \
  c = MLClassifier(); \
  print(c.get_info())"
 # Should show: is_mock: False, model_type: LightGBM
 ```
 ### Step 3: Run Full Pipeline
 ```bash
 # With real model (once set up)
 python -m src.cli run --source mock --output results/
 ```
 ---
 ## Feature Overview
 ### Classification Accuracy
 - **Hard Rules**: 94-96% (instant, ~10% of emails)
 - **ML Model**: 85-90% (fast, ~85% of emails)
 - **LLM Review**: 92-95% (slower, ~5% uncertain)
 - **Overall**: 90-94% (weighted average)
 ### Performance
 - **Calibration**: 3-5 minutes (1500 emails)
 - **Bulk Processing**: 10-12 minutes (80k emails)
 - **LLM Review**: 4-5 minutes (batched)
 - **Export**: 2-3 minutes
 - **Total**: ~17-25 minutes for 80k emails
 ### Categories (12)
 junk, transactional, auth, newsletters, social, automated, conversational, work, personal, finance, travel, unknown
 ### Features Extracted
 - **Semantic**: 384-dimensional embeddings (all-MiniLM-L6-v2)
 - **Patterns**: 20+ regex-based patterns
 - **Structural**: Metadata, timing, attachments, sender analysis
 ---
 ## Known Issues & Limitations
 ### Expected Test Failures (3/30 - Documented)
 **1. test_e2e_checkpoint_resume**
 - **Reason**: Feature vector mismatch when switching from mock to real model
 - **Impact**: Only relevant when upgrading models
 - **Resolution**: Not needed until real model deployed
 **2. test_e2e_enron_parsing**
 - **Reason**: EnronParser needs validation against actual maildir format
 - **Impact**: Parser works but needs dataset verification
 - **Resolution**: Will be validated during real training phase
 **3. test_pattern_detection_invoice**
 - **Reason**: Minor regex pattern doesn't match "bill #456"
 - **Impact**: Cosmetic - doesn't affect production accuracy
 - **Resolution**: Easy regex adjustment if needed
 ### Pydantic Warnings (16 warnings)
 - **Reason**: Using deprecated `.dict()` method (Pydantic v2 compatibility)
 - **Severity**: Cosmetic - code still works perfectly
 - **Resolution**: Will migrate to `.model_dump()` in next update
 ---
 ## Component Validation
 ### Critical Components ✅
 - [x] Feature extraction (embeddings + patterns + structural)
 - [x] Three-tier adaptive classifier
 - [x] Mock model clearly labeled
 - [x] Real model integration framework
 - [x] LLM providers (Ollama + OpenAI)
 - [x] Queue management with persistence
 - [x] Checkpointed processing
 - [x] Export/sync mechanisms
 - [x] Learning systems (threshold + patterns)
 - [x] End-to-end orchestration
 ### Framework Quality ✅
 - [x] Type hints on all functions
 - [x] Comprehensive error handling
 - [x] Logging at all critical points
 - [x] Clear mock vs production separation
 - [x] Graceful degradation
 - [x] Batch processing optimization
 - [x] Cache efficiency
 - [x] Resumable operations
 ### Testing ✅
 - [x] 27/30 tests passing
 - [x] All core functions tested
 - [x] Integration tests included
 - [x] E2E pipeline tests
 - [x] Mock model clearly separated
 - [x] 90% coverage of critical paths
 ---
 ## Deployment Path
 ### Phase 1: Framework Validation ✓ (COMPLETE)
 - All 16 phases implemented
 - 27/30 tests passing
 - Documentation complete
 - Ready for real data
 ### Phase 2: Real Model Deployment (NEXT)
 1. Download or train LightGBM model
 2. Place in `src/models/pretrained/classifier.pkl`
 3. Run verification tests
 4. Deploy to production
 ### Phase 3: Gmail Integration (PARALLEL)
 1. Set up Google Cloud Console
 2. Download OAuth credentials
 3. Configure `credentials.json`
 4. Test with 100 emails first
 5. Scale to full dataset
 ### Phase 4: Production Processing (FINAL)
 1. Process all 80k+ emails
 2. Sync results to Gmail labels
 3. Review accuracy metrics
 4. Iterate on threshold tuning
 ---
 ## How to Proceed
 ### Immediate (Framework Testing)
 ```bash
 # Test current framework with mock model
 pytest tests/ -v                          # Run full test suite
 python -m src.cli test-config             # Test config loading
 python -m src.cli run --source mock       # Test mock pipeline
 ```
 ### Short Term (Real Model)
 ```bash
 # Option 1: Train on Enron dataset
 python -c "from tools import train_enron; train_enron.train()"
 # Option 2: Download pre-trained
 python tools/download_pretrained_model.py --url https://...
 # Verify
 python tools/setup_real_model.py --check
 ```
 ### Medium Term (Gmail Integration)
 ```bash
 # Set up credentials
 # Place credentials.json in project root
 # Test with 100 emails
 python -m src.cli run --source gmail --limit 100 --output test_results/
 # Review results
 ```
 ### Production (Full Processing)
 ```bash
 # Process all emails
 python -m src.cli run --source gmail --output marion_results/
 # Package for deployment
 python setup.py sdist bdist_wheel
 ```
 ---
 ## Conclusion
 The Email Sorter framework is **100% feature-complete** and ready to use. All 16 development phases are implemented with:
 - ✅ 38 Python modules with full type hints
 - ✅ 27/30 tests passing (90% success rate)
 - ✅ ~6,000 lines of code
 - ✅ Clear mock vs real model separation
 - ✅ Comprehensive logging and error handling
 - ✅ Graceful degradation
 - ✅ Batch processing optimization
 - ✅ Complete documentation
 **The system is ready for:**
 1. Real model integration (tools provided)
 2. Gmail OAuth setup (framework ready)
 3. Full production deployment (80k+ emails)
 No architectural changes needed. Just add real data and credentials.
 ---
 **Next Step**: Download/train a real LightGBM model or use the mock for continued framework testing.
--- a/docs/COMPREHENSIVE_PROJECT_OVERVIEW.md
+++ b/docs/COMPREHENSIVE_PROJECT_OVERVIEW.md
--- a/docs/CURRENT_WORK_SUMMARY.md
+++ b/docs/CURRENT_WORK_SUMMARY.md
@ -1,232 +0,0 @@
 # Email Sorter - Current Work Summary
 **Date:** 2025-10-23
 **Status:** 100k Enron Classification Complete with Optimization
 ---
 ## Current Achievements
 ### 1. Calibration System (Phase 1) ✅
 - **LLM-driven category discovery** using qwen3:8b-q4_K_M
 - **Trained on:** 50 emails (stratified sample from 100 email batch)
 - **Categories discovered:** 10 quality categories
  - Work Communication, Financial, Forwarded, Technical Analysis, Administrative, Reports, Technical Issues, Requests, Meetings, HR & Personnel
 - **Category cache system:** Cross-mailbox consistency with semantic matching
 - **Model:** LightGBM classifier on 384-dim embeddings (all-minilm:l6-v2)
 - **Model file:** `src/models/calibrated/classifier.pkl` (1.1MB)
 ### 2. Performance Optimization ✅
 **Batch Size Testing Results:**
 - batch_size=32: 6.993s (baseline)
 - batch_size=64: 5.636s (19.4% faster)
 - batch_size=128: 5.617s (19.7% faster)
 - batch_size=256: 5.572s (20.3% faster)
 - **batch_size=512: 5.453s (22.0% faster)** ← WINNER
 **Key Optimizations:**
 - Fixed sequential embedding calls → batched API calls
 - Used Ollama's `embed()` API with batch support
 - Removed duplicate `extract_batch()` method causing cache issues
 - Optimized to 512 batch size for GPU utilization
 ### 3. 100k Classification Complete ✅
 **Performance:**
 - **Total time:** 3.4 minutes (202 seconds)
 - **Speed:** 495 emails/second
 - **Per email:** ~2ms (including all processing)
 **Accuracy:**
 - **Average confidence:** 81.1%
 - **High confidence (≥0.7):** 74,777 emails (74.8%)
 - **Medium confidence (0.5-0.7):** 17,381 emails (17.4%)
 - **Low confidence (<0.5):** 7,842 emails (7.8%)
 **Category Distribution:**
 1. Work Communication: 89,807 (89.8%) | Avg conf: 83.7%
 2. Financial: 6,534 (6.5%) | Avg conf: 58.7%
 3. Forwarded: 2,457 (2.5%) | Avg conf: 54.4%
 4. Technical Analysis: 1,129 (1.1%) | Avg conf: 56.9%
 5. Reports: 42 (0.04%)
 6. Technical Issues: 14 (0.01%)
 7. Administrative: 14 (0.01%)
 8. Requests: 3 (0.00%)
 **Output Files:**
 - `enron_100k_results/results.json` (19MB) - Full classifications
 - `enron_100k_results/summary.json` (1.5KB) - Statistics
 - `enron_100k_results/classifications.csv` (8.6MB) - Spreadsheet format
 ### 4. Evaluation & Validation Tools ✅
 **A. LLM Evaluation Script** (`evaluate_with_llm.py`)
 - Loads actual email content with EnronProvider
 - Uses qwen3:8b-q4_K_M with `<no_think>` for speed
 - Stratified sampling (high/medium/low confidence)
 - Verdict parsing: YES/PARTIAL/NO
 - Temperature=0.1 for consistency
 **B. Feedback Fine-tuning System** (`feedback_finetune.py`)
 - Collects LLM corrections on low-confidence predictions
 - Continues LightGBM training with `init_model` parameter
 - Lower learning rate (0.05) for stability
 - Creates `classifier_finetuned.pkl`
 - **Result on 200 samples:** 0 corrections needed (model already accurate!)
 **C. Attachment Handler** (exists but NOT integrated)
 - PDF text extraction (PyPDF2)
 - DOCX text extraction (python-docx)
 - Keyword detection (financial, legal, meeting, report)
 - Classification hints
 - **Status:** Available in `src/processing/attachment_handler.py` but unused
 ---
 ## Technical Architecture
 ### Data Flow
 ```
 Enron Maildir (100k emails)
    ↓
 EnronParser (stratified sampling)
    ↓
 FeatureExtractor (batch_size=512)
    ↓
 Ollama Embeddings (all-minilm:l6-v2, 384-dim)
    ↓
 LightGBM Classifier (22 categories)
    ↓
 Results (JSON/CSV export)
 ```
 ### Calibration Flow
 ```
 100 emails → 5 LLM batches (20 emails each)
    ↓
 qwen3:8b-q4_K_M discovers categories
    ↓
 Consolidation (15 → 10 categories)
    ↓
 Category cache (semantic matching)
    ↓
 50 emails labeled for training
    ↓
 LightGBM training (200 boosting rounds)
    ↓
 Model saved (classifier.pkl)
 ```
 ### Performance Metrics
 - **Calibration:** ~100 emails, ~1 minute
 - **Training:** 50 samples, LightGBM 200 rounds, ~1 second
 - **Classification:** 100k emails, batch 512, 3.4 minutes
 - **Per email:** 2ms total (embedding + inference)
 - **GPU utilization:** Batched embeddings, efficient processing
 ---
 ## Key Files & Components
 ### Models
 - `src/models/calibrated/classifier.pkl` - Trained LightGBM model (1.1MB)
 - `src/models/category_cache.json` - 10 discovered categories
 ### Core Components
 - `src/calibration/enron_parser.py` - Enron dataset parsing
 - `src/calibration/llm_analyzer.py` - LLM category discovery
 - `src/calibration/trainer.py` - LightGBM training
 - `src/calibration/workflow.py` - Orchestration
 - `src/classification/feature_extractor.py` - Batch embeddings (512)
 - `src/email_providers/enron.py` - Enron provider
 - `src/processing/attachment_handler.py` - Attachment extraction (unused)
 ### Scripts
 - `run_100k_classification.py` - Full 100k processing
 - `test_model_burst.py` - Batch testing (configurable size)
 - `evaluate_with_llm.py` - LLM quality evaluation
 - `feedback_finetune.py` - Feedback-driven fine-tuning
 ### Results
 - `enron_100k_results/` - 100k classification output
 - `enron_100k_full_run.log` - Complete processing log
 ---
 ## Known Issues & Limitations
 ### 1. Attachment Handling ❌
 - AttachmentAnalyzer exists but NOT integrated
 - Enron dataset has minimal attachments
 - Need integration for Marion emails with PDFs/DOCX
 ### 2. Category Imbalance ⚠️
 - 89.8% classified as "Work Communication"
 - May be accurate for Enron (internal work emails)
 - Other categories underrepresented
 ### 3. Low Confidence Samples
 - 7,842 emails (7.8%) with confidence <0.5
 - LLM validation shows they're actually correct
 - Model confidence may be overly conservative
 ### 4. Feature Extraction
 - Currently uses only subject + body text
 - Attachments not analyzed
 - Sender domain/patterns used but could be enhanced
 ---
 ## Next Steps
 ### Immediate
 1. **Comprehensive validation script:**
   - 50 low-confidence samples
   - 25 random samples
   - LLM summary of findings
 2. **Mermaid workflow diagram:**
   - Complete data flow visualization
   - All LLM call points
   - Performance metrics at each stage
 3. **Fresh end-to-end run:**
   - Clear all models
   - Run calibration → classification → validation
   - Document complete pipeline
 ### Future Enhancements
 1. **Integrate attachment handling** for Marion emails
 2. **Add more structural features** (time patterns, thread depth)
 3. **Active learning loop** with user feedback
 4. **Multi-model ensemble** for higher accuracy
 5. **Confidence calibration** to improve certainty estimates
 ---
 ## Performance Summary
 | Metric | Value |
 |--------|-------|
 | **Calibration Time** | ~1 minute |
 | **Training Samples** | 50 emails |
 | **Model Size** | 1.1MB |
 | **Categories** | 10 discovered |
 | **100k Processing** | 3.4 minutes |
 | **Speed** | 495 emails/sec |
 | **Avg Confidence** | 81.1% |
 | **High Confidence** | 74.8% |
 | **Batch Size** | 512 (optimal) |
 | **Embedding Dim** | 384 (all-minilm) |
 ---
 ## Conclusion
 The email sorter has achieved:
 - ✅ **Fast calibration** (1 minute on 100 emails)
 - ✅ **High accuracy** (81% avg confidence)
 - ✅ **Excellent performance** (495 emails/sec)
 - ✅ **Quality categories** (10 broad, reusable)
 - ✅ **Scalable architecture** (100k emails in 3.4 min)
 The system is **ready for production** with Marion emails after integrating attachment handling.
--- a/docs/FAST_ML_ONLY_WORKFLOW.html
+++ b/docs/FAST_ML_ONLY_WORKFLOW.html
@ -1,527 +0,0 @@
 <!DOCTYPE html>
 <html lang="en">
 <head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Fast ML-Only Workflow Analysis</title>
    <script src="https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.min.js"></script>
    <style>
        body {
            font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
            margin: 20px;
            background: #1e1e1e;
            color: #d4d4d4;
        }
        h1, h2, h3 {
            color: #4ec9b0;
        }
        .diagram {
            background: white;
            padding: 20px;
            margin: 20px 0;
            border-radius: 8px;
        }
        .timing-table {
            width: 100%;
            border-collapse: collapse;
            margin: 20px 0;
            background: #252526;
        }
        .timing-table th {
            background: #37373d;
            padding: 12px;
            text-align: left;
            color: #4ec9b0;
        }
        .timing-table td {
            padding: 10px;
            border-bottom: 1px solid #3e3e42;
        }
        .code-section {
            background: #252526;
            padding: 15px;
            margin: 10px 0;
            border-left: 4px solid #4ec9b0;
            font-family: 'Courier New', monospace;
        }
        code {
            background: #1e1e1e;
            padding: 2px 6px;
            border-radius: 3px;
            color: #ce9178;
        }
        .success {
            background: #002a00;
            border-left: 4px solid #4ec9b0;
            padding: 15px;
            margin: 10px 0;
        }
        .warning {
            background: #3e2a00;
            border-left: 4px solid #ffd93d;
            padding: 15px;
            margin: 10px 0;
        }
        .critical {
            background: #3e0000;
            border-left: 4px solid #ff6b6b;
            padding: 15px;
            margin: 10px 0;
        }
    </style>
 </head>
 <body>
    <h1>Fast ML-Only Workflow Analysis</h1>
    <h2>Your Question</h2>
    <blockquote>
        "I want to run ML-only classification on new mailboxes WITHOUT full calibration. Maybe 1 LLM call to verify categories match, then pure ML on embeddings. How can we do this fast for experimentation?"
    </blockquote>
    <h2>Current Trained Model</h2>
    <div class="success">
        <h3>Model: src/models/calibrated/classifier.pkl (1.8MB)</h3>
        <ul>
            <li><strong>Type:</strong> LightGBM Booster (not mock)</li>
            <li><strong>Categories (11):</strong> Updates, Work, Meetings, External, Financial, Test, Administrative, Operational, Technical, Urgent, Requests</li>
            <li><strong>Trained on:</strong> 10,000 Enron emails</li>
            <li><strong>Input:</strong> Embeddings (384-dim) + TF-IDF features</li>
        </ul>
    </div>
    <h2>1. Current Flow: With Calibration (Slow)</h2>
    <div class="diagram">
        <pre class="mermaid">
 flowchart TD
    Start([New Mailbox: 10k emails]) --> Check{Model exists?}
    Check -->|No| Calibration[CALIBRATION PHASE<br/>~20 minutes]
    Check -->|Yes| LoadModel[Load existing model]
    Calibration --> Sample[Sample 300 emails]
    Sample --> Discovery[LLM Category Discovery<br/>15 batches × 20 emails<br/>~5 minutes]
    Discovery --> Consolidate[Consolidate categories<br/>LLM call<br/>~5 seconds]
    Consolidate --> Label[Label 300 samples]
    Label --> Extract[Feature extraction]
    Extract --> Train[Train LightGBM<br/>~5 seconds]
    Train --> SaveModel[Save new model]
    SaveModel --> Classify[CLASSIFICATION PHASE]
    LoadModel --> Classify
    Classify --> Loop{For each email}
    Loop --> Embed[Generate embedding<br/>~0.02 sec]
    Embed --> TFIDF[TF-IDF features<br/>~0.001 sec]
    TFIDF --> Predict[ML Prediction<br/>~0.003 sec]
    Predict --> Threshold{Confidence?}
    Threshold -->|High| MLDone[ML result]
    Threshold -->|Low| LLMFallback[LLM fallback<br/>~4 sec]
    MLDone --> Next{More?}
    LLMFallback --> Next
    Next -->|Yes| Loop
    Next -->|No| Done[Results]
    style Calibration fill:#ff6b6b
    style Discovery fill:#ff6b6b
    style LLMFallback fill:#ff6b6b
    style MLDone fill:#4ec9b0
 </pre>
    </div>
    <h2>2. Desired Flow: Fast ML-Only (Your Goal)</h2>
    <div class="diagram">
        <pre class="mermaid">
 flowchart TD
    Start([New Mailbox: 10k emails]) --> LoadModel[Load pre-trained model<br/>Categories: 11 known<br/>~0.5 seconds]
    LoadModel --> OptionalCheck{Verify categories?}
    OptionalCheck -->|Yes| QuickVerify[Single LLM call<br/>Sample 10-20 emails<br/>Check category match<br/>~20 seconds]
    OptionalCheck -->|Skip| StartClassify
    QuickVerify --> MatchCheck{Categories match?}
    MatchCheck -->|Yes| StartClassify[START CLASSIFICATION]
    MatchCheck -->|No| Warn[Warning: Category mismatch<br/>Continue anyway]
    Warn --> StartClassify
    StartClassify --> Loop{For each email}
    Loop --> Embed[Generate embedding<br/>all-minilm:l6-v2<br/>384 dimensions<br/>~0.02 sec]
    Embed --> TFIDF[TF-IDF features<br/>~0.001 sec]
    TFIDF --> Combine[Combine features<br/>Embedding + TF-IDF vector]
    Combine --> Predict[LightGBM prediction<br/>~0.003 sec]
    Predict --> Result[Category + confidence<br/>NO threshold check<br/>NO LLM fallback]
    Result --> Next{More emails?}
    Next -->|Yes| Loop
    Next -->|No| Done[10k emails classified<br/>Total time: ~4 minutes]
    style QuickVerify fill:#ffd93d
    style Result fill:#4ec9b0
    style Done fill:#4ec9b0
 </pre>
    </div>
    <h2>3. What Already Works (No Code Changes Needed)</h2>
    <div class="success">
        <h3>✓ The Model is Portable</h3>
        <p>Your trained model contains:</p>
        <ul>
            <li>LightGBM Booster (the actual trained weights)</li>
            <li>Category list (11 categories)</li>
            <li>Category-to-index mapping</li>
        </ul>
        <p><strong>It can classify ANY email that has the same feature structure (embeddings + TF-IDF).</strong></p>
    </div>
    <div class="success">
        <h3>✓ Embeddings are Universal</h3>
        <p>The <code>all-minilm:l6-v2</code> model creates 384-dim embeddings for ANY text. It doesn't need to be "trained" on your categories - it just maps text to semantic space.</p>
        <p><strong>Same embedding model works on Gmail, Outlook, any mailbox.</strong></p>
    </div>
    <div class="success">
        <h3>✓ --no-llm-fallback Flag Exists</h3>
        <p>Already implemented. When set:</p>
        <ul>
            <li>Low confidence emails still get ML classification</li>
            <li>NO LLM fallback calls</li>
            <li>100% pure ML speed</li>
        </ul>
    </div>
    <div class="success">
        <h3>✓ Model Loads Without Calibration</h3>
        <p>If model exists at <code>src/models/pretrained/classifier.pkl</code>, calibration is skipped entirely.</p>
    </div>
    <h2>4. The Problem: Category Drift</h2>
    <div class="warning">
        <h3>What Happens When Mailboxes Differ</h3>
        <p><strong>Scenario:</strong> Model trained on Enron (business emails)</p>
        <p><strong>New mailbox:</strong> Personal Gmail (shopping, social, newsletters)</p>
        <table class="timing-table">
            <tr>
                <th>Enron Categories (Trained)</th>
                <th>Gmail Categories (Natural)</th>
                <th>ML Behavior</th>
            </tr>
            <tr>
                <td>Work, Meetings, Financial</td>
                <td>Shopping, Social, Travel</td>
                <td>Forces Gmail into Enron categories</td>
            </tr>
            <tr>
                <td>"Operational"</td>
                <td>No equivalent</td>
                <td>Emails mis-classified as "Operational"</td>
            </tr>
            <tr>
                <td>"External"</td>
                <td>"Newsletters"</td>
                <td>May map but semantically different</td>
            </tr>
        </table>
        <p><strong>Result:</strong> Model works, but accuracy drops. Emails get forced into inappropriate categories.</p>
    </div>
    <h2>5. Your Proposed Solution: Quick Category Verification</h2>
    <div class="diagram">
        <pre class="mermaid">
 flowchart TD
    Start([New Mailbox]) --> LoadModel[Load trained model<br/>11 categories known]
    LoadModel --> Sample[Sample 10-20 emails<br/>Quick random sample<br/>~0.1 seconds]
    Sample --> BuildPrompt[Build verification prompt<br/>Show trained categories<br/>Show sample emails]
    BuildPrompt --> LLMCall[Single LLM call<br/>~20 seconds<br/>Task: Are these categories<br/>appropriate for this mailbox?]
    LLMCall --> Parse[Parse response<br/>Expected: Yes/No + suggestions]
    Parse --> Decision{Response?}
    Decision -->|"Good match"| Proceed[Proceed with ML-only]
    Decision -->|"Poor match"| Options{User choice}
    Options -->|Continue anyway| Proceed
    Options -->|Full calibration| Calibrate[Run full calibration<br/>Discover new categories]
    Options -->|Abort| Stop[Stop - manual review]
    Proceed --> FastML[Fast ML Classification<br/>10k emails in 4 minutes]
    style LLMCall fill:#ffd93d
    style FastML fill:#4ec9b0
    style Calibrate fill:#ff6b6b
 </pre>
    </div>
    <h2>6. Implementation Options</h2>
    <h3>Option A: Pure ML (Fastest, No Verification)</h3>
    <div class="code-section">
 <strong>Command:</strong>
 python -m src.cli run \
  --source gmail \
  --limit 10000 \
  --output gmail_results/ \
  --no-llm-fallback
 <strong>What happens:</strong>
 1. Load existing model (11 Enron categories)
 2. Classify all 10k emails using those categories
 3. NO LLM calls at all
 4. Time: ~4 minutes
 <strong>Accuracy:</strong> 60-80% depending on mailbox similarity to Enron
 <strong>Use case:</strong> Quick experimentation, bulk processing
    </div>
    <h3>Option B: Quick Verify Then ML (Your Suggestion)</h3>
    <div class="code-section">
 <strong>Command:</strong>
 python -m src.cli run \
  --source gmail \
  --limit 10000 \
  --output gmail_results/ \
  --no-llm-fallback \
  --verify-categories \   # NEW FLAG (needs implementation)
  --verify-sample 20      # NEW FLAG (needs implementation)
 <strong>What happens:</strong>
 1. Load existing model (11 Enron categories)
 2. Sample 20 random emails from new mailbox
 3. Single LLM call: "Are categories [Work, Meetings, ...] appropriate for these emails?"
 4. LLM responds: "Good match" or "Poor match - suggest [Shopping, Social, ...]"
 5. If good match: Proceed with ML-only
 6. If poor match: Warn user, optionally run calibration
 <strong>Time:</strong> ~4.5 minutes (20 sec verify + 4 min classify)
 <strong>Accuracy:</strong> Same as Option A, but with confidence check
 <strong>Use case:</strong> Production deployment with safety check
    </div>
    <h3>Option C: Lightweight Calibration (Middle Ground)</h3>
    <div class="code-section">
 <strong>Command:</strong>
 python -m src.cli run \
  --source gmail \
  --limit 10000 \
  --output gmail_results/ \
  --no-llm-fallback \
  --quick-calibrate \      # NEW FLAG (needs implementation)
  --calibrate-sample 50    # Much smaller than 300
 <strong>What happens:</strong>
 1. Sample only 50 emails (not 300)
 2. Run LLM discovery on 3 batches (not 15)
 3. Map discovered categories to existing model categories
 4. If >70% overlap: Use existing model
 5. If <70% overlap: Train lightweight adapter
 <strong>Time:</strong> ~6 minutes (2 min quick cal + 4 min classify)
 <strong>Accuracy:</strong> 70-85% (better than Option A)
 <strong>Use case:</strong> New mailbox types with some verification
    </div>
    <h2>7. What Actually Needs Implementation</h2>
    <table class="timing-table">
        <tr>
            <th>Feature</th>
            <th>Status</th>
            <th>Work Required</th>
            <th>Time</th>
        </tr>
        <tr>
            <td><strong>Option A: Pure ML</strong></td>
            <td>✅ WORKS NOW</td>
            <td>None - just use --no-llm-fallback</td>
            <td>0 hours</td>
        </tr>
        <tr>
            <td><strong>--verify-categories flag</strong></td>
            <td>❌ Needs implementation</td>
            <td>Add CLI flag, sample logic, LLM prompt, response parsing</td>
            <td>2-3 hours</td>
        </tr>
        <tr>
            <td><strong>--quick-calibrate flag</strong></td>
            <td>❌ Needs implementation</td>
            <td>Modify calibration workflow, category mapping logic</td>
            <td>4-6 hours</td>
        </tr>
        <tr>
            <td><strong>Category adapter/mapper</strong></td>
            <td>❌ Needs implementation</td>
            <td>Map new categories to existing model categories using embeddings</td>
            <td>6-8 hours</td>
        </tr>
    </table>
    <h2>8. Recommended Approach: Start with Option A</h2>
    <div class="success">
        <h3>Why Option A (Pure ML, No Verification) is Best for Experimentation</h3>
        <ol>
            <li><strong>Works right now</strong> - No code changes needed</li>
            <li><strong>4 minutes per 10k emails</strong> - Ultra fast</li>
            <li><strong>Reveals real accuracy</strong> - See how well Enron model generalizes</li>
            <li><strong>Easy to compare</strong> - Run on multiple mailboxes quickly</li>
            <li><strong>No false confidence</strong> - You know it's approximate, act accordingly</li>
        </ol>
        <h3>Test Protocol</h3>
        <p><strong>Step 1:</strong> Run on Enron subset (same domain)</p>
        <code>python -m src.cli run --source enron --limit 5000 --output test_enron/ --no-llm-fallback</code>
        <p>Expected accuracy: ~78% (baseline)</p>
        <p><strong>Step 2:</strong> Run on different Enron mailbox</p>
        <code>python -m src.cli run --source enron --limit 5000 --output test_enron2/ --no-llm-fallback</code>
        <p>Expected accuracy: ~70-75% (slight drift)</p>
        <p><strong>Step 3:</strong> If you have personal Gmail/Outlook data, run there</p>
        <code>python -m src.cli run --source gmail --limit 5000 --output test_gmail/ --no-llm-fallback</code>
        <p>Expected accuracy: ~50-65% (significant drift, but still useful)</p>
    </div>
    <h2>9. Timing Comparison: All Options</h2>
    <table class="timing-table">
        <tr>
            <th>Approach</th>
            <th>LLM Calls</th>
            <th>Time (10k emails)</th>
            <th>Accuracy (Same domain)</th>
            <th>Accuracy (Different domain)</th>
        </tr>
        <tr>
            <td><strong>Full Calibration</strong></td>
            <td>~500 (discovery + labeling + classification fallback)</td>
            <td>~2.5 hours</td>
            <td>92-95%</td>
            <td>92-95%</td>
        </tr>
        <tr>
            <td><strong>Option A: Pure ML</strong></td>
            <td>0</td>
            <td>~4 minutes</td>
            <td>75-80%</td>
            <td>50-65%</td>
        </tr>
        <tr>
            <td><strong>Option B: Verify + ML</strong></td>
            <td>1 (verification)</td>
            <td>~4.5 minutes</td>
            <td>75-80%</td>
            <td>50-65%</td>
        </tr>
        <tr>
            <td><strong>Option C: Quick Calibrate + ML</strong></td>
            <td>~50 (quick discovery)</td>
            <td>~6 minutes</td>
            <td>80-85%</td>
            <td>65-75%</td>
        </tr>
        <tr>
            <td><strong>Current: ML + LLM Fallback</strong></td>
            <td>~2100 (21% fallback rate)</td>
            <td>~2.5 hours</td>
            <td>92-95%</td>
            <td>85-90%</td>
        </tr>
    </table>
    <h2>10. The Real Question: Embeddings as Universal Features</h2>
    <div class="success">
        <h3>Why Your Intuition is Correct</h3>
        <p>You said: "map it all to our structured embedding and that's how it gets done"</p>
        <p><strong>This is exactly right.</strong></p>
        <ul>
            <li><strong>Embeddings are semantic representations</strong> - "Meeting tomorrow" has similar embedding whether it's from Enron or Gmail</li>
            <li><strong>LightGBM learns patterns in embedding space</strong> - "High values in dimensions 50-70 = Meetings"</li>
            <li><strong>These patterns transfer</strong> - Different mailboxes have similar semantic patterns</li>
            <li><strong>Categories are just labels</strong> - The model doesn't care if you call it "Work" or "Business" - it learns the embedding pattern</li>
        </ul>
        <h3>The Limit</h3>
        <p>Transfer learning works when:</p>
        <ul>
            <li>Email <strong>types</strong> are similar (business emails train well on business emails)</li>
            <li>Email <strong>structure</strong> is similar (length, formality, sender patterns)</li>
        </ul>
        <p>Transfer learning fails when:</p>
        <ul>
            <li>Email <strong>domains</strong> differ significantly (e-commerce emails vs internal memos)</li>
            <li>Email <strong>purposes</strong> differ (personal chitchat vs corporate announcements)</li>
        </ul>
    </div>
    <h2>11. Recommended Next Step</h2>
    <div class="code-section">
 <strong>Immediate action (works right now):</strong>
 # Test current model on new 10k sample WITHOUT calibration
 python -m src.cli run \
  --source enron \
  --limit 10000 \
  --output ml_speed_test/ \
  --no-llm-fallback
 # Expected:
 # - Time: ~4 minutes
 # - Accuracy: ~75-80%
 # - LLM calls: 0
 # - Categories used: 11 from trained model
 # Then inspect results:
 cat ml_speed_test/results.json | python -m json.tool | less
 # Check category distribution:
 cat ml_speed_test/results.json | \
  python -c "import json, sys; data=json.load(sys.stdin); \
  from collections import Counter; \
  print(Counter(c['category'] for c in data['classifications']))"
    </div>
    <h2>12. If You Want Verification (Future Work)</h2>
    <p>I can implement <code>--verify-categories</code> flag that:</p>
    <ol>
        <li>Samples 20 emails from new mailbox</li>
        <li>Makes single LLM call showing both:
            <ul>
                <li>Trained model categories: [Work, Meetings, Financial, ...]</li>
                <li>Sample emails from new mailbox</li>
            </ul>
        </li>
        <li>Asks LLM: "Rate category fit: Good/Fair/Poor + suggest alternatives"</li>
        <li>Reports confidence score</li>
        <li>Proceeds with ML-only if score > threshold</li>
    </ol>
    <p><strong>Time cost:</strong> +20 seconds (1 LLM call)</p>
    <p><strong>Value:</strong> Automated sanity check before bulk processing</p>
    <script>
        mermaid.initialize({
            startOnLoad: true,
            theme: 'default',
            flowchart: {
                useMaxWidth: true,
                htmlLabels: true,
                curve: 'basis'
            }
        });
    </script>
 </body>
 </html>
--- a/docs/LABEL_TRAINING_PHASE_DETAIL.html
+++ b/docs/LABEL_TRAINING_PHASE_DETAIL.html
@ -1,564 +0,0 @@
 <!DOCTYPE html>
 <html lang="en">
 <head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Label Training Phase - Detailed Analysis</title>
    <script src="https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.min.js"></script>
    <style>
        body {
            font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
            margin: 20px;
            background: #1e1e1e;
            color: #d4d4d4;
        }
        h1, h2, h3 {
            color: #4ec9b0;
        }
        .diagram {
            background: white;
            padding: 20px;
            margin: 20px 0;
            border-radius: 8px;
        }
        .timing-table {
            width: 100%;
            border-collapse: collapse;
            margin: 20px 0;
            background: #252526;
        }
        .timing-table th {
            background: #37373d;
            padding: 12px;
            text-align: left;
            color: #4ec9b0;
        }
        .timing-table td {
            padding: 10px;
            border-bottom: 1px solid #3e3e42;
        }
        .code-section {
            background: #252526;
            padding: 15px;
            margin: 10px 0;
            border-left: 4px solid #4ec9b0;
            font-family: 'Courier New', monospace;
        }
        code {
            background: #1e1e1e;
            padding: 2px 6px;
            border-radius: 3px;
            color: #ce9178;
        }
        .warning {
            background: #3e2a00;
            border-left: 4px solid #ffd93d;
            padding: 15px;
            margin: 10px 0;
        }
        .critical {
            background: #3e0000;
            border-left: 4px solid #ff6b6b;
            padding: 15px;
            margin: 10px 0;
        }
    </style>
 </head>
 <body>
    <h1>Label Training Phase - Deep Dive Analysis</h1>
    <h2>1. What is "Label Training"?</h2>
    <p><strong>Location:</strong> src/calibration/llm_analyzer.py</p>
    <p><strong>Purpose:</strong> The LLM examines sample emails and assigns each one to a discovered category, creating labeled training data for the ML model.</p>
    <p><strong>This is NOT the same as category discovery.</strong> Discovery finds WHAT categories exist. Labeling creates training examples by saying WHICH emails belong to WHICH categories.</p>
    <div class="critical">
        <h3>CRITICAL MISUNDERSTANDING IN ORIGINAL DIAGRAM</h3>
        <p>The "Label Training Emails" phase described as "~3 seconds per email" is <strong>INCORRECT</strong>.</p>
        <p><strong>The actual implementation does NOT label emails individually.</strong></p>
        <p>Labels are created as a BYPRODUCT of batch category discovery, not as a separate sequential operation.</p>
    </div>
    <h2>2. Actual Label Training Flow</h2>
    <div class="diagram">
        <pre class="mermaid">
 flowchart TD
    Start([Calibration Phase Starts]) --> Sample[Sample 300 emails<br/>stratified by sender]
    Sample --> BatchSetup[Split into batches of 20 emails<br/>300 ÷ 20 = 15 batches]
    BatchSetup --> Batch1[Batch 1: Emails 1-20]
    Batch1 --> Stats1[Calculate batch statistics<br/>domains, keywords, attachments<br/>~0.1 seconds]
    Stats1 --> BuildPrompt1[Build LLM prompt<br/>Include all 20 email summaries<br/>~0.05 seconds]
    BuildPrompt1 --> LLMCall1[Single LLM call for entire batch<br/>Discovers categories AND labels all 20<br/>~20 seconds TOTAL for batch]
    LLMCall1 --> Parse1[Parse JSON response<br/>Extract categories + labels<br/>~0.1 seconds]
    Parse1 --> Store1[Store results<br/>categories: Dict<br/>labels: List of Tuples]
    Store1 --> Batch2{More batches?}
    Batch2 -->|Yes| NextBatch[Batch 2: Emails 21-40]
    Batch2 -->|No| Consolidate
    NextBatch --> Stats2[Same process<br/>15 total batches<br/>~20 seconds each]
    Stats2 --> Batch2
    Consolidate[Consolidate categories<br/>Merge duplicates<br/>Single LLM call<br/>~5 seconds]
    Consolidate --> CacheSnap[Snap to cached categories<br/>Match against persistent cache<br/>~0.5 seconds]
    CacheSnap --> Final[Final output<br/>10-12 categories<br/>300 labeled emails]
    Final --> End([Labels ready for ML training])
    style LLMCall1 fill:#ff6b6b
    style Consolidate fill:#ff6b6b
    style Stats2 fill:#ffd93d
    style Final fill:#4ec9b0
 </pre>
    </div>
    <h2>3. Key Discovery: Batched Labeling</h2>
    <div class="code-section">
 <strong>src/calibration/llm_analyzer.py:66-83</strong>
 batch_size = 20  # NOT 1 email at a time!
 for batch_idx in range(0, len(sample_emails), batch_size):
    batch = sample_emails[batch_idx:batch_idx + batch_size]
    # Single LLM call handles ENTIRE batch
    batch_results = self._analyze_batch(batch, batch_idx)
    # Returns BOTH categories AND labels for all 20 emails
    for category, desc in batch_results.get('categories', {}).items():
        discovered_categories[category] = desc
    for email_id, category in batch_results.get('labels', []):
        email_labels.append((email_id, category))
    </div>
    <div class="warning">
        <h3>Why Batching Matters</h3>
        <p><strong>Sequential (WRONG assumption):</strong> 300 emails × 3 sec/email = 900 seconds (15 minutes)</p>
        <p><strong>Batched (ACTUAL):</strong> 15 batches × 20 sec/batch = 300 seconds (5 minutes)</p>
        <p><strong>Savings:</strong> 10 minutes (67% faster than assumed)</p>
    </div>
    <h2>4. Single Batch Processing Detail</h2>
    <div class="diagram">
        <pre class="mermaid">
 flowchart TD
    Start([Batch of 20 emails]) --> Stats[Calculate Statistics<br/>~0.1 seconds]
    Stats --> StatDetails[Domain analysis<br/>Recipient counts<br/>Attachment detection<br/>Keyword extraction]
    StatDetails --> BuildList[Build email summaries<br/>For each email:<br/>ID + From + Subject + Preview]
    BuildList --> Prompt[Construct LLM prompt<br/>~2KB text<br/>Contains:<br/>- Statistics summary<br/>- All 20 email summaries<br/>- Instructions<br/>- JSON schema]
    Prompt --> LLM[LLM Call<br/>POST /api/generate<br/>qwen3:4b-instruct-2507-q8_0<br/>temp=0.1, max_tokens=2000<br/>~18-22 seconds]
    LLM --> Response[LLM Response<br/>JSON with:<br/>categories: Dict<br/>labels: List of 20 Tuples]
    Response --> Parse[Parse JSON<br/>Regex extraction<br/>Brace counting<br/>~0.05 seconds]
    Parse --> Validate{Valid JSON?}
    Validate -->|Yes| Extract[Extract data<br/>categories: 3-8 new<br/>labels: 20 tuples]
    Validate -->|No| FallbackParse[Fallback parsing<br/>Try to salvage partial data]
    FallbackParse --> Extract
    Extract --> Return[Return batch results<br/>categories: Dict str→str<br/>labels: List Tuple str,str]
    Return --> End([Merge with global results])
    style LLM fill:#ff6b6b
    style Parse fill:#4ec9b0
    style FallbackParse fill:#ffd93d
 </pre>
    </div>
    <h2>5. LLM Prompt Structure</h2>
    <div class="code-section">
 <strong>Actual prompt sent to LLM (src/calibration/llm_analyzer.py:196-232):</strong>
 &lt;no_think&gt;You are analyzing emails to discover natural categories...
 BATCH STATISTICS (20 emails):
 - Top sender domains: example.com (5), company.org (3)...
 - Avg recipients per email: 2.3
 - Emails with attachments: 4/20
 - Avg subject length: 42 chars
 - Common keywords: meeting(3), report(2)...
 EMAILS TO ANALYZE:
 1. ID: maildir_allen-p__sent_mail_512
   From: phillip.allen@enron.com
   Subject: Re: AEC Volumes at OPAL
   Preview: Here are the volumes...
 2. ID: maildir_allen-p__sent_mail_513
   From: phillip.allen@enron.com
   Subject: Meeting Tomorrow
   Preview: Can we schedule...
 [... 18 more emails ...]
 TASK:
 1. Identify natural groupings based on PURPOSE
 2. Create SHORT category names
 3. Assign each email to exactly one category
 4. CRITICAL: Copy EXACT email IDs
 Return JSON:
 {
  "categories": {"Work": "daily business communication", ...},
  "labels": [["maildir_allen-p__sent_mail_512", "Work"], ...]
 }
    </div>
    <h2>6. Timing Breakdown - 300 Sample Emails</h2>
    <table class="timing-table">
        <tr>
            <th>Operation</th>
            <th>Per Batch (20 emails)</th>
            <th>Total (15 batches)</th>
            <th>% of Total Time</th>
        </tr>
        <tr>
            <td>Calculate statistics</td>
            <td>0.1 sec</td>
            <td>1.5 sec</td>
            <td>0.5%</td>
        </tr>
        <tr>
            <td>Build email summaries</td>
            <td>0.05 sec</td>
            <td>0.75 sec</td>
            <td>0.2%</td>
        </tr>
        <tr>
            <td>Construct prompt</td>
            <td>0.01 sec</td>
            <td>0.15 sec</td>
            <td>0.05%</td>
        </tr>
        <tr>
            <td><strong>LLM API call</strong></td>
            <td><strong>18-22 sec</strong></td>
            <td><strong>270-330 sec</strong></td>
            <td><strong>98%</strong></td>
        </tr>
        <tr>
            <td>Parse JSON response</td>
            <td>0.05 sec</td>
            <td>0.75 sec</td>
            <td>0.2%</td>
        </tr>
        <tr>
            <td>Merge results</td>
            <td>0.02 sec</td>
            <td>0.3 sec</td>
            <td>0.1%</td>
        </tr>
        <tr>
            <td colspan="2"><strong>SUBTOTAL: Batch Discovery</strong></td>
            <td><strong>~300 seconds (5 min)</strong></td>
            <td><strong>98.5%</strong></td>
        </tr>
        <tr>
            <td colspan="2">Consolidation LLM call</td>
            <td>5 seconds</td>
            <td>1.3%</td>
        </tr>
        <tr>
            <td colspan="2">Cache snapping (semantic matching)</td>
            <td>0.5 seconds</td>
            <td>0.2%</td>
        </tr>
        <tr>
            <td colspan="2"><strong>TOTAL LABELING PHASE</strong></td>
            <td><strong>~305 seconds (5 min)</strong></td>
            <td><strong>100%</strong></td>
        </tr>
    </table>
    <div class="warning">
        <h3>Corrected Understanding</h3>
        <p><strong>Original estimate:</strong> "~3 seconds per email" = 900 seconds for 300 emails</p>
        <p><strong>Actual timing:</strong> ~20 seconds per batch of 20 = ~305 seconds for 300 emails</p>
        <p><strong>Difference:</strong> 3× faster than original assumption</p>
        <p><strong>Why:</strong> Batching allows LLM to see context across multiple emails and make better category decisions in a single inference pass.</p>
    </div>
    <h2>7. What Gets Created</h2>
    <div class="diagram">
        <pre class="mermaid">
 flowchart LR
    Input[300 sampled emails] --> Discovery[Category Discovery<br/>15 batches × 20 emails]
    Discovery --> RawCats[Raw Categories<br/>~30-40 discovered<br/>May have duplicates:<br/>Work, work, Business, etc.]
    RawCats --> Consolidate[Consolidation<br/>LLM merges similar<br/>~5 seconds]
    Consolidate --> Merged[Merged Categories<br/>~12-15 categories<br/>Work, Financial, etc.]
    Merged --> CacheSnap[Cache Snap<br/>Match against persistent cache<br/>~0.5 seconds]
    CacheSnap --> Final[Final Categories<br/>10-12 categories]
    Discovery --> RawLabels[Raw Labels<br/>300 tuples:<br/>email_id, category]
    RawLabels --> UpdateLabels[Update label categories<br/>to match snapped names]
    UpdateLabels --> FinalLabels[Final Labels<br/>300 training pairs]
    Final --> Training[Training Data]
    FinalLabels --> Training
    Training --> MLTrain[Train LightGBM Model<br/>~5 seconds]
    MLTrain --> Model[Trained Model<br/>1.8MB .pkl file]
    style Discovery fill:#ff6b6b
    style Consolidate fill:#ff6b6b
    style Model fill:#4ec9b0
 </pre>
    </div>
    <h2>8. Example Output</h2>
    <div class="code-section">
 <strong>discovered_categories (Dict[str, str]):</strong>
 {
  "Work": "daily business communication and coordination",
  "Financial": "budgets, reports, financial planning",
  "Meetings": "scheduling and meeting coordination",
  "Technical": "system issues and technical discussions",
  "Requests": "action items and requests for information",
  "Reports": "status reports and summaries",
  "Administrative": "HR, policies, company announcements",
  "Urgent": "time-sensitive matters",
  "Conversational": "casual check-ins and social",
  "External": "communication with external partners"
 }
 <strong>sample_labels (List[Tuple[str, str]]):</strong>
 [
  ("maildir_allen-p__sent_mail_1", "Financial"),
  ("maildir_allen-p__sent_mail_2", "Work"),
  ("maildir_allen-p__sent_mail_3", "Meetings"),
  ("maildir_allen-p__sent_mail_4", "Work"),
  ("maildir_allen-p__sent_mail_5", "Financial"),
  ... (300 total)
 ]
    </div>
    <h2>9. Why Batching is Critical</h2>
    <table class="timing-table">
        <tr>
            <th>Approach</th>
            <th>LLM Calls</th>
            <th>Time/Call</th>
            <th>Total Time</th>
            <th>Quality</th>
        </tr>
        <tr>
            <td><strong>Sequential (1 email/call)</strong></td>
            <td>300</td>
            <td>3 sec</td>
            <td>900 sec (15 min)</td>
            <td>Poor - no context</td>
        </tr>
        <tr>
            <td><strong>Small batches (5 emails/call)</strong></td>
            <td>60</td>
            <td>8 sec</td>
            <td>480 sec (8 min)</td>
            <td>Fair - limited context</td>
        </tr>
        <tr>
            <td><strong>Current (20 emails/call)</strong></td>
            <td>15</td>
            <td>20 sec</td>
            <td>300 sec (5 min)</td>
            <td>Good - sufficient context</td>
        </tr>
        <tr>
            <td><strong>Large batches (50 emails/call)</strong></td>
            <td>6</td>
            <td>45 sec</td>
            <td>270 sec (4.5 min)</td>
            <td>Risk - may exceed token limits</td>
        </tr>
    </table>
    <div class="warning">
        <h3>Why 20 emails per batch?</h3>
        <ul>
            <li><strong>Token limit:</strong> 20 emails × ~150 tokens/email = ~3000 tokens input, well under 8K limit</li>
            <li><strong>Context window:</strong> LLM can see patterns across multiple emails</li>
            <li><strong>Speed:</strong> Minimizes API calls while staying within limits</li>
            <li><strong>Quality:</strong> Enough examples to identify patterns, not so many that it gets confused</li>
        </ul>
    </div>
    <h2>10. Configuration Parameters</h2>
    <table class="timing-table">
        <tr>
            <th>Parameter</th>
            <th>Location</th>
            <th>Default</th>
            <th>Effect on Timing</th>
        </tr>
        <tr>
            <td>sample_size</td>
            <td>CalibrationConfig</td>
            <td>300</td>
            <td>300 samples = 15 batches = 5 min</td>
        </tr>
        <tr>
            <td>batch_size</td>
            <td>llm_analyzer.py:62</td>
            <td>20</td>
            <td>Hardcoded - affects batch count</td>
        </tr>
        <tr>
            <td>llm_batch_size</td>
            <td>CalibrationConfig</td>
            <td>50</td>
            <td>NOT USED for discovery (misleading name)</td>
        </tr>
        <tr>
            <td>temperature</td>
            <td>LLM call</td>
            <td>0.1</td>
            <td>Lower = faster, more deterministic</td>
        </tr>
        <tr>
            <td>max_tokens</td>
            <td>LLM call</td>
            <td>2000</td>
            <td>Higher = potentially slower response</td>
        </tr>
    </table>
    <h2>11. Full Calibration Timeline</h2>
    <div class="diagram">
        <pre class="mermaid">
 gantt
    title Calibration Phase Timeline (300 samples, 10k total emails)
    dateFormat mm:ss
    axisFormat %M:%S
    section Sampling
    Stratified sample (3% of 10k) :00:00, 01s
    section Category Discovery
    Batch 1 (emails 1-20)         :00:01, 20s
    Batch 2 (emails 21-40)        :00:21, 20s
    Batch 3 (emails 41-60)        :00:41, 20s
    Batch 4-13 (emails 61-260)    :01:01, 200s
    Batch 14 (emails 261-280)     :04:21, 20s
    Batch 15 (emails 281-300)     :04:41, 20s
    section Consolidation
    LLM category merge            :05:01, 05s
    Cache snap                    :05:06, 00.5s
    section ML Training
    Feature extraction (300)      :05:07, 06s
    LightGBM training             :05:13, 05s
    Validation (100 emails)       :05:18, 02s
    Save model to disk            :05:20, 00.5s
 </pre>
    </div>
    <h2>12. Key Insights</h2>
    <div class="critical">
        <h3>1. Labels are NOT created sequentially</h3>
        <p>The LLM creates labels as a byproduct of batch category discovery. There is NO separate "label each email one by one" phase.</p>
    </div>
    <div class="critical">
        <h3>2. Batching is the optimization</h3>
        <p>Processing 20 emails in a single LLM call (20 sec) is 3× faster than 20 individual calls (60 sec total).</p>
    </div>
    <div class="critical">
        <h3>3. LLM time dominates everything</h3>
        <p>98% of labeling phase time is LLM API calls. Everything else (parsing, merging, caching) is negligible.</p>
    </div>
    <div class="critical">
        <h3>4. Consolidation is cheap</h3>
        <p>Merging 30-40 raw categories into 10-12 final ones takes only ~5 seconds with a single LLM call.</p>
    </div>
    <h2>13. Optimization Opportunities</h2>
    <table class="timing-table">
        <tr>
            <th>Optimization</th>
            <th>Current</th>
            <th>Potential</th>
            <th>Tradeoff</th>
        </tr>
        <tr>
            <td>Increase batch size</td>
            <td>20 emails/batch</td>
            <td>30-40 emails/batch</td>
            <td>May hit token limits, slower per call</td>
        </tr>
        <tr>
            <td>Reduce sample size</td>
            <td>300 samples (3%)</td>
            <td>200 samples (2%)</td>
            <td>Less training data, potentially worse model</td>
        </tr>
        <tr>
            <td>Parallel batching</td>
            <td>Sequential 15 batches</td>
            <td>3-5 concurrent batches</td>
            <td>Requires async LLM client, more complex</td>
        </tr>
        <tr>
            <td>Skip consolidation</td>
            <td>Always consolidate if >10 cats</td>
            <td>Skip if <15 cats</td>
            <td>May leave duplicate categories</td>
        </tr>
        <tr>
            <td>Cache-first approach</td>
            <td>Discover then snap to cache</td>
            <td>Snap to cache, only discover new</td>
            <td>Less adaptive to new mailbox types</td>
        </tr>
    </table>
    <script>
        mermaid.initialize({
            startOnLoad: true,
            theme: 'default',
            flowchart: {
                useMaxWidth: true,
                htmlLabels: true,
                curve: 'basis'
            },
            gantt: {
                useWidth: 1200
            }
        });
    </script>
 </body>
 </html>
--- a/docs/MODEL_INFO.md
+++ b/docs/MODEL_INFO.md
@ -1,129 +0,0 @@
 # Model Information
 ## Current Status
 - **Model Type**: LightGBM Classifier (Production)
 - **Location**: `src/models/pretrained/classifier.pkl`
 - **Categories**: 12 (junk, transactional, auth, newsletters, social, automated, conversational, work, personal, finance, travel, unknown)
 - **Feature Extraction**: Hybrid (embeddings + patterns + structural features)
 ## Usage
 The ML classifier will automatically use the real model if it exists at:
 ```
 src/models/pretrained/classifier.pkl
 ```
 ### Programmatic Usage
 ```python
 from src.classification.ml_classifier import MLClassifier
 # Will automatically load real model if available
 classifier = MLClassifier()
 # Check if using mock or real model
 info = classifier.get_info()
 print(f"Is mock: {info['is_mock']}")
 print(f"Model type: {info['model_type']}")
 # Make predictions
 result = classifier.predict(feature_vector)
 print(f"Category: {result['category']}")
 print(f"Confidence: {result['confidence']}")
 ```
 ### Command Line Usage
 ```bash
 # Test with mock pipeline
 python -m src.cli run --source mock --output test_results/
 # Test with real model (when available)
 python -m src.cli run --source gmail --limit 100 --output results/
 ```
 ## How to Get a Real Model
 ### Option 1: Train Your Own (Recommended)
 ```python
 from src.calibration.trainer import ModelTrainer
 from src.calibration.enron_parser import EnronParser
 from src.classification.feature_extractor import FeatureExtractor
 # Parse Enron dataset
 parser = EnronParser("enron_mail_20150507")
 emails = parser.parse_emails(limit=5000)
 # Extract features
 extractor = FeatureExtractor()
 labeled_data = [(email, category) for email, category in zip(emails, categories)]
 # Train model
 trainer = ModelTrainer(extractor, categories)
 results = trainer.train(labeled_data)
 # Save model
 trainer.save_model("src/models/pretrained/classifier.pkl")
 ```
 ### Option 2: Download Pre-trained Model
 Use the provided script:
 ```bash
 cd tools
 python download_pretrained_model.py \
  --url https://example.com/model.pkl \
  --hash abc123def456
 ```
 ### Option 3: Use Community Model
 Check available pre-trained models at:
 - Email Sorter releases on GitHub
 - Hugging Face model hub (when available)
 - Community-trained models
 ## Model Performance
 Expected accuracy on real data:
 - **Hard Rules**: 94-96% (instant, ~10% of emails)
 - **ML Model**: 85-90% (fast, ~85% of emails)
 - **LLM Review**: 92-95% (slower, ~5% uncertain cases)
 - **Overall**: 90-94% (weighted average)
 ## Retraining
 To retrain the model:
 ```bash
 python -m src.cli train \
  --source enron \
  --output models/new_model.pkl \
  --limit 10000
 ```
 ## Troubleshooting
 ### Model Not Loading
 1. Check file exists: `src/models/pretrained/classifier.pkl`
 2. Try to load directly:
   ```python
   import pickle
   with open('src/models/pretrained/classifier.pkl', 'rb') as f:
       data = pickle.load(f)
   print(data.keys())
   ```
 3. Ensure pickle format is correct
 ### Low Accuracy
 1. Model may be underfitted - train on more data
 2. Feature extraction may need tuning
 3. Categories may need adjustment
 4. Consider LLM review for uncertain cases
 ### Slow Predictions
 1. Use embedding cache for batch processing
 2. Implement parallel processing
 3. Consider quantization for LightGBM model
 4. Profile feature extraction step
--- a/docs/NEXT_STEPS.md
+++ b/docs/NEXT_STEPS.md
@ -1,437 +0,0 @@
 # Email Sorter - Next Steps & Action Plan
 **Date**: 2025-10-21
 **Status**: Framework Complete - Ready for Real Model Integration
 **Test Status**: 27/30 passing (90%)
 ---
 ## Quick Summary
 ✅ **Framework**: 100% complete, all 16 phases implemented
 ✅ **Testing**: 90% pass rate (27/30 tests)
 ✅ **Documentation**: Comprehensive and up-to-date
 ✅ **Tools**: Model integration scripts provided
 ❌ **Real Model**: Currently using mock (placeholder)
 ❌ **Gmail Credentials**: Not yet configured
 ❌ **Real Data Processing**: Ready when model + credentials available
 ---
 ## Three Paths Forward
 Choose your path based on your needs:
 ### Path A: Quick Framework Validation (5 minutes)
 **Goal**: Verify everything works with mock model
 **Commands**:
 ```bash
 cd "c:/Build Folder/email-sorter"
 source venv/Scripts/activate
 # Run quick validation
 pytest tests/ -v --tb=short
 python -m src.cli test-config
 python -m src.cli run --source mock --output test_results/
 ```
 **Result**: Confirms framework works correctly
 ### Path B: Real Model Integration (30-60 minutes)
 **Goal**: Replace mock model with real LightGBM model
 **Two Sub-Options**:
 #### B1: Train Your Own Model on Enron Dataset
 ```bash
 # Parse Enron emails (already downloaded)
 python -c "
 from src.calibration.enron_parser import EnronParser
 from src.classification.feature_extractor import FeatureExtractor
 from src.calibration.trainer import ModelTrainer
 parser = EnronParser('enron_mail_20150507')
 emails = parser.parse_emails(limit=5000)
 extractor = FeatureExtractor()
 trainer = ModelTrainer(extractor, ['junk', 'transactional', 'auth', 'newsletters',
                                     'social', 'automated', 'conversational', 'work',
                                     'personal', 'finance', 'travel', 'unknown'])
 # Train (takes 5-10 minutes on this laptop)
 results = trainer.train([(e, 'unknown') for e in emails])
 trainer.save_model('src/models/pretrained/classifier.pkl')
 "
 # Verify
 python tools/setup_real_model.py --check
 ```
 #### B2: Download Pre-trained Model
 ```bash
 # If you have a pre-trained model URL
 python tools/download_pretrained_model.py \
  --url https://example.com/lightgbm_model.pkl \
  --hash abc123def456
 # Or if you have local file
 python tools/setup_real_model.py --model-path /path/to/model.pkl
 # Verify
 python tools/setup_real_model.py --check
 ```
 **Result**: Real model installed, framework uses it automatically
 ### Path C: Full Production Deployment (2-3 hours)
 **Goal**: Process all 80k+ emails with Gmail integration
 **Prerequisites**: Path B (real model) + Gmail OAuth
 **Steps**:
 1. **Setup Gmail OAuth**
   ```bash
   # Get credentials from Google Cloud Console
   # https://console.cloud.google.com/
   # - Create OAuth 2.0 credentials
   # - Download as JSON
   # - Place as credentials.json in project root
   # Test Gmail connection
   python -m src.cli test-gmail
   ```
 2. **Test with 100 Emails**
   ```bash
   python -m src.cli run \
     --source gmail \
     --limit 100 \
     --output test_results/
   ```
 3. **Process Full Dataset**
   ```bash
   python -m src.cli run \
     --source gmail \
     --output marion_results/
   ```
 4. **Review Results**
   - Check `marion_results/results.json`
   - Check `marion_results/report.txt`
   - Review accuracy metrics
   - Adjust thresholds if needed
 ---
 ## What's Ready Right Now
 ### ✅ Framework Components (All Complete)
 - [x] Feature extraction (embeddings + patterns + structural)
 - [x] Three-tier adaptive classifier (hard rules → ML → LLM)
 - [x] Embedding cache and batch processing
 - [x] Processing pipeline with checkpointing
 - [x] LLM integration (Ollama ready, OpenAI compatible)
 - [x] Calibration workflow
 - [x] Export system (JSON/CSV)
 - [x] Provider sync (Gmail/IMAP framework)
 - [x] Learning systems (threshold + pattern learning)
 - [x] Complete CLI interface
 - [x] Comprehensive test suite
 ### ❌ What Needs Your Input
 1. **Real Model** (50 MB file)
   - Option: Train on Enron (~5-10 min, laptop-friendly)
   - Option: Download pre-trained (~1 min)
 2. **Gmail Credentials** (OAuth JSON)
   - Get from Google Cloud Console
   - Place in project root as `credentials.json`
 3. **Real Data** (Already have: Enron dataset)
   - Optional: Your own emails for better tuning
 ---
 ## File Locations & Important Paths
 ```
 Project Root: c:/Build Folder/email-sorter
 Key Files:
 ├── src/
 │   ├── cli.py                          # Command-line interface
 │   ├── orchestration.py                # Main pipeline
 │   ├── classification/
 │   │   ├── feature_extractor.py        # Feature extraction
 │   │   ├── ml_classifier.py            # ML predictions
 │   │   ├── adaptive_classifier.py      # Three-tier orchestration
 │   │   └── embedding_cache.py          # Caching & batching
 │   ├── calibration/
 │   │   ├── trainer.py                  # LightGBM trainer
 │   │   ├── enron_parser.py             # Parse Enron dataset
 │   │   └── workflow.py                 # Calibration pipeline
 │   ├── processing/
 │   │   ├── bulk_processor.py           # Batch processing
 │   │   ├── queue_manager.py            # LLM queue
 │   │   └── attachment_handler.py       # PDF/DOCX extraction
 │   ├── llm/
 │   │   ├── ollama.py                   # Ollama integration
 │   │   └── openai_compat.py            # OpenAI API
 │   └── email_providers/
 │       ├── gmail.py                    # Gmail provider
 │       └── imap.py                     # IMAP provider
 │
 ├── models/                             # (Will be created)
 │   └── pretrained/
 │       └── classifier.pkl              # Real model goes here
 │
 ├── tools/
 │   ├── download_pretrained_model.py    # Download models
 │   └── setup_real_model.py             # Setup models
 │
 ├── enron_mail_20150507/                # Enron dataset (already extracted)
 │
 ├── tests/                              # 23 test cases
 ├── config/                             # Configuration
 ├── src/models/pretrained/              # (Will be created for real model)
 │
 └── Documentation:
    ├── PROJECT_STATUS.md               # High-level overview
    ├── COMPLETION_ASSESSMENT.md        # Detailed component review
    ├── MODEL_INFO.md                   # Model usage guide
    └── NEXT_STEPS.md                   # This file
 ```
 ---
 ## Testing Your Setup
 ### Framework Validation
 ```bash
 # Test configuration loading
 python -m src.cli test-config
 # Test Ollama (if running locally)
 python -m src.cli test-ollama
 # Run full test suite
 pytest tests/ -v
 ```
 ### Mock Pipeline (No Real Data Needed)
 ```bash
 python -m src.cli run --source mock --output test_results/
 ```
 ### Real Model Verification
 ```bash
 python tools/setup_real_model.py --check
 ```
 ### Gmail Connection Test
 ```bash
 python -m src.cli test-gmail
 ```
 ---
 ## Performance Expectations
 ### With Mock Model (Testing)
 - Feature extraction: ~50-100ms per email
 - ML prediction: ~10-20ms per email
 - Total time for 100 emails: ~30-40 seconds
 ### With Real Model (Production)
 - Feature extraction: ~50-100ms per email
 - ML prediction: ~5-10ms per email (LightGBM is faster)
 - LLM review (5% of emails): ~2-5 seconds per email
 - Total time for 80k emails: 15-25 minutes
 ### Calibration Phase
 - Sampling: 1-2 minutes
 - LLM category discovery: 2-3 minutes
 - Model training: 5-10 minutes
 - Total: 10-15 minutes
 ---
 ## Troubleshooting
 ### Problem: "Model not found" but framework running
 **Solution**: This is normal - system uses mock model automatically
 ```bash
 python tools/setup_real_model.py --check  # Shows current status
 ```
 ### Problem: Ollama tests failing
 **Solution**: Ollama is optional, LLM review will skip gracefully
 ```bash
 # Not critical - framework has graceful fallback
 python -m src.cli run --source mock
 ```
 ### Problem: Gmail connection fails
 **Solution**: Gmail is optional, test with mock first
 ```bash
 python -m src.cli run --source mock --output results/
 ```
 ### Problem: Low accuracy with mock model
 **Expected behavior**: Mock model is for framework testing only
 ```python
 # Check model info
 from src.classification.ml_classifier import MLClassifier
 c = MLClassifier()
 print(c.get_info())  # Shows is_mock: True
 ```
 ---
 ## Decision Tree: What to Do Next
 ```
 START
 │
 ├─ Do you want to test the framework first?
 │  └─ YES → Run Path A (5 minutes)
 │           pytest tests/ -v
 │           python -m src.cli run --source mock
 │
 ├─ Do you want to set up a real model?
 │  ├─ YES (TRAIN) → Run Path B1 (30-60 min)
 │  │               Train on Enron dataset
 │  │               python tools/setup_real_model.py --check
 │  │
 │  └─ YES (DOWNLOAD) → Run Path B2 (5 min)
 │                      python tools/setup_real_model.py --model-path /path/to/model.pkl
 │
 ├─ Do you want Gmail integration?
 │  └─ YES → Setup OAuth credentials
 │           Place credentials.json in project root
 │           python -m src.cli test-gmail
 │
 └─ Do you want to process all 80k emails?
   └─ YES → Run Path C (2-3 hours)
            python -m src.cli run --source gmail --output results/
 ```
 ---
 ## Success Criteria
 ### ✅ Framework is Ready When:
 - [ ] `pytest tests/` shows 27/30 passing
 - [ ] `python -m src.cli test-config` succeeds
 - [ ] `python -m src.cli run --source mock` completes
 ### ✅ Real Model is Ready When:
 - [ ] `python tools/setup_real_model.py --check` shows model found
 - [ ] `python -m src.cli run --source mock` shows `is_mock: False`
 - [ ] Test predictions work without errors
 ### ✅ Gmail is Ready When:
 - [ ] `credentials.json` exists in project root
 - [ ] `python -m src.cli test-gmail` succeeds
 - [ ] Can fetch 10 emails from Gmail
 ### ✅ Production is Ready When:
 - [ ] Real model integrated
 - [ ] Gmail credentials configured
 - [ ] Test run on 100 emails succeeds
 - [ ] Accuracy metrics are acceptable
 - [ ] Ready to process full dataset
 ---
 ## Common Commands Reference
 ```bash
 # Navigate to project
 cd "c:/Build Folder/email-sorter"
 source venv/Scripts/activate
 # Testing
 pytest tests/ -v                              # Run all tests
 pytest tests/test_feature_extraction.py -v    # Run specific test file
 # Configuration
 python -m src.cli test-config                 # Validate config
 python -m src.cli test-ollama                 # Test LLM provider
 python -m src.cli test-gmail                  # Test Gmail connection
 # Framework testing (mock)
 python -m src.cli run --source mock --output test_results/
 # Model setup
 python tools/setup_real_model.py --check                    # Check status
 python tools/setup_real_model.py --model-path /path/to/model  # Install model
 python tools/setup_real_model.py --info                     # Show info
 # Real processing (after setup)
 python -m src.cli run --source gmail --limit 100 --output test/
 python -m src.cli run --source gmail --output results/
 # Development
 python -m pytest tests/ --cov=src              # Coverage report
 python -m src.cli --help                       # Show all commands
 ```
 ---
 ## What NOT to Do
 ❌ **Do NOT**:
 - Try to use mock model in production (it's not accurate)
 - Process all emails before testing with 100
 - Skip Gmail credential setup (use mock for testing instead)
 - Modify core classifier code (framework is complete)
 - Skip the test suite validation
 - Use Ollama if laptop is low on resources (graceful fallback available)
 ✅ **DO**:
 - Test with mock first
 - Integrate real model before processing
 - Start with 100 emails then scale
 - Review results and adjust thresholds
 - Keep this file for reference
 - Use the tools provided for model integration
 ---
 ## Support & Questions
 If something doesn't work:
 1. **Check logs**: All operations log to `logs/email_sorter.log`
 2. **Run tests**: `pytest tests/ -v` shows what's working
 3. **Check framework**: `python -m src.cli test-config` validates setup
 4. **Review docs**: See COMPLETION_ASSESSMENT.md for details
 ---
 ## Timeline Estimate
 **What You Can Do Now:**
 - Framework validation: 5 minutes
 - Mock pipeline test: 10 minutes
 - Documentation review: 15 minutes
 **What You Can Do When Home:**
 - Real model training: 30-60 minutes
 - Gmail OAuth setup: 15-30 minutes
 - Full processing: 20-30 minutes
 **Total Time to Production**: 1.5-2 hours when you're home with better hardware
 ---
 ## Summary
 Your Email Sorter framework is **100% complete and tested**. The next step is simply choosing:
 1. **Now**: Validate framework with mock model (5 min)
 2. **When home**: Integrate real model (30-60 min)
 3. **When ready**: Process all 80k emails (20-30 min)
 All tools are provided. All documentation is complete. Framework is ready to use.
 **Choose your path above and get started!**
--- a/docs/PROJECT_BLUEPRINT.md
+++ b/docs/PROJECT_BLUEPRINT.md
--- a/docs/PROJECT_COMPLETE.md
+++ b/docs/PROJECT_COMPLETE.md
@ -1,566 +0,0 @@
 # EMAIL SORTER - PROJECT COMPLETE
 **Date**: October 21, 2025
 **Status**: FEATURE COMPLETE - Ready to Use
 **Framework Maturity**: All Features Implemented
 **Test Coverage**: 90% (27/30 passing)
 **Code Quality**: Full Type Hints and Comprehensive Error Handling
 ---
 ## The Bottom Line
 ✅ **Email Sorter framework is 100% complete and ready to use**
 All 16 planned development phases are implemented. The system is ready to process Marion's 80k+ emails with high accuracy. All you need to do is:
 1. Optionally integrate a real LightGBM model (tools provided)
 2. Set up Gmail OAuth credentials (when ready)
 3. Run the pipeline
 That's it. No more building. No more architecture decisions. Framework is done.
 ---
 ## What You Have
 ### Core System (Ready to Use)
 - ✅ 38 Python modules (~6,000 lines of code)
 - ✅ 12-category email classifier
 - ✅ Hybrid ML/LLM classification system
 - ✅ Smart feature extraction (embeddings + patterns + structure)
 - ✅ Processing pipeline with checkpointing
 - ✅ Gmail and IMAP sync capabilities
 - ✅ Model training framework
 - ✅ Learning systems (threshold + pattern adjustment)
 ### Tools (Ready to Use)
 - ✅ CLI interface (`python -m src.cli --help`)
 - ✅ Model download tool (`tools/download_pretrained_model.py`)
 - ✅ Model setup tool (`tools/setup_real_model.py`)
 - ✅ Test suite (23 tests, 90% pass rate)
 ### Documentation (Complete)
 - ✅ PROJECT_STATUS.md - Feature inventory
 - ✅ COMPLETION_ASSESSMENT.md - Detailed evaluation
 - ✅ MODEL_INFO.md - Model usage guide
 - ✅ NEXT_STEPS.md - Action plan
 - ✅ README.md - Getting started
 - ✅ Full API documentation via docstrings
 ### Data (Ready)
 - ✅ Enron dataset extracted (569MB, real emails)
 - ✅ Mock provider for testing
 - ✅ Test data sets
 ---
 ## What's Different From Before
 When we started, there were **16 planned phases** with many unknowns. Now:
 | Phase | Status | Details |
 |-------|--------|---------|
 | 1-3 | ✅ DONE | Infrastructure, config, logging |
 | 4 | ✅ DONE | Email providers (Gmail, IMAP, Mock) |
 | 5 | ✅ DONE | Feature extraction (embeddings + patterns) |
 | 6 | ✅ DONE | ML classifier (mock + LightGBM framework) |
 | 7 | ✅ DONE | LLM integration (Ollama + OpenAI) |
 | 8 | ✅ DONE | Adaptive classifier (3-tier system) |
 | 9 | ✅ DONE | Processing pipeline (checkpointing) |
 | 10 | ✅ DONE | Calibration system |
 | 11 | ✅ DONE | Export & reporting |
 | 12 | ✅ DONE | Learning systems |
 | 13 | ✅ DONE | Advanced processing |
 | 14 | ✅ DONE | Provider sync |
 | 15 | ✅ DONE | Orchestration |
 | 16 | ✅ DONE | Packaging |
 | 17 | ✅ DONE | Testing |
 **Every. Single. Phase. Complete.**
 ---
 ## Test Results
 ```
 ======================== Final Test Results ==========================
 PASSED: 27/30 (90% success rate)
 Core Components ✅
  - Email models and validation
  - Configuration system
  - Feature extraction (embeddings + patterns + structure)
  - ML classifier (mock + loading)
  - Adaptive three-tier classifier
  - LLM providers (Ollama + OpenAI)
  - Queue management with persistence
  - Bulk processing with checkpointing
  - Email sampling and analysis
  - Threshold learning
  - Pattern learning
  - Results export (JSON/CSV)
  - Provider sync (Gmail/IMAP)
  - End-to-end pipeline
 KNOWN ISSUES (3 - All Expected & Documented):
  ❌ test_e2e_checkpoint_resume
     Reason: Feature count mismatch between mock and real model
     Impact: Only relevant when upgrading to real model
     Status: Expected and acceptable
  ❌ test_e2e_enron_parsing
     Reason: Parser needs validation against actual maildir format
     Impact: Validation needed during training phase
     Status: Parser works, needs Enron dataset validation
  ❌ test_pattern_detection_invoice
     Reason: Minor regex doesn't match "bill #456"
     Impact: Cosmetic issue in test data
     Status: No production impact, easy to fix if needed
 WARNINGS: 16 (All Pydantic deprecation - cosmetic, code works fine)
 Duration: ~90 seconds
 Coverage: All critical paths
 Quality: Comprehensive with full type hints
 ```
 ---
 ## Project Metrics
 ```
 CODEBASE
  - Python Modules:        38 files
  - Lines of Code:         ~6,000+
  - Type Hints:            100% coverage
  - Docstrings:            Comprehensive
  - Error Handling:        All critical paths
  - Logging:               Rich + file output
 TESTING
  - Unit Tests:            23 tests
  - Test Files:            6 suites
  - Pass Rate:             90% (27/30)
  - Coverage:              All core features
  - Execution Time:        ~90 seconds
 ARCHITECTURE
  - Core Modules:          16 major components
  - Email Providers:       3 (Mock, Gmail, IMAP)
  - Classifiers:           3 (Hard rules, ML, LLM)
  - Processing Layers:     5 (Extract, Classify, Learn, Export, Sync)
  - Learning Systems:      2 (Threshold, Patterns)
 DEPENDENCIES
  - Direct:                42 packages
  - Python Version:        3.8+
  - Key Libraries:         LightGBM, sentence-transformers, Ollama, Google API
 GIT HISTORY
  - Commits:               14 total
  - Build Path:            Clear progression through all phases
  - Latest Additions:      Model integration tools + documentation
 ```
 ---
 ## System Architecture
 ```
 ┌─────────────────────────────────────────────────────────────┐
 │              EMAIL SORTER v1.0 - COMPLETE                   │
 ├─────────────────────────────────────────────────────────────┤
 │
 │  INPUT LAYER
 │  ├── Gmail Provider (OAuth, ready for credentials)
 │  ├── IMAP Provider (generic mail servers)
 │  ├── Mock Provider (for testing)
 │  └── Enron Dataset (real email data, 569MB)
 │
 │  FEATURE EXTRACTION
 │  ├── Semantic embeddings (384D, all-MiniLM-L6-v2)
 │  ├── Hard pattern matching (20+ patterns)
 │  ├── Structural features (metadata, timing, attachments)
 │  ├── Caching system (MD5-based, disk + memory)
 │  └── Batch processing (parallel, efficient)
 │
 │  CLASSIFICATION ENGINE (3-Tier Adaptive)
 │  ├── Tier 1: Hard Rules (instant, ~10%, 94-96% accuracy)
 │  │   - Pattern detection
 │  │   - Sender analysis
 │  │   - Content matching
 │  │
 │  ├── Tier 2: ML Classifier (fast, ~85%, 85-90% accuracy)
 │  │   - LightGBM gradient boosting (production model)
 │  │   - Mock Random Forest (testing)
 │  │   - Serializable for deployment
 │  │
 │  └── Tier 3: LLM Review (careful, ~5%, 92-95% accuracy)
 │      - Ollama (local, recommended)
 │      - OpenAI (API-compatible)
 │      - Batch processing
 │      - Queue management
 │
 │  LEARNING SYSTEM
 │  ├── Threshold Adjuster
 │  │   - Tracks ML vs LLM agreement
 │  │   - Suggests dynamic thresholds
 │  │   - Per-category analysis
 │  │
 │  └── Pattern Learner
 │      - Sender-specific distributions
 │      - Hard rule suggestions
 │      - Domain-level patterns
 │
 │  PROCESSING PIPELINE
 │  ├── Sampling (stratified + random)
 │  ├── Bulk processing (with checkpointing)
 │  ├── Batch queue management
 │  └── Resumable from interruption
 │
 │  OUTPUT LAYER
 │  ├── JSON Export (with full metadata)
 │  ├── CSV Export (for analysis)
 │  ├── Gmail Sync (labels)
 │  ├── IMAP Sync (keywords)
 │  └── Reports (human-readable)
 │
 │  CALIBRATION SYSTEM
 │  ├── Sample selection
 │  ├── LLM category discovery
 │  ├── Training data preparation
 │  ├── Model training
 │  └── Validation
 │
 └─────────────────────────────────────────────────────────────┘
 Performance:
  - 1500 emails (calibration):    ~5 minutes
  - 80,000 emails (full run):     ~20 minutes
  - Classification accuracy:       90-94%
  - Hard rule precision:          94-96%
 ```
 ---
 ## How to Use It
 ### Quick Start (Right Now)
 ```bash
 cd "c:/Build Folder/email-sorter"
 source venv/Scripts/activate
 # Validate framework
 pytest tests/ -v
 # Run with mock model
 python -m src.cli run --source mock --output test_results/
 ```
 ### With Real Model (When Ready)
 ```bash
 # Option 1: Train on Enron
 python tools/setup_real_model.py --model-path /path/to/trained_model.pkl
 # Option 2: Use pre-trained
 python tools/download_pretrained_model.py --url https://example.com/model.pkl
 # Verify
 python tools/setup_real_model.py --check
 # Run with real model (automatic)
 python -m src.cli run --source mock --output results/
 ```
 ### With Gmail (When Credentials Ready)
 ```bash
 # Place credentials.json in project root
 # Then:
 python -m src.cli run --source gmail --limit 100 --output test/
 python -m src.cli run --source gmail --output all_results/
 ```
 ---
 ## What's NOT Included (By Design)
 ### ❌ Not Here (Intentionally Deferred)
 1. **Real Trained Model** - You decide: train on Enron or download
 2. **Gmail Credentials** - Requires your Google Cloud setup
 3. **Live Email Processing** - Requires #1 and #2 above
 ### ✅ Why This Is Good
 - Framework is clean and unopinionated
 - Your model, your training decisions
 - Your credentials, your privacy
 - Complete freedom to customize
 ---
 ## Key Decisions Made
 ### 1. Mock Model Strategy
 - Framework uses clearly labeled mock for testing
 - No deception (explicit warnings in output)
 - Real model integration framework ready
 - Smooth path to production
 ### 2. Modular Architecture
 - Each component can be tested independently
 - Easy to swap components (e.g., different LLM)
 - Framework doesn't force decisions
 - Extensible design
 ### 3. Three-Tier Classification
 - Hard rules for instant/certain cases
 - ML for bulk processing
 - LLM for uncertain/complex cases
 - Balances speed and accuracy
 ### 4. Learning Systems
 - Threshold adjustment from LLM feedback
 - Pattern learning from sender data
 - Continuous improvement without retraining
 - Dynamic tuning
 ### 5. Graceful Degradation
 - Works without LLM (falls back to ML)
 - Works without Gmail (uses mock)
 - Works without real model (uses mock)
 - No single point of failure
 ---
 ## Performance Characteristics
 ### CPU Usage
 - Feature extraction: Single-threaded, parallelizable
 - ML prediction: ~5-10ms per email
 - LLM call: ~2-5 seconds per email
 - Embedding cache: Reduces recomputation by 50-80%
 ### Memory Usage
 - Embeddings cache: ~200-500MB (configurable)
 - Batch processing: Configurable batch size
 - Model (LightGBM): ~50-100MB
 - Total runtime: ~500MB-1GB
 ### Accuracy
 - Hard rules: 94-96% (pattern-based)
 - ML alone: 85-90% (LightGBM)
 - ML + LLM: 90-94% (adaptive)
 - With fine-tuning: 95%+ possible
 ---
 ## Deployment Options
 ### Option 1: Local Development
 ```bash
 python -m src.cli run --source mock --output local_results/
 ```
 - No external dependencies
 - Perfect for testing
 - Mock model for framework validation
 ### Option 2: With Ollama (Local LLM)
 ```bash
 # Start Ollama with qwen model
 python -m src.cli run --source mock --output results/
 ```
 - Local LLM processing (no internet)
 - Privacy-first operation
 - Careful resource usage
 ### Option 3: Cloud Integration
 ```bash
 # With OpenAI API
 python -m src.cli run --source gmail --output results/
 ```
 - Real Gmail integration
 - Cloud LLM support
 - Full production setup
 ---
 ## Next Actions (Choose One)
 ### Right Now (5 minutes)
 ```bash
 # Validate framework with mock
 pytest tests/ -v
 python -m src.cli test-config
 python -m src.cli run --source mock --output test_results/
 ```
 ### When Home (30-60 minutes)
 ```bash
 # Train real model or download pre-trained
 python tools/setup_real_model.py --model-path /path/to/model.pkl
 # Verify
 python tools/setup_real_model.py --check
 ```
 ### When Ready (2-3 hours)
 ```bash
 # Gmail OAuth setup
 # credentials.json in project root
 # Process all emails
 python -m src.cli run --source gmail --output marion_results/
 ```
 ---
 ## Documentation Map
 - **README.md** - Getting started
 - **PROJECT_STATUS.md** - Feature inventory and architecture
 - **COMPLETION_ASSESSMENT.md** - Detailed component evaluation (90-point checklist)
 - **MODEL_INFO.md** - Model usage and training guide
 - **NEXT_STEPS.md** - Action plan and deployment paths
 - **PROJECT_COMPLETE.md** - This file
 ---
 ## Support Resources
 ### If Something Doesn't Work
 1. Check logs: `tail -f logs/email_sorter.log`
 2. Run tests: `pytest tests/ -v`
 3. Validate config: `python -m src.cli test-config`
 4. Review docs: See documentation map above
 ### Common Issues
 - "Model not found" → Normal, using mock model
 - "Ollama connection failed" → Optional, will skip gracefully
 - "Low accuracy" → Expected with mock model
 - Tests failing → Check 3 known issues (all documented)
 ---
 ## Success Criteria
 ### ✅ Framework is Complete
 - [x] All 16 phases implemented
 - [x] 90% test pass rate
 - [x] Full type hints
 - [x] Comprehensive logging
 - [x] Clear error messages
 - [x] Graceful degradation
 ### ✅ Ready for Real Model
 - [x] Model integration framework complete
 - [x] Tools for downloading/setup provided
 - [x] Framework automatically uses real model when available
 - [x] No code changes needed
 ### ✅ Ready for Gmail Integration
 - [x] OAuth framework implemented
 - [x] Provider sync completed
 - [x] Label mapping configured
 - [x] Batch update support
 ### ✅ Ready for Deployment
 - [x] Checkpointing and resumability
 - [x] Error recovery
 - [x] Performance optimized
 - [x] Resource-efficient
 ---
 ## What's Next?
 You have three paths:
 ### Path A: Framework Validation (Do Now)
 - Runtime: 15 minutes
 - Effort: Minimal
 - Result: Confirm everything works
 ### Path B: Model Integration (Do When Home)
 - Runtime: 30-60 minutes
 - Effort: Run one command or training script
 - Result: Real LightGBM model installed
 ### Path C: Full Deployment (Do When Ready)
 - Runtime: 2-3 hours
 - Effort: Setup Gmail OAuth + run processing
 - Result: All 80k emails sorted and labeled
 **All paths are clear. All tools are provided. Framework is complete.**
 ---
 ## The Reality
 This is a **complete email classification system** with:
 - High-quality code (type hints, comprehensive logging, error handling)
 - Smart hybrid classification (hard rules → ML → LLM)
 - Proven ML framework (LightGBM)
 - Real email data for training (Enron dataset)
 - Flexible deployment options
 - Clear upgrade path
 The framework is **done**. The architecture is **solid**. The testing is **comprehensive**.
 What remains is **optional optimization**:
 1. Integrating your real trained model
 2. Setting up Gmail credentials
 3. Fine-tuning categories and thresholds
 But none of that is required to start using the system.
 **The system is ready. Your move.**
 ---
 ## Final Stats
 ```
 PROJECT COMPLETE
 Date:                2025-10-21
 Status:              100% FEATURE COMPLETE
 Framework Maturity:  All Features Implemented
 Test Coverage:       90% (27/30 passing)
 Code Quality:        Full type hints and comprehensive error handling
 Documentation:       Comprehensive
 Ready for:           Immediate use or real model integration
 Development Path:    14 commits tracking complete implementation
 Build Time:          ~2 weeks of focused development
 Lines of Code:       ~6,000+
 Core Modules:        38 Python files
 Test Suite:          23 comprehensive tests
 Dependencies:        42 packages
 What You Can Do:
  ✅ Test framework now (mock model)
  ✅ Train on Enron when home
  ✅ Process 80k+ emails when ready
  ✅ Scale to production immediately
  ✅ Customize categories and rules
  ✅ Deploy to other systems
 What's Not Needed:
  ❌ More architecture work
  ❌ Core framework changes
  ❌ Additional phase development
  ❌ More infrastructure setup
 Bottom Line:
  🎉 EMAIL SORTER IS COMPLETE AND READY TO USE 🎉
 ```
 ---
 **Built with Python, LightGBM, Sentence-Transformers, Ollama, and Google APIs**
 **Ready for email classification and Marion's 80k+ emails**
 **What are you waiting for? Start processing!**
--- a/docs/PROJECT_ROADMAP_2025.md
+++ b/docs/PROJECT_ROADMAP_2025.md
@ -0,0 +1,479 @@
 # Email Sorter: Project Roadmap & Learnings
 ## Document Purpose
 This document captures learnings from the November 2025 research session and defines the project scope, role within a larger email processing ecosystem, and development roadmap for 2025.
 ---
 ## Project Scope Definition
 ### What This Tool IS
 **Email Sorter is a TRIAGE tool.** Its job is:
 1. **Bulk classification** - Sort emails into buckets quickly
 2. **Risk-based routing** - Flag high-stakes items for careful handling
 3. **Downstream handoff** - Prepare emails for specialized processing tools
 ### What This Tool IS NOT
 - Not a spam filter (trust Gmail/Outlook for that)
 - Not a complete email management solution
 - Not trying to do everything
 - Not the final destination for any email
 ### Role in Larger Ecosystem
 ```
 ┌─────────────────────────────────────────────────────────────────┐
 │                    EMAIL PROCESSING ECOSYSTEM                    │
 └─────────────────────────────────────────────────────────────────┘
     ┌──────────────┐
     │  RAW INBOX   │  (Gmail, Outlook, IMAP)
     │   10k+       │
     └──────┬───────┘
            │
            ▼
     ┌──────────────┐
     │ SPAM FILTER  │  ← Trust existing provider (Gmail/Outlook)
     │  (existing)  │
     └──────┬───────┘
            │
            ▼
 ┌───────────────────────────────────────┐
 │         EMAIL SORTER (THIS TOOL)      │  ← TRIAGE/ROUTING
 │  ┌─────────────┐  ┌────────────────┐  │
 │  │ Agent Scan  │→ │ ML/LLM Classify│  │
 │  │ (discovery) │  │ (bulk sort)    │  │
 │  └─────────────┘  └────────────────┘  │
 └───────────────────┬───────────────────┘
                    │
      ┌─────────────┼─────────────┬─────────────┐
      ▼             ▼             ▼             ▼
 ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
 │  JUNK    │ │ ROUTINE  │ │ BUSINESS │ │ PERSONAL │
 │  BUCKET  │ │  BUCKET  │ │  BUCKET  │ │  BUCKET  │
 └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘
     │            │            │            │
     ▼            ▼            ▼            ▼
 ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
 │  Batch   │ │  Batch   │ │ Knowledge│ │  Human   │
 │ Cleanup  │ │ Summary  │ │  Graph   │ │  Review  │
 │  (cheap) │ │  Tool    │ │  Builder │ │(careful) │
 └──────────┘ └──────────┘ └──────────┘ └──────────┘
     OTHER TOOLS IN ECOSYSTEM (not this project)
 ```
 ---
 ## Key Learnings from Research Sessions
 ### Session 1: brett-gmail (801 emails, Personal Inbox)
 | Method | Accuracy | Time |
 |--------|----------|------|
 | ML-Only | 54.9% | ~5 sec |
 | ML+LLM | 93.3% | ~3.5 min |
 | Manual Agent | 99.8% | ~25 min |
 ### Session 2: brett-microsoft (596 emails, Business Inbox)
 | Method | Accuracy | Time |
 |--------|----------|------|
 | Manual Agent | 98.2% | ~30 min |
 **Key Insight:** Business inboxes require different classification approaches than personal inboxes.
 ---
 ### 1. ML Pipeline is Overkill for Small Datasets
 | Dataset Size | Recommended Approach | Rationale |
 |--------------|---------------------|-----------|
 | <500 | Agent-only analysis | ML overhead exceeds benefit |
 | 500-2000 | Agent pre-scan + ML | Discovery improves ML accuracy |
 | 2000-10000 | ML + LLM fallback | Balanced speed/accuracy |
 | >10000 | ML-only (fast mode) | Speed critical at scale |
 **Evidence:** 801-email dataset achieved 99.8% accuracy with 25-min agent analysis vs 54.9% with pure ML.
 ### 2. Agent Pre-Scan Adds Massive Value
 A 10-15 minute agent discovery phase before bulk classification:
 - Identifies dominant sender domains
 - Discovers subject patterns
 - Suggests optimal categories for THIS dataset
 - Can generate sender-to-category mappings
 **This is NOT the same as the full manual analysis.** It's a quick reconnaissance pass.
 ### 3. Categories Should Serve Downstream Processing
 Don't optimize for human-readable labels. Optimize for routing decisions:
 | Category Type | Downstream Handler | Accuracy Need |
 |---------------|-------------------|---------------|
 | Junk/Marketing | Batch cleanup tool | LOW (errors OK) |
 | Newsletters | Summary aggregator | MEDIUM |
 | Transactional | Archive, searchable | MEDIUM |
 | Business | Knowledge graph | HIGH |
 | Personal | Human review | CRITICAL |
 | Security | Never auto-filter | CRITICAL |
 ### 4. Risk-Based Accuracy Requirements
 Not all emails need the same classification confidence:
 ```
 HIGH STAKES (must not miss):
 ├─ Personal correspondence (sentimental value)
 ├─ Security alerts (account safety)
 ├─ Job applications (life-changing)
 └─ Financial/legal documents
 LOW STAKES (errors tolerable):
 ├─ Marketing promotions
 ├─ Newsletter digests
 ├─ Automated notifications
 └─ Social media alerts
 ```
 ### 5. Spam Filtering is a Solved Problem
 Don't reinvent spam filtering. Gmail and Outlook do it well. This tool should:
 - Assume spam is already filtered
 - Focus on categorizing legitimate mail
 - Trust the upstream provider
 If spam does get through, a simple secondary filter could catch obvious cases, but this is low priority.
 ### 6. Sender Domain is the Strongest Signal
 From the 801-email analysis:
 - Top 5 senders = 47.5% of all emails
 - Sender domain alone could classify 80%+ of automated emails
 - Subject patterns matter less than sender patterns
 **Implication:** A sender-first classification approach could dramatically speed up processing.
 ### 7. Inbox Character Matters (NEW - Session 2)
 **Critical Discovery:** Before classifying emails, assess the inbox CHARACTER:
 | Inbox Type | Characteristics | Classification Approach |
 |------------|-----------------|------------------------|
 | **Personal/Consumer** | Subscription-heavy, marketing-dominant, automated 40-50% | Sender domain first |
 | **Business/Professional** | Client work, operations, developer tools 60-70% | Sender + Subject context |
 | **Mixed** | Both patterns present | Hybrid approach needed |
 **Evidence from brett-microsoft analysis:**
 - 73.2% Business/Professional content
 - Only 8.2% Personal content
 - Required client relationship tracking
 - Support case ID extraction valuable
 **Implications for Agent Pre-Scan:**
 1. First determine inbox character (business vs personal vs mixed)
 2. Select appropriate category templates
 3. Business inboxes need relationship context, not just sender domains
 ### 8. Business Inboxes Need Special Handling (NEW - Session 2)
 Business/professional inboxes require additional classification dimensions:
 **Client Relationship Tracking:**
 - Same domain may have different contexts (internal vs external)
 - Client conversations span multiple senders
 - Subject threading matters more than in consumer inboxes
 **Support Case ID Extraction:**
 - Business inboxes often have case/ticket IDs connecting emails
 - Microsoft: Case #, TrackingID#
 - Other vendors: Ticket numbers, reference IDs
 - ID extraction should be first-class feature
 **Accuracy Expectations:**
 - Personal inboxes: 99%+ achievable with sender-first
 - Business inboxes: 95-98% achievable (more nuanced)
 - Accept lower accuracy ceiling, invest in risk-flagging
 ### 9. Multi-Inbox Analysis Reveals Patterns (NEW - Session 2)
 Analyzing multiple inboxes from same user reveals:
 - **Inbox segregation patterns** - Gmail for personal, Outlook for business
 - **Cross-inbox senders** - Security alerts appear in both
 - **Category overlap** - Some categories universal, some inbox-specific
 **Implication:** Future feature could merge analysis across inboxes to build complete user profile.
 ---
 ## Technical Architecture (Refined)
 ### Current State
 ```
 Email Source → LocalFileParser → FeatureExtractor → ML Classifier → Output
                                      │
                                      └→ LLM Fallback (if low confidence)
 ```
 ### Target State (2025)
 ```
 Email Source
     │
     ▼
 ┌─────────────────────────────────────────────────────────────┐
 │                    ROUTING LAYER                             │
 │  Check dataset size → Route to appropriate pipeline          │
 └─────────────────────────────────────────────────────────────┘
     │
     ├─── <500 emails ────→ Agent-Only Analysis
     │
     ├─── 500-5000 ───────→ Agent Pre-Scan + ML Pipeline
     │
     └─── >5000 ──────────→ ML Pipeline (optional LLM)
 Each pipeline outputs:
  - Categorized emails (with confidence)
  - Risk flags (high-stakes items)
  - Routing recommendations
  - Insights report
 ```
 ### Agent Pre-Scan Module (NEW)
 ```python
 class AgentPreScan:
    """
    Quick discovery phase before bulk classification.
    Time budget: 10-15 minutes.
    """
    def scan(self, emails: List[Email]) -> PreScanResult:
        # 1. Sender domain analysis (2 min)
        sender_stats = self.analyze_senders(emails)
        # 2. Subject pattern detection (3 min)
        patterns = self.detect_patterns(emails, sample_size=100)
        # 3. Category suggestions (5 min, uses LLM)
        categories = self.suggest_categories(sender_stats, patterns)
        # 4. Generate sender map (2 min)
        sender_map = self.create_sender_mapping(sender_stats, categories)
        return PreScanResult(
            sender_stats=sender_stats,
            patterns=patterns,
            suggested_categories=categories,
            sender_map=sender_map,
            estimated_distribution=self.estimate_distribution(emails, categories)
        )
 ```
 ---
 ## Development Roadmap
 ### Phase 0: Documentation Complete (NOW)
 - [x] Research session findings documented
 - [x] Classification methods comparison written
 - [x] Project scope defined
 - [x] This roadmap created
 ### Phase 1: Quick Wins (Q1 2025, 4-8 hours)
 1. **Dataset size routing**
   - Auto-detect email count
   - Route small datasets to agent analysis
   - Route large datasets to ML pipeline
 2. **Sender-first classification**
   - Extract sender domain
   - Check against known sender map
   - Skip ML for known high-volume senders
 3. **Risk flagging**
   - Flag low-confidence results
   - Flag potential personal emails
   - Flag security-related emails
 ### Phase 2: Agent Pre-Scan (Q1 2025, 8-16 hours)
 1. **Sender analysis module**
   - Cluster by domain
   - Calculate volume statistics
   - Identify automated vs personal
 2. **Pattern detection module**
   - Sample subject lines
   - Find templates and IDs
   - Detect lifecycle stages
 3. **Category suggestion module**
   - Use LLM to suggest categories
   - Based on sender/pattern analysis
   - Output category definitions
 4. **Sender mapping module**
   - Map senders to suggested categories
   - Output as JSON for pipeline use
   - Support manual overrides
 ### Phase 3: Integration & Polish (Q2 2025)
 1. **Unified CLI**
   - Single command handles all dataset sizes
   - Progress reporting
   - Configurable verbosity
 2. **Output standardization**
   - Common format for all pipelines
   - Include routing recommendations
   - Include confidence and risk flags
 3. **Ecosystem integration**
   - Define handoff format for downstream tools
   - Document API for other tools to consume
   - Create example integrations
 ### Phase 4: Scale Testing (Q2-Q3 2025)
 1. **Test on real 10k+ mailboxes**
   - Multiple users, different patterns
   - Measure accuracy vs speed
   - Refine thresholds
 2. **Pattern library**
   - Accumulate patterns from multiple mailboxes
   - Build reusable sender maps
   - Create category templates
 3. **Feedback loop**
   - Track classification accuracy
   - Learn from corrections
   - Improve over time
 ---
 ## Configuration Philosophy
 ### User-Facing Config (Keep Simple)
 ```yaml
 # config/user_config.yaml
 mode: auto          # auto | agent | ml | hybrid
 risk_threshold: high  # low | medium | high
 output_format: json   # json | csv | html
 ```
 ### Internal Config (Full Control)
 ```yaml
 # config/advanced_config.yaml
 routing:
  small_threshold: 500
  medium_threshold: 5000
 agent_prescan:
  enabled: true
  time_budget_minutes: 15
  sample_size: 100
 ml_pipeline:
  confidence_threshold: 0.55
  llm_fallback: true
  batch_size: 512
 risk_detection:
  personal_indicators: [gmail.com, hotmail.com, outlook.com]
  security_senders: [accounts.google.com, security@]
  high_stakes_keywords: [urgent, important, legal, contract]
 ```
 ---
 ## Success Metrics
 ### For This Tool
 | Metric | Target | Current |
 |--------|--------|---------|
 | Classification accuracy (large datasets) | >85% | 54.9% (ML), 93.3% (ML+LLM) |
 | Processing speed (10k emails) | <5 min | ~24 sec (ML-only) |
 | High-stakes miss rate | <1% | Not measured |
 | Setup time for new mailbox | <20 min | Variable |
 ### For Ecosystem
 | Metric | Target |
 |--------|--------|
 | End-to-end mailbox processing | <2 hours for 10k |
 | User intervention needed | <10% of emails |
 | Downstream tool compatibility | 100% |
 ---
 ## Open Questions (To Resolve in 2025)
 1. **Category standardization**: Should categories be fixed across all users, or discovered per-mailbox?
 2. **Sender map sharing**: Can sender maps be shared across users? Privacy implications?
 3. **Incremental processing**: How to handle new emails added to already-processed mailboxes?
 4. **Multi-account support**: Same user, multiple email accounts?
 5. **Feedback integration**: How do corrections feed back into the system?
 ---
 ## Files Created During Research
 ### Session 1 (brett-gmail, Personal Inbox)
 | File | Purpose |
 |------|---------|
 | `tools/brett_gmail_analyzer.py` | Custom analyzer for personal inbox |
 | `tools/generate_html_report.py` | HTML report generator |
 | `data/brett_gmail_analysis.json` | Analysis data output |
 | `docs/CLASSIFICATION_METHODS_COMPARISON.md` | Method comparison |
 | `docs/REPORT_FORMAT.md` | HTML report documentation |
 | `docs/SESSION_HANDOVER_20251128.md` | Session 1 handover |
 ### Session 2 (brett-microsoft, Business Inbox)
 | File | Purpose |
 |------|---------|
 | `tools/brett_microsoft_analyzer.py` | Custom analyzer for business inbox |
 | `data/brett_microsoft_analysis.json` | Analysis data output |
 | `/home/bob/.../brett-ms-sorter/BRETT_MICROSOFT_ANALYSIS_REPORT.md` | Full analysis report |
 ---
 ## Summary
 **Email Sorter is a triage tool, not a complete solution.**
 Its job is to quickly sort emails into buckets so that specialized downstream tools can handle each bucket appropriately. The key insight from this research session is that an agent pre-scan phase, even just 10-15 minutes, dramatically improves classification accuracy for any dataset size.
 The ML pipeline is valuable for scale (10k+ emails) but overkill for smaller datasets. Risk-based accuracy means we can tolerate errors on junk but must be careful with personal correspondence.
 2025 development should focus on:
 1. Smart routing based on dataset size
 2. Agent pre-scan for discovery
 3. Standardized output for ecosystem integration
 4. Scale testing on real large mailboxes
 ---
 *Document Version: 1.1*
 *Created: 2025-11-28*
 *Updated: 2025-11-28 (Session 2 learnings)*
 *Sessions: brett-gmail (801 emails, personal), brett-microsoft (596 emails, business)*
--- a/docs/PROJECT_STATUS.md
+++ b/docs/PROJECT_STATUS.md
@ -1,402 +0,0 @@
 # EMAIL SORTER - PROJECT STATUS
 **Date:** 2025-10-21
 **Status:** PHASE 2 - IMPLEMENTATION COMPLETE
 **Version:** 1.0.0 (Development)
 ---
 ## EXECUTIVE SUMMARY
 Email Sorter framework is **100% code-complete and tested**. All 16 planned phases have been implemented. The system is ready for:
 1. **Real data training** (when you get home with Enron dataset access)
 2. **Gmail/IMAP credential configuration** (OAuth setup)
 3. **Full end-to-end testing** with real email data
 4. **Production deployment** to process Marion's 80k+ emails
 ---
 ## COMPLETED PHASES (1-16)
 ### Phase 1: Project Setup ✅
 - Virtual environment configured
 - All dependencies installed (42+ packages)
 - Directory structure created
 - Git initialized with 10 commits
 ### Phase 2-3: Core Infrastructure ✅
 - `src/utils/config.py` - YAML-based configuration system
 - `src/utils/logging.py` - Rich logging with file output
 - Email data models with full type hints
 ### Phase 4: Email Providers ✅
 - **MockProvider** - For testing (fully functional)
 - **GmailProvider** - Stub ready for OAuth credentials
 - **IMAPProvider** - Stub ready for server config
 - All with graceful error handling
 ### Phase 5: Feature Extraction ✅
 - Semantic embeddings (sentence-transformers, 384 dims)
 - Hard pattern matching (20+ patterns)
 - Structural features (metadata, timing, attachments)
 - Attachment analysis (PDF, DOCX, XLSX text extraction)
 ### Phase 6: ML Classifier ✅
 - Mock Random Forest (clearly labeled for testing)
 - Placeholder for real LightGBM training
 - Prediction with confidence scores
 - Model serialization/deserialization
 ### Phase 7: LLM Integration ✅
 - OllamaProvider (local, with retry logic)
 - OpenAIProvider (API-compatible)
 - Graceful degradation when LLM unavailable
 - Batch processing support
 ### Phase 8: Adaptive Classifier ✅
 - Three-tier classification:
  1. Hard rules (10% - instant)
  2. ML classifier (85% - fast)
  3. LLM review (5% - uncertain cases)
 - Dynamic threshold management
 - Statistics tracking
 ### Phase 9: Processing Pipeline ✅
 - BulkProcessor with checkpointing
 - Resumable processing from checkpoints
 - Batch-based processing
 - Progress tracking
 ### Phase 10: Calibration System ✅
 - EmailSampler (stratified + random)
 - LLMAnalyzer (discover natural categories)
 - CalibrationWorkflow (end-to-end)
 - Category validation
 ### Phase 11: Export & Reporting ✅
 - JSON export with metadata
 - CSV export for analysis
 - Organized by category
 - Human-readable reports
 ### Phase 12: Threshold & Pattern Learning ✅
 - **ThresholdAdjuster** - Learn from LLM feedback
  - Agreement tracking per category
  - Automatic threshold suggestions
  - Adjustment history
 - **PatternLearner** - Sender-specific rules
  - Category distribution per sender
  - Domain-level patterns
  - Hard rule suggestions
 ### Phase 13: Advanced Processing ✅
 - **EnronParser** - Parse Enron email dataset
 - **AttachmentHandler** - Extract PDF/DOCX content
 - **ModelTrainer** - Real LightGBM training
 - **EmbeddingCache** - Cache with MD5 hashing
 - **EmbeddingBatcher** - Parallel embedding generation
 - **QueueManager** - Batch queue with persistence
 ### Phase 14: Provider Sync ✅
 - **GmailSync** - Sync to Gmail labels
 - **IMAPSync** - Sync to IMAP keywords
 - Configurable label mapping
 - Batch update support
 ### Phase 15: Orchestration ✅
 - **EmailSorterOrchestrator** - 4-phase pipeline
  1. Calibration
  2. Bulk processing
  3. LLM review
  4. Export & sync
 - Full progress tracking
 - Timing and metrics
 ### Phase 16: Packaging ✅
 - `setup.py` - setuptools configuration
 - `pyproject.toml` - Modern PEP 517/518
 - Optional dependencies (dev, gmail, ollama, openai)
 - Console script entry point
 ### Phase 15: Testing ✅
 - 23 unit tests written
 - 5/7 E2E tests passing
 - Feature extraction validated
 - Classifier flow tested
 - Mock provider integration tested
 ---
 ## CODE STATISTICS
 ```
 Total Files:         37 Python modules + configs
 Total Lines:         ~6,000+ lines of code
 Core Modules:        16 major components
 Test Coverage:       23 tests (unit + integration)
 Dependencies:        42 packages installed
 Git Commits:         10 commits tracking all work
 ```
 ---
 ## ARCHITECTURE OVERVIEW
 ```
 ┌──────────────────────────────────────────────────────────────┐
 │                     EMAIL SORTER v1.0                        │
 └──────────────────────────────────────────────────────────────┘
 ┌─ INPUT ─────────────────┐
 │  Email Providers        │
 │  - MockProvider ✅      │
 │  - Gmail (OAuth ready)  │
 │  - IMAP (ready)         │
 └─────────────────────────┘
         ↓
 ┌─ CALIBRATION ───────────┐
 │  EmailSampler ✅        │
 │  LLMAnalyzer ✅         │
 │  CalibrationWorkflow ✅ │
 │  ModelTrainer ✅        │
 └─────────────────────────┘
         ↓
 ┌─ FEATURE EXTRACTION ────┐
 │  Embeddings ✅          │
 │  Patterns ✅            │
 │  Structural ✅          │
 │  Attachments ✅         │
 │  Cache + Batch ✅       │
 └─────────────────────────┘
         ↓
 ┌─ CLASSIFICATION ────────┐
 │  Hard Rules ✅          │
 │  ML (LightGBM) ✅       │
 │  LLM (Ollama/OpenAI) ✅ │
 │  Adaptive Orchestrator ✅
 │  Queue Management ✅    │
 └─────────────────────────┘
         ↓
 ┌─ LEARNING ─────────────┐
 │  Threshold Adjuster ✅ │
 │  Pattern Learner ✅    │
 └─────────────────────────┘
         ↓
 ┌─ OUTPUT ────────────────┐
 │  JSON Export ✅         │
 │  CSV Export ✅          │
 │  Reports ✅             │
 │  Gmail Sync ✅          │
 │  IMAP Sync ✅           │
 └─────────────────────────┘
 ```
 ---
 ## WHAT'S READY RIGHT NOW
 ### ✅ Framework (Complete)
 - All core infrastructure
 - Config management
 - Logging system
 - Email data models
 - Feature extraction
 - Classifier orchestration
 - Processing pipeline
 - Export system
 - All tests passing
 ### ✅ Testing (Verified)
 - Mock provider works
 - Feature extraction validated
 - Classification flow tested
 - Export formats work
 - Hard rules accurate
 - CLI interface operational
 ### ⚠️ Requires Your Input
 1. **ML Model Training**
   - Mock Random Forest included
   - Real LightGBM training code ready
   - Enron dataset available (569MB)
   - Just needs: `trainer.train(labeled_emails)`
 2. **Gmail OAuth**
   - Provider code complete
   - Needs: credentials.json
   - Clear error messages when missing
 3. **LLM Testing**
   - Ollama integration ready
   - qwen3:1.7b loaded
   - Integration tested (careful with laptop)
 ---
 ## NEXT STEPS - WHEN YOU GET HOME
 ### Step 1: Model Training
 ```python
 from src.calibration.enron_parser import EnronParser
 from src.calibration.trainer import ModelTrainer
 # Parse Enron
 parser = EnronParser("enron_mail_20150507")
 enron_emails = parser.parse_emails(limit=5000)
 # Train real model
 trainer = ModelTrainer(feature_extractor, categories, config)
 results = trainer.train(labeled_emails)
 trainer.save_model("models/lightgbm_real.pkl")
 ```
 ### Step 2: Gmail OAuth Setup
 ```bash
 # Download credentials.json from Google Cloud Console
 # Place in project root or config/
 # Run: email-sorter --source gmail --credentials credentials.json
 ```
 ### Step 3: Full Pipeline Test
 ```bash
 # Test with 100 emails
 email-sorter --source gmail --limit 100 --output test_results/
 # Full production run
 email-sorter --source gmail --output marion_results/
 ```
 ### Step 4: Production Deployment
 ```bash
 # Package as wheel
 python setup.py sdist bdist_wheel
 # Install
 pip install dist/email_sorter-1.0.0-py3-none-any.whl
 # Run
 email-sorter --source gmail --credentials ~/.gmail_creds.json --output results/
 ```
 ---
 ## KEY FILES TO KNOW
 **Core Entry Points:**
 - `src/cli.py` - Command-line interface
 - `src/orchestration.py` - Main pipeline orchestrator
 **Training & Calibration:**
 - `src/calibration/trainer.py` - Real LightGBM training
 - `src/calibration/workflow.py` - End-to-end calibration
 - `src/calibration/enron_parser.py` - Dataset parsing
 **Classification:**
 - `src/classification/adaptive_classifier.py` - Main classifier
 - `src/classification/feature_extractor.py` - Feature extraction
 - `src/classification/ml_classifier.py` - ML predictions
 - `src/classification/llm_classifier.py` - LLM predictions
 **Learning:**
 - `src/adjustment/threshold_adjuster.py` - Dynamic thresholds
 - `src/adjustment/pattern_learner.py` - Sender patterns
 **Processing:**
 - `src/processing/bulk_processor.py` - Batch processing
 - `src/processing/queue_manager.py` - LLM queue
 - `src/processing/attachment_handler.py` - Attachment analysis
 **Export:**
 - `src/export/exporter.py` - Results export
 - `src/export/provider_sync.py` - Gmail/IMAP sync
 ---
 ## GIT HISTORY
 ```
 b34bb50 Add pyproject.toml - modern Python packaging configuration
 ee6c276 Add queue management, embedding optimization, and calibration workflow
 f5d89a6 CRITICAL: Add missing Phase 12 modules and advanced features
 c531412 Phase 15: End-to-end pipeline tests - 5/7 passing
 02be616 Phase 9-14: Complete processing pipeline, calibration, export
 b7cc744 Complete IMAP provider import fixes
 16bc6f0 Fix IMAP provider imports
 b49dad9 Build Phase 1-7: Core infrastructure and classifiers
 8c73f25 Initial commit: Complete project blueprint and research
 ```
 ---
 ## TESTING
 ### Run All Tests
 ```bash
 cd email-sorter
 source venv/Scripts/activate
 pytest tests/ -v
 ```
 ### Quick CLI Test
 ```bash
 # Test config loading
 python -m src.cli test-config
 # Test Ollama connection (if running)
 python -m src.cli test-ollama
 # Full mock pipeline
 python -m src.cli run --source mock --output test_results/
 ```
 ---
 ## WHAT MAKES THIS COMPLETE
 1. **All 16 Phases Implemented** - No shortcuts, everything built
 2. **Production Code Quality** - Type hints, error handling, logging
 3. **End-to-End Tested** - 23 tests, multiple integration tests
 4. **Well Documented** - Docstrings, comments, README
 5. **Clearly Labeled Mocks** - Mock components transparent about limitations
 6. **Ready for Real Data** - All systems tested, waiting for:
   - Real Gmail credentials
   - Real Enron training data
   - Real model training at home
 ---
 ## PERFORMANCE EXPECTATIONS
 - **Calibration:** 3-5 minutes (1500 email sample)
 - **Bulk Processing:** 10-12 minutes (80k emails)
 - **LLM Review:** 4-5 minutes (batched)
 - **Export:** 2-3 minutes
 - **Total:** ~17-25 minutes for 80k emails
 **Accuracy:** 94-96% (when trained on real data)
 ---
 ## RESOURCES
 - **Documentation:** README.md, PROJECT_BLUEPRINT.md, BUILD_INSTRUCTIONS.md
 - **Research:** RESEARCH_FINDINGS.md
 - **Config:** config/default_config.yaml, config/categories.yaml
 - **Enron Dataset:** enron_mail_20150507/ (569MB, ready to use)
 - **Tests:** tests/ (23 tests)
 ---
 ## SUMMARY
 **Status:** ✅ FEATURE COMPLETE
 Email Sorter is a fully implemented, tested, and documented system ready for production use. All 16 development phases are complete with over 6,000 lines of production code. The system is waiting for real data (your Enron dataset) and real credentials (Gmail OAuth) to demonstrate its full capabilities.
 **You can now:** Train a real model, configure Gmail, and process your 80k+ emails with confidence that the system is complete and ready.
 ---
 **Built with:** Python 3.8+, LightGBM, Sentence-Transformers, Ollama, Gmail API
 **Ready for:** Production email classification, local processing, privacy-first operation
--- a/docs/PROJECT_STATUS_AND_NEXT_STEPS.html
+++ b/docs/PROJECT_STATUS_AND_NEXT_STEPS.html
@ -1,648 +0,0 @@
 <!DOCTYPE html>
 <html lang="en">
 <head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Email Sorter - Project Status & Next Steps</title>
    <script src="https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.min.js"></script>
    <style>
        body {
            font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
            margin: 20px;
            background: #1e1e1e;
            color: #d4d4d4;
        }
        h1, h2, h3 {
            color: #4ec9b0;
        }
        .diagram {
            background: white;
            padding: 20px;
            margin: 20px 0;
            border-radius: 8px;
        }
        .success {
            background: #002a00;
            border-left: 4px solid #4ec9b0;
            padding: 15px;
            margin: 10px 0;
        }
        .section {
            background: #252526;
            padding: 15px;
            margin: 10px 0;
            border-left: 4px solid #569cd6;
        }
        table {
            width: 100%;
            border-collapse: collapse;
            margin: 20px 0;
            background: #252526;
        }
        th {
            background: #37373d;
            padding: 12px;
            text-align: left;
            color: #4ec9b0;
        }
        td {
            padding: 10px;
            border-bottom: 1px solid #3e3e42;
        }
        code {
            background: #1e1e1e;
            padding: 2px 6px;
            border-radius: 3px;
            color: #ce9178;
        }
        .mvp-proven {
            background: #003a00;
            border: 3px solid #4ec9b0;
            padding: 20px;
            margin: 20px 0;
            border-radius: 8px;
            text-align: center;
        }
        .mvp-proven h2 {
            font-size: 2em;
            margin: 0;
        }
    </style>
 </head>
 <body>
    <div class="mvp-proven">
        <h2>🎉 MVP PROVEN AND WORKING 🎉</h2>
        <p style="font-size: 1.2em; margin: 10px 0;">
            <strong>10,000 emails classified in 4 minutes</strong><br/>
            72.7% accuracy | 0 LLM calls | Pure ML speed
        </p>
    </div>
    <h1>Email Sorter - Project Status & Next Steps</h1>
    <h2>✅ What We've Achieved (MVP Complete)</h2>
    <div class="success">
        <h3>Core System Working</h3>
        <ul>
            <li><strong>LLM-Driven Calibration:</strong> Discovers categories from email samples (11 categories found)</li>
            <li><strong>ML Model Training:</strong> LightGBM trained on 10k emails (1.8MB model)</li>
            <li><strong>Fast Classification:</strong> 10k emails in ~4 minutes with --no-llm-fallback</li>
            <li><strong>Category Verification:</strong> Single LLM call validates model fit for new mailboxes</li>
            <li><strong>Embedding-Based Features:</strong> Universal 384-dim embeddings transfer across mailboxes</li>
            <li><strong>Threshold Optimization:</strong> 0.55 threshold reduces LLM fallback by 40%</li>
        </ul>
    </div>
    <h2>📊 Test Results Summary</h2>
    <table>
        <tr>
            <th>Metric</th>
            <th>Result</th>
            <th>Status</th>
        </tr>
        <tr>
            <td>Total emails processed</td>
            <td>10,000</td>
            <td>✅</td>
        </tr>
        <tr>
            <td>Processing time</td>
            <td>~4 minutes</td>
            <td>✅</td>
        </tr>
        <tr>
            <td>ML classification rate</td>
            <td>78.4%</td>
            <td>✅</td>
        </tr>
        <tr>
            <td>LLM calls (with --no-llm-fallback)</td>
            <td>0</td>
            <td>✅</td>
        </tr>
        <tr>
            <td>Accuracy estimate</td>
            <td>72.7%</td>
            <td>✅ (acceptable for speed)</td>
        </tr>
        <tr>
            <td>Categories discovered</td>
            <td>11 (Work, Financial, Updates, etc.)</td>
            <td>✅</td>
        </tr>
        <tr>
            <td>Model size</td>
            <td>1.8MB</td>
            <td>✅ (portable)</td>
        </tr>
    </table>
    <h2>🗂️ Project Organization</h2>
    <h3>Core Modules</h3>
    <table>
        <tr>
            <th>Module</th>
            <th>Purpose</th>
            <th>Status</th>
        </tr>
        <tr>
            <td><code>src/cli.py</code></td>
            <td>Main CLI with all flags (--verify-categories, --no-llm-fallback)</td>
            <td>✅ Complete</td>
        </tr>
        <tr>
            <td><code>src/calibration/workflow.py</code></td>
            <td>LLM-driven category discovery + training</td>
            <td>✅ Complete</td>
        </tr>
        <tr>
            <td><code>src/calibration/llm_analyzer.py</code></td>
            <td>Batch LLM analysis (20 emails/call)</td>
            <td>✅ Complete</td>
        </tr>
        <tr>
            <td><code>src/calibration/category_verifier.py</code></td>
            <td>Single LLM call to verify categories</td>
            <td>✅ New feature</td>
        </tr>
        <tr>
            <td><code>src/classification/ml_classifier.py</code></td>
            <td>LightGBM model wrapper</td>
            <td>✅ Complete</td>
        </tr>
        <tr>
            <td><code>src/classification/adaptive_classifier.py</code></td>
            <td>Rule → ML → LLM orchestrator</td>
            <td>✅ Complete</td>
        </tr>
        <tr>
            <td><code>src/classification/feature_extractor.py</code></td>
            <td>Embeddings (384-dim) + TF-IDF</td>
            <td>✅ Complete</td>
        </tr>
    </table>
    <h3>Models & Data</h3>
    <table>
        <tr>
            <th>Asset</th>
            <th>Location</th>
            <th>Status</th>
        </tr>
        <tr>
            <td>Trained model</td>
            <td><code>src/models/calibrated/classifier.pkl</code></td>
            <td>✅ 1.8MB, 11 categories</td>
        </tr>
        <tr>
            <td>Pretrained copy</td>
            <td><code>src/models/pretrained/classifier.pkl</code></td>
            <td>✅ Ready for fast load</td>
        </tr>
        <tr>
            <td>Category cache</td>
            <td><code>src/models/category_cache.json</code></td>
            <td>✅ 10 cached categories</td>
        </tr>
        <tr>
            <td>Test results</td>
            <td><code>test/results.json</code></td>
            <td>✅ 10k classifications</td>
        </tr>
    </table>
    <h3>Documentation</h3>
    <table>
        <tr>
            <th>Document</th>
            <th>Purpose</th>
        </tr>
        <tr>
            <td><code>SYSTEM_FLOW.html</code></td>
            <td>Complete system flow diagrams with timing</td>
        </tr>
        <tr>
            <td><code>LABEL_TRAINING_PHASE_DETAIL.html</code></td>
            <td>Deep dive into calibration phase</td>
        </tr>
        <tr>
            <td><code>FAST_ML_ONLY_WORKFLOW.html</code></td>
            <td>Pure ML workflow analysis</td>
        </tr>
        <tr>
            <td><code>VERIFY_CATEGORIES_FEATURE.html</code></td>
            <td>Category verification documentation</td>
        </tr>
        <tr>
            <td><code>PROJECT_STATUS_AND_NEXT_STEPS.html</code></td>
            <td>This document - status and roadmap</td>
        </tr>
    </table>
    <h2>🎯 Next Steps (Priority Order)</h2>
    <h3>Phase 1: Clean Up & Organize (Next Session)</h3>
    <div class="section">
        <h4>1.1 Clean Root Directory</h4>
        <p><strong>Goal:</strong> Move test artifacts and scripts to organized locations</p>
        <ul>
            <li>Create <code>docs/</code> folder - move all .html files there</li>
            <li>Create <code>scripts/</code> folder - move all .sh files there</li>
            <li>Create <code>logs/</code> folder - move all .log files there</li>
            <li>Delete debug files (debug_*.txt, spot_check_results.txt)</li>
            <li>Create .gitignore for logs/, results/, test/, ml_only_test/, etc.</li>
        </ul>
        <p><strong>Time:</strong> 10 minutes</p>
    </div>
    <div class="section">
        <h4>1.2 Create README.md</h4>
        <p><strong>Goal:</strong> Professional project documentation</p>
        <ul>
            <li>Overview of system architecture</li>
            <li>Quick start guide</li>
            <li>Usage examples (with/without calibration, with/without verification)</li>
            <li>Performance benchmarks (from our tests)</li>
            <li>Configuration options</li>
        </ul>
        <p><strong>Time:</strong> 30 minutes</p>
    </div>
    <div class="section">
        <h4>1.3 Add Tests</h4>
        <p><strong>Goal:</strong> Ensure code quality and catch regressions</p>
        <ul>
            <li>Unit tests for feature extraction</li>
            <li>Unit tests for category verification</li>
            <li>Integration test for full pipeline</li>
            <li>Test for --no-llm-fallback flag</li>
            <li>Test for --verify-categories flag</li>
        </ul>
        <p><strong>Time:</strong> 2 hours</p>
    </div>
    <h3>Phase 2: Real-World Integration (Week 1-2)</h3>
    <div class="section">
        <h4>2.1 Gmail Provider Implementation</h4>
        <p><strong>Goal:</strong> Connect to real Gmail accounts</p>
        <ul>
            <li>Implement Gmail API authentication (OAuth2)</li>
            <li>Fetch emails with pagination</li>
            <li>Handle Gmail-specific metadata (labels, threads)</li>
            <li>Test with personal Gmail account</li>
        </ul>
        <p><strong>Time:</strong> 4-6 hours</p>
    </div>
    <div class="section">
        <h4>2.2 IMAP Provider Implementation</h4>
        <p><strong>Goal:</strong> Support any email provider (Outlook, custom servers)</p>
        <ul>
            <li>IMAP connection handling</li>
            <li>SSL/TLS support</li>
            <li>Folder navigation</li>
            <li>Test with Outlook/Protonmail</li>
        </ul>
        <p><strong>Time:</strong> 3-4 hours</p>
    </div>
    <div class="section">
        <h4>2.3 Email Syncing (Apply Classifications)</h4>
        <p><strong>Goal:</strong> Move/label emails based on classification</p>
        <ul>
            <li>Gmail: Apply labels to emails</li>
            <li>IMAP: Move emails to folders</li>
            <li>Dry-run mode (preview without applying)</li>
            <li>Batch operations for speed</li>
            <li>Rollback capability</li>
        </ul>
        <p><strong>Time:</strong> 6-8 hours</p>
    </div>
    <h3>Phase 3: Production Features (Week 3-4)</h3>
    <div class="section">
        <h4>3.1 Incremental Classification</h4>
        <p><strong>Goal:</strong> Only classify new emails, not entire inbox</p>
        <ul>
            <li>Track last processed email ID</li>
            <li>Resume from checkpoint</li>
            <li>Database/file-based state tracking</li>
            <li>Scheduled runs (cron integration)</li>
        </ul>
        <p><strong>Time:</strong> 4-6 hours</p>
    </div>
    <div class="section">
        <h4>3.2 Multi-Account Support</h4>
        <p><strong>Goal:</strong> Manage multiple email accounts</p>
        <ul>
            <li>Per-account configuration</li>
            <li>Per-account trained models</li>
            <li>Account switching CLI</li>
            <li>Shared category cache across accounts</li>
        </ul>
        <p><strong>Time:</strong> 3-4 hours</p>
    </div>
    <div class="section">
        <h4>3.3 Model Management</h4>
        <p><strong>Goal:</strong> Handle model lifecycle</p>
        <ul>
            <li>Model versioning (timestamps)</li>
            <li>Model comparison (A/B testing)</li>
            <li>Model export/import</li>
            <li>Retraining scheduler</li>
            <li>Model degradation detection</li>
        </ul>
        <p><strong>Time:</strong> 4-5 hours</p>
    </div>
    <h3>Phase 4: Advanced Features (Month 2)</h3>
    <div class="section">
        <h4>4.1 Web Dashboard</h4>
        <p><strong>Goal:</strong> Visual interface for monitoring and management</p>
        <ul>
            <li>Flask/FastAPI backend</li>
            <li>React/Vue frontend</li>
            <li>View classification results</li>
            <li>Manually correct classifications (feedback loop)</li>
            <li>Monitor accuracy over time</li>
            <li>Trigger recalibration</li>
        </ul>
        <p><strong>Time:</strong> 20-30 hours</p>
    </div>
    <div class="section">
        <h4>4.2 Active Learning</h4>
        <p><strong>Goal:</strong> Improve model from user corrections</p>
        <ul>
            <li>User feedback collection</li>
            <li>Disagreement-based sampling (low confidence + user correction)</li>
            <li>Incremental model updates</li>
            <li>Feedback-driven category evolution</li>
        </ul>
        <p><strong>Time:</strong> 8-10 hours</p>
    </div>
    <div class="section">
        <h4>4.3 Performance Optimization</h4>
        <p><strong>Goal:</strong> Scale to 100k+ emails</p>
        <ul>
            <li>Batch embedding generation (reduce API calls)</li>
            <li>Async/parallel classification</li>
            <li>Model quantization (reduce size)</li>
            <li>GPU acceleration for embeddings</li>
            <li>Caching layer (Redis)</li>
        </ul>
        <p><strong>Time:</strong> 10-15 hours</p>
    </div>
    <h2>🔧 Immediate Action Items (This Week)</h2>
    <table>
        <tr>
            <th>Task</th>
            <th>Priority</th>
            <th>Time</th>
            <th>Status</th>
        </tr>
        <tr>
            <td>Clean root directory - organize files</td>
            <td>High</td>
            <td>10 min</td>
            <td>Pending</td>
        </tr>
        <tr>
            <td>Create comprehensive README.md</td>
            <td>High</td>
            <td>30 min</td>
            <td>Pending</td>
        </tr>
        <tr>
            <td>Add .gitignore for test artifacts</td>
            <td>High</td>
            <td>5 min</td>
            <td>Pending</td>
        </tr>
        <tr>
            <td>Create setup.py for pip installation</td>
            <td>Medium</td>
            <td>20 min</td>
            <td>Pending</td>
        </tr>
        <tr>
            <td>Write basic unit tests</td>
            <td>Medium</td>
            <td>2 hours</td>
            <td>Pending</td>
        </tr>
        <tr>
            <td>Test Gmail provider (basic fetch)</td>
            <td>Medium</td>
            <td>2 hours</td>
            <td>Pending</td>
        </tr>
    </table>
    <h2>📈 Success Metrics</h2>
    <div class="diagram">
        <pre class="mermaid">
 flowchart LR
    MVP[MVP Proven] --> P1[Phase 1: Organization]
    P1 --> P2[Phase 2: Integration]
    P2 --> P3[Phase 3: Production]
    P3 --> P4[Phase 4: Advanced]
    P1 --> M1[Metric: Clean codebase<br/>100% docs coverage]
    P2 --> M2[Metric: Real email support<br/>Gmail + IMAP working]
    P3 --> M3[Metric: Daily automation<br/>Incremental processing]
    P4 --> M4[Metric: User adoption<br/>10+ users, 90%+ satisfaction]
    style MVP fill:#4ec9b0
    style P1 fill:#569cd6
    style P2 fill:#569cd6
    style P3 fill:#569cd6
    style P4 fill:#569cd6
 </pre>
    </div>
    <h2>🚀 Quick Start Commands</h2>
    <div class="section">
        <h3>Train New Model (Full Calibration)</h3>
        <code>
 source venv/bin/activate<br/>
 python -m src.cli run \<br/>
 &nbsp;&nbsp;--source enron \<br/>
 &nbsp;&nbsp;--limit 10000 \<br/>
 &nbsp;&nbsp;--output results/<br/>
        </code>
        <p><strong>Time:</strong> ~25 minutes | <strong>LLM calls:</strong> ~500 | <strong>Accuracy:</strong> 92-95%</p>
    </div>
    <div class="section">
        <h3>Fast ML-Only Classification (Existing Model)</h3>
        <code>
 source venv/bin/activate<br/>
 python -m src.cli run \<br/>
 &nbsp;&nbsp;--source enron \<br/>
 &nbsp;&nbsp;--limit 10000 \<br/>
 &nbsp;&nbsp;--output fast_test/ \<br/>
 &nbsp;&nbsp;--no-llm-fallback<br/>
        </code>
        <p><strong>Time:</strong> ~4 minutes | <strong>LLM calls:</strong> 0 | <strong>Accuracy:</strong> 72-78%</p>
    </div>
    <div class="section">
        <h3>ML with Category Verification (Recommended)</h3>
        <code>
 source venv/bin/activate<br/>
 python -m src.cli run \<br/>
 &nbsp;&nbsp;--source enron \<br/>
 &nbsp;&nbsp;--limit 10000 \<br/>
 &nbsp;&nbsp;--output verified_test/ \<br/>
 &nbsp;&nbsp;--no-llm-fallback \<br/>
 &nbsp;&nbsp;--verify-categories<br/>
        </code>
        <p><strong>Time:</strong> ~4.5 minutes | <strong>LLM calls:</strong> 1 | <strong>Accuracy:</strong> 72-78%</p>
    </div>
    <h2>📁 Recommended Project Structure (After Cleanup)</h2>
    <pre style="background: #252526; padding: 15px; border-radius: 5px; font-family: monospace;">
 email-sorter/
 ├── README.md                  # Main documentation
 ├── setup.py                   # Pip installation
 ├── requirements.txt           # Dependencies
 ├── .gitignore                 # Ignore test artifacts
 │
 ├── src/                       # Core source code
 │   ├── calibration/           # LLM-driven calibration
 │   ├── classification/        # ML classification
 │   ├── email_providers/       # Gmail, IMAP, Enron
 │   ├── llm/                   # LLM providers
 │   ├── utils/                 # Shared utilities
 │   └── models/                # Trained models
 │       ├── calibrated/        # Current trained model
 │       ├── pretrained/        # Quick-load copy
 │       └── category_cache.json
 │
 ├── config/                    # Configuration files
 │   ├── default_config.yaml
 │   └── categories.yaml
 │
 ├── tests/                     # Unit & integration tests
 │   ├── test_calibration.py
 │   ├── test_classification.py
 │   └── test_verification.py
 │
 ├── scripts/                   # Helper scripts
 │   ├── train_model.sh
 │   ├── fast_classify.sh
 │   └── verify_and_classify.sh
 │
 ├── docs/                      # HTML documentation
 │   ├── SYSTEM_FLOW.html
 │   ├── LABEL_TRAINING_PHASE_DETAIL.html
 │   ├── FAST_ML_ONLY_WORKFLOW.html
 │   └── VERIFY_CATEGORIES_FEATURE.html
 │
 ├── logs/                      # Runtime logs (gitignored)
 │   └── *.log
 │
 └── results/                   # Test results (gitignored)
    └── *.json
    </pre>
    <h2>🎓 Key Learnings</h2>
    <div class="section">
        <ul>
            <li><strong>Embeddings are universal:</strong> Same model works across different mailboxes</li>
            <li><strong>Batching is critical:</strong> 20 emails/LLM call = 3× faster than sequential</li>
            <li><strong>Thresholds matter:</strong> 0.55 threshold reduces LLM usage by 40%</li>
            <li><strong>Category verification adds value:</strong> 20 sec for confidence check is worth it</li>
            <li><strong>Pure ML is viable:</strong> 73% accuracy with 0 LLM calls for speed tests</li>
            <li><strong>LLM-driven calibration works:</strong> Discovers natural categories without hardcoding</li>
        </ul>
    </div>
    <h2>✅ Ready for Production?</h2>
    <table>
        <tr>
            <th>Component</th>
            <th>Status</th>
            <th>Blocker</th>
        </tr>
        <tr>
            <td>Core ML Pipeline</td>
            <td>✅ Ready</td>
            <td>None</td>
        </tr>
        <tr>
            <td>LLM Calibration</td>
            <td>✅ Ready</td>
            <td>None</td>
        </tr>
        <tr>
            <td>Category Verification</td>
            <td>✅ Ready</td>
            <td>None</td>
        </tr>
        <tr>
            <td>Fast ML-Only Mode</td>
            <td>✅ Ready</td>
            <td>None</td>
        </tr>
        <tr>
            <td>Enron Provider</td>
            <td>✅ Ready</td>
            <td>None (test only)</td>
        </tr>
        <tr>
            <td>Gmail Provider</td>
            <td>⚠️ Needs implementation</td>
            <td>OAuth2 + API calls</td>
        </tr>
        <tr>
            <td>IMAP Provider</td>
            <td>⚠️ Needs implementation</td>
            <td>IMAP library integration</td>
        </tr>
        <tr>
            <td>Email Syncing</td>
            <td>❌ Not implemented</td>
            <td>Apply labels/move emails</td>
        </tr>
        <tr>
            <td>Tests</td>
            <td>⚠️ Minimal coverage</td>
            <td>Need comprehensive tests</td>
        </tr>
        <tr>
            <td>Documentation</td>
            <td>✅ Excellent</td>
            <td>Need README.md</td>
        </tr>
    </table>
    <p><strong>Verdict:</strong> MVP is production-ready for <em>Enron dataset testing</em>. Need Gmail/IMAP providers for real-world use.</p>
    <script>
        mermaid.initialize({
            startOnLoad: true,
            theme: 'default',
            flowchart: {
                useMaxWidth: true,
                htmlLabels: true,
                curve: 'basis'
            }
        });
    </script>
 </body>
 </html>
--- a/docs/REPORT_FORMAT.md
+++ b/docs/REPORT_FORMAT.md
@ -0,0 +1,232 @@
 # Email Classification Report Format
 This document explains the HTML report generation system, its data sources, and how to customize it.
 ## Overview
 The report generator creates a static HTML file from classification results. It requires enriched `results.json` with email metadata (subject, sender, date, etc.) - not just classification data.
 ## Files Involved
 | File | Purpose |
 |------|---------|
 | `tools/generate_html_report.py` | Main report generator script |
 | `src/cli.py` | Classification CLI - outputs enriched `results.json` |
 | `src/export/exporter.py` | Legacy exporter (JSON/CSV) - not used for HTML |
 ## Data Flow
 ```
 Email Source (.eml/.msg files)
        ↓
   src/cli.py (classification)
        ↓
   results.json (enriched with metadata)
        ↓
   tools/generate_html_report.py
        ↓
   report.html (static, self-contained)
 ```
 ## Usage
 ### Generate Report
 ```bash
 python tools/generate_html_report.py \
  --input /path/to/results.json \
  --output /path/to/report.html
 ```
 If `--output` is omitted, creates `report.html` in same directory as input.
 ### Full Workflow
 ```bash
 # 1. Classify emails
 python -m src.cli run \
  --source local \
  --directory "/path/to/emails" \
  --output "/path/to/output" \
  --no-llm-fallback
 # 2. Generate report
 python tools/generate_html_report.py \
  --input "/path/to/output/results.json"
 ```
 ## results.json Format
 The report generator expects this structure:
 ```json
 {
  "metadata": {
    "total_emails": 801,
    "accuracy_estimate": 0.55,
    "classification_stats": {
      "rule_matched": 9,
      "ml_classified": 468,
      "llm_classified": 0,
      "needs_review": 324
    },
    "generated_at": "2025-11-28T02:34:00.680196",
    "source": "local",
    "source_path": "/path/to/emails"
  },
  "classifications": [
    {
      "email_id": "unique_id.eml",
      "subject": "Email subject line",
      "sender": "sender@example.com",
      "sender_name": "Sender Name",
      "date": "2023-04-13T09:43:29+10:00",
      "has_attachments": false,
      "category": "Work",
      "confidence": 0.81,
      "method": "ml"
    }
  ]
 }
 ```
 ### Required Fields
 | Field | Type | Description |
 |-------|------|-------------|
 | `email_id` | string | Unique identifier (usually filename) |
 | `subject` | string | Email subject line |
 | `sender` | string | Sender email address |
 | `category` | string | Assigned category |
 | `confidence` | float | Classification confidence (0-1) |
 | `method` | string | Classification method: `ml`, `rule`, or `llm` |
 ### Optional Fields
 | Field | Type | Description |
 |-------|------|-------------|
 | `sender_name` | string | Display name of sender |
 | `date` | string | ISO 8601 date string |
 | `has_attachments` | boolean | Whether email has attachments |
 ## Report Sections
 ### 1. Header
 - Report title
 - Generation timestamp
 - Source info
 - Total email count
 ### 2. Stats Grid
 - Total emails
 - Number of categories
 - High confidence count (>=70%)
 - Unique sender domains
 ### 3. Category Distribution
 - Horizontal bar chart
 - Count and percentage per category
 - Sorted by count (descending)
 ### 4. Classification Methods
 - Breakdown of ML vs Rule vs LLM
 - Shows which method handled what percentage
 ### 5. Confidence Distribution
 - High (>=70%): Green
 - Medium (50-70%): Yellow
 - Low (<50%): Red
 ### 6. Top Senders
 - Top 20 senders by email count
 - Grid layout
 ### 7. Email Tables (Tabbed)
 - "All" tab shows all emails
 - Category tabs filter by category
 - Search box filters by subject/sender
 - Columns: Date, Subject, Sender, Category, Confidence, Method
 - Sorted by date (newest first)
 - Attachment indicator (📎)
 ## Customization
 ### Changing Colors
 Edit the CSS variables in `generate_html_report.py`:
 ```css
 :root {
    --bg-primary: #1a1a2e;      /* Main background */
    --bg-secondary: #16213e;    /* Card backgrounds */
    --bg-card: #0f3460;         /* Nested elements */
    --text-primary: #eee;       /* Main text */
    --text-secondary: #aaa;     /* Muted text */
    --accent: #e94560;          /* Accent color (red) */
    --accent-hover: #ff6b6b;    /* Accent hover */
    --success: #00d9a5;         /* Green (high confidence) */
    --warning: #ffc107;         /* Yellow (medium confidence) */
    --border: #2a2a4a;          /* Border color */
 }
 ```
 ### Light Theme Example
 ```css
 :root {
    --bg-primary: #f5f5f5;
    --bg-secondary: #ffffff;
    --bg-card: #e8e8e8;
    --text-primary: #333;
    --text-secondary: #666;
    --accent: #2563eb;
    --accent-hover: #3b82f6;
    --success: #10b981;
    --warning: #f59e0b;
    --border: #d1d5db;
 }
 ```
 ### Adding New Sections
 1. Add data extraction in `generate_html_report()` function
 2. Add HTML section in the main template string
 3. Style with existing CSS classes or add new ones
 ### Adding New Table Columns
 1. Modify `generate_email_row()` function
 2. Add `<th>` in table header
 3. Add `<td>` in row template
 ## Performance Notes
 - Report is fully static (no server required)
 - JavaScript is minimal (tab switching, search filtering)
 - Handles 1000+ emails without performance issues
 - For 10k+ emails, consider pagination (not yet implemented)
 ## Future Enhancements (TODO)
 - [ ] Pagination for large datasets
 - [ ] Export to PDF option
 - [ ] Configurable color themes via CLI
 - [ ] Column sorting (click headers)
 - [ ] Date range filter
 - [ ] Sender domain grouping
 - [ ] Category confidence heatmap
 - [ ] Email body preview on hover
 ## Troubleshooting
 ### "KeyError: 'subject'"
 Results.json lacks email metadata. Re-run classification with latest cli.py.
 ### Empty tables
 Check that results.json has `classifications` array with data.
 ### Dates showing "N/A"
 Date parsing failed. Check date format in results.json is ISO 8601.
 ### Search not working
 JavaScript error. Check browser console. Ensure no HTML entities in data.
--- a/docs/RESEARCH_FINDINGS.md
+++ b/docs/RESEARCH_FINDINGS.md
@ -1,419 +0,0 @@
 # EMAIL SORTER - RESEARCH FINDINGS
 Date: 2024-10-21
 Research Phase: Complete
 ---
 ## SEARCH SUMMARY
 We conducted web research on:
 1. Email classification benchmarks (2024)
 2. XGBoost vs LightGBM for embeddings and mixed features
 3. Competition analysis (existing email organizers)
 4. Gradient boosting with embeddings + categorical features
 ---
 ## 1. EMAIL CLASSIFICATION BENCHMARKS (2024)
 ### Key Findings
 **Enron Dataset Performance:**
 - Traditional ML (SVM, Random Forest): **95-98% accuracy**
 - Deep Learning (DNN-BiLSTM): **98.69% accuracy**
 - Transformer models (BERT, RoBERTa, DistilBERT): **~99% accuracy**
 - LLMs (GPT-4): **99.7% accuracy** (phishing detection)
 - Ensemble stacking methods: **98.8% accuracy**, F1: 98.9%
 **Zero-Shot LLM Performance:**
 - Flan-T5: **94% accuracy**, F1: 90%
 - GPT-4: **97% accuracy**, F1: 95%
 **Key insight:** Modern ML methods can achieve 95-98% accuracy on email classification. Our hybrid target of 94-96% is realistic and competitive.
 ### Dataset Details
 - **Enron Email Dataset**: 500,000+ emails from 150 employees
 - **EnronQA benchmark**: 103,638 emails with 528,304 Q&A pairs
 - **AESLC**: Annotated Enron Subject Line Corpus (for summarization)
 ### Implications for Our System
 - Our 94-96% target is achievable and competitive
 - LightGBM + embeddings should hit 92-95% easily
 - LLM review for 5-10% uncertain cases will push us to upper range
 - Attachment analysis is a differentiator (not tested in benchmarks)
 ---
 ## 2. LIGHTGBM VS XGBOOST FOR HYBRID FEATURES
 ### Decision: LightGBM WINS 🏆
 | Feature | LightGBM | XGBoost | Winner |
 |---------|----------|---------|--------|
 | **Categorical handling** | Native support | Needs encoding | ✅ LightGBM |
 | **Speed** | 2-5x faster | Baseline | ✅ LightGBM |
 | **Memory** | Very efficient | Standard | ✅ LightGBM |
 | **Accuracy** | Equivalent | Equivalent | Tie |
 | **Mixed features** | 4x speedup | Slower | ✅ LightGBM |
 ### Key Advantages of LightGBM
 1. **Native Categorical Support**
   - LightGBM splits categorical features by equality
   - No need for one-hot encoding
   - Avoids dimensionality explosion
   - XGBoost requires manual encoding (label, mean, or one-hot)
 2. **Speed Performance**
   - 2-5x faster than XGBoost in general
   - **4x speedup** on datasets with categorical features
   - Same AUC performance, drastically better speed
 3. **Memory Efficiency**
   - Preferable for large, sparse datasets
   - Better for memory-constrained environments
 4. **Embedding Compatibility**
   - Handles dense numerical features (embeddings) excellently
   - Native categorical handling for mixed feature types
   - Perfect for our hybrid approach
 ### Research Quote
 > "LightGBM is significantly faster than XGBoost but delivers almost equivalent performance. In tests, both algorithms achieve pretty much the same AUC, but LightGBM runs from 2 to 5 times faster."
 ### Implications for Our System
 **Perfect for our hybrid features:**
 ```python
 features = {
    'embeddings': [384 dense numerical],      # ✅ LightGBM handles
    'patterns': [20 boolean/numerical],       # ✅ LightGBM handles
    'sender_type': 'corporate',               # ✅ LightGBM native categorical
    'time_of_day': 'morning',                 # ✅ LightGBM native categorical
 }
 # No encoding needed! 4x faster than XGBoost with encoding
 ```
 ---
 ## 3. COMPETITION ANALYSIS
 ### Cloud-Based Email Organizers (2024)
 | Tool | Price | Features | Privacy | Accuracy Estimate |
 |------|-------|----------|---------|-------------------|
 | **SaneBox** | $7-15/mo | AI filtering, smart folders | ❌ Cloud | ~85% |
 | **Clean Email** | $10-30/mo | 30+ smart filters, bulk ops | ❌ Cloud | ~80% |
 | **Spark** | Free/Paid | Smart inbox, categorization | ❌ Cloud | ~75% |
 | **EmailTree.ai** | Enterprise | NLP classification, routing | ❌ Cloud | ~90% |
 | **Mailstrom** | $30-50/yr | Bulk analysis, categorization | ❌ Cloud | ~70% |
 ### Key Features They Offer
 **Common capabilities:**
 - Automatic categorization (newsletters, social, etc.)
 - Smart folders based on sender/topic
 - Bulk operations (archive, delete)
 - Unsubscribe management
 - Search and filter
 **What they DON'T offer:**
 - ❌ Local processing (all require cloud upload)
 - ❌ Attachment content analysis
 - ❌ One-time cleanup (all are subscriptions)
 - ❌ Offline capability
 - ❌ Custom LLM integration
 - ❌ Open source / distributable
 ### Our Competitive Advantages
 ✅ **100% LOCAL** - No data leaves the machine
 ✅ **Privacy-first** - Perfect for business owners with sensitive data
 ✅ **One-time use** - No subscription, pay per job or DIY
 ✅ **Attachment analysis** - Extract and classify PDF/DOCX content
 ✅ **Customizable** - Adapts to each inbox via calibration
 ✅ **Open source potential** - Distributable as Python wheel
 ✅ **Offline capable** - Works without internet after setup
 ### Market Gap Identified
 **Target customers:**
 - Self-employed / business owners with 10k-100k+ emails
 - Can't/won't upload to cloud (privacy, GDPR, security concerns)
 - Want one-time cleanup, not ongoing subscription
 - Tech-savvy enough to run Python tool or hire someone to run it
 - Have sensitive business correspondence, invoices, contracts
 **Pain point:**
 > "I've thought about just deleting it all, but there's some stuff I need to keep..."
 **Our solution:**
 - Local processing (100% private)
 - Smart classification (94-96% accurate)
 - Attachment analysis (find those invoices!)
 - One-time fee or DIY
 **Pricing comparison:**
 - SaneBox: $120-180/year subscription
 - Clean Email: $120-360/year subscription
 - **Us**: $50-200 one-time job OR free (DIY wheel)
 ---
 ## 4. GRADIENT BOOSTING WITH EMBEDDINGS
 ### Key Finding: CatBoost Has Embedding Support
 **GB-CENT Model** (Gradient Boosted Categorical Embedding and Numerical Trees):
 - Combines latent factor embeddings with tree components
 - Handles categorical features via low-dimensional representation
 - Captures nonlinear interactions of numerical features
 - Best of both worlds approach
 **CatBoost's "killer feature":**
 > "CatBoost has a killer feature that knows how to work with embeddings, though this is not well-documented."
 **Performance insights:**
 - Embeddings both as a feature AND as separate numerical features → best quality
 - Native categorical handling has slight edge over encoded approaches
 - One-hot encoding generally performs poorly (especially with limited tree depth)
 ### Implications for Our System
 **LightGBM strategy (validated by research):**
 ```python
 import lightgbm as lgb
 # Combine embeddings + categorical features
 X = np.concatenate([
    embeddings,              # 384 dense numerical
    pattern_booleans,        # 20 numerical (0/1)
    structural_numerical     # 10 numerical (counts, lengths)
 ], axis=1)
 # Specify categorical features by name
 categorical_features = ['sender_domain_type', 'time_of_day', 'day_of_week']
 model = lgb.LGBMClassifier(
    categorical_feature=categorical_features,  # Native handling
    n_estimators=200,
    learning_rate=0.1,
    max_depth=8
 )
 model.fit(X, y)
 ```
 **Why this works:**
 - LightGBM handles embeddings (dense numerical) excellently
 - Native categorical handling for domain_type, time_of_day, etc.
 - No encoding overhead (faster, less memory)
 - Research shows slight accuracy edge over encoded approaches
 ---
 ## 5. SENTENCE EMBEDDINGS FOR EMAIL
 ### all-MiniLM-L6-v2 - The Sweet Spot
 **Model specs:**
 - Size: 23MB (tiny!)
 - Dimensions: 384 (vs 768 for larger models)
 - Speed: ~100 emails/sec on CPU
 - Accuracy: 85-95% on email/text classification tasks
 - Pretrained on 1B+ sentence pairs
 **Why it's perfect for us:**
 - Small enough to bundle with wheel distribution
 - Fast on CPU (no GPU required)
 - Semantic understanding (handles synonyms, paraphrasing)
 - Works with short text (emails are perfect)
 - No fine-tuning needed (pretrained is excellent)
 ### Structured Embeddings (Our Innovation)
 Instead of naive embedding:
 ```python
 # BAD
 text = f"{subject} {body}"
 embedding = model.encode(text)
 ```
 **Our approach (parameterized headers):**
 ```python
 # GOOD - gives model rich context
 text = f"""[EMAIL_METADATA]
 sender_type: corporate
 has_attachments: true
 [DETECTED_PATTERNS]
 has_otp: false
 has_invoice: true
 [CONTENT]
 subject: {subject}
 body: {body[:300]}
 """
 embedding = model.encode(text)
 ```
 **Research-backed benefit:** 5-10% accuracy boost from structured context
 ---
 ## 6. ATTACHMENT ANALYSIS (COMPETITIVE ADVANTAGE)
 ### What Competitors Do
 **Most tools:**
 - Note "has attachment: true/false"
 - Maybe detect attachment type (PDF, DOCX, etc.)
 - **DO NOT** extract or analyze attachment content
 ### What We Can Do
 **Simple extraction (fast, high value):**
 ```python
 if attachment_type == 'pdf':
    text = extract_pdf_text(attachment)  # PyPDF2 library
    # Pattern matching in PDF
    has_invoice = 'invoice' in text.lower()
    has_account_number = bool(re.search(r'account\s*#?\d+', text))
    has_total_amount = bool(re.search(r'total.*\$\d+', text, re.I))
    # Boost classification confidence
    if has_invoice and has_account_number:
        category = 'transactional'  # 99% confidence
 if attachment_type == 'docx':
    text = extract_docx_text(attachment)  # python-docx library
    word_count = len(text.split())
    # Long documents might be contracts, reports
    if word_count > 1000:
        category_hint = 'work'
 ```
 **Business owner value:**
 - "Find all invoices" → includes PDFs with invoice content
 - "Financial documents" → PDFs with account numbers
 - "Contracts" → DOCX files with legal terms
 - "Reports" → Long DOCX or PDF files
 **Implementation:**
 - Use PyPDF2 for PDFs (<5MB size limit)
 - Use python-docx for Word docs
 - Use openpyxl for simple Excel files
 - Flag complex/large attachments for review
 ---
 ## 7. PERFORMANCE OPTIMIZATION
 ### Batching Strategy (Critical)
 **Embedding generation bottleneck:**
 - Sequential: 80,000 emails × 10ms = 13 minutes
 - Batched (128 emails): 80,000 ÷ 128 × 100ms = ~1 minute
 **LLM processing optimization:**
 - Don't send 1500 individual requests during calibration
 - Batch 10-20 emails per prompt → 75-150 requests instead
 - Compress sample if needed (1500 → 500 smarter selection)
 ### Expected Performance (Revised)
 ```
 80,000 emails breakdown:
 ├─ Calibration (500 compressed samples): 2-3 min
 ├─ Pattern detection (all 80k): 10 sec
 ├─ Embedding generation (batched): 1-2 min
 ├─ LightGBM classification: 3 sec
 ├─ Hard rules (10%): instant
 ├─ LLM review (5%, batched): 4 min
 └─ Export: 2 min
 Total: ~10-12 minutes (optimistic)
 Total: ~15-20 minutes (realistic with overhead)
 ```
 ---
 ## 8. SECURITY & PRIVACY ADVANTAGES
 ### Why Local Processing Matters
 **GDPR considerations:**
 - Cloud upload = data processing agreement needed
 - Local processing = no third-party involvement
 - Business emails often contain sensitive data
 **Privacy concerns:**
 - Client lists, pricing, contracts
 - Financial information, invoices
 - Personal health information (if medical business)
 - Legal correspondence
 **Our advantage:**
 - 100% local processing
 - No data retention
 - No cloud storage
 - Fresh repo per job (isolation)
 ---
 ## CONCLUSIONS & RECOMMENDATIONS
 ### 1. Use LightGBM (Not XGBoost)
 - 2-5x faster
 - Native categorical handling
 - Perfect for our hybrid features
 - Research-validated choice
 ### 2. Structured Embeddings Work
 - Parameterized headers boost accuracy 5-10%
 - Guide model with detected patterns
 - Research-backed technique
 ### 3. Attachment Analysis is Differentiator
 - Competitors don't do this
 - High value for business owners
 - Simple to implement (PyPDF2, python-docx)
 ### 4. Qwen 3 Model Strategy
 - **qwen3:4b** for calibration (better discovery)
 - **qwen3:1.7b** for bulk review (faster)
 - Single config file for easy swapping
 ### 5. Market Gap Validated
 - No local, privacy-first alternatives
 - Business owners have this pain point
 - One-time cleanup vs subscription
 - 94-96% accuracy is competitive
 ### 6. Performance Target Achievable
 - 15-20 min for 80k emails (realistic)
 - 94-96% accuracy (research-backed)
 - <5% need LLM review
 - Competitive with cloud tools
 ---
 ## NEXT STEPS
 1. ✅ Research complete
 2. ✅ Architecture validated
 3. ⏭ Build core infrastructure
 4. ⏭ Implement hybrid features
 5. ⏭ Create LightGBM classifier
 6. ⏭ Add LLM providers
 7. ⏭ Build test harness
 8. ⏭ Package as wheel
 9. ⏭ Test on real inbox
 ---
 **Research phase complete. Architecture validated. Ready to build.**
--- a/docs/ROOT_CAUSE_ANALYSIS.md
+++ b/docs/ROOT_CAUSE_ANALYSIS.md
@ -1,319 +0,0 @@
 # Root Cause Analysis: Category Explosion & Over-Confidence
 **Date:** 2025-10-24
 **Run:** 100k emails, qwen3:4b model
 **Issue:** Model trained on 29 categories instead of expected 11, with extreme over-confidence
 ---
 ## Executive Summary
 The 100k classification run technically succeeded (92.1% accuracy estimate) but revealed critical architectural issues:
 1. **Category Explosion:** 29 training categories vs expected 11
 2. **Duplicate Categories:** Work/work, Administrative/auth, finance/Financial
 3. **Extreme Over-Confidence:** 99%+ classifications at 1.0 confidence
 4. **Category Leakage:** Hardcoded categories leaked into LLM-discovered categories
 ---
 ## The Bug
 ### Location
 [src/calibration/workflow.py:110](src/calibration/workflow.py#L110)
 ```python
 all_categories = list(set(self.categories) | set(discovered_categories.keys()) | label_categories)
 ```
 ### What Happened
 The workflow merges THREE category sources:
 1. **`self.categories`** - 12 hardcoded categories from `config/categories.yaml`:
   - junk, transactional, auth, newsletters, social, automated
   - conversational, work, personal, finance, travel, unknown
 2. **`discovered_categories.keys()`** - 11 LLM-discovered categories:
   - Work, Financial, Administrative, Operational, Meeting
   - Technical, External, Announcements, Urgent, Miscellaneous, Forwarded
 3. **`label_categories`** - Additional categories from LLM labels:
   - Bowl Pool 2000, California Market, Prehearing, Change, Monitoring
   - Information
 ### Result: 29 Total Categories
 ```
 1. Administrative           (LLM discovered)
 2. Announcements           (LLM discovered)
 3. Bowl Pool 2000          (LLM label - weird)
 4. California Market       (LLM label - too specific)
 5. Change                  (LLM label - vague)
 6. External                (LLM discovered)
 7. Financial               (LLM discovered)
 8. Forwarded               (LLM discovered)
 9. Information             (LLM label - vague)
 10. Meeting                (LLM discovered)
 11. Miscellaneous          (LLM discovered)
 12. Monitoring             (LLM label - too specific)
 13. Operational            (LLM discovered)
 14. Prehearing             (LLM label - too specific)
 15. Technical              (LLM discovered)
 16. Urgent                 (LLM discovered)
 17. Work                   (LLM discovered)
 18. auth                   (hardcoded)
 19. automated              (hardcoded)
 20. conversational         (hardcoded)
 21. finance                (hardcoded)
 22. junk                   (hardcoded)
 23. newsletters            (hardcoded)
 24. personal               (hardcoded)
 25. social                 (hardcoded)
 26. transactional          (hardcoded)
 27. travel                 (hardcoded)
 28. unknown                (hardcoded)
 29. work                   (hardcoded)
 ```
 ### Duplicates Identified
 - **Work (LLM) vs work (hardcoded)** - 14,223 vs 368 emails
 - **Financial (LLM) vs finance (hardcoded)** - 5,943 vs 0 emails
 - **Administrative (LLM) vs auth (hardcoded)** - 67,195 vs 37 emails
 ---
 ## Impact Analysis
 ### 1. Category Distribution (100k Results)
 | Category | Count | Confidence | Source |
 |----------|-------|------------|--------|
 | Administrative | 67,195 | 1.000 | LLM discovered |
 | Work | 14,223 | 1.000 | LLM discovered |
 | Meeting | 7,785 | 1.000 | LLM discovered |
 | Financial | 5,943 | 1.000 | LLM discovered |
 | Operational | 3,274 | 1.000 | LLM discovered |
 | junk | 394 | 0.960 | Hardcoded |
 | work | 368 | 0.950 | Hardcoded |
 | Miscellaneous | 238 | 1.000 | LLM discovered |
 | Technical | 193 | 1.000 | LLM discovered |
 | External | 137 | 1.000 | LLM discovered |
 | transactional | 44 | 0.970 | Hardcoded |
 | auth | 37 | 0.990 | Hardcoded |
 | unknown | 23 | 0.500 | Hardcoded |
 | Others | <20 each | Various | Mixed |
 ### 2. Extreme Over-Confidence
 - **67,195 emails** classified as "Administrative" with **1.0 confidence**
 - **99.9%** of all classifications have confidence >= 0.95
 - This is unrealistic - suggests overfitting or poor calibration
 ### 3. Why It Still "Worked"
 - LLM-discovered categories (uppercase) handled 99%+ of emails
 - Hardcoded categories (lowercase) mostly unused except for rules
 - Model learned both sets but strongly preferred LLM categories
 - Enron dataset doesn't match hardcoded categories well
 ---
 ## Why This Happened
 ### Design Intent vs Reality
 **Original Design:**
 - Hardcoded categories in `categories.yaml` for rule-based matching
 - LLM discovers NEW categories during calibration
 - Merge both for flexible classification
 **Reality:**
 - Hardcoded categories leak into ML training
 - Creates duplicate concepts (Work vs work)
 - LLM labels include one-off categories (Bowl Pool 2000)
 - No deduplication or conflict resolution
 ### The Workflow Path
 ```
 1. CLI loads hardcoded categories from categories.yaml
   → ['junk', 'transactional', 'auth', ... 'work', 'finance', 'unknown']
 2. Passes to CalibrationWorkflow.__init__(categories=...)
   → self.categories = list(categories.keys())
 3. LLM discovers categories from emails
   → {'Work': 'business emails', 'Financial': 'budgets', ...}
 4. Consolidation reduces duplicates (within LLM categories only)
   → But doesn't see hardcoded categories
 5. Merge ALL sources at workflow.py:110
   → Hardcoded + Discovered + Label anomalies = 29 categories
 6. Trainer learns all 29 categories
   → Model becomes confused but weights LLM categories heavily
 ```
 ---
 ## Spot-Check Findings
 ### High Confidence Samples (Correct)
 ✅ **Sample 1:** "i'll get the movie and wine. my suggestion is something from central market"
   - Classified: Administrative (1.0)
   - **Assessment:** Questionable - looks more personal
 ✅ **Sample 2:** "Can you spell S-N-O-O-T-Y?"
   - Classified: Administrative (1.0)
   - **Assessment:** Wrong - clearly conversational/personal
 ✅ **Sample 3:** "MEETING TONIGHT - 6:00 pm Central Time at The Houstonian"
   - Classified: Meeting (1.0)
   - **Assessment:** Correct
 ### Low Confidence Samples (Unknown)
 ⚠️ **All low confidence samples classified as "unknown" (0.500)**
 - These fell back to LLM
 - LLM failed to classify (returned unknown)
 - Actual content: Legitimate business emails about deferrals, power units
 ### Category Anomalies
 ❌ **"California Market" (6 emails, 1.0 confidence)**
 - Too specific - shouldn't be a standalone category
 - Should be "Work" or "External"
 ❌ **"Bowl Pool 2000" (exists in training set)**
 - One-off event category
 - Should never have been kept
 ---
 ## Performance Impact
 ### What Went Right
 - **ML handled 99.1%** of emails (99,134 / 100,000)
 - **Only 31 fell to LLM** (0.03%)
 - Fast classification (~3 minutes for 100k)
 - Discovered categories are semantically good
 ### What Went Wrong
 - **Unrealistic confidence** - Almost everything is 1.0
 - **Category pollution** - 29 instead of 11
 - **Duplicates** - Work/work, finance/Financial
 - **No calibration** - Model confidence not properly calibrated
 - **Hardcoded categories unused** - 368 "work" vs 14,223 "Work"
 ---
 ## Root Causes
 ### 1. Architectural Confusion
 **Two competing philosophies:**
 - **Rule-based system:** Use hardcoded categories with pattern matching
 - **LLM-driven system:** Discover categories from data
 **Result:** They interfere with each other instead of complementing
 ### 2. Missing Deduplication
 The workflow.py:110 line does a simple set union without:
 - Case normalization
 - Semantic similarity checking
 - Conflict resolution
 - Priority rules
 ### 3. No Consolidation Across Sources
 The LLM consolidation step (line 91-100) only consolidates within discovered categories. It doesn't:
 - Check against hardcoded categories
 - Merge similar concepts
 - Remove one-off labels
 ### 4. Poor Category Cache Design
 The category cache (src/models/category_cache.json) saves LLM categories but:
 - Doesn't deduplicate against hardcoded categories
 - Allows case-sensitive duplicates
 - No validation of category quality
 ---
 ## Recommendations
 ### Immediate Fixes
 1. **Remove hardcoded categories from ML training**
   - Use them ONLY for rule-based matching
   - Don't merge into `all_categories` for training
   - Let LLM discover all ML categories
 2. **Add case-insensitive deduplication**
   - Normalize to title case
   - Check semantic similarity
   - Merge duplicates before training
 3. **Filter label anomalies**
   - Reject categories with <10 training samples
   - Reject overly specific categories (Bowl Pool 2000)
   - LLM review step for quality
 4. **Calibrate model confidence**
   - Use temperature scaling or Platt scaling
   - Ensure confidence reflects actual accuracy
 ### Architecture Decision
 **Option A: Rule-Based + ML (Current)**
 - Keep hardcoded categories for RULES ONLY
 - LLM discovers categories for ML ONLY
 - Never merge the two
 **Option B: Pure LLM Discovery (Recommended)**
 - Remove categories.yaml entirely
 - LLM discovers ALL categories
 - Rules can still match on keywords but don't define categories
 **Option C: Hybrid with Priority**
 - Define 3-5 HIGH-PRIORITY hardcoded categories (junk, auth, transactional)
 - Let LLM discover everything else
 - Clear hierarchy: Rules → Hardcoded ML → Discovered ML
 ---
 ## Next Steps
 1. **Decision:** Choose architecture (A, B, or C above)
 2. **Fix workflow.py:110** - Implement chosen strategy
 3. **Add deduplication logic** - Case-insensitive, semantic matching
 4. **Rerun calibration** - Clean 250-sample run
 5. **Validate results** - Ensure clean categories
 6. **Fix confidence** - Add calibration layer
 ---
 ## Files to Modify
 1. [src/calibration/workflow.py:110](src/calibration/workflow.py#L110) - Category merging logic
 2. [src/calibration/llm_analyzer.py](src/calibration/llm_analyzer.py) - Add cross-source consolidation
 3. [src/cli.py:70](src/cli.py#L70) - Decide whether to load hardcoded categories
 4. [config/categories.yaml](config/categories.yaml) - Clarify purpose (rules only?)
 5. [src/calibration/trainer.py](src/calibration/trainer.py) - Add confidence calibration
 ---
 ## Conclusion
 The system technically worked - it classified 100k emails with high ML efficiency. However, the category explosion and over-confidence issues reveal fundamental architectural problems that need resolution before production use.
 The core question: **Should hardcoded categories participate in ML training at all?**
 My recommendation: **No.** Use them for rules only, let LLM discover ML categories cleanly.
--- a/docs/SESSION_HANDOVER_20251128.md
+++ b/docs/SESSION_HANDOVER_20251128.md
@ -0,0 +1,128 @@
 # Session Handover Report - Email Sorter
 **Date:** 2025-11-28
 **Session ID:** eb549838-a153-48d1-ae5d-891e0e83108f
 ---
 ## What Was Done This Session
 ### 1. Classified 801 emails from brett-gmail using three methods:
 | Method | Accuracy | Time | Output Location |
 |--------|----------|------|-----------------|
 | ML-Only | 54.9% | ~5 sec | `/home/bob/Documents/Email Manager/emails/brett-gm-md/` |
 | ML+LLM | 93.3% | ~3.5 min | `/home/bob/Documents/Email Manager/emails/brett-gm-llm/` |
 | Manual Agent | 99.8% | ~25 min | Same as ML-only + analysis files |
 ### 2. Created/Modified Files
 **New Files:**
 - `tools/generate_html_report.py` - HTML report generator
 - `tools/brett_gmail_analyzer.py` - Custom dataset analyzer
 - `data/brett_gmail_analysis.json` - Analysis output
 - `docs/REPORT_FORMAT.md` - Report system documentation
 - `docs/CLASSIFICATION_METHODS_COMPARISON.md` - Method comparison
 - `docs/PROJECT_ROADMAP_2025.md` - Full roadmap and learnings
 - `/home/bob/Documents/Email Manager/emails/brett-gm-md/BRETT_GMAIL_ANALYSIS_REPORT.md` - Analysis report
 - `/home/bob/Documents/Email Manager/emails/brett-gm-md/report.html` - HTML report (ML-only)
 - `/home/bob/Documents/Email Manager/emails/brett-gm-llm/report.html` - HTML report (ML+LLM)
 **Modified Files:**
 - `src/cli.py` - Added `--force-ml` flag, enriched results.json with email metadata
 - `src/llm/openai_compat.py` - Removed API key requirement for local vLLM
 - `config/default_config.yaml` - Changed LLM to openai provider on localhost:11433
 ### 3. Key Configuration Changes
 ```yaml
 # config/default_config.yaml - LLM now uses vLLM endpoint
 llm:
  provider: "openai"
  openai:
    base_url: "http://localhost:11433/v1"
    api_key: "not-needed"
    classification_model: "qwen3-coder-30b"
 ```
 ---
 ## Key Findings
 1. **ML pipeline overkill for <5000 emails** - Agent analysis gives better accuracy in similar time
 2. **Sender domain is strongest signal** - Top 5 senders = 47.5% of emails
 3. **Categories should serve downstream routing** - Not human labels, but processing decisions
 4. **Risk-based accuracy** - Personal emails need high accuracy, junk can tolerate errors
 5. **This tool = triage** - Sorts into buckets for other specialized tools
 ---
 ## Project Scope (Agreed with User)
 **Email Sorter IS:**
 - Bulk classification/triage tool
 - Router to downstream specialized tools
 - Part of larger email processing ecosystem
 **Email Sorter IS NOT:**
 - Complete email management solution
 - Spam filter (trust Gmail/Outlook)
 - Final destination for emails
 ---
 ## Recommended Dataset Size Routing
 | Size | Method |
 |------|--------|
 | <500 | Agent-only |
 | 500-5000 | Agent pre-scan + ML |
 | >5000 | ML pipeline |
 ---
 ## Background Processes
 There are stale background bash processes (f8678e, 0a3549, 0d150e) from classification runs. These completed successfully and can be ignored.
 ---
 ## What Needs Doing Next
 1. **Review docs/** - All learnings are in PROJECT_ROADMAP_2025.md
 2. **Phase 1 development** - Dataset size routing, sender-first classification
 3. **Agent pre-scan module** - 10-15 min discovery phase before ML
 ---
 ## User Preferences (from CLAUDE.md)
 - NO emojis in commits
 - NO "Generated with Claude" attribution
 - Use tools (Read/Edit/Grep) not bash commands for file ops
 - Virtual environment required for Python
 - TTS available via `fss-speak` (single line messages only, no newlines)
 ---
 ## Quick Start for Next Agent
 ```bash
 cd /MASTERFOLDER/Tools/email-sorter
 source venv/bin/activate
 # Read the roadmap
 cat docs/PROJECT_ROADMAP_2025.md
 # Run classification
 python -m src.cli run --source local \
  --directory "/path/to/emails" \
  --output "/path/to/output" \
  --force-ml --llm-provider openai
 # Generate HTML report
 python tools/generate_html_report.py --input /path/to/results.json
 ```
 ---
 *Session ended: 2025-11-28 ~03:30 AEDT*
--- a/docs/START_HERE.md
+++ b/docs/START_HERE.md
@ -1,324 +0,0 @@
 # EMAIL SORTER - START HERE
 **Welcome to Email Sorter v1.0 - Your Email Classification System**
 ---
 ## What Is This?
 A **complete email classification system** that:
 - Uses hybrid ML/LLM classification for 90-94% accuracy
 - Processes emails with smart rules, machine learning, and AI
 - Works with Gmail, IMAP, or any email dataset
 - Is ready to use **right now**
 ---
 ## What You Need to Know
 ### ✅ The Good News
 - **Framework is 100% complete** - all 16 planned phases are done
 - **Ready to use immediately** - with mock model or real model
 - **Complete codebase** - 6000+ lines, full type hints, comprehensive logging
 - **90% test pass rate** - 27/30 tests passing
 - **Comprehensive documentation** - 10 guides covering everything
 ### ❌ The Not-So-News
 - **Mock model included** - for testing the framework (not for production accuracy)
 - **Real model optional** - you choose to train on Enron or download pre-trained
 - **Gmail setup optional** - framework works without it
 - **LLM integration optional** - graceful fallback if unavailable
 ---
 ## Three Ways to Get Started
 ### 🟢 Path A: Validate Framework (5 minutes)
 Perfect if you want to quickly verify everything works
 ```bash
 cd "c:/Build Folder/email-sorter"
 source venv/Scripts/activate
 # Run tests
 pytest tests/ -v
 # Test with mock pipeline
 python -m src.cli run --source mock --output test_results/
 ```
 **What you'll learn**: Framework works perfectly with mock model
 ---
 ### 🟡 Path B: Integrate Real Model (30-60 minutes)
 Perfect if you want actual classification results
 ```bash
 # Option 1: Train on Enron dataset (recommended)
 python -c "
 from src.calibration.enron_parser import EnronParser
 from src.calibration.trainer import ModelTrainer
 from src.classification.feature_extractor import FeatureExtractor
 parser = EnronParser('enron_mail_20150507')
 emails = parser.parse_emails(limit=5000)
 extractor = FeatureExtractor()
 trainer = ModelTrainer(extractor, ['junk', 'transactional', 'auth', 'newsletters',
                                     'social', 'automated', 'conversational', 'work',
                                     'personal', 'finance', 'travel', 'unknown'])
 results = trainer.train([(e, 'unknown') for e in emails])
 trainer.save_model('src/models/pretrained/classifier.pkl')
 "
 # Option 2: Use pre-trained model
 python tools/setup_real_model.py --model-path /path/to/model.pkl
 # Verify
 python tools/setup_real_model.py --check
 ```
 **What you'll get**: Real LightGBM model, automatic classification with 85-90% accuracy
 ---
 ### 🔴 Path C: Full Production Deployment (2-3 hours)
 Perfect if you want to process Marion's 80k+ emails
 ```bash
 # 1. Setup Gmail OAuth (download credentials.json, place in project root)
 # 2. Test with 100 emails
 python -m src.cli run --source gmail --limit 100 --output test_results/
 # 3. Process all emails
 python -m src.cli run --source gmail --output marion_results/
 # 4. Check results
 cat marion_results/report.txt
 ```
 **What you'll get**: All 80k+ emails sorted, labeled, and synced to Gmail
 ---
 ## Documentation Map
 | Document | Purpose | When to Read |
 |----------|---------|--------------|
 | **START_HERE.md** | This file - quick orientation | First (right now!) |
 | **NEXT_STEPS.md** | Decision tree and action plan | Decide your path |
 | **PROJECT_COMPLETE.md** | Final summary and status | Understand scope |
 | **COMPLETION_ASSESSMENT.md** | Detailed component review | Deep dive needed |
 | **MODEL_INFO.md** | Model usage and training | For model setup |
 | **README.md** | Getting started guide | General reference |
 | **PROJECT_STATUS.md** | Feature inventory | Full feature list |
 | **PROJECT_BLUEPRINT.md** | Original architecture plan | Background context |
 ---
 ## Quick Reference Commands
 ```bash
 # Navigate and activate
 cd "c:/Build Folder/email-sorter"
 source venv/Scripts/activate
 # Validation
 pytest tests/ -v                           # Run all tests
 python -m src.cli test-config             # Validate configuration
 python -m src.cli test-ollama             # Test LLM (if running)
 python -m src.cli test-gmail              # Test Gmail connection
 # Framework testing
 python -m src.cli run --source mock       # Test with mock provider
 # Real processing
 python -m src.cli run --source gmail --limit 100    # Test with Gmail
 python -m src.cli run --source gmail --output results/  # Full processing
 # Model management
 python tools/setup_real_model.py --check              # Check model status
 python tools/setup_real_model.py --model-path FILE   # Install model
 python tools/download_pretrained_model.py --url URL  # Download model
 ```
 ---
 ## Common Questions
 ### Q: Do I need to do anything right now?
 **A:** No! But you can run `pytest tests/ -v` to verify everything works.
 ### Q: Is the framework ready to use?
 **A:** YES! All 16 phases are complete. 90% test pass rate. Ready to use.
 ### Q: How do I get better accuracy than the mock model?
 **A:** Train a real model or download pre-trained. See Path B above.
 ### Q: Does this work without Gmail?
 **A:** YES! Use mock provider or IMAP provider instead.
 ### Q: Can I use it right now?
 **A:** YES! With mock model. For real accuracy, integrate real model (Path B).
 ### Q: How long to process all 80k emails?
 **A:** About 20-30 minutes after setup. Path C shows how.
 ### Q: Where do I start?
 **A:** Choose your path above. Path A (5 min) is the quickest.
 ---
 ## What Each Path Gets You
 ### Path A Results (5 minutes)
 - ✅ Confirm framework works
 - ✅ See mock classification in action
 - ✅ Verify all tests pass
 - ❌ Not real-world accuracy yet
 ### Path B Results (30-60 minutes)
 - ✅ Real LightGBM model trained
 - ✅ 85-90% classification accuracy
 - ✅ Ready for real data
 - ❌ Haven't processed real emails yet
 ### Path C Results (2-3 hours)
 - ✅ All emails classified
 - ✅ 90-94% overall accuracy
 - ✅ Synced to Gmail labels
 - ✅ Full deployment complete
 - ✅ Marion's 80k+ emails processed
 ---
 ## Key Files & Locations
 ```
 c:/Build Folder/email-sorter/
 Core Framework:
  src/                          Main framework code
    classification/             Email classifiers
    calibration/                Model training
    processing/                 Batch processing
    llm/                        LLM providers
    email_providers/            Email sources
    export/                     Results export
 Data & Models:
  enron_mail_20150507/          Real email dataset (already extracted)
  src/models/pretrained/        Where real model goes
  models/                       Alternative model directory
 Tools:
  tools/setup_real_model.py     Install pre-trained models
  tools/download_pretrained_model.py   Download models
 Configuration:
  config/                       YAML configuration
  credentials.json              (optional) Gmail OAuth
 Testing:
  tests/                        23 test cases
  logs/                         Execution logs
 ```
 ---
 ## Success Looks Like
 ### After Path A (5 min)
 ```
 ✅ 27/30 tests passing
 ✅ Framework validation complete
 ✅ Mock pipeline ran successfully
 Status: Ready to explore
 ```
 ### After Path B (30-60 min)
 ```
 ✅ Real model installed
 ✅ Model check shows: is_mock: False
 ✅ Ready for real classification
 Status: Ready for real data
 ```
 ### After Path C (2-3 hours)
 ```
 ✅ All 80k emails processed
 ✅ Gmail labels synced
 ✅ Results exported and reviewed
 ✅ Accuracy metrics acceptable
 Status: Complete and deployed
 ```
 ---
 ## One More Thing...
 **This framework is complete and ready to use NOW.** You don't need to:
 - Fix anything ✅
 - Add components ✅
 - Change architecture ✅
 - Debug systems ✅
 - Train models (optional) ✅
 What you CAN do:
 - Use it immediately with mock model
 - Integrate real model when ready
 - Scale to production anytime
 - Customize categories and rules
 - Deploy to other systems
 ---
 ## Your Next Step
 Pick one:
 **🟢 I want to test the framework right now** → Go to Path A (5 min)
 **🟡 I want better accuracy tomorrow** → Go to Path B (30-60 min)
 **🔴 I want all emails processed this week** → Go to Path C (2-3 hours total)
 Or read one of the detailed docs:
 - **NEXT_STEPS.md** - Decision tree
 - **PROJECT_COMPLETE.md** - Full summary
 - **README.md** - Detailed guide
 ---
 ## Contact & Support
 If something doesn't work:
 1. Check logs: `tail -f logs/email_sorter.log`
 2. Run tests: `pytest tests/ -v`
 3. Validate setup: `python -m src.cli test-config`
 4. Review docs: See Documentation Map above
 Most issues are covered in the docs!
 ---
 ## Quick Stats
 - **Framework Status**: 100% complete
 - **Test Pass Rate**: 90% (27/30)
 - **Lines of Code**: ~6,000+ production
 - **Python Modules**: 38 files
 - **Documentation**: 10 guides
 - **Ready for**: Immediate use
 ---
 **Ready to get started? Choose your path above and begin! 🚀**
 The framework is done. The tools are ready. The documentation is complete.
 All you need to do is pick a path and start.
 Let's go!
--- a/docs/SYSTEM_FLOW.html
+++ b/docs/SYSTEM_FLOW.html
@ -1,493 +0,0 @@
 <!DOCTYPE html>
 <html lang="en">
 <head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Email Sorter System Flow</title>
    <script src="https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.min.js"></script>
    <style>
        body {
            font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
            margin: 20px;
            background: #1e1e1e;
            color: #d4d4d4;
        }
        h1, h2, h3 {
            color: #4ec9b0;
        }
        .diagram {
            background: white;
            padding: 20px;
            margin: 20px 0;
            border-radius: 8px;
        }
        .timing-table {
            width: 100%;
            border-collapse: collapse;
            margin: 20px 0;
            background: #252526;
        }
        .timing-table th {
            background: #37373d;
            padding: 12px;
            text-align: left;
            color: #4ec9b0;
        }
        .timing-table td {
            padding: 10px;
            border-bottom: 1px solid #3e3e42;
        }
        .flag-section {
            background: #252526;
            padding: 15px;
            margin: 10px 0;
            border-left: 4px solid #4ec9b0;
        }
        code {
            background: #1e1e1e;
            padding: 2px 6px;
            border-radius: 3px;
            color: #ce9178;
        }
    </style>
 </head>
 <body>
    <h1>Email Sorter System Flow Documentation</h1>
    <h2>1. Main Execution Flow</h2>
    <div class="diagram">
        <pre class="mermaid">
 flowchart TD
    Start([python -m src.cli run]) --> LoadConfig[Load config/default_config.yaml]
    LoadConfig --> InitProviders[Initialize Email Provider<br/>Enron/Gmail/IMAP]
    InitProviders --> FetchEmails[Fetch Emails<br/>--limit N]
    FetchEmails --> CheckSize{Email Count?}
    CheckSize -->|"< 1000"| SetMockMode[Set ml_classifier.is_mock = True<br/>LLM-only mode]
    CheckSize -->|">= 1000"| CheckModel{Model Exists?}
    CheckModel -->|No model at<br/>src/models/pretrained/classifier.pkl| RunCalibration[CALIBRATION PHASE<br/>LLM category discovery<br/>Train ML model]
    CheckModel -->|Model exists| SkipCalibration[Skip Calibration<br/>Load existing model]
    SetMockMode --> SkipCalibration
    RunCalibration --> ClassifyPhase[CLASSIFICATION PHASE]
    SkipCalibration --> ClassifyPhase
    ClassifyPhase --> Loop{For each email}
    Loop --> RuleCheck{Hard rule match?}
    RuleCheck -->|Yes| RuleClassify[Category by rule<br/>confidence=1.0<br/>method='rule']
    RuleCheck -->|No| MLClassify[ML Classification<br/>Get category + confidence]
    MLClassify --> ConfCheck{Confidence >= threshold?}
    ConfCheck -->|Yes| AcceptML[Accept ML result<br/>method='ml'<br/>needs_review=False]
    ConfCheck -->|No| LowConf[Low confidence detected<br/>needs_review=True]
    LowConf --> FlagCheck{--no-llm-fallback?}
    FlagCheck -->|Yes| AcceptMLAnyway[Accept ML anyway<br/>needs_review=False]
    FlagCheck -->|No| LLMCheck{LLM available?}
    LLMCheck -->|Yes| LLMReview[LLM Classification<br/>~4 seconds<br/>method='llm']
    LLMCheck -->|No| AcceptMLAnyway
    RuleClassify --> NextEmail{More emails?}
    AcceptML --> NextEmail
    AcceptMLAnyway --> NextEmail
    LLMReview --> NextEmail
    NextEmail -->|Yes| Loop
    NextEmail -->|No| SaveResults[Save results.json]
    SaveResults --> End([Complete])
    style RunCalibration fill:#ff6b6b
    style LLMReview fill:#ff6b6b
    style SetMockMode fill:#ffd93d
    style FlagCheck fill:#4ec9b0
    style AcceptMLAnyway fill:#4ec9b0
 </pre>
    </div>
    <h2>2. Calibration Phase Detail (When Triggered)</h2>
    <div class="diagram">
        <pre class="mermaid">
 flowchart TD
    Start([Calibration Triggered]) --> Sample[Stratified Sampling<br/>3% of emails<br/>min 250, max 1500]
    Sample --> LLMBatch[LLM Category Discovery<br/>50 emails per batch]
    LLMBatch --> Batch1[Batch 1: 50 emails<br/>~20 seconds]
    Batch1 --> Batch2[Batch 2: 50 emails<br/>~20 seconds]
    Batch2 --> BatchN[... N batches<br/>For 300 samples: 6 batches]
    BatchN --> Consolidate[LLM Consolidation<br/>Merge similar categories<br/>~5 seconds]
    Consolidate --> Categories[Final Categories<br/>~10-12 unique categories]
    Categories --> Label[Label Training Emails<br/>LLM labels each sample<br/>~3 seconds per email]
    Label --> Extract[Feature Extraction<br/>Embeddings + TF-IDF<br/>~0.02 seconds per email]
    Extract --> Train[Train LightGBM Model<br/>~5 seconds total]
    Train --> Validate[Validate on 100 samples<br/>~2 seconds]
    Validate --> Save[Save Model<br/>src/models/calibrated/classifier.pkl]
    Save --> End([Calibration Complete<br/>Total time: 15-25 minutes for 10k emails])
    style LLMBatch fill:#ff6b6b
    style Label fill:#ff6b6b
    style Consolidate fill:#ff6b6b
    style Train fill:#4ec9b0
 </pre>
    </div>
    <h2>3. Classification Phase Detail</h2>
    <div class="diagram">
        <pre class="mermaid">
 flowchart TD
    Start([Classification Phase]) --> Email[Get Email]
    Email --> Rules{Check Hard Rules<br/>Pattern matching}
    Rules -->|Match| RuleDone[Rule Match<br/>~0.001 seconds<br/>59 of 10000 emails]
    Rules -->|No match| Embed[Generate Embedding<br/>all-minilm:l6-v2<br/>~0.02 seconds]
    Embed --> TFIDF[TF-IDF Features<br/>~0.001 seconds]
    TFIDF --> MLPredict[ML Prediction<br/>LightGBM<br/>~0.003 seconds]
    MLPredict --> Threshold{Confidence >= 0.55?}
    Threshold -->|Yes| MLDone[ML Classification<br/>7842 of 10000 emails<br/>78.4%]
    Threshold -->|No| Flag{--no-llm-fallback?}
    Flag -->|Yes| MLForced[Force ML result<br/>No LLM call]
    Flag -->|No| LLM[LLM Classification<br/>~4 seconds<br/>2099 of 10000 emails<br/>21%]
    RuleDone --> Next([Next Email])
    MLDone --> Next
    MLForced --> Next
    LLM --> Next
    style LLM fill:#ff6b6b
    style MLDone fill:#4ec9b0
    style MLForced fill:#ffd93d
 </pre>
    </div>
    <h2>4. Model Loading Logic</h2>
    <div class="diagram">
        <pre class="mermaid">
 flowchart TD
    Start([MLClassifier.__init__]) --> CheckPath{model_path provided?}
    CheckPath -->|Yes| UsePath[Use provided path]
    CheckPath -->|No| Default[Default:<br/>src/models/pretrained/classifier.pkl]
    UsePath --> FileCheck{File exists?}
    Default --> FileCheck
    FileCheck -->|Yes| Load[Load pickle file]
    FileCheck -->|No| CreateMock[Create MOCK model<br/>Random Forest<br/>12 hardcoded categories]
    Load --> ValidCheck{Valid model data?}
    ValidCheck -->|Yes| CheckMock{is_mock flag?}
    ValidCheck -->|No| CreateMock
    CheckMock -->|True| WarnMock[Warn: MOCK model active]
    CheckMock -->|False| RealModel[Real trained model loaded]
    CreateMock --> MockWarnings[Multiple warnings printed<br/>NOT for production]
    WarnMock --> Ready[Model Ready]
    RealModel --> Ready
    MockWarnings --> Ready
    Ready --> End([Classification can start])
    style CreateMock fill:#ff6b6b
    style RealModel fill:#4ec9b0
    style WarnMock fill:#ffd93d
 </pre>
    </div>
    <h2>5. Flag Conditions & Effects</h2>
    <div class="flag-section">
        <h3>--no-llm-fallback</h3>
        <p><strong>Location:</strong> src/cli.py:46, src/classification/adaptive_classifier.py:152-161</p>
        <p><strong>Effect:</strong> When ML confidence < threshold, accept ML result anyway instead of calling LLM</p>
        <p><strong>Use case:</strong> Test pure ML performance, avoid LLM costs</p>
        <p><strong>Code path:</strong></p>
        <code>
 if self.disable_llm_fallback:<br/>
 &nbsp;&nbsp;# Just return ML result without LLM fallback<br/>
 &nbsp;&nbsp;return ClassificationResult(needs_review=False)
        </code>
    </div>
    <div class="flag-section">
        <h3>--limit N</h3>
        <p><strong>Location:</strong> src/cli.py:38</p>
        <p><strong>Effect:</strong> Limits number of emails fetched from source</p>
        <p><strong>Calibration trigger:</strong> If N < 1000, forces LLM-only mode (no ML training)</p>
        <p><strong>Code path:</strong></p>
        <code>
 if total_emails < 1000:<br/>
 &nbsp;&nbsp;ml_classifier.is_mock = True  # Skip ML, use LLM only
        </code>
    </div>
    <div class="flag-section">
        <h3>Model Path Override</h3>
        <p><strong>Location:</strong> src/classification/ml_classifier.py:43</p>
        <p><strong>Default:</strong> src/models/pretrained/classifier.pkl</p>
        <p><strong>Calibration saves to:</strong> src/models/calibrated/classifier.pkl</p>
        <p><strong>Problem:</strong> Calibration saves to different location than default load location</p>
        <p><strong>Solution:</strong> Copy calibrated model to pretrained location OR pass model_path parameter</p>
    </div>
    <h2>6. Timing Breakdown (10,000 emails)</h2>
    <table class="timing-table">
        <tr>
            <th>Phase</th>
            <th>Operation</th>
            <th>Time per Email</th>
            <th>Total Time (10k)</th>
            <th>LLM Required?</th>
        </tr>
        <tr>
            <td rowspan="6"><strong>Calibration</strong><br/>(if model doesn't exist)</td>
            <td>Stratified sampling (300 emails)</td>
            <td>-</td>
            <td>~1 second</td>
            <td>No</td>
        </tr>
        <tr>
            <td>LLM category discovery (6 batches)</td>
            <td>~0.4 sec/email</td>
            <td>~2 minutes</td>
            <td>YES</td>
        </tr>
        <tr>
            <td>LLM consolidation</td>
            <td>-</td>
            <td>~5 seconds</td>
            <td>YES</td>
        </tr>
        <tr>
            <td>LLM labeling (300 samples)</td>
            <td>~3 sec/email</td>
            <td>~15 minutes</td>
            <td>YES</td>
        </tr>
        <tr>
            <td>Feature extraction (300 samples)</td>
            <td>~0.02 sec/email</td>
            <td>~6 seconds</td>
            <td>No (embeddings)</td>
        </tr>
        <tr>
            <td>Model training (LightGBM)</td>
            <td>-</td>
            <td>~5 seconds</td>
            <td>No</td>
        </tr>
        <tr>
            <td colspan="3"><strong>CALIBRATION TOTAL</strong></td>
            <td><strong>~17-20 minutes</strong></td>
            <td><strong>YES</strong></td>
        </tr>
        <tr>
            <td rowspan="5"><strong>Classification</strong><br/>(with model)</td>
            <td>Hard rule matching</td>
            <td>~0.001 sec</td>
            <td>~10 seconds (all 10k)</td>
            <td>No</td>
        </tr>
        <tr>
            <td>Embedding generation</td>
            <td>~0.02 sec</td>
            <td>~200 seconds (all 10k)</td>
            <td>No (Ollama embed)</td>
        </tr>
        <tr>
            <td>ML prediction</td>
            <td>~0.003 sec</td>
            <td>~30 seconds (all 10k)</td>
            <td>No</td>
        </tr>
        <tr>
            <td>LLM fallback (21% of emails)</td>
            <td>~4 sec/email</td>
            <td>~140 minutes (2100 emails)</td>
            <td>YES</td>
        </tr>
        <tr>
            <td>Saving results</td>
            <td>-</td>
            <td>~1 second</td>
            <td>No</td>
        </tr>
        <tr>
            <td colspan="3"><strong>CLASSIFICATION TOTAL (with LLM fallback)</strong></td>
            <td><strong>~2.5 hours</strong></td>
            <td><strong>YES (21%)</strong></td>
        </tr>
        <tr>
            <td colspan="3"><strong>CLASSIFICATION TOTAL (--no-llm-fallback)</strong></td>
            <td><strong>~4 minutes</strong></td>
            <td><strong>No</strong></td>
        </tr>
    </table>
    <h2>7. Why LLM Still Loads</h2>
    <div class="diagram">
        <pre class="mermaid">
 flowchart TD
    Start([CLI startup]) --> Always1[ALWAYS: Load LLM provider<br/>src/cli.py:98-117]
    Always1 --> Reason1[Reason: Needed for calibration<br/>if model doesn't exist]
    Reason1 --> Check{Model exists?}
    Check -->|No| NeedLLM1[LLM required for calibration<br/>Category discovery<br/>Sample labeling]
    Check -->|Yes| SkipCal[Skip calibration]
    SkipCal --> ClassStart[Start classification]
    NeedLLM1 --> DoCalibration[Run calibration<br/>Uses LLM]
    DoCalibration --> ClassStart
    ClassStart --> Always2[ALWAYS: LLM provider is available<br/>llm.is_available = True]
    Always2 --> EmailLoop[For each email...]
    EmailLoop --> LowConf{Low confidence?}
    LowConf -->|No| NoLLM[No LLM call]
    LowConf -->|Yes| FlagCheck{--no-llm-fallback?}
    FlagCheck -->|Yes| NoLLMCall[No LLM call<br/>Accept ML result]
    FlagCheck -->|No| LLMAvail{llm.is_available?}
    LLMAvail -->|Yes| CallLLM[LLM called<br/>src/cli.py:227-228]
    LLMAvail -->|No| NoLLMCall
    NoLLM --> End([Next email])
    NoLLMCall --> End
    CallLLM --> End
    style Always1 fill:#ffd93d
    style Always2 fill:#ffd93d
    style CallLLM fill:#ff6b6b
    style NoLLMCall fill:#4ec9b0
 </pre>
    </div>
    <h3>Why LLM Provider is Always Initialized:</h3>
    <ul>
        <li><strong>Line 98-117 (src/cli.py):</strong> LLM provider is created before checking if model exists</li>
        <li><strong>Reason:</strong> Need LLM ready in case calibration is required</li>
        <li><strong>Result:</strong> Even with --no-llm-fallback, LLM provider loads (but won't be called for classification)</li>
    </ul>
    <h2>8. Command Scenarios</h2>
    <table class="timing-table">
        <tr>
            <th>Command</th>
            <th>Model Exists?</th>
            <th>Calibration Runs?</th>
            <th>LLM Used for Classification?</th>
            <th>Total Time (10k)</th>
        </tr>
        <tr>
            <td><code>python -m src.cli run --source enron --limit 10000</code></td>
            <td>No</td>
            <td>YES (~20 min)</td>
            <td>YES (~2.5 hours)</td>
            <td>~2 hours 50 min</td>
        </tr>
        <tr>
            <td><code>python -m src.cli run --source enron --limit 10000</code></td>
            <td>Yes</td>
            <td>No</td>
            <td>YES (~2.5 hours)</td>
            <td>~2.5 hours</td>
        </tr>
        <tr>
            <td><code>python -m src.cli run --source enron --limit 10000 --no-llm-fallback</code></td>
            <td>No</td>
            <td>YES (~20 min)</td>
            <td>NO</td>
            <td>~24 minutes</td>
        </tr>
        <tr>
            <td><code>python -m src.cli run --source enron --limit 10000 --no-llm-fallback</code></td>
            <td>Yes</td>
            <td>No</td>
            <td>NO</td>
            <td>~4 minutes</td>
        </tr>
        <tr>
            <td><code>python -m src.cli run --source enron --limit 500</code></td>
            <td>Any</td>
            <td>No (too few emails)</td>
            <td>YES (100% LLM-only)</td>
            <td>~35 minutes</td>
        </tr>
    </table>
    <h2>9. Current System State</h2>
    <div class="flag-section">
        <h3>Model Status</h3>
        <ul>
            <li><strong>src/models/calibrated/classifier.pkl</strong> - 1.8MB, trained at 02:54, 10 categories</li>
            <li><strong>src/models/pretrained/classifier.pkl</strong> - Copy of calibrated model (created manually)</li>
        </ul>
    </div>
    <div class="flag-section">
        <h3>Threshold Configuration</h3>
        <ul>
            <li><strong>config/default_config.yaml:</strong> default_threshold = 0.55</li>
            <li><strong>config/categories.yaml:</strong> All category thresholds = 0.55</li>
            <li><strong>Effect:</strong> ML must be ≥55% confident to skip LLM</li>
        </ul>
    </div>
    <div class="flag-section">
        <h3>Last Run Results (10k emails)</h3>
        <ul>
            <li><strong>Rules:</strong> 59 emails (0.6%)</li>
            <li><strong>ML:</strong> 7,842 emails (78.4%)</li>
            <li><strong>LLM fallback:</strong> 2,099 emails (21%)</li>
            <li><strong>Accuracy estimate:</strong> 92.7%</li>
        </ul>
    </div>
    <h2>10. To Run ML-Only Test (No LLM Calls During Classification)</h2>
    <div class="flag-section">
        <h3>Requirements:</h3>
        <ol>
            <li>Model must exist at <code>src/models/pretrained/classifier.pkl</code> ✓ (done)</li>
            <li>Use <code>--no-llm-fallback</code> flag</li>
            <li>Ensure sufficient emails (≥1000) to avoid LLM-only mode</li>
        </ol>
        <h3>Command:</h3>
        <code>
 python -m src.cli run --source enron --limit 10000 --output ml_only_10k/ --no-llm-fallback
        </code>
        <h3>Expected Results:</h3>
        <ul>
            <li><strong>Calibration:</strong> Skipped (model exists)</li>
            <li><strong>LLM calls during classification:</strong> 0</li>
            <li><strong>Total time:</strong> ~4 minutes</li>
            <li><strong>ML acceptance rate:</strong> 100% (all emails classified by ML, even low confidence)</li>
        </ul>
    </div>
    <script>
        mermaid.initialize({
            startOnLoad: true,
            theme: 'default',
            flowchart: {
                useMaxWidth: true,
                htmlLabels: true,
                curve: 'basis'
            }
        });
    </script>
 </body>
 </html>
--- a/docs/VERIFY_CATEGORIES_FEATURE.html
+++ b/docs/VERIFY_CATEGORIES_FEATURE.html
@ -1,357 +0,0 @@
 <!DOCTYPE html>
 <html lang="en">
 <head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Category Verification Feature</title>
    <script src="https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.min.js"></script>
    <style>
        body {
            font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
            margin: 20px;
            background: #1e1e1e;
            color: #d4d4d4;
        }
        h1, h2, h3 {
            color: #4ec9b0;
        }
        .diagram {
            background: white;
            padding: 20px;
            margin: 20px 0;
            border-radius: 8px;
        }
        .code-section {
            background: #252526;
            padding: 15px;
            margin: 10px 0;
            border-left: 4px solid #4ec9b0;
            font-family: 'Courier New', monospace;
        }
        code {
            background: #1e1e1e;
            padding: 2px 6px;
            border-radius: 3px;
            color: #ce9178;
        }
        .success {
            background: #002a00;
            border-left: 4px solid #4ec9b0;
            padding: 15px;
            margin: 10px 0;
        }
    </style>
 </head>
 <body>
    <h1>--verify-categories Feature</h1>
    <div class="success">
        <h2>✅ IMPLEMENTED AND READY TO USE</h2>
        <p><strong>Feature:</strong> Single LLM call to verify model categories fit new mailbox</p>
        <p><strong>Cost:</strong> +20 seconds, 1 LLM call</p>
        <p><strong>Value:</strong> Confidence check before bulk ML classification</p>
    </div>
    <h2>Usage</h2>
    <div class="code-section">
 <strong>Basic usage (with verification):</strong>
 python -m src.cli run \
  --source enron \
  --limit 10000 \
  --output verified_test/ \
  --no-llm-fallback \
  --verify-categories
 <strong>Custom verification sample size:</strong>
 python -m src.cli run \
  --source enron \
  --limit 10000 \
  --output verified_test/ \
  --no-llm-fallback \
  --verify-categories \
  --verify-sample 30
 <strong>Without verification (fastest):</strong>
 python -m src.cli run \
  --source enron \
  --limit 10000 \
  --output fast_test/ \
  --no-llm-fallback
    </div>
    <h2>How It Works</h2>
    <div class="diagram">
        <pre class="mermaid">
 flowchart TD
    Start([Run with --verify-categories]) --> LoadModel[Load trained model<br/>Categories: Updates, Work,<br/>Meetings, etc.]
    LoadModel --> FetchEmails[Fetch all emails<br/>10,000 total]
    FetchEmails --> CheckFlag{--verify-categories?}
    CheckFlag -->|No| SkipVerify[Skip verification<br/>Proceed to classification]
    CheckFlag -->|Yes| Sample[Sample random emails<br/>Default: 20 emails]
    Sample --> BuildPrompt[Build verification prompt<br/>Show model categories<br/>Show sample emails]
    BuildPrompt --> LLMCall[Single LLM call<br/>~20 seconds<br/>Task: Rate category fit]
    LLMCall --> ParseResponse[Parse JSON response<br/>Extract verdict + confidence]
    ParseResponse --> Verdict{Verdict?}
    Verdict -->|GOOD_MATCH<br/>80%+ fit| LogGood[Log: Categories appropriate<br/>Confidence: 0.8-1.0]
    Verdict -->|FAIR_MATCH<br/>60-80% fit| LogFair[Log: Categories acceptable<br/>Confidence: 0.6-0.8]
    Verdict -->|POOR_MATCH<br/><60% fit| LogPoor[Log WARNING<br/>Show suggested categories<br/>Recommend calibration<br/>Confidence: 0.0-0.6]
    LogGood --> Proceed[Proceed with ML classification]
    LogFair --> Proceed
    LogPoor --> Proceed
    SkipVerify --> Proceed
    Proceed --> ClassifyAll[Classify all 10,000 emails<br/>Pure ML, no LLM fallback<br/>~4 minutes]
    ClassifyAll --> Done[Results saved]
    style LLMCall fill:#ffd93d
    style LogGood fill:#4ec9b0
    style LogPoor fill:#ff6b6b
    style ClassifyAll fill:#4ec9b0
 </pre>
    </div>
    <h2>Example Outputs</h2>
    <h3>Scenario 1: GOOD_MATCH (Enron → Enron)</h3>
    <div class="code-section">
 ================================================================================
 VERIFYING MODEL CATEGORIES
 ================================================================================
 Verifying model categories against 10000 emails
 Model categories (11): Updates, Work, Meetings, External, Financial, Test, Administrative, Operational, Technical, Urgent, Requests
 Sampled 20 emails for verification
 Calling LLM for category verification...
 Verification complete: GOOD_MATCH (0.85)
 Reasoning: The sample emails fit well into the trained categories. Most are work-related correspondence, meetings, and operational updates which align with the model.
 Verification: GOOD_MATCH
 Confidence: 85%
 Model categories look appropriate for this mailbox
 ================================================================================
 Starting classification...
    </div>
    <h3>Scenario 2: POOR_MATCH (Enron → Personal Gmail)</h3>
    <div class="code-section">
 ================================================================================
 VERIFYING MODEL CATEGORIES
 ================================================================================
 Verifying model categories against 10000 emails
 Model categories (11): Updates, Work, Meetings, External, Financial, Test, Administrative, Operational, Technical, Urgent, Requests
 Sampled 20 emails for verification
 Calling LLM for category verification...
 Verification complete: POOR_MATCH (0.45)
 Reasoning: Many sample emails are shopping confirmations, social media notifications, and personal correspondence which don't fit the business-focused categories well.
 Verification: POOR_MATCH
 Confidence: 45%
 ================================================================================
 WARNING: Model categories may not fit this mailbox well
 Suggested categories: ['Shopping', 'Social', 'Travel', 'Newsletters', 'Personal']
 Consider running full calibration for better accuracy
 Proceeding with existing model anyway...
 ================================================================================
 Starting classification...
    </div>
    <h2>LLM Prompt Structure</h2>
    <div class="code-section">
 You are evaluating whether pre-trained email categories fit a new mailbox.
 TRAINED MODEL CATEGORIES (11 categories):
  - Updates
  - Work
  - Meetings
  - External
  - Financial
  - Test
  - Administrative
  - Operational
  - Technical
  - Urgent
  - Requests
 SAMPLE EMAILS FROM NEW MAILBOX (20 total, showing first 20):
 1. From: phillip.allen@enron.com
   Subject: Re: AEC Volumes at OPAL
   Preview: Here are the volumes for today...
 2. From: notifications@amazon.com
   Subject: Your order has shipped
   Preview: Your Amazon.com order #123-4567890...
 [... 18 more emails ...]
 TASK:
 Evaluate if the trained categories are appropriate for this mailbox.
 Consider:
 1. Do the sample emails naturally fit into the trained categories?
 2. Are there obvious email types that don't match any category?
 3. Are the category names semantically appropriate?
 4. Would a user find these categories helpful for THIS mailbox?
 Respond with JSON:
 {
  "verdict": "GOOD_MATCH" | "FAIR_MATCH" | "POOR_MATCH",
  "confidence": 0.0-1.0,
  "reasoning": "brief explanation",
  "fit_percentage": 0-100,
  "suggested_categories": ["cat1", "cat2", ...],
  "category_mapping": {"old_name": "better_name", ...}
 }
    </div>
    <h2>Configuration</h2>
    <table style="width:100%; border-collapse: collapse; background: #252526; margin: 20px 0;">
        <tr style="background: #37373d;">
            <th style="padding: 12px; text-align: left; color: #4ec9b0;">Flag</th>
            <th style="padding: 12px; text-align: left; color: #4ec9b0;">Type</th>
            <th style="padding: 12px; text-align: left; color: #4ec9b0;">Default</th>
            <th style="padding: 12px; text-align: left; color: #4ec9b0;">Description</th>
        </tr>
        <tr style="border-bottom: 1px solid #3e3e42;">
            <td style="padding: 10px;"><code>--verify-categories</code></td>
            <td style="padding: 10px;">Flag</td>
            <td style="padding: 10px;">False</td>
            <td style="padding: 10px;">Enable category verification</td>
        </tr>
        <tr style="border-bottom: 1px solid #3e3e42;">
            <td style="padding: 10px;"><code>--verify-sample</code></td>
            <td style="padding: 10px;">Integer</td>
            <td style="padding: 10px;">20</td>
            <td style="padding: 10px;">Number of emails to sample</td>
        </tr>
        <tr style="border-bottom: 1px solid #3e3e42;">
            <td style="padding: 10px;"><code>--no-llm-fallback</code></td>
            <td style="padding: 10px;">Flag</td>
            <td style="padding: 10px;">False</td>
            <td style="padding: 10px;">Disable LLM fallback during classification</td>
        </tr>
    </table>
    <h2>When Verification Runs</h2>
    <ul>
        <li>✅ Only if <code>--verify-categories</code> flag is set</li>
        <li>✅ Only if trained model exists (not mock)</li>
        <li>✅ After emails are fetched, before calibration/classification</li>
        <li>❌ Skipped if using mock model</li>
        <li>❌ Skipped if model doesn't exist (calibration will run anyway)</li>
    </ul>
    <h2>Timing Impact</h2>
    <table style="width:100%; border-collapse: collapse; background: #252526; margin: 20px 0;">
        <tr style="background: #37373d;">
            <th style="padding: 12px; text-align: left; color: #4ec9b0;">Configuration</th>
            <th style="padding: 12px; text-align: left; color: #4ec9b0;">Time (10k emails)</th>
            <th style="padding: 12px; text-align: left; color: #4ec9b0;">LLM Calls</th>
        </tr>
        <tr style="border-bottom: 1px solid #3e3e42;">
            <td style="padding: 10px;">ML-only (no flags)</td>
            <td style="padding: 10px;">~4 minutes</td>
            <td style="padding: 10px;">0</td>
        </tr>
        <tr style="border-bottom: 1px solid #3e3e42;">
            <td style="padding: 10px;">ML-only + <code>--verify-categories</code></td>
            <td style="padding: 10px;">~4.3 minutes</td>
            <td style="padding: 10px;">1 (verification)</td>
        </tr>
        <tr style="border-bottom: 1px solid #3e3e42;">
            <td style="padding: 10px;">Full calibration (no model)</td>
            <td style="padding: 10px;">~25 minutes</td>
            <td style="padding: 10px;">~500</td>
        </tr>
        <tr style="border-bottom: 1px solid #3e3e42;">
            <td style="padding: 10px;">ML + LLM fallback (21%)</td>
            <td style="padding: 10px;">~2.5 hours</td>
            <td style="padding: 10px;">~2100</td>
        </tr>
    </table>
    <h2>Decision Tree</h2>
    <div class="diagram">
        <pre class="mermaid">
 flowchart TD
    Start([Need to classify emails]) --> HaveModel{Trained model<br/>exists?}
    HaveModel -->|No| MustCalibrate[Must run calibration<br/>~20 minutes<br/>~500 LLM calls]
    HaveModel -->|Yes| SameDomain{Same domain as<br/>training data?}
    SameDomain -->|Yes, confident| FastML[Pure ML<br/>4 minutes<br/>0 LLM calls]
    SameDomain -->|Unsure| VerifyML[ML + Verification<br/>4.3 minutes<br/>1 LLM call]
    SameDomain -->|No, different| Options{Accuracy needs?}
    Options -->|High accuracy required| MustCalibrate
    Options -->|Speed more important| VerifyML
    Options -->|Experimental| FastML
    MustCalibrate --> Done[Classification complete]
    FastML --> Done
    VerifyML --> Done
    style FastML fill:#4ec9b0
    style VerifyML fill:#ffd93d
    style MustCalibrate fill:#ff6b6b
 </pre>
    </div>
    <h2>Quick Start</h2>
    <div class="code-section">
 <strong>Test with verification on same domain (Enron → Enron):</strong>
 python -m src.cli run \
  --source enron \
  --limit 1000 \
  --output verify_test_same/ \
  --no-llm-fallback \
  --verify-categories
 Expected: GOOD_MATCH (0.80-0.95)
 Time: ~30 seconds
 <strong>Test without verification for speed comparison:</strong>
 python -m src.cli run \
  --source enron \
  --limit 1000 \
  --output no_verify_test/ \
  --no-llm-fallback
 Expected: Same accuracy, 20 seconds faster
 Time: ~10 seconds
    </div>
    <script>
        mermaid.initialize({
            startOnLoad: true,
            theme: 'default',
            flowchart: {
                useMaxWidth: true,
                htmlLabels: true,
                curve: 'basis'
            }
        });
    </script>
 </body>
 </html>
--- a/docs/WORKFLOW_DIAGRAM.md
+++ b/docs/WORKFLOW_DIAGRAM.md
@ -1,255 +0,0 @@
 # Email Sorter - Complete Workflow Diagram
 ## Full End-to-End Pipeline with LLM Calls
 ```mermaid
 graph TB
    Start([📧 Start: Enron Maildir<br/>100,000 emails]) --> Parse[EnronParser<br/>Stratified Sampling]
    Parse --> CalibCheck{Need<br/>Calibration?}
    CalibCheck -->|Yes: No Model| CalibStart[🎯 CALIBRATION PHASE]
    CalibCheck -->|No: Model Exists| ClassifyStart[📊 CLASSIFICATION PHASE]
    %% CALIBRATION PHASE
    CalibStart --> Sample[Sample 100 Emails<br/>Stratified by user/folder]
    Sample --> Split[Split: 50 train / 50 validation]
    Split --> LLMBatch[📤 LLM CALL 1-5<br/>Batch Discovery<br/>5 batches × 20 emails]
    LLMBatch -->|qwen3:8b-q4_K_M| Discover[Category Discovery<br/>~15 raw categories]
    Discover --> Consolidate[📤 LLM CALL 6<br/>Consolidation<br/>Merge similar categories]
    Consolidate -->|qwen3:8b-q4_K_M| CacheSnap[Category Cache Snap<br/>Semantic matching<br/>10 final categories]
    CacheSnap --> ExtractTrain[Extract Features<br/>50 training emails<br/>Batch embeddings]
    ExtractTrain --> Embed1[📤 EMBEDDING CALLS<br/>Ollama all-minilm:l6-v2<br/>384-dim vectors]
    Embed1 --> TrainModel[Train LightGBM<br/>200 boosting rounds<br/>22 total categories]
    TrainModel --> SaveModel[💾 Save Model<br/>classifier.pkl 1.1MB]
    SaveModel --> ClassifyStart
    %% CLASSIFICATION PHASE
    ClassifyStart --> LoadModel[Load Model<br/>classifier.pkl]
    LoadModel --> FetchAll[Fetch All Emails<br/>100,000 emails]
    FetchAll --> BatchProcess[Process in Batches<br/>5,000 emails per batch<br/>20 batches total]
    BatchProcess --> ExtractFeatures[Extract Features<br/>Batch size: 512<br/>Batched embeddings]
    ExtractFeatures --> Embed2[📤 EMBEDDING CALLS<br/>Ollama all-minilm:l6-v2<br/>~200 batched calls]
    Embed2 --> MLInference[LightGBM Inference<br/>Predict categories<br/>~2ms per email]
    MLInference --> Results[💾 Save Results<br/>results.json 19MB<br/>summary.json 1.5KB<br/>classifications.csv 8.6MB]
    Results --> ValidationStart[🔍 VALIDATION PHASE]
    %% VALIDATION PHASE
    ValidationStart --> SelectSamples[Select Samples<br/>50 low-conf + 25 random]
    SelectSamples --> LoadEmails[Load Full Email Content<br/>Subject + Body + Metadata]
    LoadEmails --> LLMEval[📤 LLM CALLS 7-81<br/>Individual Evaluation<br/>75 total assessments]
    LLMEval -->|qwen3:8b-q4_K_M<br/>&lt;no_think&gt;| EvalResults[Collect Verdicts<br/>YES/PARTIAL/NO<br/>+ Reasoning]
    EvalResults --> LLMSummary[📤 LLM CALL 82<br/>Final Summary<br/>Aggregate findings]
    LLMSummary -->|qwen3:8b-q4_K_M| FinalReport[📊 Final Report<br/>Accuracy metrics<br/>Category quality<br/>Recommendations]
    FinalReport --> End([✅ Complete<br/>100k classified<br/>+ validated])
    %% OPTIONAL FINE-TUNING LOOP
    FinalReport -.->|If corrections needed| FineTune[🔄 FINE-TUNING<br/>Collect LLM corrections<br/>Continue training]
    FineTune -.-> ClassifyStart
    style Start fill:#e1f5e1
    style End fill:#e1f5e1
    style LLMBatch fill:#fff4e6
    style Consolidate fill:#fff4e6
    style Embed1 fill:#e6f3ff
    style Embed2 fill:#e6f3ff
    style LLMEval fill:#fff4e6
    style LLMSummary fill:#fff4e6
    style SaveModel fill:#ffe6f0
    style Results fill:#ffe6f0
    style FinalReport fill:#ffe6f0
 ```
 ---
 ## Pipeline Stages Breakdown
 ### STAGE 1: CALIBRATION (1 minute)
 **Input:** 100 emails
 **LLM Calls:** 6 calls
 - 5 batch discovery calls (20 emails each)
 - 1 consolidation call
 **Embedding Calls:** ~50 calls (one per training email)
 **Output:**
 - 10 discovered categories
 - Trained LightGBM model (1.1MB)
 - Category cache
 ### STAGE 2: CLASSIFICATION (3.4 minutes)
 **Input:** 100,000 emails
 **LLM Calls:** 0 (pure ML inference)
 **Embedding Calls:** ~200 batched calls (512 emails per batch)
 **Output:**
 - 100,000 classifications
 - Confidence scores
 - Results in JSON/CSV
 ### STAGE 3: VALIDATION (variable, ~5-10 minutes)
 **Input:** 75 sample emails (50 low-conf + 25 random)
 **LLM Calls:** 76 calls
 - 75 individual evaluation calls
 - 1 final summary call
 **Output:**
 - Quality assessment (YES/PARTIAL/NO)
 - Accuracy metrics
 - Recommendations
 ---
 ## LLM Call Summary
 | Call # | Purpose | Model | Input | Output | Time |
 |--------|---------|-------|-------|--------|------|
 | 1-5 | Batch Discovery | qwen3:8b | 20 emails each | Categories | ~5-6s each |
 | 6 | Consolidation | qwen3:8b | 15 categories | 10 merged | ~3s |
 | 7-81 | Evaluation | qwen3:8b | 1 email + category | Verdict | ~2s each |
 | 82 | Summary | qwen3:8b | 75 evaluations | Final report | ~5s |
 **Total LLM Calls:** 82
 **Total LLM Time:** ~3-4 minutes
 **Embedding Calls:** ~250 (batched)
 **Embedding Time:** ~30 seconds (batched)
 ---
 ## Performance Metrics
 ### Calibration Phase
 - **Time:** 60 seconds
 - **Samples:** 100 emails (50 for training)
 - **Categories Discovered:** 10
 - **Model Size:** 1.1MB
 - **Accuracy on training:** 95%+
 ### Classification Phase
 - **Time:** 202 seconds (3.4 minutes)
 - **Emails:** 100,000
 - **Speed:** 495 emails/second
 - **Per Email:** 2ms total processing
 - **Batch Size:** 512 (optimal)
 - **GPU Utilization:** High (batched embeddings)
 ### Validation Phase
 - **Time:** ~10 minutes (75 LLM calls)
 - **Samples:** 75 emails
 - **Per Sample:** ~8 seconds
 - **Accuracy Found:** Model already accurate (0 corrections)
 ---
 ## Data Flow Details
 ### Email Processing Pipeline
 ```
 Email File → Parse → Features → Embedding → Model → Category
  (text)     (dict)   (struct)   (384-dim)  (22-cat) (label)
 ```
 ### Feature Extraction
 ```
 Email Content
 ├─ Subject (text)
 ├─ Body (text)
 ├─ Sender (email address)
 ├─ Date (timestamp)
 ├─ Attachments (boolean + count)
 └─ Patterns (regex matches)
    ↓
 Structured Text
    ↓
 Ollama Embedding (all-minilm:l6-v2)
    ↓
 384-dimensional vector
 ```
 ### LightGBM Training
 ```
 Features (384-dim) + Labels (10 categories)
    ↓
 Training: 200 boosting rounds
    ↓
 Model: 22 categories total (10 discovered + 12 hardcoded)
    ↓
 Output: classifier.pkl (1.1MB)
 ```
 ---
 ## Category Distribution (100k Results)
 ```mermaid
 pie title Category Distribution
    "Work Communication" : 89807
    "Financial" : 6534
    "Forwarded" : 2457
    "Technical Analysis" : 1129
    "Other" : 73
 ```
 ---
 ## Confidence Distribution (100k Results)
 ```mermaid
 pie title Confidence Levels
    "High (≥0.7)" : 74777
    "Medium (0.5-0.7)" : 17381
    "Low (<0.5)" : 7842
 ```
 ---
 ## System Architecture
 ```mermaid
 graph LR
    A[Email Source<br/>Gmail/IMAP/Enron] --> B[Email Provider]
    B --> C[Feature Extractor]
    C --> D[Ollama<br/>Embeddings]
    C --> E[Pattern Detector]
    D --> F[LightGBM<br/>Classifier]
    E --> F
    F --> G[Results<br/>JSON/CSV]
    F --> H[Sync Engine<br/>Labels/Keywords]
    I[LLM<br/>qwen3:8b] -.->|Calibration| J[Category Discovery]
    J -.-> F
    I -.->|Validation| K[Quality Check]
    K -.-> G
    style D fill:#e6f3ff
    style I fill:#fff4e6
    style F fill:#f0e6ff
    style G fill:#ffe6f0
 ```
 ---
 ## Next: Integrated End-to-End Script
 Building comprehensive validation script with:
 1. 50 low-confidence samples
 2. 25 random samples
 3. Final LLM summary call
 4. Complete pipeline orchestration
--- a/docs/chat-gippity-research.md
+++ b/docs/chat-gippity-research.md
@ -1,126 +0,0 @@
 Yeah — your instinct is solid here, Brett.
 For the kind of **multi-category, fast, local, high-volume** classification system you’re building, training your own model on your own labelled data will almost certainly give **better and more controllable results** than trying to stretch an off-the-shelf model.
 Let’s break this down clearly — especially what **sentence-transformers** brings to the table and why it’s a good fit for your hybrid architecture.
 ---
 ## 🧠 What Sentence Transformers Actually Do
 A **sentence transformer** is a model that converts a piece of text (e.g. subject line, short body snippet, sender info) into a **dense vector embedding** — basically a fixed-length numerical representation that captures its semantic meaning.
 Once you have embeddings, classification becomes a **simple, fast** downstream problem:
 * Instead of training a giant deep network,
 * You train a **small, lightweight model** (like Logistic Regression, XGBoost, or a simple neural head) on top of those embeddings.
 ---
 ## 🚀 Why This Is a Great Fit for Your Email Sorter
 | Sentence Transformers                                       | Why it matters for you                                                        |
 | ----------------------------------------------------------- | ----------------------------------------------------------------------------- |
 | **Pretrained models** already “understand” general language | So you don’t need massive datasets to get good results.                       |
 | **Fast inference**                                          | Embedding generation can run on CPU or GPU and is easy to parallelise.        |
 | **Stable + deterministic**                                  | Embeddings are consistent across runs (unlike LLM zero-shot answers).         |
 | **Lightweight training**                                    | You can train a classifier on top with a few thousand labelled samples.       |
 | **Supports multi-class** easily                             | Perfect for your 12 category types.                                           |
 | **Re-usable**                                               | One universal model for all inboxes; just retrain the top layer occasionally. |
 ---
 ## 🏗️ How It Would Work in Your Pipeline
 ### 1. **Use the LLM once** to label your initial batch (e.g. 1.5k–5k emails).
 This gives you your **bootstrapped labelled dataset**.
 ### 2. **Generate embeddings**
 * Take **subject**, **sender domain**, and optionally a short **body snippet**.
 * Pass through sentence transformer → get 768-dimensional vector (typical).
 * Save those embeddings alongside labels.
 ### 3. **Train a classifier** on top
 A lightweight model like:
 * **Logistic Regression** (fastest),
 * **XGBoost / LightGBM** (slightly heavier, more accuracy),
 * or even a shallow **MLP** if you want.
 This becomes your **universal email classifier**.
 ### 4. **Run in production**
 * New email comes in → embed text → run classifier → get category + confidence.
 * If below threshold → send to LLM for re-classification.
 This gives you **LLM semantic power** at training time, and **ML speed** at runtime.
 ---
 ## 🧪 Practical Setup (Minimal)
 ```python
 from sentence_transformers import SentenceTransformer
 from sklearn.linear_model import LogisticRegression
 # 1. Load a pre-trained embedding model
 model = SentenceTransformer("all-MiniLM-L6-v2")  # fast, lightweight
 # 2. Convert texts to embeddings
 texts = [f"{subject} {sender}" for subject, sender in your_email_data]
 embeddings = model.encode(texts, batch_size=64, show_progress_bar=True)
 # 3. Train a lightweight classifier
 clf = LogisticRegression(max_iter=1000)
 clf.fit(embeddings, labels)  # labels come from your LLM bootstrap
 # 4. Predict
 new_embedding = model.encode(["Subject from new email"])
 pred = clf.predict(new_embedding)
 conf = clf.predict_proba(new_embedding)
 ```
 * `all-MiniLM-L6-v2` is a fantastic starting model — small, fast, and surprisingly accurate.
 * You can fine-tune the sentence transformer later if you want **extra precision**.
 ---
 ## 🧠 Why It’s Easier Than Full Fine-Tuning
 You *don’t* need to train the transformer itself (at least not at the start).
 You’re just training the **top layer** (classifier). That means:
 * Training takes minutes, not hours.
 * You don’t need huge GPUs.
 * You can refresh or retrain easily with new data.
 Later on, if you want to **fine-tune the transformer itself** (so it “understands emails” more deeply), that’s an optional next step.
 ---
 ## ⚡ Typical Results People See
 * With 2–5k labelled samples, sentence transformer embeddings + logistic regression can hit **85–95 % accuracy** on email category tasks.
 * Inference time is **<5 ms per email** on CPU.
 * Works well for both generic and user-specific inboxes.
 ---
 ## 🪜 Suggested Path for You
 1. Use your **LLM pass** to generate labels on your first big inbox.
 2. Generate embeddings with a pretrained MiniLM.
 3. Train a logistic regression or XGBoost model.
 4. Run it on the next inbox → see how it performs.
 5. (Optional) Fine-tune the transformer if you want to push performance higher.
 ---
 👉 In short:
 Yes — sentence transformers are **perfect** for this.
 They give you **semantic power without LLM overhead**, are **easy to train**, and will make your hybrid classifier **extremely fast and accurate** after that first run.
 If you want, I can give you a **tiny starter training script** (30–40 lines) that does the embedding + classifier training from your first LLM-labelled dataset. Would you like that?
--- a/src/cli.py
+++ b/src/cli.py
@ -53,6 +53,8 @@ def cli():
              help='Verify model categories fit new mailbox (single LLM call)')
@click.option('--verify-sample', type=int, default=20,
              help='Number of emails to sample for category verification')
@click.option('--force-ml', is_flag=True,
              help='Force use of existing ML model regardless of dataset size')
 def run(
    source: str,
    credentials: Optional[str],
@ -65,7 +67,8 @@ def run(
    verbose: bool,
    no_llm_fallback: bool,
    verify_categories: bool,
-    verify_sample: int
+    verify_sample: int,
    force_ml: bool
 ):
    """Run email sorter pipeline."""
@ -198,10 +201,14 @@ def run(
    total_emails = len(emails)
    # Skip ML for small datasets (<1000 emails) - use LLM only
-    if total_emails < 1000:
+    # Unless --force-ml is set and we have an existing model
    if total_emails < 1000 and not force_ml:
        logger.warning(f"Only {total_emails} emails - too few for ML training")
        logger.warning("Using LLM-only classification (no ML model)")
        logger.warning("Use --force-ml to use existing model anyway")
        ml_classifier.is_mock = True
    elif force_ml and ml_classifier.model:
        logger.info(f"--force-ml: Using existing ML model for {total_emails} emails")
    # Check if we need calibration (no good ML model)
    if ml_classifier.is_mock or not ml_classifier.model:
@ -294,7 +301,20 @@ def run(
    logger.info("Exporting results")
    Path(output).mkdir(parents=True, exist_ok=True)
    # Build email lookup for metadata enrichment
    email_lookup = {email.id: email for email in emails}
    import json
    from datetime import datetime as dt
    def serialize_date(date_obj):
        """Serialize date to ISO format string."""
        if date_obj is None:
            return None
        if isinstance(date_obj, dt):
            return date_obj.isoformat()
        return str(date_obj)
    results_data = {
        'metadata': {
            'total_emails': len(emails),
@ -304,16 +324,24 @@ def run(
                'ml_classified': adaptive_classifier.get_stats().ml_classified,
                'llm_classified': adaptive_classifier.get_stats().llm_classified,
                'needs_review': adaptive_classifier.get_stats().needs_review,
-            }
+            },
            'generated_at': dt.now().isoformat(),
            'source': source,
            'source_path': directory if source == 'local' else None,
        },
        'classifications': [
            {
                'email_id': r.email_id,
                'subject': email_lookup.get(r.email_id, emails[i]).subject if r.email_id in email_lookup or i < len(emails) else '',
                'sender': email_lookup.get(r.email_id, emails[i]).sender if r.email_id in email_lookup or i < len(emails) else '',
                'sender_name': email_lookup.get(r.email_id, emails[i]).sender_name if r.email_id in email_lookup or i < len(emails) else None,
                'date': serialize_date(email_lookup.get(r.email_id, emails[i]).date if r.email_id in email_lookup or i < len(emails) else None),
                'has_attachments': email_lookup.get(r.email_id, emails[i]).has_attachments if r.email_id in email_lookup or i < len(emails) else False,
                'category': r.category,
                'confidence': r.confidence,
                'method': r.method
            }
-            for r in results
+            for i, r in enumerate(results)
        ]
    }
--- a/src/llm/openai_compat.py
+++ b/src/llm/openai_compat.py
@ -47,14 +47,12 @@ class OpenAIProvider(BaseLLMProvider):
        try:
            from openai import OpenAI
-            if not self.api_key:
+            # For local vLLM/OpenAI-compatible servers, API key may not be required
-                self.logger.error("OpenAI API key not configured")
+            # Use a placeholder if not set
-                self.logger.error("Set OPENAI_API_KEY environment variable or pass api_key parameter")
+            api_key = self.api_key or "not-needed"
                self._available = False
                return
            self.client = OpenAI(
-                api_key=self.api_key,
+                api_key=api_key,
                base_url=self.base_url if self.base_url != "https://api.openai.com/v1" else None,
                timeout=self.timeout
            )
@ -121,7 +119,7 @@ class OpenAIProvider(BaseLLMProvider):
    def test_connection(self) -> bool:
        """Test if OpenAI API is accessible."""
-        if not self.client or not self.api_key:
+        if not self.client:
            self.logger.warning("OpenAI client not initialized")
            return False
--- a/tools/batch_llm_classifier.py
+++ b/tools/batch_llm_classifier.py
@ -0,0 +1,364 @@
 #!/usr/bin/env python3
 """
 Standalone vLLM Batch Email Classifier
 PREREQUISITE: vLLM server must be running at configured endpoint
 This is a SEPARATE tool from the main ML classification pipeline.
 Use this for:
 - One-off batch questions ("find all emails about project X")
 - Custom classification criteria not in trained model
 - Exploratory analysis with flexible prompts
 Use RAG instead for:
 - Searching across large email corpus
 - Finding specific topics/keywords
 - Building knowledge from email content
 """
 import time
 import asyncio
 import logging
 import sys
 from pathlib import Path
 from typing import List, Dict, Any, Optional
 import httpx
 import click
 # Server configuration
 VLLM_CONFIG = {
    'base_url': 'https://rtx3090.bobai.com.au/v1',
    'api_key': 'rtx3090_foxadmin_10_8034ecb47841f45ba1d5f3f5d875c092',
    'model': 'qwen3-coder-30b',
    'batch_size': 4,  # Tested optimal - 100% success, proper batch pooling
    'temperature': 0.1,
    'max_tokens': 500
 }
 async def check_vllm_server(base_url: str, api_key: str, model: str) -> bool:
    """Check if vLLM server is running and model is loaded."""
    try:
        async with httpx.AsyncClient() as client:
            response = await client.post(
                f"{base_url}/chat/completions",
                json={
                    "model": model,
                    "messages": [{"role": "user", "content": "test"}],
                    "max_tokens": 5
                },
                headers={
                    "Authorization": f"Bearer {api_key}",
                    "Content-Type": "application/json"
                },
                timeout=10.0
            )
            return response.status_code == 200
    except Exception as e:
        print(f"ERROR: vLLM server check failed: {e}")
        return False
 async def classify_email_async(
    client: httpx.AsyncClient,
    email: Any,
    prompt_template: str,
    base_url: str,
    api_key: str,
    model: str,
    temperature: float,
    max_tokens: int
 ) -> Dict[str, Any]:
    """Classify single email using async HTTP request."""
    # No semaphore - proper batch pooling instead
    try:
        # Build prompt with email data
        prompt = prompt_template.format(
            subject=email.get('subject', 'N/A')[:100],
            sender=email.get('sender', 'N/A')[:50],
            body_snippet=email.get('body_snippet', '')[:500]
        )
            response = await client.post(
                f"{base_url}/chat/completions",
                json={
                    "model": model,
                    "messages": [{"role": "user", "content": prompt}],
                    "temperature": temperature,
                    "max_tokens": max_tokens
                },
                headers={
                    "Authorization": f"Bearer {api_key}",
                    "Content-Type": "application/json"
                },
                timeout=30.0
            )
            if response.status_code == 200:
                data = response.json()
                content = data['choices'][0]['message']['content']
                return {
                    'email_id': email.get('id', 'unknown'),
                    'subject': email.get('subject', 'N/A')[:60],
                    'result': content.strip(),
                    'success': True
                }
            return {
                'email_id': email.get('id', 'unknown'),
                'subject': email.get('subject', 'N/A')[:60],
                'result': f'HTTP {response.status_code}',
                'success': False
            }
    except Exception as e:
        return {
            'email_id': email.get('id', 'unknown'),
            'subject': email.get('subject', 'N/A')[:60],
            'result': f'Error: {str(e)[:100]}',
            'success': False
        }
 async def classify_single_batch(
    client: httpx.AsyncClient,
    emails: List[Dict[str, Any]],
    prompt_template: str,
    config: Dict[str, Any]
 ) -> List[Dict[str, Any]]:
    """Classify one batch of emails - send all at once, wait for completion."""
    tasks = [
        classify_email_async(
            client, email, prompt_template,
            config['base_url'], config['api_key'], config['model'],
            config['temperature'], config['max_tokens']
        )
        for email in emails
    ]
    results = await asyncio.gather(*tasks)
    return results
 async def batch_classify_async(
    emails: List[Dict[str, Any]],
    prompt_template: str,
    config: Dict[str, Any]
 ) -> List[Dict[str, Any]]:
    """Classify emails using proper batch pooling."""
    batch_size = config['batch_size']
    all_results = []
    async with httpx.AsyncClient() as client:
        # Process in batches - send batch, wait for all to complete, repeat
        for batch_start in range(0, len(emails), batch_size):
            batch_end = min(batch_start + batch_size, len(emails))
            batch_emails = emails[batch_start:batch_end]
            batch_results = await classify_single_batch(
                client, batch_emails, prompt_template, config
            )
            all_results.extend(batch_results)
    return all_results
 def load_emails_from_provider(provider_type: str, credentials: Optional[str], limit: int) -> List[Dict[str, Any]]:
    """Load emails from configured provider."""
    # Lazy import to avoid dependency issues
    if provider_type == 'enron':
        from src.email_providers.enron import EnronProvider
        provider = EnronProvider(maildir_path=".")
        provider.connect({})
        emails = provider.fetch_emails(limit=limit)
        provider.disconnect()
        # Convert to dict format
        return [
            {
                'id': e.id,
                'subject': e.subject,
                'sender': e.sender,
                'body_snippet': e.body_snippet
            }
            for e in emails
        ]
    elif provider_type == 'gmail':
        from src.email_providers.gmail import GmailProvider
        if not credentials:
            print("ERROR: Gmail requires --credentials path")
            sys.exit(1)
        provider = GmailProvider()
        provider.connect({'credentials_path': credentials})
        emails = provider.fetch_emails(limit=limit)
        provider.disconnect()
        return [
            {
                'id': e.id,
                'subject': e.subject,
                'sender': e.sender,
                'body_snippet': e.body_snippet
            }
            for e in emails
        ]
    else:
        print(f"ERROR: Unsupported provider: {provider_type}")
        sys.exit(1)
@click.group()
 def cli():
    """vLLM Batch Email Classifier - Ask custom questions across email batches."""
    pass
@cli.command()
@click.option('--source', type=click.Choice(['gmail', 'enron']), default='enron',
              help='Email provider')
@click.option('--credentials', type=click.Path(exists=False),
              help='Path to credentials file (for Gmail)')
@click.option('--limit', type=int, default=50,
              help='Number of emails to process')
@click.option('--question', type=str, required=True,
              help='Question to ask about each email')
@click.option('--output', type=click.Path(), default='batch_results.txt',
              help='Output file for results')
 def ask(source: str, credentials: Optional[str], limit: int, question: str, output: str):
    """Ask a custom question about a batch of emails."""
    print("=" * 80)
    print("vLLM BATCH EMAIL CLASSIFIER")
    print("=" * 80)
    print(f"Question: {question}")
    print(f"Source: {source}")
    print(f"Batch size: {limit}")
    print("=" * 80)
    print()
    # Check vLLM server
    print("Checking vLLM server...")
    if not asyncio.run(check_vllm_server(
        VLLM_CONFIG['base_url'],
        VLLM_CONFIG['api_key'],
        VLLM_CONFIG['model']
    )):
        print()
        print("ERROR: vLLM server not available or not responding")
        print(f"Expected endpoint: {VLLM_CONFIG['base_url']}")
        print(f"Expected model: {VLLM_CONFIG['model']}")
        print()
        print("PREREQUISITE: Start vLLM server before running this tool")
        sys.exit(1)
    print(f"✓ vLLM server running ({VLLM_CONFIG['model']})")
    print()
    # Load emails
    print(f"Loading {limit} emails from {source}...")
    emails = load_emails_from_provider(source, credentials, limit)
    print(f"✓ Loaded {len(emails)} emails")
    print()
    # Build prompt template (optimized for caching)
    prompt_template = f"""You are analyzing emails to answer specific questions.
 INSTRUCTIONS:
 - Read the email carefully
 - Answer the question directly and concisely
 - Provide reasoning if helpful
 - If the email is not relevant, say "Not relevant"
 QUESTION:
 {question}
 EMAIL TO ANALYZE:
 Subject: {{subject}}
 From: {{sender}}
 Body: {{body_snippet}}
 ANSWER:
 """
    # Process batch
    print(f"Processing {len(emails)} emails with {VLLM_CONFIG['max_concurrent']} concurrent requests...")
    start_time = time.time()
    results = asyncio.run(batch_classify_async(emails, prompt_template, VLLM_CONFIG))
    end_time = time.time()
    total_time = end_time - start_time
    # Stats
    successful = sum(1 for r in results if r['success'])
    throughput = len(emails) / total_time
    print()
    print("=" * 80)
    print("RESULTS")
    print("=" * 80)
    print(f"Total emails: {len(emails)}")
    print(f"Successful: {successful}")
    print(f"Failed: {len(emails) - successful}")
    print(f"Time: {total_time:.1f}s")
    print(f"Throughput: {throughput:.2f} emails/sec")
    print("=" * 80)
    print()
    # Save results
    with open(output, 'w') as f:
        f.write(f"Question: {question}\n")
        f.write(f"Processed: {len(emails)} emails in {total_time:.1f}s\n")
        f.write("=" * 80 + "\n\n")
        for i, result in enumerate(results, 1):
            f.write(f"{i}. {result['subject']}\n")
            f.write(f"   Email ID: {result['email_id']}\n")
            f.write(f"   Answer: {result['result']}\n")
            f.write("\n")
    print(f"Results saved to: {output}")
    print()
    # Show sample
    print("SAMPLE RESULTS (first 5):")
    for i, result in enumerate(results[:5], 1):
        print(f"\n{i}. {result['subject']}")
        print(f"   {result['result'][:100]}...")
@cli.command()
 def check():
    """Check if vLLM server is running and ready."""
    print("Checking vLLM server...")
    print(f"Endpoint: {VLLM_CONFIG['base_url']}")
    print(f"Model: {VLLM_CONFIG['model']}")
    print()
    if asyncio.run(check_vllm_server(
        VLLM_CONFIG['base_url'],
        VLLM_CONFIG['api_key'],
        VLLM_CONFIG['model']
    )):
        print("✓ vLLM server is running and ready")
        print(f"✓ Max concurrent requests: {VLLM_CONFIG['max_concurrent']}")
        print(f"✓ Estimated throughput: ~4.4 emails/sec")
    else:
        print("✗ vLLM server not available")
        print()
        print("Start vLLM server before using this tool")
        sys.exit(1)
 if __name__ == '__main__':
    cli()
--- a/tools/brett_gmail_analyzer.py
+++ b/tools/brett_gmail_analyzer.py
@ -0,0 +1,391 @@
 #!/usr/bin/env python3
 """
 Brett Gmail Dataset Analyzer
 ============================
 CUSTOM script for analyzing the brett-gmail email dataset.
 NOT portable to other datasets without modification.
 Usage:
    python tools/brett_gmail_analyzer.py
 Output:
    - Console report with comprehensive statistics
    - data/brett_gmail_analysis.json with full analysis data
 """
 import json
 import re
 from collections import Counter, defaultdict
 from datetime import datetime
 from pathlib import Path
 # Add parent to path for imports
 import sys
 sys.path.insert(0, str(Path(__file__).parent.parent))
 from src.calibration.local_file_parser import LocalFileParser
 # =============================================================================
 # CLASSIFICATION RULES - CUSTOM FOR BRETT'S GMAIL
 # =============================================================================
 def classify_email(email):
    """
    Classify email into categories based on sender domain and subject patterns.
    Priority: Sender domain > Subject keywords
    """
    sender = email.sender or ""
    subject = email.subject or ""
    domain = sender.split('@')[-1] if '@' in sender else sender
    # === HIGH-LEVEL CATEGORIES ===
    # --- Art & Collectibles ---
    if 'mutualart.com' in domain:
        return ('Art & Collectibles', 'MutualArt Alerts')
    # --- Travel & Tourism ---
    if 'tripadvisor.com' in domain:
        return ('Travel & Tourism', 'Tripadvisor')
    if 'booking.com' in domain:
        return ('Travel & Tourism', 'Booking.com')
    # --- Entertainment & Streaming ---
    if 'spotify.com' in domain:
        if 'concert' in subject.lower() or 'live' in subject.lower():
            return ('Entertainment', 'Spotify Concerts')
        return ('Entertainment', 'Spotify Promotions')
    if 'youtube.com' in domain:
        return ('Entertainment', 'YouTube')
    if 'onlyfans.com' in domain:
        return ('Entertainment', 'OnlyFans')
    if 'ign.com' in domain:
        return ('Entertainment', 'IGN Gaming')
    # --- Shopping & eCommerce ---
    if 'ebay.com' in domain or 'reply.ebay' in domain:
        return ('Shopping', 'eBay')
    if 'aliexpress.com' in domain:
        return ('Shopping', 'AliExpress')
    if 'alibabacloud.com' in domain or 'alibaba-inc.com' in domain:
        return ('Tech Services', 'Alibaba Cloud')
    if '4wdsupacentre' in domain:
        return ('Shopping', '4WD Supacentre')
    if 'mikeblewitt' in domain or 'mbcoffscoast' in domain:
        return ('Shopping', 'Mike Blewitt/MBC')
    if 'auspost.com.au' in domain:
        return ('Shopping', 'Australia Post')
    if 'printfresh' in domain:
        return ('Business', 'Timesheets')
    # --- AI & Tech Services ---
    if 'anthropic.com' in domain or 'claude.com' in domain:
        return ('AI Services', 'Anthropic/Claude')
    if 'openai.com' in domain:
        return ('AI Services', 'OpenAI')
    if 'openrouter.ai' in domain:
        return ('AI Services', 'OpenRouter')
    if 'lambda' in domain:
        return ('AI Services', 'Lambda Labs')
    if 'x.ai' in domain:
        return ('AI Services', 'xAI')
    if 'perplexity.ai' in domain:
        return ('AI Services', 'Perplexity')
    if 'cursor.com' in domain:
        return ('Developer Tools', 'Cursor')
    # --- Developer Tools ---
    if 'ngrok.com' in domain:
        return ('Developer Tools', 'ngrok')
    if 'docker.com' in domain:
        return ('Developer Tools', 'Docker')
    # --- Productivity Apps ---
    if 'screencastify.com' in domain:
        return ('Productivity', 'Screencastify')
    if 'tango.us' in domain:
        return ('Productivity', 'Tango')
    if 'xplor.com' in domain or 'myxplor' in domain:
        return ('Services', 'Xplor Childcare')
    # --- Google Services ---
    if 'google.com' in domain or 'accounts.google.com' in domain:
        if 'performance report' in subject.lower() or 'business profile' in subject.lower():
            return ('Google', 'Business Profile')
        if 'security' in subject.lower() or 'sign-in' in subject.lower():
            return ('Security', 'Google Security')
        if 'firebase' in subject.lower() or 'firestore' in subject.lower():
            return ('Developer Tools', 'Firebase')
        if 'ads' in subject.lower():
            return ('Google', 'Google Ads')
        if 'analytics' in subject.lower():
            return ('Google', 'Analytics')
        if re.search(r'verification code|verify', subject, re.I):
            return ('Security', 'Google Verification')
        return ('Google', 'Other Google')
    # --- Microsoft ---
    if 'microsoft.com' in domain or 'outlook.com' in domain or 'hotmail.com' in domain:
        if 'security' in subject.lower() or 'protection' in domain:
            return ('Security', 'Microsoft Security')
        return ('Personal', 'Microsoft/Outlook')
    # --- Social Media ---
    if 'reddit' in domain:
        return ('Social', 'Reddit')
    # --- Business/Work ---
    if 'frontiertechstrategies' in domain:
        return ('Business', 'Appointments')
    if 'crsaustralia.gov.au' in domain:
        return ('Business', 'Job Applications')
    if 'v6send.net' in domain:
        return ('Shopping', 'Automotive Dealers')
    # === SUBJECT-BASED FALLBACK ===
    if re.search(r'security alert|verification code|sign.?in|password|2fa', subject, re.I):
        return ('Security', 'General Security')
    if re.search(r'order.*ship|receipt|payment|invoice|purchase', subject, re.I):
        return ('Transactions', 'Orders/Receipts')
    if re.search(r'trial|subscription|billing|renew', subject, re.I):
        return ('Billing', 'Subscriptions')
    if re.search(r'terms of service|privacy policy|legal', subject, re.I):
        return ('Legal', 'Policy Updates')
    if re.search(r'welcome to|getting started', subject, re.I):
        return ('Onboarding', 'Welcome Emails')
    # --- Personal contacts ---
    if 'gmail.com' in domain:
        return ('Personal', 'Gmail Contacts')
    return ('Uncategorized', 'Unknown')
 def extract_order_ids(emails):
    """Extract order/transaction IDs from emails."""
    order_patterns = [
        (r'Order\s+(\d{10,})', 'AliExpress Order'),
        (r'receipt.*(\d{4}-\d{4}-\d{4})', 'Receipt ID'),
        (r'#(\d{4,})', 'Generic Order ID'),
    ]
    orders = []
    for email in emails:
        subject = email.subject or ""
        for pattern, order_type in order_patterns:
            match = re.search(pattern, subject, re.I)
            if match:
                orders.append({
                    'id': match.group(1),
                    'type': order_type,
                    'subject': subject,
                    'date': str(email.date) if email.date else None,
                    'sender': email.sender
                })
                break
    return orders
 def analyze_time_distribution(emails):
    """Analyze email distribution over time."""
    by_year = Counter()
    by_month = Counter()
    by_day_of_week = Counter()
    day_names = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
    for email in emails:
        if email.date:
            try:
                by_year[email.date.year] += 1
                by_month[f"{email.date.year}-{email.date.month:02d}"] += 1
                by_day_of_week[day_names[email.date.weekday()]] += 1
            except:
                pass
    return {
        'by_year': dict(by_year.most_common()),
        'by_month': dict(sorted(by_month.items())),
        'by_day_of_week': {d: by_day_of_week.get(d, 0) for d in day_names}
    }
 def main():
    email_dir = "/home/bob/Documents/Email Manager/emails/brett-gmail"
    output_dir = Path(__file__).parent.parent / "data"
    output_dir.mkdir(exist_ok=True)
    print("="*70)
    print("BRETT GMAIL DATASET ANALYSIS")
    print("="*70)
    print(f"\nSource: {email_dir}")
    print(f"Output: {output_dir}")
    # Parse emails
    print("\nParsing emails...")
    parser = LocalFileParser(email_dir)
    emails = parser.parse_emails()
    print(f"Total emails: {len(emails)}")
    # Date range
    dates = [e.date for e in emails if e.date]
    if dates:
        dates.sort()
        print(f"Date range: {dates[0].strftime('%Y-%m-%d')} to {dates[-1].strftime('%Y-%m-%d')}")
    # Classify all emails
    print("\nClassifying emails...")
    category_counts = Counter()
    subcategory_counts = Counter()
    by_category = defaultdict(list)
    by_subcategory = defaultdict(list)
    for email in emails:
        category, subcategory = classify_email(email)
        category_counts[category] += 1
        subcategory_counts[subcategory] += 1
        by_category[category].append(email)
        by_subcategory[subcategory].append(email)
    # Print category summary
    print("\n" + "="*70)
    print("CATEGORY SUMMARY")
    print("="*70)
    for category, count in category_counts.most_common():
        pct = count / len(emails) * 100
        bar = "█" * int(pct / 2)
        print(f"\n{category} ({count} emails, {pct:.1f}%)")
        print(f"  {bar}")
        # Show subcategories
        subcats = Counter()
        for email in by_category[category]:
            _, subcat = classify_email(email)
            subcats[subcat] += 1
        for subcat, subcount in subcats.most_common():
            print(f"    - {subcat}: {subcount}")
    # Analyze senders
    print("\n" + "="*70)
    print("TOP SENDERS BY VOLUME")
    print("="*70)
    sender_counts = Counter(e.sender for e in emails)
    for sender, count in sender_counts.most_common(15):
        pct = count / len(emails) * 100
        print(f"  {count:4d} ({pct:4.1f}%)  {sender}")
    # Time analysis
    print("\n" + "="*70)
    print("TIME DISTRIBUTION")
    print("="*70)
    time_dist = analyze_time_distribution(emails)
    print("\nBy Year:")
    for year, count in sorted(time_dist['by_year'].items()):
        bar = "█" * (count // 10)
        print(f"  {year}: {count:4d} {bar}")
    print("\nBy Day of Week:")
    for day, count in time_dist['by_day_of_week'].items():
        bar = "█" * (count // 5)
        print(f"  {day}: {count:3d} {bar}")
    # Extract orders
    print("\n" + "="*70)
    print("ORDER/TRANSACTION IDs FOUND")
    print("="*70)
    orders = extract_order_ids(emails)
    if orders:
        for order in orders[:10]:
            print(f"  [{order['type']}] {order['id']}")
            print(f"    Subject: {order['subject'][:60]}...")
    else:
        print("  No order IDs detected in subjects")
    # Actionable insights
    print("\n" + "="*70)
    print("ACTIONABLE INSIGHTS")
    print("="*70)
    # High-volume automated senders
    automated_domains = ['mutualart.com', 'tripadvisor.com', 'ebay.com', 'spotify.com']
    auto_count = sum(1 for e in emails if any(d in (e.sender or '') for d in automated_domains))
    print(f"\n1. AUTOMATED EMAILS: {auto_count} ({auto_count/len(emails)*100:.1f}%)")
    print("   - MutualArt alerts: Consider aggregating to weekly digest")
    print("   - Tripadvisor: Can be filtered to trash or separate folder")
    print("   - eBay/Spotify: Promotional, low priority")
    # Security alerts
    security_count = category_counts.get('Security', 0)
    print(f"\n2. SECURITY ALERTS: {security_count} ({security_count/len(emails)*100:.1f}%)")
    print("   - Google security: Review for legitimate sign-in attempts")
    print("   - Should NOT be auto-filtered")
    # Business/Work
    business_count = category_counts.get('Business', 0) + category_counts.get('Google', 0)
    print(f"\n3. BUSINESS-RELATED: {business_count} ({business_count/len(emails)*100:.1f}%)")
    print("   - Google Business Profile reports: Monthly review")
    print("   - Job applications: High priority")
    print("   - Appointments: Calendar integration")
    # AI Services (professional interest)
    ai_count = category_counts.get('AI Services', 0) + category_counts.get('Developer Tools', 0)
    print(f"\n4. AI/DEVELOPER TOOLS: {ai_count} ({ai_count/len(emails)*100:.1f}%)")
    print("   - Anthropic, OpenAI, Lambda: Keep for reference")
    print("   - ngrok, Docker, Cursor: Developer updates")
    # Personal
    personal_count = category_counts.get('Personal', 0)
    print(f"\n5. PERSONAL: {personal_count} ({personal_count/len(emails)*100:.1f}%)")
    print("   - Gmail contacts: May need human review")
    print("   - Microsoft/Outlook: Check for spam")
    # Save analysis data
    analysis_data = {
        'metadata': {
            'total_emails': len(emails),
            'date_range': {
                'start': str(dates[0]) if dates else None,
                'end': str(dates[-1]) if dates else None
            },
            'analyzed_at': datetime.now().isoformat()
        },
        'categories': dict(category_counts),
        'subcategories': dict(subcategory_counts),
        'top_senders': dict(sender_counts.most_common(50)),
        'time_distribution': time_dist,
        'orders_found': orders,
        'classification_accuracy': {
            'categorized': len(emails) - category_counts.get('Uncategorized', 0),
            'uncategorized': category_counts.get('Uncategorized', 0),
            'accuracy_pct': (len(emails) - category_counts.get('Uncategorized', 0)) / len(emails) * 100
        }
    }
    output_file = output_dir / "brett_gmail_analysis.json"
    with open(output_file, 'w') as f:
        json.dump(analysis_data, f, indent=2)
    print(f"\n\nAnalysis saved to: {output_file}")
    print("\n" + "="*70)
    print(f"CLASSIFICATION ACCURACY: {analysis_data['classification_accuracy']['accuracy_pct']:.1f}%")
    print(f"({analysis_data['classification_accuracy']['categorized']} categorized, "
          f"{analysis_data['classification_accuracy']['uncategorized']} uncategorized)")
    print("="*70)
 if __name__ == '__main__':
    main()
--- a/tools/brett_microsoft_analyzer.py
+++ b/tools/brett_microsoft_analyzer.py
@ -0,0 +1,500 @@
 #!/usr/bin/env python3
 """
 Brett Microsoft (Outlook) Dataset Analyzer
 ==========================================
 CUSTOM script for analyzing the brett-microsoft email dataset.
 NOT portable to other datasets without modification.
 Usage:
    python tools/brett_microsoft_analyzer.py
 Output:
    - Console report with comprehensive statistics
    - data/brett_microsoft_analysis.json with full analysis data
 """
 import json
 import re
 from collections import Counter, defaultdict
 from datetime import datetime
 from pathlib import Path
 # Add parent to path for imports
 import sys
 sys.path.insert(0, str(Path(__file__).parent.parent))
 from src.calibration.local_file_parser import LocalFileParser
 # =============================================================================
 # CLASSIFICATION RULES - CUSTOM FOR BRETT'S MICROSOFT/OUTLOOK INBOX
 # =============================================================================
 def classify_email(email):
    """
    Classify email into categories based on sender domain and subject patterns.
    This is a BUSINESS inbox - different approach than personal Gmail.
    Priority: Sender domain > Subject keywords > Business context
    """
    sender = email.sender or ""
    subject = email.subject or ""
    domain = sender.split('@')[-1] if '@' in sender else sender
    # === BUSINESS OPERATIONS ===
    # MYOB/Accounting
    if 'apps.myob.com' in domain or 'myob' in subject.lower():
        return ('Business Operations', 'MYOB Invoices')
    # TPG/Telecom/Internet
    if 'tpgtelecom.com.au' in domain or 'aapt.com.au' in domain:
        if 'suspension' in subject.lower() or 'overdue' in subject.lower():
            return ('Business Operations', 'Telecom - Urgent/Overdue')
        if 'novation' in subject.lower():
            return ('Business Operations', 'Telecom - Contract Changes')
        if 'NBN' in subject or 'nbn' in subject.lower():
            return ('Business Operations', 'Telecom - NBN')
        return ('Business Operations', 'Telecom - General')
    # DocuSign (Contracts)
    if 'docusign' in domain or 'docusign' in subject.lower():
        return ('Business Operations', 'DocuSign Contracts')
    # === CLIENT WORK ===
    # Green Output / Energy Avengers (App Development Client)
    if 'greenoutput.com.au' in domain or 'energyavengers' in domain:
        return ('Client Work', 'Energy Avengers Project')
    # Brighter Access (Client)
    if 'brighteraccess' in domain or 'Brighter Access' in subject:
        return ('Client Work', 'Brighter Access')
    # Waterfall Way Designs (Business Partner)
    if 'waterfallwaydesigns' in domain:
        return ('Client Work', 'Waterfall Way Designs')
    # Target Impact
    if 'targetimpact.com.au' in domain:
        return ('Client Work', 'Target Impact')
    # MerlinFX
    if 'merlinfx.com.au' in domain:
        return ('Client Work', 'MerlinFX')
    # Solar/Energy related (Energy Avengers ecosystem)
    if 'solarairenergy.com.au' in domain or 'solarconnected.com.au' in domain:
        return ('Client Work', 'Energy Avengers Ecosystem')
    if 'eonadvisory.com.au' in domain or 'australianpowerbrokers.com.au' in domain:
        return ('Client Work', 'Energy Avengers Ecosystem')
    if 'fyconsulting.com.au' in domain:
        return ('Client Work', 'Energy Avengers Ecosystem')
    if 'convergedesign.com.au' in domain:
        return ('Client Work', 'Energy Avengers Ecosystem')
    # MYP Corp (Disability Services Software)
    if '1myp.com' in domain or 'mypcorp' in domain or 'MYP' in subject:
        return ('Business Operations', 'MYP Software')
    # === MICROSOFT SERVICES ===
    # Microsoft Support Cases
    if re.search(r'\[Case.*#|Case #|TrackingID', subject, re.I) or 'support.microsoft.com' in domain:
        return ('Microsoft', 'Support Cases')
    # Microsoft Billing/Invoices
    if 'Microsoft invoice' in subject or 'credit card was declined' in subject:
        return ('Microsoft', 'Billing')
    # Microsoft Subscriptions
    if 'subscription' in subject.lower() and 'microsoft' in sender.lower():
        return ('Microsoft', 'Subscriptions')
    # SharePoint/Teams
    if 'sharepointonline.com' in domain or 'Teams' in subject:
        return ('Microsoft', 'SharePoint/Teams')
    # O365 Service Updates
    if 'o365su' in sender or ('digest' in subject.lower() and 'microsoft' in sender.lower()):
        return ('Microsoft', 'Service Updates')
    # General Microsoft
    if 'microsoft.com' in domain:
        return ('Microsoft', 'General')
    # === DEVELOPER TOOLS ===
    # GitHub CI/CD
    if re.search(r'\[FSSCoding', subject):
        return ('Developer', 'GitHub CI/CD Failures')
    # GitHub Issues/PRs
    if 'github.com' in domain:
        if 'linuxmint' in subject or 'cinnamon' in subject:
            return ('Developer', 'Open Source Contributions')
        if 'Pheromind' in subject or 'ChrisRoyse' in subject:
            return ('Developer', 'GitHub Collaborations')
        return ('Developer', 'GitHub Notifications')
    # Neo4j
    if 'neo4j.com' in domain:
        if 'webinar' in subject.lower() or 'Webinar' in subject:
            return ('Developer', 'Neo4j Webinars')
        if 'NODES' in subject or 'GraphTalk' in subject:
            return ('Developer', 'Neo4j Conference')
        return ('Developer', 'Neo4j')
    # Cursor (AI IDE)
    if 'cursor.com' in domain or 'cursor.so' in domain or 'Cursor' in subject:
        return ('Developer', 'Cursor IDE')
    # Tailscale
    if 'tailscale.com' in domain:
        return ('Developer', 'Tailscale')
    # Hugging Face
    if 'huggingface' in domain or 'Hugging Face' in subject:
        return ('Developer', 'Hugging Face')
    # Stripe (Payment Failures)
    if 'stripe.com' in domain:
        return ('Billing', 'Stripe Payments')
    # Contabo (Hosting)
    if 'contabo.com' in domain:
        return ('Developer', 'Contabo Hosting')
    # SendGrid
    if 'sendgrid' in subject.lower():
        return ('Developer', 'SendGrid')
    # Twilio
    if 'twilio.com' in domain:
        return ('Developer', 'Twilio')
    # Brave Search API
    if 'brave.com' in domain:
        return ('Developer', 'Brave Search API')
    # PyPI
    if 'pypi' in subject.lower() or 'pypi.org' in domain:
        return ('Developer', 'PyPI')
    # NVIDIA/CUDA
    if 'CUDA' in subject or 'nvidia' in domain:
        return ('Developer', 'NVIDIA/CUDA')
    # Inception Labs / AI Tools
    if 'inceptionlabs.ai' in domain:
        return ('Developer', 'AI Tools')
    # === LEARNING ===
    # Computer Enhance (Casey Muratori) / Substack
    if 'computerenhance' in sender or 'substack.com' in domain:
        return ('Learning', 'Substack/Newsletters')
    # Odoo
    if 'odoo.com' in domain:
        return ('Learning', 'Odoo ERP')
    # Mozilla Firefox
    if 'mozilla.org' in domain:
        return ('Developer', 'Mozilla Firefox')
    # === PERSONAL / COMMUNITY ===
    # Grandfather Gatherings (Personal Community)
    if 'Grandfather Gather' in subject:
        return ('Personal', 'Grandfather Gatherings')
    # Mailchimp newsletters (often personal)
    if 'mailchimpapp.com' in domain:
        return ('Personal', 'Personal Newsletters')
    # Community Events
    if 'Community Working Bee' in subject:
        return ('Personal', 'Community Events')
    # Personal emails (Gmail/Hotmail)
    if 'gmail.com' in domain or 'hotmail.com' in domain or 'bigpond.com' in domain:
        return ('Personal', 'Personal Contacts')
    # FSS Internal
    if 'foxsoftwaresolutions.com.au' in domain:
        return ('Business Operations', 'FSS Internal')
    # === FINANCIAL ===
    # eToro
    if 'etoro.com' in domain:
        return ('Financial', 'eToro Trading')
    # Dell
    if 'dell.com' in domain or 'Dell' in subject:
        return ('Business Operations', 'Dell Hardware')
    # Insurance
    if 'KT Insurance' in subject or 'insurance' in subject.lower():
        return ('Business Operations', 'Insurance')
    # SBSCH Payments
    if 'SBSCH' in subject:
        return ('Business Operations', 'SBSCH Payments')
    # iCare NSW
    if 'icare.nsw.gov.au' in domain:
        return ('Business Operations', 'iCare NSW')
    # Vodafone
    if 'vodafone.com.au' in domain:
        return ('Business Operations', 'Telecom - Vodafone')
    # === MISC ===
    # Undeliverable/Bounces
    if 'Undeliverable' in subject:
        return ('System', 'Email Bounces')
    # Security
    if re.search(r'Security Alert|Login detected|security code|Verify', subject, re.I):
        return ('Security', 'Security Alerts')
    # Password Reset
    if 'password' in subject.lower():
        return ('Security', 'Password')
    # Calendly
    if 'calendly.com' in domain:
        return ('Business Operations', 'Calendly')
    # Trello
    if 'trello.com' in domain:
        return ('Business Operations', 'Trello')
    # Scorptec
    if 'scorptec' in domain:
        return ('Business Operations', 'Hardware Vendor')
    # Webcentral
    if 'webcentral.com.au' in domain:
        return ('Business Operations', 'Web Hosting')
    # Bluetti (Hardware)
    if 'bluettipower.com' in domain:
        return ('Business Operations', 'Hardware - Power')
    # ABS Surveys
    if 'abs.gov.au' in domain:
        return ('Business Operations', 'Government - ABS')
    # Qualtrics/Surveys
    if 'qualtrics' in domain:
        return ('Business Operations', 'Surveys')
    return ('Uncategorized', 'Unknown')
 def extract_case_ids(emails):
    """Extract Microsoft support case IDs and tracking IDs from emails."""
    case_patterns = [
        (r'Case\s*#?\s*:?\s*(\d{8})', 'Microsoft Case'),
        (r'\[Case\s*#?\s*:?\s*(\d{8})\]', 'Microsoft Case'),
        (r'TrackingID#(\d{16})', 'Tracking ID'),
    ]
    cases = defaultdict(list)
    for email in emails:
        subject = email.subject or ""
        for pattern, case_type in case_patterns:
            match = re.search(pattern, subject, re.I)
            if match:
                case_id = match.group(1)
                cases[case_id].append({
                    'type': case_type,
                    'subject': subject,
                    'date': str(email.date) if email.date else None,
                    'sender': email.sender
                })
    return dict(cases)
 def analyze_time_distribution(emails):
    """Analyze email distribution over time."""
    by_year = Counter()
    by_month = Counter()
    by_day_of_week = Counter()
    day_names = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
    for email in emails:
        if email.date:
            try:
                by_year[email.date.year] += 1
                by_month[f"{email.date.year}-{email.date.month:02d}"] += 1
                by_day_of_week[day_names[email.date.weekday()]] += 1
            except:
                pass
    return {
        'by_year': dict(by_year.most_common()),
        'by_month': dict(sorted(by_month.items())),
        'by_day_of_week': {d: by_day_of_week.get(d, 0) for d in day_names}
    }
 def main():
    email_dir = "/home/bob/Documents/Email Manager/emails/brett-microsoft"
    output_dir = Path(__file__).parent.parent / "data"
    output_dir.mkdir(exist_ok=True)
    print("="*70)
    print("BRETT MICROSOFT (OUTLOOK) DATASET ANALYSIS")
    print("="*70)
    print(f"\nSource: {email_dir}")
    print(f"Output: {output_dir}")
    # Parse emails
    print("\nParsing emails...")
    parser = LocalFileParser(email_dir)
    emails = parser.parse_emails()
    print(f"Total emails: {len(emails)}")
    # Date range
    dates = [e.date for e in emails if e.date]
    if dates:
        dates.sort()
        print(f"Date range: {dates[0].strftime('%Y-%m-%d')} to {dates[-1].strftime('%Y-%m-%d')}")
    # Classify all emails
    print("\nClassifying emails...")
    category_counts = Counter()
    subcategory_counts = Counter()
    by_category = defaultdict(list)
    by_subcategory = defaultdict(list)
    for email in emails:
        category, subcategory = classify_email(email)
        category_counts[category] += 1
        subcategory_counts[f"{category}: {subcategory}"] += 1
        by_category[category].append(email)
        by_subcategory[subcategory].append(email)
    # Print category summary
    print("\n" + "="*70)
    print("TOP-LEVEL CATEGORY SUMMARY")
    print("="*70)
    for category, count in category_counts.most_common():
        pct = count / len(emails) * 100
        bar = "█" * int(pct / 2)
        print(f"\n{category} ({count} emails, {pct:.1f}%)")
        print(f"  {bar}")
        # Show subcategories
        subcats = Counter()
        for email in by_category[category]:
            _, subcat = classify_email(email)
            subcats[subcat] += 1
        for subcat, subcount in subcats.most_common():
            print(f"    - {subcat}: {subcount}")
    # Analyze senders
    print("\n" + "="*70)
    print("TOP SENDERS BY VOLUME")
    print("="*70)
    sender_counts = Counter(e.sender for e in emails)
    for sender, count in sender_counts.most_common(15):
        pct = count / len(emails) * 100
        print(f"  {count:4d} ({pct:4.1f}%)  {sender}")
    # Time analysis
    print("\n" + "="*70)
    print("TIME DISTRIBUTION")
    print("="*70)
    time_dist = analyze_time_distribution(emails)
    print("\nBy Year:")
    for year, count in sorted(time_dist['by_year'].items()):
        bar = "█" * (count // 10)
        print(f"  {year}: {count:4d} {bar}")
    print("\nBy Day of Week:")
    for day, count in time_dist['by_day_of_week'].items():
        bar = "█" * (count // 5)
        print(f"  {day}: {count:3d} {bar}")
    # Extract case IDs
    print("\n" + "="*70)
    print("MICROSOFT SUPPORT CASES TRACKED")
    print("="*70)
    cases = extract_case_ids(emails)
    if cases:
        for case_id, occurrences in sorted(cases.items()):
            print(f"\n  Case/Tracking: {case_id} ({len(occurrences)} emails)")
            for occ in occurrences[:3]:
                print(f"    - {occ['date']}: {occ['subject'][:50]}...")
    else:
        print("  No case IDs detected")
    # Actionable insights
    print("\n" + "="*70)
    print("INBOX CHARACTER ASSESSMENT")
    print("="*70)
    business_pct = (category_counts.get('Business Operations', 0) +
                    category_counts.get('Client Work', 0) +
                    category_counts.get('Developer', 0)) / len(emails) * 100
    personal_pct = category_counts.get('Personal', 0) / len(emails) * 100
    print(f"\n  Business/Professional: {business_pct:.1f}%")
    print(f"  Personal: {personal_pct:.1f}%")
    print(f"\n  ASSESSMENT: This is a {'BUSINESS' if business_pct > 50 else 'MIXED'} inbox")
    # Save analysis data
    analysis_data = {
        'metadata': {
            'total_emails': len(emails),
            'inbox_type': 'microsoft',
            'inbox_character': 'business' if business_pct > 50 else 'mixed',
            'date_range': {
                'start': str(dates[0]) if dates else None,
                'end': str(dates[-1]) if dates else None
            },
            'analyzed_at': datetime.now().isoformat()
        },
        'categories': dict(category_counts),
        'subcategories': dict(subcategory_counts),
        'top_senders': dict(sender_counts.most_common(50)),
        'time_distribution': time_dist,
        'support_cases': cases,
        'classification_accuracy': {
            'categorized': len(emails) - category_counts.get('Uncategorized', 0),
            'uncategorized': category_counts.get('Uncategorized', 0),
            'accuracy_pct': (len(emails) - category_counts.get('Uncategorized', 0)) / len(emails) * 100
        }
    }
    output_file = output_dir / "brett_microsoft_analysis.json"
    with open(output_file, 'w') as f:
        json.dump(analysis_data, f, indent=2)
    print(f"\n\nAnalysis saved to: {output_file}")
    print("\n" + "="*70)
    print(f"CLASSIFICATION ACCURACY: {analysis_data['classification_accuracy']['accuracy_pct']:.1f}%")
    print(f"({analysis_data['classification_accuracy']['categorized']} categorized, "
          f"{analysis_data['classification_accuracy']['uncategorized']} uncategorized)")
    print("="*70)
 if __name__ == '__main__':
    main()
--- a/tools/generate_html_report.py
+++ b/tools/generate_html_report.py
@ -0,0 +1,642 @@
 #!/usr/bin/env python3
 """
 Generate interactive HTML report from email classification results.
 Usage:
    python tools/generate_html_report.py --input results.json --output report.html
 """
 import argparse
 import json
 from pathlib import Path
 from datetime import datetime
 from collections import Counter, defaultdict
 from html import escape
 def load_results(input_path: str) -> dict:
    """Load classification results from JSON."""
    with open(input_path) as f:
        return json.load(f)
 def extract_domain(sender: str) -> str:
    """Extract domain from email address."""
    if not sender:
        return "unknown"
    if "@" in sender:
        return sender.split("@")[-1].lower()
    return sender.lower()
 def format_date(date_str: str) -> str:
    """Format ISO date string for display."""
    if not date_str:
        return "N/A"
    try:
        dt = datetime.fromisoformat(date_str.replace("Z", "+00:00"))
        return dt.strftime("%Y-%m-%d %H:%M")
    except:
        return date_str[:16] if len(date_str) > 16 else date_str
 def truncate(text: str, max_len: int = 60) -> str:
    """Truncate text with ellipsis."""
    if not text:
        return ""
    if len(text) <= max_len:
        return text
    return text[:max_len-3] + "..."
 def generate_html_report(results: dict, output_path: str):
    """Generate interactive HTML report."""
    metadata = results.get("metadata", {})
    classifications = results.get("classifications", [])
    # Calculate statistics
    total = len(classifications)
    categories = Counter(c["category"] for c in classifications)
    methods = Counter(c["method"] for c in classifications)
    # Group by category
    by_category = defaultdict(list)
    for c in classifications:
        by_category[c["category"]].append(c)
    # Sort categories by count
    sorted_categories = sorted(categories.keys(), key=lambda x: categories[x], reverse=True)
    # Sender statistics
    sender_domains = Counter(extract_domain(c.get("sender", "")) for c in classifications)
    top_senders = Counter(c.get("sender", "unknown") for c in classifications).most_common(20)
    # Confidence distribution
    high_conf = sum(1 for c in classifications if c.get("confidence", 0) >= 0.7)
    med_conf = sum(1 for c in classifications if 0.5 <= c.get("confidence", 0) < 0.7)
    low_conf = sum(1 for c in classifications if c.get("confidence", 0) < 0.5)
    # Generate HTML
    html = f'''<!DOCTYPE html>
 <html lang="en">
 <head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Email Classification Report</title>
    <style>
        :root {{
            --bg-primary: #1a1a2e;
            --bg-secondary: #16213e;
            --bg-card: #0f3460;
            --text-primary: #eee;
            --text-secondary: #aaa;
            --accent: #e94560;
            --accent-hover: #ff6b6b;
            --success: #00d9a5;
            --warning: #ffc107;
            --border: #2a2a4a;
        }}
        * {{
            margin: 0;
            padding: 0;
            box-sizing: border-box;
        }}
        body {{
            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, sans-serif;
            background: var(--bg-primary);
            color: var(--text-primary);
            line-height: 1.6;
        }}
        .container {{
            max-width: 1400px;
            margin: 0 auto;
            padding: 20px;
        }}
        header {{
            background: var(--bg-secondary);
            padding: 30px;
            border-radius: 12px;
            margin-bottom: 30px;
            border: 1px solid var(--border);
        }}
        header h1 {{
            font-size: 2rem;
            margin-bottom: 10px;
            color: var(--accent);
        }}
        .meta-info {{
            display: flex;
            flex-wrap: wrap;
            gap: 20px;
            margin-top: 15px;
            color: var(--text-secondary);
            font-size: 0.9rem;
        }}
        .meta-info span {{
            background: var(--bg-card);
            padding: 5px 12px;
            border-radius: 20px;
        }}
        .stats-grid {{
            display: grid;
            grid-template-columns: repeat(auto-fit, minmax(200px, 1fr));
            gap: 20px;
            margin-bottom: 30px;
        }}
        .stat-card {{
            background: var(--bg-secondary);
            padding: 20px;
            border-radius: 12px;
            border: 1px solid var(--border);
            text-align: center;
        }}
        .stat-card .value {{
            font-size: 2.5rem;
            font-weight: bold;
            color: var(--accent);
        }}
        .stat-card .label {{
            color: var(--text-secondary);
            font-size: 0.9rem;
            margin-top: 5px;
        }}
        .tabs {{
            display: flex;
            flex-wrap: wrap;
            gap: 10px;
            margin-bottom: 20px;
            border-bottom: 2px solid var(--border);
            padding-bottom: 10px;
        }}
        .tab {{
            padding: 10px 20px;
            background: var(--bg-secondary);
            border: 1px solid var(--border);
            border-radius: 8px 8px 0 0;
            cursor: pointer;
            transition: all 0.2s;
            color: var(--text-secondary);
        }}
        .tab:hover {{
            background: var(--bg-card);
            color: var(--text-primary);
        }}
        .tab.active {{
            background: var(--accent);
            color: white;
            border-color: var(--accent);
        }}
        .tab .count {{
            background: rgba(255,255,255,0.2);
            padding: 2px 8px;
            border-radius: 10px;
            font-size: 0.8rem;
            margin-left: 8px;
        }}
        .tab-content {{
            display: none;
        }}
        .tab-content.active {{
            display: block;
        }}
        .email-table {{
            width: 100%;
            border-collapse: collapse;
            background: var(--bg-secondary);
            border-radius: 12px;
            overflow: hidden;
        }}
        .email-table th {{
            background: var(--bg-card);
            padding: 15px;
            text-align: left;
            font-weight: 600;
            color: var(--text-primary);
            position: sticky;
            top: 0;
        }}
        .email-table td {{
            padding: 12px 15px;
            border-bottom: 1px solid var(--border);
            color: var(--text-secondary);
        }}
        .email-table tr:hover td {{
            background: var(--bg-card);
            color: var(--text-primary);
        }}
        .email-table .subject {{
            max-width: 400px;
            color: var(--text-primary);
        }}
        .email-table .sender {{
            max-width: 250px;
        }}
        .confidence {{
            display: inline-block;
            padding: 3px 10px;
            border-radius: 12px;
            font-size: 0.85rem;
            font-weight: 500;
        }}
        .confidence.high {{
            background: rgba(0, 217, 165, 0.2);
            color: var(--success);
        }}
        .confidence.medium {{
            background: rgba(255, 193, 7, 0.2);
            color: var(--warning);
        }}
        .confidence.low {{
            background: rgba(233, 69, 96, 0.2);
            color: var(--accent);
        }}
        .method-badge {{
            display: inline-block;
            padding: 3px 8px;
            border-radius: 4px;
            font-size: 0.75rem;
            text-transform: uppercase;
        }}
        .method-ml {{
            background: rgba(0, 217, 165, 0.2);
            color: var(--success);
        }}
        .method-rule {{
            background: rgba(100, 149, 237, 0.2);
            color: cornflowerblue;
        }}
        .method-llm {{
            background: rgba(255, 193, 7, 0.2);
            color: var(--warning);
        }}
        .section {{
            background: var(--bg-secondary);
            padding: 25px;
            border-radius: 12px;
            margin-bottom: 30px;
            border: 1px solid var(--border);
        }}
        .section h2 {{
            margin-bottom: 20px;
            color: var(--accent);
            font-size: 1.3rem;
        }}
        .chart-bar {{
            display: flex;
            align-items: center;
            margin-bottom: 10px;
        }}
        .chart-bar .label {{
            width: 150px;
            font-size: 0.9rem;
            color: var(--text-secondary);
        }}
        .chart-bar .bar-container {{
            flex: 1;
            height: 24px;
            background: var(--bg-card);
            border-radius: 4px;
            overflow: hidden;
            margin: 0 15px;
        }}
        .chart-bar .bar {{
            height: 100%;
            background: linear-gradient(90deg, var(--accent), var(--accent-hover));
            transition: width 0.5s ease;
        }}
        .chart-bar .value {{
            width: 80px;
            text-align: right;
            font-size: 0.9rem;
        }}
        .sender-list {{
            display: grid;
            grid-template-columns: repeat(auto-fill, minmax(300px, 1fr));
            gap: 10px;
        }}
        .sender-item {{
            display: flex;
            justify-content: space-between;
            padding: 10px 15px;
            background: var(--bg-card);
            border-radius: 8px;
            font-size: 0.9rem;
        }}
        .sender-item .email {{
            color: var(--text-secondary);
            overflow: hidden;
            text-overflow: ellipsis;
            white-space: nowrap;
            max-width: 220px;
        }}
        .sender-item .count {{
            color: var(--accent);
            font-weight: bold;
        }}
        .search-box {{
            width: 100%;
            padding: 12px 20px;
            background: var(--bg-card);
            border: 1px solid var(--border);
            border-radius: 8px;
            color: var(--text-primary);
            font-size: 1rem;
            margin-bottom: 20px;
        }}
        .search-box:focus {{
            outline: none;
            border-color: var(--accent);
        }}
        .table-container {{
            max-height: 600px;
            overflow-y: auto;
            border-radius: 12px;
        }}
        .attachment-icon {{
            color: var(--warning);
        }}
        footer {{
            text-align: center;
            padding: 20px;
            color: var(--text-secondary);
            font-size: 0.85rem;
        }}
    </style>
 </head>
 <body>
    <div class="container">
        <header>
            <h1>Email Classification Report</h1>
            <p>Automated analysis of email inbox</p>
            <div class="meta-info">
                <span>Generated: {datetime.now().strftime("%Y-%m-%d %H:%M")}</span>
                <span>Source: {escape(metadata.get("source", "unknown"))}</span>
                <span>Total Emails: {total:,}</span>
            </div>
        </header>
        <div class="stats-grid">
            <div class="stat-card">
                <div class="value">{total:,}</div>
                <div class="label">Total Emails</div>
            </div>
            <div class="stat-card">
                <div class="value">{len(categories)}</div>
                <div class="label">Categories</div>
            </div>
            <div class="stat-card">
                <div class="value">{high_conf}</div>
                <div class="label">High Confidence (&ge;70%)</div>
            </div>
            <div class="stat-card">
                <div class="value">{len(sender_domains)}</div>
                <div class="label">Unique Domains</div>
            </div>
        </div>
        <div class="section">
            <h2>Category Distribution</h2>
            {"".join(f'''
            <div class="chart-bar">
                <div class="label">{escape(cat)}</div>
                <div class="bar-container">
                    <div class="bar" style="width: {categories[cat]/total*100:.1f}%"></div>
                </div>
                <div class="value">{categories[cat]:,} ({categories[cat]/total*100:.1f}%)</div>
            </div>
            ''' for cat in sorted_categories)}
        </div>
        <div class="section">
            <h2>Classification Methods</h2>
            {"".join(f'''
            <div class="chart-bar">
                <div class="label">{escape(method.upper())}</div>
                <div class="bar-container">
                    <div class="bar" style="width: {methods[method]/total*100:.1f}%"></div>
                </div>
                <div class="value">{methods[method]:,} ({methods[method]/total*100:.1f}%)</div>
            </div>
            ''' for method in sorted(methods.keys()))}
        </div>
        <div class="section">
            <h2>Confidence Distribution</h2>
            <div class="chart-bar">
                <div class="label">High (&ge;70%)</div>
                <div class="bar-container">
                    <div class="bar" style="width: {high_conf/total*100:.1f}%; background: linear-gradient(90deg, #00d9a5, #00ffcc);"></div>
                </div>
                <div class="value">{high_conf:,} ({high_conf/total*100:.1f}%)</div>
            </div>
            <div class="chart-bar">
                <div class="label">Medium (50-70%)</div>
                <div class="bar-container">
                    <div class="bar" style="width: {med_conf/total*100:.1f}%; background: linear-gradient(90deg, #ffc107, #ffdb58);"></div>
                </div>
                <div class="value">{med_conf:,} ({med_conf/total*100:.1f}%)</div>
            </div>
            <div class="chart-bar">
                <div class="label">Low (&lt;50%)</div>
                <div class="bar-container">
                    <div class="bar" style="width: {low_conf/total*100:.1f}%; background: linear-gradient(90deg, #e94560, #ff6b6b);"></div>
                </div>
                <div class="value">{low_conf:,} ({low_conf/total*100:.1f}%)</div>
            </div>
        </div>
        <div class="section">
            <h2>Top Senders</h2>
            <div class="sender-list">
                {"".join(f'''
                <div class="sender-item">
                    <span class="email" title="{escape(sender)}">{escape(truncate(sender, 35))}</span>
                    <span class="count">{count}</span>
                </div>
                ''' for sender, count in top_senders)}
            </div>
        </div>
        <div class="section">
            <h2>Emails by Category</h2>
            <div class="tabs">
                <div class="tab active" onclick="showTab('all')">All<span class="count">{total}</span></div>
                {"".join(f'''<div class="tab" onclick="showTab('{escape(cat)}')">{escape(cat)}<span class="count">{categories[cat]}</span></div>''' for cat in sorted_categories)}
            </div>
            <input type="text" class="search-box" placeholder="Search by subject, sender..." onkeyup="filterTable(this.value)">
            <div id="tab-all" class="tab-content active">
                <div class="table-container">
                    <table class="email-table" id="email-table-all">
                        <thead>
                            <tr>
                                <th>Date</th>
                                <th>Subject</th>
                                <th>Sender</th>
                                <th>Category</th>
                                <th>Confidence</th>
                                <th>Method</th>
                            </tr>
                        </thead>
                        <tbody>
                            {"".join(generate_email_row(c) for c in sorted(classifications, key=lambda x: x.get("date") or "", reverse=True))}
                        </tbody>
                    </table>
                </div>
            </div>
            {"".join(f'''
            <div id="tab-{escape(cat)}" class="tab-content">
                <div class="table-container">
                    <table class="email-table">
                        <thead>
                            <tr>
                                <th>Date</th>
                                <th>Subject</th>
                                <th>Sender</th>
                                <th>Confidence</th>
                                <th>Method</th>
                            </tr>
                        </thead>
                        <tbody>
                            {"".join(generate_email_row(c, show_category=False) for c in sorted(by_category[cat], key=lambda x: x.get("date") or "", reverse=True))}
                        </tbody>
                    </table>
                </div>
            </div>
            ''' for cat in sorted_categories)}
        </div>
        <footer>
            Generated by Email Sorter | {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}
        </footer>
    </div>
    <script>
        function showTab(tabId) {{
            // Hide all tabs
            document.querySelectorAll('.tab-content').forEach(el => el.classList.remove('active'));
            document.querySelectorAll('.tab').forEach(el => el.classList.remove('active'));
            // Show selected tab
            document.getElementById('tab-' + tabId).classList.add('active');
            event.target.classList.add('active');
        }}
        function filterTable(query) {{
            query = query.toLowerCase();
            document.querySelectorAll('.tab-content.active tbody tr').forEach(row => {{
                const text = row.textContent.toLowerCase();
                row.style.display = text.includes(query) ? '' : 'none';
            }});
        }}
    </script>
 </body>
 </html>
 '''
    with open(output_path, "w", encoding="utf-8") as f:
        f.write(html)
    print(f"Report generated: {output_path}")
    print(f"  Total emails: {total:,}")
    print(f"  Categories: {len(categories)}")
    print(f"  Top category: {sorted_categories[0]} ({categories[sorted_categories[0]]:,})")
 def generate_email_row(c: dict, show_category: bool = True) -> str:
    """Generate HTML table row for an email."""
    conf = c.get("confidence", 0)
    conf_class = "high" if conf >= 0.7 else "medium" if conf >= 0.5 else "low"
    method = c.get("method", "unknown")
    method_class = f"method-{method}"
    attachment_icon = '<span class="attachment-icon" title="Has attachments">📎</span> ' if c.get("has_attachments") else ""
    category_col = f'<td>{escape(c.get("category", "unknown"))}</td>' if show_category else ""
    return f'''
        <tr data-search="{escape(c.get('subject', ''))} {escape(c.get('sender', ''))}">
            <td>{format_date(c.get("date"))}</td>
            <td class="subject">{attachment_icon}{escape(truncate(c.get("subject", "No subject"), 70))}</td>
            <td class="sender" title="{escape(c.get('sender', ''))}">{escape(truncate(c.get("sender_name") or c.get("sender", ""), 35))}</td>
            {category_col}
            <td><span class="confidence {conf_class}">{conf*100:.0f}%</span></td>
            <td><span class="method-badge {method_class}">{method}</span></td>
        </tr>
    '''
 def main():
    parser = argparse.ArgumentParser(description="Generate HTML report from classification results")
    parser.add_argument("--input", "-i", required=True, help="Path to results.json")
    parser.add_argument("--output", "-o", default=None, help="Output HTML file path")
    args = parser.parse_args()
    input_path = Path(args.input)
    if not input_path.exists():
        print(f"Error: Input file not found: {input_path}")
        return 1
    output_path = args.output or str(input_path.parent / "report.html")
    results = load_results(args.input)
    generate_html_report(results, output_path)
    return 0
 if __name__ == "__main__":
    exit(main())