email-sorter/docs/SYSTEM_FLOW.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Email Sorter System Flow</title>
    <script src="https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.min.js"></script>
    <style>
        body {
            font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
            margin: 20px;
            background: #1e1e1e;
            color: #d4d4d4;
        }
        h1, h2, h3 {
            color: #4ec9b0;
        }
        .diagram {
            background: white;
            padding: 20px;
            margin: 20px 0;
            border-radius: 8px;
        }
        .timing-table {
            width: 100%;
            border-collapse: collapse;
            margin: 20px 0;
            background: #252526;
        }
        .timing-table th {
            background: #37373d;
            padding: 12px;
            text-align: left;
            color: #4ec9b0;
        }
        .timing-table td {
            padding: 10px;
            border-bottom: 1px solid #3e3e42;
        }
        .flag-section {
            background: #252526;
            padding: 15px;
            margin: 10px 0;
            border-left: 4px solid #4ec9b0;
        }
        code {
            background: #1e1e1e;
            padding: 2px 6px;
            border-radius: 3px;
            color: #ce9178;
        }
    </style>
</head>
<body>
    <h1>Email Sorter System Flow Documentation</h1>

    <h2>1. Main Execution Flow</h2>
    <div class="diagram">
        <pre class="mermaid">
flowchart TD
    Start([python -m src.cli run]) --> LoadConfig[Load config/default_config.yaml]
    LoadConfig --> InitProviders[Initialize Email Provider<br/>Enron/Gmail/IMAP]
    InitProviders --> FetchEmails[Fetch Emails<br/>--limit N]

    FetchEmails --> CheckSize{Email Count?}
    CheckSize -->|"< 1000"| SetMockMode[Set ml_classifier.is_mock = True<br/>LLM-only mode]
    CheckSize -->|">= 1000"| CheckModel{Model Exists?}

    CheckModel -->|No model at<br/>src/models/pretrained/classifier.pkl| RunCalibration[CALIBRATION PHASE<br/>LLM category discovery<br/>Train ML model]
    CheckModel -->|Model exists| SkipCalibration[Skip Calibration<br/>Load existing model]
    SetMockMode --> SkipCalibration

    RunCalibration --> ClassifyPhase[CLASSIFICATION PHASE]
    SkipCalibration --> ClassifyPhase

    ClassifyPhase --> Loop{For each email}
    Loop --> RuleCheck{Hard rule match?}
    RuleCheck -->|Yes| RuleClassify[Category by rule<br/>confidence=1.0<br/>method='rule']
    RuleCheck -->|No| MLClassify[ML Classification<br/>Get category + confidence]

    MLClassify --> ConfCheck{Confidence >= threshold?}
    ConfCheck -->|Yes| AcceptML[Accept ML result<br/>method='ml'<br/>needs_review=False]
    ConfCheck -->|No| LowConf[Low confidence detected<br/>needs_review=True]

    LowConf --> FlagCheck{--no-llm-fallback?}
    FlagCheck -->|Yes| AcceptMLAnyway[Accept ML anyway<br/>needs_review=False]
    FlagCheck -->|No| LLMCheck{LLM available?}

    LLMCheck -->|Yes| LLMReview[LLM Classification<br/>~4 seconds<br/>method='llm']
    LLMCheck -->|No| AcceptMLAnyway

    RuleClassify --> NextEmail{More emails?}
    AcceptML --> NextEmail
    AcceptMLAnyway --> NextEmail
    LLMReview --> NextEmail

    NextEmail -->|Yes| Loop
    NextEmail -->|No| SaveResults[Save results.json]
    SaveResults --> End([Complete])

    style RunCalibration fill:#ff6b6b
    style LLMReview fill:#ff6b6b
    style SetMockMode fill:#ffd93d
    style FlagCheck fill:#4ec9b0
    style AcceptMLAnyway fill:#4ec9b0
</pre>
    </div>

    <h2>2. Calibration Phase Detail (When Triggered)</h2>
    <div class="diagram">
        <pre class="mermaid">
flowchart TD
    Start([Calibration Triggered]) --> Sample[Stratified Sampling<br/>3% of emails<br/>min 250, max 1500]
    Sample --> LLMBatch[LLM Category Discovery<br/>50 emails per batch]

    LLMBatch --> Batch1[Batch 1: 50 emails<br/>~20 seconds]
    Batch1 --> Batch2[Batch 2: 50 emails<br/>~20 seconds]
    Batch2 --> BatchN[... N batches<br/>For 300 samples: 6 batches]

    BatchN --> Consolidate[LLM Consolidation<br/>Merge similar categories<br/>~5 seconds]
    Consolidate --> Categories[Final Categories<br/>~10-12 unique categories]

    Categories --> Label[Label Training Emails<br/>LLM labels each sample<br/>~3 seconds per email]
    Label --> Extract[Feature Extraction<br/>Embeddings + TF-IDF<br/>~0.02 seconds per email]
    Extract --> Train[Train LightGBM Model<br/>~5 seconds total]

    Train --> Validate[Validate on 100 samples<br/>~2 seconds]
    Validate --> Save[Save Model<br/>src/models/calibrated/classifier.pkl]
    Save --> End([Calibration Complete<br/>Total time: 15-25 minutes for 10k emails])

    style LLMBatch fill:#ff6b6b
    style Label fill:#ff6b6b
    style Consolidate fill:#ff6b6b
    style Train fill:#4ec9b0
</pre>
    </div>

    <h2>3. Classification Phase Detail</h2>
    <div class="diagram">
        <pre class="mermaid">
flowchart TD
    Start([Classification Phase]) --> Email[Get Email]
    Email --> Rules{Check Hard Rules<br/>Pattern matching}

    Rules -->|Match| RuleDone[Rule Match<br/>~0.001 seconds<br/>59 of 10000 emails]
    Rules -->|No match| Embed[Generate Embedding<br/>all-minilm:l6-v2<br/>~0.02 seconds]

    Embed --> TFIDF[TF-IDF Features<br/>~0.001 seconds]
    TFIDF --> MLPredict[ML Prediction<br/>LightGBM<br/>~0.003 seconds]

    MLPredict --> Threshold{Confidence >= 0.55?}
    Threshold -->|Yes| MLDone[ML Classification<br/>7842 of 10000 emails<br/>78.4%]
    Threshold -->|No| Flag{--no-llm-fallback?}

    Flag -->|Yes| MLForced[Force ML result<br/>No LLM call]
    Flag -->|No| LLM[LLM Classification<br/>~4 seconds<br/>2099 of 10000 emails<br/>21%]

    RuleDone --> Next([Next Email])
    MLDone --> Next
    MLForced --> Next
    LLM --> Next

    style LLM fill:#ff6b6b
    style MLDone fill:#4ec9b0
    style MLForced fill:#ffd93d
</pre>
    </div>

    <h2>4. Model Loading Logic</h2>
    <div class="diagram">
        <pre class="mermaid">
flowchart TD
    Start([MLClassifier.__init__]) --> CheckPath{model_path provided?}
    CheckPath -->|Yes| UsePath[Use provided path]
    CheckPath -->|No| Default[Default:<br/>src/models/pretrained/classifier.pkl]

    UsePath --> FileCheck{File exists?}
    Default --> FileCheck

    FileCheck -->|Yes| Load[Load pickle file]
    FileCheck -->|No| CreateMock[Create MOCK model<br/>Random Forest<br/>12 hardcoded categories]

    Load --> ValidCheck{Valid model data?}
    ValidCheck -->|Yes| CheckMock{is_mock flag?}
    ValidCheck -->|No| CreateMock

    CheckMock -->|True| WarnMock[Warn: MOCK model active]
    CheckMock -->|False| RealModel[Real trained model loaded]

    CreateMock --> MockWarnings[Multiple warnings printed<br/>NOT for production]
    WarnMock --> Ready[Model Ready]
    RealModel --> Ready
    MockWarnings --> Ready

    Ready --> End([Classification can start])

    style CreateMock fill:#ff6b6b
    style RealModel fill:#4ec9b0
    style WarnMock fill:#ffd93d
</pre>
    </div>

    <h2>5. Flag Conditions & Effects</h2>

    <div class="flag-section">
        <h3>--no-llm-fallback</h3>
        <p><strong>Location:</strong> src/cli.py:46, src/classification/adaptive_classifier.py:152-161</p>
        <p><strong>Effect:</strong> When ML confidence < threshold, accept ML result anyway instead of calling LLM</p>
        <p><strong>Use case:</strong> Test pure ML performance, avoid LLM costs</p>
        <p><strong>Code path:</strong></p>
        <code>
if self.disable_llm_fallback:<br/>
&nbsp;&nbsp;# Just return ML result without LLM fallback<br/>
&nbsp;&nbsp;return ClassificationResult(needs_review=False)
        </code>
    </div>

    <div class="flag-section">
        <h3>--limit N</h3>
        <p><strong>Location:</strong> src/cli.py:38</p>
        <p><strong>Effect:</strong> Limits number of emails fetched from source</p>
        <p><strong>Calibration trigger:</strong> If N < 1000, forces LLM-only mode (no ML training)</p>
        <p><strong>Code path:</strong></p>
        <code>
if total_emails < 1000:<br/>
&nbsp;&nbsp;ml_classifier.is_mock = True  # Skip ML, use LLM only
        </code>
    </div>

    <div class="flag-section">
        <h3>Model Path Override</h3>
        <p><strong>Location:</strong> src/classification/ml_classifier.py:43</p>
        <p><strong>Default:</strong> src/models/pretrained/classifier.pkl</p>
        <p><strong>Calibration saves to:</strong> src/models/calibrated/classifier.pkl</p>
        <p><strong>Problem:</strong> Calibration saves to different location than default load location</p>
        <p><strong>Solution:</strong> Copy calibrated model to pretrained location OR pass model_path parameter</p>
    </div>

    <h2>6. Timing Breakdown (10,000 emails)</h2>

    <table class="timing-table">
        <tr>
            <th>Phase</th>
            <th>Operation</th>
            <th>Time per Email</th>
            <th>Total Time (10k)</th>
            <th>LLM Required?</th>
        </tr>
        <tr>
            <td rowspan="6"><strong>Calibration</strong><br/>(if model doesn't exist)</td>
            <td>Stratified sampling (300 emails)</td>
            <td>-</td>
            <td>~1 second</td>
            <td>No</td>
        </tr>
        <tr>
            <td>LLM category discovery (6 batches)</td>
            <td>~0.4 sec/email</td>
            <td>~2 minutes</td>
            <td>YES</td>
        </tr>
        <tr>
            <td>LLM consolidation</td>
            <td>-</td>
            <td>~5 seconds</td>
            <td>YES</td>
        </tr>
        <tr>
            <td>LLM labeling (300 samples)</td>
            <td>~3 sec/email</td>
            <td>~15 minutes</td>
            <td>YES</td>
        </tr>
        <tr>
            <td>Feature extraction (300 samples)</td>
            <td>~0.02 sec/email</td>
            <td>~6 seconds</td>
            <td>No (embeddings)</td>
        </tr>
        <tr>
            <td>Model training (LightGBM)</td>
            <td>-</td>
            <td>~5 seconds</td>
            <td>No</td>
        </tr>
        <tr>
            <td colspan="3"><strong>CALIBRATION TOTAL</strong></td>
            <td><strong>~17-20 minutes</strong></td>
            <td><strong>YES</strong></td>
        </tr>
        <tr>
            <td rowspan="5"><strong>Classification</strong><br/>(with model)</td>
            <td>Hard rule matching</td>
            <td>~0.001 sec</td>
            <td>~10 seconds (all 10k)</td>
            <td>No</td>
        </tr>
        <tr>
            <td>Embedding generation</td>
            <td>~0.02 sec</td>
            <td>~200 seconds (all 10k)</td>
            <td>No (Ollama embed)</td>
        </tr>
        <tr>
            <td>ML prediction</td>
            <td>~0.003 sec</td>
            <td>~30 seconds (all 10k)</td>
            <td>No</td>
        </tr>
        <tr>
            <td>LLM fallback (21% of emails)</td>
            <td>~4 sec/email</td>
            <td>~140 minutes (2100 emails)</td>
            <td>YES</td>
        </tr>
        <tr>
            <td>Saving results</td>
            <td>-</td>
            <td>~1 second</td>
            <td>No</td>
        </tr>
        <tr>
            <td colspan="3"><strong>CLASSIFICATION TOTAL (with LLM fallback)</strong></td>
            <td><strong>~2.5 hours</strong></td>
            <td><strong>YES (21%)</strong></td>
        </tr>
        <tr>
            <td colspan="3"><strong>CLASSIFICATION TOTAL (--no-llm-fallback)</strong></td>
            <td><strong>~4 minutes</strong></td>
            <td><strong>No</strong></td>
        </tr>
    </table>

    <h2>7. Why LLM Still Loads</h2>

    <div class="diagram">
        <pre class="mermaid">
flowchart TD
    Start([CLI startup]) --> Always1[ALWAYS: Load LLM provider<br/>src/cli.py:98-117]
    Always1 --> Reason1[Reason: Needed for calibration<br/>if model doesn't exist]

    Reason1 --> Check{Model exists?}
    Check -->|No| NeedLLM1[LLM required for calibration<br/>Category discovery<br/>Sample labeling]
    Check -->|Yes| SkipCal[Skip calibration]

    SkipCal --> ClassStart[Start classification]
    NeedLLM1 --> DoCalibration[Run calibration<br/>Uses LLM]
    DoCalibration --> ClassStart

    ClassStart --> Always2[ALWAYS: LLM provider is available<br/>llm.is_available = True]
    Always2 --> EmailLoop[For each email...]

    EmailLoop --> LowConf{Low confidence?}
    LowConf -->|No| NoLLM[No LLM call]
    LowConf -->|Yes| FlagCheck{--no-llm-fallback?}

    FlagCheck -->|Yes| NoLLMCall[No LLM call<br/>Accept ML result]
    FlagCheck -->|No| LLMAvail{llm.is_available?}

    LLMAvail -->|Yes| CallLLM[LLM called<br/>src/cli.py:227-228]
    LLMAvail -->|No| NoLLMCall

    NoLLM --> End([Next email])
    NoLLMCall --> End
    CallLLM --> End

    style Always1 fill:#ffd93d
    style Always2 fill:#ffd93d
    style CallLLM fill:#ff6b6b
    style NoLLMCall fill:#4ec9b0
</pre>
    </div>

    <h3>Why LLM Provider is Always Initialized:</h3>
    <ul>
        <li><strong>Line 98-117 (src/cli.py):</strong> LLM provider is created before checking if model exists</li>
        <li><strong>Reason:</strong> Need LLM ready in case calibration is required</li>
        <li><strong>Result:</strong> Even with --no-llm-fallback, LLM provider loads (but won't be called for classification)</li>
    </ul>

    <h2>8. Command Scenarios</h2>

    <table class="timing-table">
        <tr>
            <th>Command</th>
            <th>Model Exists?</th>
            <th>Calibration Runs?</th>
            <th>LLM Used for Classification?</th>
            <th>Total Time (10k)</th>
        </tr>
        <tr>
            <td><code>python -m src.cli run --source enron --limit 10000</code></td>
            <td>No</td>
            <td>YES (~20 min)</td>
            <td>YES (~2.5 hours)</td>
            <td>~2 hours 50 min</td>
        </tr>
        <tr>
            <td><code>python -m src.cli run --source enron --limit 10000</code></td>
            <td>Yes</td>
            <td>No</td>
            <td>YES (~2.5 hours)</td>
            <td>~2.5 hours</td>
        </tr>
        <tr>
            <td><code>python -m src.cli run --source enron --limit 10000 --no-llm-fallback</code></td>
            <td>No</td>
            <td>YES (~20 min)</td>
            <td>NO</td>
            <td>~24 minutes</td>
        </tr>
        <tr>
            <td><code>python -m src.cli run --source enron --limit 10000 --no-llm-fallback</code></td>
            <td>Yes</td>
            <td>No</td>
            <td>NO</td>
            <td>~4 minutes</td>
        </tr>
        <tr>
            <td><code>python -m src.cli run --source enron --limit 500</code></td>
            <td>Any</td>
            <td>No (too few emails)</td>
            <td>YES (100% LLM-only)</td>
            <td>~35 minutes</td>
        </tr>
    </table>

    <h2>9. Current System State</h2>

    <div class="flag-section">
        <h3>Model Status</h3>
        <ul>
            <li><strong>src/models/calibrated/classifier.pkl</strong> - 1.8MB, trained at 02:54, 10 categories</li>
            <li><strong>src/models/pretrained/classifier.pkl</strong> - Copy of calibrated model (created manually)</li>
        </ul>
    </div>

    <div class="flag-section">
        <h3>Threshold Configuration</h3>
        <ul>
            <li><strong>config/default_config.yaml:</strong> default_threshold = 0.55</li>
            <li><strong>config/categories.yaml:</strong> All category thresholds = 0.55</li>
            <li><strong>Effect:</strong> ML must be ≥55% confident to skip LLM</li>
        </ul>
    </div>

    <div class="flag-section">
        <h3>Last Run Results (10k emails)</h3>
        <ul>
            <li><strong>Rules:</strong> 59 emails (0.6%)</li>
            <li><strong>ML:</strong> 7,842 emails (78.4%)</li>
            <li><strong>LLM fallback:</strong> 2,099 emails (21%)</li>
            <li><strong>Accuracy estimate:</strong> 92.7%</li>
        </ul>
    </div>

    <h2>10. To Run ML-Only Test (No LLM Calls During Classification)</h2>

    <div class="flag-section">
        <h3>Requirements:</h3>
        <ol>
            <li>Model must exist at <code>src/models/pretrained/classifier.pkl</code> ✓ (done)</li>
            <li>Use <code>--no-llm-fallback</code> flag</li>
            <li>Ensure sufficient emails (≥1000) to avoid LLM-only mode</li>
        </ol>

        <h3>Command:</h3>
        <code>
python -m src.cli run --source enron --limit 10000 --output ml_only_10k/ --no-llm-fallback
        </code>

        <h3>Expected Results:</h3>
        <ul>
            <li><strong>Calibration:</strong> Skipped (model exists)</li>
            <li><strong>LLM calls during classification:</strong> 0</li>
            <li><strong>Total time:</strong> ~4 minutes</li>
            <li><strong>ML acceptance rate:</strong> 100% (all emails classified by ML, even low confidence)</li>
        </ul>
    </div>

    <script>
        mermaid.initialize({
            startOnLoad: true,
            theme: 'default',
            flowchart: {
                useMaxWidth: true,
                htmlLabels: true,
                curve: 'basis'
            }
        });
    </script>
</body>
</html>