email-sorter/docs/SYSTEM_FLOW.html
FSSCoding 53174a34eb Organize project structure and add MVP features
Project Reorganization:
- Created docs/ directory and moved all documentation
- Created scripts/ directory for shell scripts
- Created scripts/experimental/ for research scripts
- Updated .gitignore for new structure
- Updated README.md with MVP status and new structure

New Features:
- Category verification system (verify_model_categories)
- --verify-categories flag for mailbox compatibility check
- --no-llm-fallback flag for pure ML classification
- Trained model saved in src/models/calibrated/

Threshold Optimization:
- Reduced default threshold from 0.75 to 0.55
- Updated all category thresholds to 0.55
- Reduces LLM fallback rate by 40% (35% -> 21%)

Documentation:
- SYSTEM_FLOW.html - Complete system architecture
- VERIFY_CATEGORIES_FEATURE.html - Feature documentation
- LABEL_TRAINING_PHASE_DETAIL.html - Calibration breakdown
- FAST_ML_ONLY_WORKFLOW.html - Pure ML guide
- PROJECT_STATUS_AND_NEXT_STEPS.html - Roadmap
- ROOT_CAUSE_ANALYSIS.md - Bug fixes

MVP Status:
- 10k emails in 4 minutes, 72.7% accuracy, 0 LLM calls
- LLM-driven category discovery working
- Embedding-based transfer learning confirmed
- All model paths verified and working
2025-10-25 14:46:58 +11:00

494 lines
17 KiB
HTML

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Email Sorter System Flow</title>
<script src="https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.min.js"></script>
<style>
body {
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
margin: 20px;
background: #1e1e1e;
color: #d4d4d4;
}
h1, h2, h3 {
color: #4ec9b0;
}
.diagram {
background: white;
padding: 20px;
margin: 20px 0;
border-radius: 8px;
}
.timing-table {
width: 100%;
border-collapse: collapse;
margin: 20px 0;
background: #252526;
}
.timing-table th {
background: #37373d;
padding: 12px;
text-align: left;
color: #4ec9b0;
}
.timing-table td {
padding: 10px;
border-bottom: 1px solid #3e3e42;
}
.flag-section {
background: #252526;
padding: 15px;
margin: 10px 0;
border-left: 4px solid #4ec9b0;
}
code {
background: #1e1e1e;
padding: 2px 6px;
border-radius: 3px;
color: #ce9178;
}
</style>
</head>
<body>
<h1>Email Sorter System Flow Documentation</h1>
<h2>1. Main Execution Flow</h2>
<div class="diagram">
<pre class="mermaid">
flowchart TD
Start([python -m src.cli run]) --> LoadConfig[Load config/default_config.yaml]
LoadConfig --> InitProviders[Initialize Email Provider<br/>Enron/Gmail/IMAP]
InitProviders --> FetchEmails[Fetch Emails<br/>--limit N]
FetchEmails --> CheckSize{Email Count?}
CheckSize -->|"< 1000"| SetMockMode[Set ml_classifier.is_mock = True<br/>LLM-only mode]
CheckSize -->|">= 1000"| CheckModel{Model Exists?}
CheckModel -->|No model at<br/>src/models/pretrained/classifier.pkl| RunCalibration[CALIBRATION PHASE<br/>LLM category discovery<br/>Train ML model]
CheckModel -->|Model exists| SkipCalibration[Skip Calibration<br/>Load existing model]
SetMockMode --> SkipCalibration
RunCalibration --> ClassifyPhase[CLASSIFICATION PHASE]
SkipCalibration --> ClassifyPhase
ClassifyPhase --> Loop{For each email}
Loop --> RuleCheck{Hard rule match?}
RuleCheck -->|Yes| RuleClassify[Category by rule<br/>confidence=1.0<br/>method='rule']
RuleCheck -->|No| MLClassify[ML Classification<br/>Get category + confidence]
MLClassify --> ConfCheck{Confidence >= threshold?}
ConfCheck -->|Yes| AcceptML[Accept ML result<br/>method='ml'<br/>needs_review=False]
ConfCheck -->|No| LowConf[Low confidence detected<br/>needs_review=True]
LowConf --> FlagCheck{--no-llm-fallback?}
FlagCheck -->|Yes| AcceptMLAnyway[Accept ML anyway<br/>needs_review=False]
FlagCheck -->|No| LLMCheck{LLM available?}
LLMCheck -->|Yes| LLMReview[LLM Classification<br/>~4 seconds<br/>method='llm']
LLMCheck -->|No| AcceptMLAnyway
RuleClassify --> NextEmail{More emails?}
AcceptML --> NextEmail
AcceptMLAnyway --> NextEmail
LLMReview --> NextEmail
NextEmail -->|Yes| Loop
NextEmail -->|No| SaveResults[Save results.json]
SaveResults --> End([Complete])
style RunCalibration fill:#ff6b6b
style LLMReview fill:#ff6b6b
style SetMockMode fill:#ffd93d
style FlagCheck fill:#4ec9b0
style AcceptMLAnyway fill:#4ec9b0
</pre>
</div>
<h2>2. Calibration Phase Detail (When Triggered)</h2>
<div class="diagram">
<pre class="mermaid">
flowchart TD
Start([Calibration Triggered]) --> Sample[Stratified Sampling<br/>3% of emails<br/>min 250, max 1500]
Sample --> LLMBatch[LLM Category Discovery<br/>50 emails per batch]
LLMBatch --> Batch1[Batch 1: 50 emails<br/>~20 seconds]
Batch1 --> Batch2[Batch 2: 50 emails<br/>~20 seconds]
Batch2 --> BatchN[... N batches<br/>For 300 samples: 6 batches]
BatchN --> Consolidate[LLM Consolidation<br/>Merge similar categories<br/>~5 seconds]
Consolidate --> Categories[Final Categories<br/>~10-12 unique categories]
Categories --> Label[Label Training Emails<br/>LLM labels each sample<br/>~3 seconds per email]
Label --> Extract[Feature Extraction<br/>Embeddings + TF-IDF<br/>~0.02 seconds per email]
Extract --> Train[Train LightGBM Model<br/>~5 seconds total]
Train --> Validate[Validate on 100 samples<br/>~2 seconds]
Validate --> Save[Save Model<br/>src/models/calibrated/classifier.pkl]
Save --> End([Calibration Complete<br/>Total time: 15-25 minutes for 10k emails])
style LLMBatch fill:#ff6b6b
style Label fill:#ff6b6b
style Consolidate fill:#ff6b6b
style Train fill:#4ec9b0
</pre>
</div>
<h2>3. Classification Phase Detail</h2>
<div class="diagram">
<pre class="mermaid">
flowchart TD
Start([Classification Phase]) --> Email[Get Email]
Email --> Rules{Check Hard Rules<br/>Pattern matching}
Rules -->|Match| RuleDone[Rule Match<br/>~0.001 seconds<br/>59 of 10000 emails]
Rules -->|No match| Embed[Generate Embedding<br/>all-minilm:l6-v2<br/>~0.02 seconds]
Embed --> TFIDF[TF-IDF Features<br/>~0.001 seconds]
TFIDF --> MLPredict[ML Prediction<br/>LightGBM<br/>~0.003 seconds]
MLPredict --> Threshold{Confidence >= 0.55?}
Threshold -->|Yes| MLDone[ML Classification<br/>7842 of 10000 emails<br/>78.4%]
Threshold -->|No| Flag{--no-llm-fallback?}
Flag -->|Yes| MLForced[Force ML result<br/>No LLM call]
Flag -->|No| LLM[LLM Classification<br/>~4 seconds<br/>2099 of 10000 emails<br/>21%]
RuleDone --> Next([Next Email])
MLDone --> Next
MLForced --> Next
LLM --> Next
style LLM fill:#ff6b6b
style MLDone fill:#4ec9b0
style MLForced fill:#ffd93d
</pre>
</div>
<h2>4. Model Loading Logic</h2>
<div class="diagram">
<pre class="mermaid">
flowchart TD
Start([MLClassifier.__init__]) --> CheckPath{model_path provided?}
CheckPath -->|Yes| UsePath[Use provided path]
CheckPath -->|No| Default[Default:<br/>src/models/pretrained/classifier.pkl]
UsePath --> FileCheck{File exists?}
Default --> FileCheck
FileCheck -->|Yes| Load[Load pickle file]
FileCheck -->|No| CreateMock[Create MOCK model<br/>Random Forest<br/>12 hardcoded categories]
Load --> ValidCheck{Valid model data?}
ValidCheck -->|Yes| CheckMock{is_mock flag?}
ValidCheck -->|No| CreateMock
CheckMock -->|True| WarnMock[Warn: MOCK model active]
CheckMock -->|False| RealModel[Real trained model loaded]
CreateMock --> MockWarnings[Multiple warnings printed<br/>NOT for production]
WarnMock --> Ready[Model Ready]
RealModel --> Ready
MockWarnings --> Ready
Ready --> End([Classification can start])
style CreateMock fill:#ff6b6b
style RealModel fill:#4ec9b0
style WarnMock fill:#ffd93d
</pre>
</div>
<h2>5. Flag Conditions & Effects</h2>
<div class="flag-section">
<h3>--no-llm-fallback</h3>
<p><strong>Location:</strong> src/cli.py:46, src/classification/adaptive_classifier.py:152-161</p>
<p><strong>Effect:</strong> When ML confidence < threshold, accept ML result anyway instead of calling LLM</p>
<p><strong>Use case:</strong> Test pure ML performance, avoid LLM costs</p>
<p><strong>Code path:</strong></p>
<code>
if self.disable_llm_fallback:<br/>
&nbsp;&nbsp;# Just return ML result without LLM fallback<br/>
&nbsp;&nbsp;return ClassificationResult(needs_review=False)
</code>
</div>
<div class="flag-section">
<h3>--limit N</h3>
<p><strong>Location:</strong> src/cli.py:38</p>
<p><strong>Effect:</strong> Limits number of emails fetched from source</p>
<p><strong>Calibration trigger:</strong> If N < 1000, forces LLM-only mode (no ML training)</p>
<p><strong>Code path:</strong></p>
<code>
if total_emails < 1000:<br/>
&nbsp;&nbsp;ml_classifier.is_mock = True # Skip ML, use LLM only
</code>
</div>
<div class="flag-section">
<h3>Model Path Override</h3>
<p><strong>Location:</strong> src/classification/ml_classifier.py:43</p>
<p><strong>Default:</strong> src/models/pretrained/classifier.pkl</p>
<p><strong>Calibration saves to:</strong> src/models/calibrated/classifier.pkl</p>
<p><strong>Problem:</strong> Calibration saves to different location than default load location</p>
<p><strong>Solution:</strong> Copy calibrated model to pretrained location OR pass model_path parameter</p>
</div>
<h2>6. Timing Breakdown (10,000 emails)</h2>
<table class="timing-table">
<tr>
<th>Phase</th>
<th>Operation</th>
<th>Time per Email</th>
<th>Total Time (10k)</th>
<th>LLM Required?</th>
</tr>
<tr>
<td rowspan="6"><strong>Calibration</strong><br/>(if model doesn't exist)</td>
<td>Stratified sampling (300 emails)</td>
<td>-</td>
<td>~1 second</td>
<td>No</td>
</tr>
<tr>
<td>LLM category discovery (6 batches)</td>
<td>~0.4 sec/email</td>
<td>~2 minutes</td>
<td>YES</td>
</tr>
<tr>
<td>LLM consolidation</td>
<td>-</td>
<td>~5 seconds</td>
<td>YES</td>
</tr>
<tr>
<td>LLM labeling (300 samples)</td>
<td>~3 sec/email</td>
<td>~15 minutes</td>
<td>YES</td>
</tr>
<tr>
<td>Feature extraction (300 samples)</td>
<td>~0.02 sec/email</td>
<td>~6 seconds</td>
<td>No (embeddings)</td>
</tr>
<tr>
<td>Model training (LightGBM)</td>
<td>-</td>
<td>~5 seconds</td>
<td>No</td>
</tr>
<tr>
<td colspan="3"><strong>CALIBRATION TOTAL</strong></td>
<td><strong>~17-20 minutes</strong></td>
<td><strong>YES</strong></td>
</tr>
<tr>
<td rowspan="5"><strong>Classification</strong><br/>(with model)</td>
<td>Hard rule matching</td>
<td>~0.001 sec</td>
<td>~10 seconds (all 10k)</td>
<td>No</td>
</tr>
<tr>
<td>Embedding generation</td>
<td>~0.02 sec</td>
<td>~200 seconds (all 10k)</td>
<td>No (Ollama embed)</td>
</tr>
<tr>
<td>ML prediction</td>
<td>~0.003 sec</td>
<td>~30 seconds (all 10k)</td>
<td>No</td>
</tr>
<tr>
<td>LLM fallback (21% of emails)</td>
<td>~4 sec/email</td>
<td>~140 minutes (2100 emails)</td>
<td>YES</td>
</tr>
<tr>
<td>Saving results</td>
<td>-</td>
<td>~1 second</td>
<td>No</td>
</tr>
<tr>
<td colspan="3"><strong>CLASSIFICATION TOTAL (with LLM fallback)</strong></td>
<td><strong>~2.5 hours</strong></td>
<td><strong>YES (21%)</strong></td>
</tr>
<tr>
<td colspan="3"><strong>CLASSIFICATION TOTAL (--no-llm-fallback)</strong></td>
<td><strong>~4 minutes</strong></td>
<td><strong>No</strong></td>
</tr>
</table>
<h2>7. Why LLM Still Loads</h2>
<div class="diagram">
<pre class="mermaid">
flowchart TD
Start([CLI startup]) --> Always1[ALWAYS: Load LLM provider<br/>src/cli.py:98-117]
Always1 --> Reason1[Reason: Needed for calibration<br/>if model doesn't exist]
Reason1 --> Check{Model exists?}
Check -->|No| NeedLLM1[LLM required for calibration<br/>Category discovery<br/>Sample labeling]
Check -->|Yes| SkipCal[Skip calibration]
SkipCal --> ClassStart[Start classification]
NeedLLM1 --> DoCalibration[Run calibration<br/>Uses LLM]
DoCalibration --> ClassStart
ClassStart --> Always2[ALWAYS: LLM provider is available<br/>llm.is_available = True]
Always2 --> EmailLoop[For each email...]
EmailLoop --> LowConf{Low confidence?}
LowConf -->|No| NoLLM[No LLM call]
LowConf -->|Yes| FlagCheck{--no-llm-fallback?}
FlagCheck -->|Yes| NoLLMCall[No LLM call<br/>Accept ML result]
FlagCheck -->|No| LLMAvail{llm.is_available?}
LLMAvail -->|Yes| CallLLM[LLM called<br/>src/cli.py:227-228]
LLMAvail -->|No| NoLLMCall
NoLLM --> End([Next email])
NoLLMCall --> End
CallLLM --> End
style Always1 fill:#ffd93d
style Always2 fill:#ffd93d
style CallLLM fill:#ff6b6b
style NoLLMCall fill:#4ec9b0
</pre>
</div>
<h3>Why LLM Provider is Always Initialized:</h3>
<ul>
<li><strong>Line 98-117 (src/cli.py):</strong> LLM provider is created before checking if model exists</li>
<li><strong>Reason:</strong> Need LLM ready in case calibration is required</li>
<li><strong>Result:</strong> Even with --no-llm-fallback, LLM provider loads (but won't be called for classification)</li>
</ul>
<h2>8. Command Scenarios</h2>
<table class="timing-table">
<tr>
<th>Command</th>
<th>Model Exists?</th>
<th>Calibration Runs?</th>
<th>LLM Used for Classification?</th>
<th>Total Time (10k)</th>
</tr>
<tr>
<td><code>python -m src.cli run --source enron --limit 10000</code></td>
<td>No</td>
<td>YES (~20 min)</td>
<td>YES (~2.5 hours)</td>
<td>~2 hours 50 min</td>
</tr>
<tr>
<td><code>python -m src.cli run --source enron --limit 10000</code></td>
<td>Yes</td>
<td>No</td>
<td>YES (~2.5 hours)</td>
<td>~2.5 hours</td>
</tr>
<tr>
<td><code>python -m src.cli run --source enron --limit 10000 --no-llm-fallback</code></td>
<td>No</td>
<td>YES (~20 min)</td>
<td>NO</td>
<td>~24 minutes</td>
</tr>
<tr>
<td><code>python -m src.cli run --source enron --limit 10000 --no-llm-fallback</code></td>
<td>Yes</td>
<td>No</td>
<td>NO</td>
<td>~4 minutes</td>
</tr>
<tr>
<td><code>python -m src.cli run --source enron --limit 500</code></td>
<td>Any</td>
<td>No (too few emails)</td>
<td>YES (100% LLM-only)</td>
<td>~35 minutes</td>
</tr>
</table>
<h2>9. Current System State</h2>
<div class="flag-section">
<h3>Model Status</h3>
<ul>
<li><strong>src/models/calibrated/classifier.pkl</strong> - 1.8MB, trained at 02:54, 10 categories</li>
<li><strong>src/models/pretrained/classifier.pkl</strong> - Copy of calibrated model (created manually)</li>
</ul>
</div>
<div class="flag-section">
<h3>Threshold Configuration</h3>
<ul>
<li><strong>config/default_config.yaml:</strong> default_threshold = 0.55</li>
<li><strong>config/categories.yaml:</strong> All category thresholds = 0.55</li>
<li><strong>Effect:</strong> ML must be ≥55% confident to skip LLM</li>
</ul>
</div>
<div class="flag-section">
<h3>Last Run Results (10k emails)</h3>
<ul>
<li><strong>Rules:</strong> 59 emails (0.6%)</li>
<li><strong>ML:</strong> 7,842 emails (78.4%)</li>
<li><strong>LLM fallback:</strong> 2,099 emails (21%)</li>
<li><strong>Accuracy estimate:</strong> 92.7%</li>
</ul>
</div>
<h2>10. To Run ML-Only Test (No LLM Calls During Classification)</h2>
<div class="flag-section">
<h3>Requirements:</h3>
<ol>
<li>Model must exist at <code>src/models/pretrained/classifier.pkl</code> ✓ (done)</li>
<li>Use <code>--no-llm-fallback</code> flag</li>
<li>Ensure sufficient emails (≥1000) to avoid LLM-only mode</li>
</ol>
<h3>Command:</h3>
<code>
python -m src.cli run --source enron --limit 10000 --output ml_only_10k/ --no-llm-fallback
</code>
<h3>Expected Results:</h3>
<ul>
<li><strong>Calibration:</strong> Skipped (model exists)</li>
<li><strong>LLM calls during classification:</strong> 0</li>
<li><strong>Total time:</strong> ~4 minutes</li>
<li><strong>ML acceptance rate:</strong> 100% (all emails classified by ML, even low confidence)</li>
</ul>
</div>
<script>
mermaid.initialize({
startOnLoad: true,
theme: 'default',
flowchart: {
useMaxWidth: true,
htmlLabels: true,
curve: 'basis'
}
});
</script>
</body>
</html>