Project Reorganization: - Created docs/ directory and moved all documentation - Created scripts/ directory for shell scripts - Created scripts/experimental/ for research scripts - Updated .gitignore for new structure - Updated README.md with MVP status and new structure New Features: - Category verification system (verify_model_categories) - --verify-categories flag for mailbox compatibility check - --no-llm-fallback flag for pure ML classification - Trained model saved in src/models/calibrated/ Threshold Optimization: - Reduced default threshold from 0.75 to 0.55 - Updated all category thresholds to 0.55 - Reduces LLM fallback rate by 40% (35% -> 21%) Documentation: - SYSTEM_FLOW.html - Complete system architecture - VERIFY_CATEGORIES_FEATURE.html - Feature documentation - LABEL_TRAINING_PHASE_DETAIL.html - Calibration breakdown - FAST_ML_ONLY_WORKFLOW.html - Pure ML guide - PROJECT_STATUS_AND_NEXT_STEPS.html - Roadmap - ROOT_CAUSE_ANALYSIS.md - Bug fixes MVP Status: - 10k emails in 4 minutes, 72.7% accuracy, 0 LLM calls - LLM-driven category discovery working - Embedding-based transfer learning confirmed - All model paths verified and working
494 lines
17 KiB
HTML
494 lines
17 KiB
HTML
<!DOCTYPE html>
|
|
<html lang="en">
|
|
<head>
|
|
<meta charset="UTF-8">
|
|
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
|
<title>Email Sorter System Flow</title>
|
|
<script src="https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.min.js"></script>
|
|
<style>
|
|
body {
|
|
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
|
|
margin: 20px;
|
|
background: #1e1e1e;
|
|
color: #d4d4d4;
|
|
}
|
|
h1, h2, h3 {
|
|
color: #4ec9b0;
|
|
}
|
|
.diagram {
|
|
background: white;
|
|
padding: 20px;
|
|
margin: 20px 0;
|
|
border-radius: 8px;
|
|
}
|
|
.timing-table {
|
|
width: 100%;
|
|
border-collapse: collapse;
|
|
margin: 20px 0;
|
|
background: #252526;
|
|
}
|
|
.timing-table th {
|
|
background: #37373d;
|
|
padding: 12px;
|
|
text-align: left;
|
|
color: #4ec9b0;
|
|
}
|
|
.timing-table td {
|
|
padding: 10px;
|
|
border-bottom: 1px solid #3e3e42;
|
|
}
|
|
.flag-section {
|
|
background: #252526;
|
|
padding: 15px;
|
|
margin: 10px 0;
|
|
border-left: 4px solid #4ec9b0;
|
|
}
|
|
code {
|
|
background: #1e1e1e;
|
|
padding: 2px 6px;
|
|
border-radius: 3px;
|
|
color: #ce9178;
|
|
}
|
|
</style>
|
|
</head>
|
|
<body>
|
|
<h1>Email Sorter System Flow Documentation</h1>
|
|
|
|
<h2>1. Main Execution Flow</h2>
|
|
<div class="diagram">
|
|
<pre class="mermaid">
|
|
flowchart TD
|
|
Start([python -m src.cli run]) --> LoadConfig[Load config/default_config.yaml]
|
|
LoadConfig --> InitProviders[Initialize Email Provider<br/>Enron/Gmail/IMAP]
|
|
InitProviders --> FetchEmails[Fetch Emails<br/>--limit N]
|
|
|
|
FetchEmails --> CheckSize{Email Count?}
|
|
CheckSize -->|"< 1000"| SetMockMode[Set ml_classifier.is_mock = True<br/>LLM-only mode]
|
|
CheckSize -->|">= 1000"| CheckModel{Model Exists?}
|
|
|
|
CheckModel -->|No model at<br/>src/models/pretrained/classifier.pkl| RunCalibration[CALIBRATION PHASE<br/>LLM category discovery<br/>Train ML model]
|
|
CheckModel -->|Model exists| SkipCalibration[Skip Calibration<br/>Load existing model]
|
|
SetMockMode --> SkipCalibration
|
|
|
|
RunCalibration --> ClassifyPhase[CLASSIFICATION PHASE]
|
|
SkipCalibration --> ClassifyPhase
|
|
|
|
ClassifyPhase --> Loop{For each email}
|
|
Loop --> RuleCheck{Hard rule match?}
|
|
RuleCheck -->|Yes| RuleClassify[Category by rule<br/>confidence=1.0<br/>method='rule']
|
|
RuleCheck -->|No| MLClassify[ML Classification<br/>Get category + confidence]
|
|
|
|
MLClassify --> ConfCheck{Confidence >= threshold?}
|
|
ConfCheck -->|Yes| AcceptML[Accept ML result<br/>method='ml'<br/>needs_review=False]
|
|
ConfCheck -->|No| LowConf[Low confidence detected<br/>needs_review=True]
|
|
|
|
LowConf --> FlagCheck{--no-llm-fallback?}
|
|
FlagCheck -->|Yes| AcceptMLAnyway[Accept ML anyway<br/>needs_review=False]
|
|
FlagCheck -->|No| LLMCheck{LLM available?}
|
|
|
|
LLMCheck -->|Yes| LLMReview[LLM Classification<br/>~4 seconds<br/>method='llm']
|
|
LLMCheck -->|No| AcceptMLAnyway
|
|
|
|
RuleClassify --> NextEmail{More emails?}
|
|
AcceptML --> NextEmail
|
|
AcceptMLAnyway --> NextEmail
|
|
LLMReview --> NextEmail
|
|
|
|
NextEmail -->|Yes| Loop
|
|
NextEmail -->|No| SaveResults[Save results.json]
|
|
SaveResults --> End([Complete])
|
|
|
|
style RunCalibration fill:#ff6b6b
|
|
style LLMReview fill:#ff6b6b
|
|
style SetMockMode fill:#ffd93d
|
|
style FlagCheck fill:#4ec9b0
|
|
style AcceptMLAnyway fill:#4ec9b0
|
|
</pre>
|
|
</div>
|
|
|
|
<h2>2. Calibration Phase Detail (When Triggered)</h2>
|
|
<div class="diagram">
|
|
<pre class="mermaid">
|
|
flowchart TD
|
|
Start([Calibration Triggered]) --> Sample[Stratified Sampling<br/>3% of emails<br/>min 250, max 1500]
|
|
Sample --> LLMBatch[LLM Category Discovery<br/>50 emails per batch]
|
|
|
|
LLMBatch --> Batch1[Batch 1: 50 emails<br/>~20 seconds]
|
|
Batch1 --> Batch2[Batch 2: 50 emails<br/>~20 seconds]
|
|
Batch2 --> BatchN[... N batches<br/>For 300 samples: 6 batches]
|
|
|
|
BatchN --> Consolidate[LLM Consolidation<br/>Merge similar categories<br/>~5 seconds]
|
|
Consolidate --> Categories[Final Categories<br/>~10-12 unique categories]
|
|
|
|
Categories --> Label[Label Training Emails<br/>LLM labels each sample<br/>~3 seconds per email]
|
|
Label --> Extract[Feature Extraction<br/>Embeddings + TF-IDF<br/>~0.02 seconds per email]
|
|
Extract --> Train[Train LightGBM Model<br/>~5 seconds total]
|
|
|
|
Train --> Validate[Validate on 100 samples<br/>~2 seconds]
|
|
Validate --> Save[Save Model<br/>src/models/calibrated/classifier.pkl]
|
|
Save --> End([Calibration Complete<br/>Total time: 15-25 minutes for 10k emails])
|
|
|
|
style LLMBatch fill:#ff6b6b
|
|
style Label fill:#ff6b6b
|
|
style Consolidate fill:#ff6b6b
|
|
style Train fill:#4ec9b0
|
|
</pre>
|
|
</div>
|
|
|
|
<h2>3. Classification Phase Detail</h2>
|
|
<div class="diagram">
|
|
<pre class="mermaid">
|
|
flowchart TD
|
|
Start([Classification Phase]) --> Email[Get Email]
|
|
Email --> Rules{Check Hard Rules<br/>Pattern matching}
|
|
|
|
Rules -->|Match| RuleDone[Rule Match<br/>~0.001 seconds<br/>59 of 10000 emails]
|
|
Rules -->|No match| Embed[Generate Embedding<br/>all-minilm:l6-v2<br/>~0.02 seconds]
|
|
|
|
Embed --> TFIDF[TF-IDF Features<br/>~0.001 seconds]
|
|
TFIDF --> MLPredict[ML Prediction<br/>LightGBM<br/>~0.003 seconds]
|
|
|
|
MLPredict --> Threshold{Confidence >= 0.55?}
|
|
Threshold -->|Yes| MLDone[ML Classification<br/>7842 of 10000 emails<br/>78.4%]
|
|
Threshold -->|No| Flag{--no-llm-fallback?}
|
|
|
|
Flag -->|Yes| MLForced[Force ML result<br/>No LLM call]
|
|
Flag -->|No| LLM[LLM Classification<br/>~4 seconds<br/>2099 of 10000 emails<br/>21%]
|
|
|
|
RuleDone --> Next([Next Email])
|
|
MLDone --> Next
|
|
MLForced --> Next
|
|
LLM --> Next
|
|
|
|
style LLM fill:#ff6b6b
|
|
style MLDone fill:#4ec9b0
|
|
style MLForced fill:#ffd93d
|
|
</pre>
|
|
</div>
|
|
|
|
<h2>4. Model Loading Logic</h2>
|
|
<div class="diagram">
|
|
<pre class="mermaid">
|
|
flowchart TD
|
|
Start([MLClassifier.__init__]) --> CheckPath{model_path provided?}
|
|
CheckPath -->|Yes| UsePath[Use provided path]
|
|
CheckPath -->|No| Default[Default:<br/>src/models/pretrained/classifier.pkl]
|
|
|
|
UsePath --> FileCheck{File exists?}
|
|
Default --> FileCheck
|
|
|
|
FileCheck -->|Yes| Load[Load pickle file]
|
|
FileCheck -->|No| CreateMock[Create MOCK model<br/>Random Forest<br/>12 hardcoded categories]
|
|
|
|
Load --> ValidCheck{Valid model data?}
|
|
ValidCheck -->|Yes| CheckMock{is_mock flag?}
|
|
ValidCheck -->|No| CreateMock
|
|
|
|
CheckMock -->|True| WarnMock[Warn: MOCK model active]
|
|
CheckMock -->|False| RealModel[Real trained model loaded]
|
|
|
|
CreateMock --> MockWarnings[Multiple warnings printed<br/>NOT for production]
|
|
WarnMock --> Ready[Model Ready]
|
|
RealModel --> Ready
|
|
MockWarnings --> Ready
|
|
|
|
Ready --> End([Classification can start])
|
|
|
|
style CreateMock fill:#ff6b6b
|
|
style RealModel fill:#4ec9b0
|
|
style WarnMock fill:#ffd93d
|
|
</pre>
|
|
</div>
|
|
|
|
<h2>5. Flag Conditions & Effects</h2>
|
|
|
|
<div class="flag-section">
|
|
<h3>--no-llm-fallback</h3>
|
|
<p><strong>Location:</strong> src/cli.py:46, src/classification/adaptive_classifier.py:152-161</p>
|
|
<p><strong>Effect:</strong> When ML confidence < threshold, accept ML result anyway instead of calling LLM</p>
|
|
<p><strong>Use case:</strong> Test pure ML performance, avoid LLM costs</p>
|
|
<p><strong>Code path:</strong></p>
|
|
<code>
|
|
if self.disable_llm_fallback:<br/>
|
|
# Just return ML result without LLM fallback<br/>
|
|
return ClassificationResult(needs_review=False)
|
|
</code>
|
|
</div>
|
|
|
|
<div class="flag-section">
|
|
<h3>--limit N</h3>
|
|
<p><strong>Location:</strong> src/cli.py:38</p>
|
|
<p><strong>Effect:</strong> Limits number of emails fetched from source</p>
|
|
<p><strong>Calibration trigger:</strong> If N < 1000, forces LLM-only mode (no ML training)</p>
|
|
<p><strong>Code path:</strong></p>
|
|
<code>
|
|
if total_emails < 1000:<br/>
|
|
ml_classifier.is_mock = True # Skip ML, use LLM only
|
|
</code>
|
|
</div>
|
|
|
|
<div class="flag-section">
|
|
<h3>Model Path Override</h3>
|
|
<p><strong>Location:</strong> src/classification/ml_classifier.py:43</p>
|
|
<p><strong>Default:</strong> src/models/pretrained/classifier.pkl</p>
|
|
<p><strong>Calibration saves to:</strong> src/models/calibrated/classifier.pkl</p>
|
|
<p><strong>Problem:</strong> Calibration saves to different location than default load location</p>
|
|
<p><strong>Solution:</strong> Copy calibrated model to pretrained location OR pass model_path parameter</p>
|
|
</div>
|
|
|
|
<h2>6. Timing Breakdown (10,000 emails)</h2>
|
|
|
|
<table class="timing-table">
|
|
<tr>
|
|
<th>Phase</th>
|
|
<th>Operation</th>
|
|
<th>Time per Email</th>
|
|
<th>Total Time (10k)</th>
|
|
<th>LLM Required?</th>
|
|
</tr>
|
|
<tr>
|
|
<td rowspan="6"><strong>Calibration</strong><br/>(if model doesn't exist)</td>
|
|
<td>Stratified sampling (300 emails)</td>
|
|
<td>-</td>
|
|
<td>~1 second</td>
|
|
<td>No</td>
|
|
</tr>
|
|
<tr>
|
|
<td>LLM category discovery (6 batches)</td>
|
|
<td>~0.4 sec/email</td>
|
|
<td>~2 minutes</td>
|
|
<td>YES</td>
|
|
</tr>
|
|
<tr>
|
|
<td>LLM consolidation</td>
|
|
<td>-</td>
|
|
<td>~5 seconds</td>
|
|
<td>YES</td>
|
|
</tr>
|
|
<tr>
|
|
<td>LLM labeling (300 samples)</td>
|
|
<td>~3 sec/email</td>
|
|
<td>~15 minutes</td>
|
|
<td>YES</td>
|
|
</tr>
|
|
<tr>
|
|
<td>Feature extraction (300 samples)</td>
|
|
<td>~0.02 sec/email</td>
|
|
<td>~6 seconds</td>
|
|
<td>No (embeddings)</td>
|
|
</tr>
|
|
<tr>
|
|
<td>Model training (LightGBM)</td>
|
|
<td>-</td>
|
|
<td>~5 seconds</td>
|
|
<td>No</td>
|
|
</tr>
|
|
<tr>
|
|
<td colspan="3"><strong>CALIBRATION TOTAL</strong></td>
|
|
<td><strong>~17-20 minutes</strong></td>
|
|
<td><strong>YES</strong></td>
|
|
</tr>
|
|
<tr>
|
|
<td rowspan="5"><strong>Classification</strong><br/>(with model)</td>
|
|
<td>Hard rule matching</td>
|
|
<td>~0.001 sec</td>
|
|
<td>~10 seconds (all 10k)</td>
|
|
<td>No</td>
|
|
</tr>
|
|
<tr>
|
|
<td>Embedding generation</td>
|
|
<td>~0.02 sec</td>
|
|
<td>~200 seconds (all 10k)</td>
|
|
<td>No (Ollama embed)</td>
|
|
</tr>
|
|
<tr>
|
|
<td>ML prediction</td>
|
|
<td>~0.003 sec</td>
|
|
<td>~30 seconds (all 10k)</td>
|
|
<td>No</td>
|
|
</tr>
|
|
<tr>
|
|
<td>LLM fallback (21% of emails)</td>
|
|
<td>~4 sec/email</td>
|
|
<td>~140 minutes (2100 emails)</td>
|
|
<td>YES</td>
|
|
</tr>
|
|
<tr>
|
|
<td>Saving results</td>
|
|
<td>-</td>
|
|
<td>~1 second</td>
|
|
<td>No</td>
|
|
</tr>
|
|
<tr>
|
|
<td colspan="3"><strong>CLASSIFICATION TOTAL (with LLM fallback)</strong></td>
|
|
<td><strong>~2.5 hours</strong></td>
|
|
<td><strong>YES (21%)</strong></td>
|
|
</tr>
|
|
<tr>
|
|
<td colspan="3"><strong>CLASSIFICATION TOTAL (--no-llm-fallback)</strong></td>
|
|
<td><strong>~4 minutes</strong></td>
|
|
<td><strong>No</strong></td>
|
|
</tr>
|
|
</table>
|
|
|
|
<h2>7. Why LLM Still Loads</h2>
|
|
|
|
<div class="diagram">
|
|
<pre class="mermaid">
|
|
flowchart TD
|
|
Start([CLI startup]) --> Always1[ALWAYS: Load LLM provider<br/>src/cli.py:98-117]
|
|
Always1 --> Reason1[Reason: Needed for calibration<br/>if model doesn't exist]
|
|
|
|
Reason1 --> Check{Model exists?}
|
|
Check -->|No| NeedLLM1[LLM required for calibration<br/>Category discovery<br/>Sample labeling]
|
|
Check -->|Yes| SkipCal[Skip calibration]
|
|
|
|
SkipCal --> ClassStart[Start classification]
|
|
NeedLLM1 --> DoCalibration[Run calibration<br/>Uses LLM]
|
|
DoCalibration --> ClassStart
|
|
|
|
ClassStart --> Always2[ALWAYS: LLM provider is available<br/>llm.is_available = True]
|
|
Always2 --> EmailLoop[For each email...]
|
|
|
|
EmailLoop --> LowConf{Low confidence?}
|
|
LowConf -->|No| NoLLM[No LLM call]
|
|
LowConf -->|Yes| FlagCheck{--no-llm-fallback?}
|
|
|
|
FlagCheck -->|Yes| NoLLMCall[No LLM call<br/>Accept ML result]
|
|
FlagCheck -->|No| LLMAvail{llm.is_available?}
|
|
|
|
LLMAvail -->|Yes| CallLLM[LLM called<br/>src/cli.py:227-228]
|
|
LLMAvail -->|No| NoLLMCall
|
|
|
|
NoLLM --> End([Next email])
|
|
NoLLMCall --> End
|
|
CallLLM --> End
|
|
|
|
style Always1 fill:#ffd93d
|
|
style Always2 fill:#ffd93d
|
|
style CallLLM fill:#ff6b6b
|
|
style NoLLMCall fill:#4ec9b0
|
|
</pre>
|
|
</div>
|
|
|
|
<h3>Why LLM Provider is Always Initialized:</h3>
|
|
<ul>
|
|
<li><strong>Line 98-117 (src/cli.py):</strong> LLM provider is created before checking if model exists</li>
|
|
<li><strong>Reason:</strong> Need LLM ready in case calibration is required</li>
|
|
<li><strong>Result:</strong> Even with --no-llm-fallback, LLM provider loads (but won't be called for classification)</li>
|
|
</ul>
|
|
|
|
<h2>8. Command Scenarios</h2>
|
|
|
|
<table class="timing-table">
|
|
<tr>
|
|
<th>Command</th>
|
|
<th>Model Exists?</th>
|
|
<th>Calibration Runs?</th>
|
|
<th>LLM Used for Classification?</th>
|
|
<th>Total Time (10k)</th>
|
|
</tr>
|
|
<tr>
|
|
<td><code>python -m src.cli run --source enron --limit 10000</code></td>
|
|
<td>No</td>
|
|
<td>YES (~20 min)</td>
|
|
<td>YES (~2.5 hours)</td>
|
|
<td>~2 hours 50 min</td>
|
|
</tr>
|
|
<tr>
|
|
<td><code>python -m src.cli run --source enron --limit 10000</code></td>
|
|
<td>Yes</td>
|
|
<td>No</td>
|
|
<td>YES (~2.5 hours)</td>
|
|
<td>~2.5 hours</td>
|
|
</tr>
|
|
<tr>
|
|
<td><code>python -m src.cli run --source enron --limit 10000 --no-llm-fallback</code></td>
|
|
<td>No</td>
|
|
<td>YES (~20 min)</td>
|
|
<td>NO</td>
|
|
<td>~24 minutes</td>
|
|
</tr>
|
|
<tr>
|
|
<td><code>python -m src.cli run --source enron --limit 10000 --no-llm-fallback</code></td>
|
|
<td>Yes</td>
|
|
<td>No</td>
|
|
<td>NO</td>
|
|
<td>~4 minutes</td>
|
|
</tr>
|
|
<tr>
|
|
<td><code>python -m src.cli run --source enron --limit 500</code></td>
|
|
<td>Any</td>
|
|
<td>No (too few emails)</td>
|
|
<td>YES (100% LLM-only)</td>
|
|
<td>~35 minutes</td>
|
|
</tr>
|
|
</table>
|
|
|
|
<h2>9. Current System State</h2>
|
|
|
|
<div class="flag-section">
|
|
<h3>Model Status</h3>
|
|
<ul>
|
|
<li><strong>src/models/calibrated/classifier.pkl</strong> - 1.8MB, trained at 02:54, 10 categories</li>
|
|
<li><strong>src/models/pretrained/classifier.pkl</strong> - Copy of calibrated model (created manually)</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<div class="flag-section">
|
|
<h3>Threshold Configuration</h3>
|
|
<ul>
|
|
<li><strong>config/default_config.yaml:</strong> default_threshold = 0.55</li>
|
|
<li><strong>config/categories.yaml:</strong> All category thresholds = 0.55</li>
|
|
<li><strong>Effect:</strong> ML must be ≥55% confident to skip LLM</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<div class="flag-section">
|
|
<h3>Last Run Results (10k emails)</h3>
|
|
<ul>
|
|
<li><strong>Rules:</strong> 59 emails (0.6%)</li>
|
|
<li><strong>ML:</strong> 7,842 emails (78.4%)</li>
|
|
<li><strong>LLM fallback:</strong> 2,099 emails (21%)</li>
|
|
<li><strong>Accuracy estimate:</strong> 92.7%</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<h2>10. To Run ML-Only Test (No LLM Calls During Classification)</h2>
|
|
|
|
<div class="flag-section">
|
|
<h3>Requirements:</h3>
|
|
<ol>
|
|
<li>Model must exist at <code>src/models/pretrained/classifier.pkl</code> ✓ (done)</li>
|
|
<li>Use <code>--no-llm-fallback</code> flag</li>
|
|
<li>Ensure sufficient emails (≥1000) to avoid LLM-only mode</li>
|
|
</ol>
|
|
|
|
<h3>Command:</h3>
|
|
<code>
|
|
python -m src.cli run --source enron --limit 10000 --output ml_only_10k/ --no-llm-fallback
|
|
</code>
|
|
|
|
<h3>Expected Results:</h3>
|
|
<ul>
|
|
<li><strong>Calibration:</strong> Skipped (model exists)</li>
|
|
<li><strong>LLM calls during classification:</strong> 0</li>
|
|
<li><strong>Total time:</strong> ~4 minutes</li>
|
|
<li><strong>ML acceptance rate:</strong> 100% (all emails classified by ML, even low confidence)</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<script>
|
|
mermaid.initialize({
|
|
startOnLoad: true,
|
|
theme: 'default',
|
|
flowchart: {
|
|
useMaxWidth: true,
|
|
htmlLabels: true,
|
|
curve: 'basis'
|
|
}
|
|
});
|
|
</script>
|
|
</body>
|
|
</html>
|