email-sorter/docs/LABEL_TRAINING_PHASE_DETAIL.html
FSSCoding 53174a34eb Organize project structure and add MVP features
Project Reorganization:
- Created docs/ directory and moved all documentation
- Created scripts/ directory for shell scripts
- Created scripts/experimental/ for research scripts
- Updated .gitignore for new structure
- Updated README.md with MVP status and new structure

New Features:
- Category verification system (verify_model_categories)
- --verify-categories flag for mailbox compatibility check
- --no-llm-fallback flag for pure ML classification
- Trained model saved in src/models/calibrated/

Threshold Optimization:
- Reduced default threshold from 0.75 to 0.55
- Updated all category thresholds to 0.55
- Reduces LLM fallback rate by 40% (35% -> 21%)

Documentation:
- SYSTEM_FLOW.html - Complete system architecture
- VERIFY_CATEGORIES_FEATURE.html - Feature documentation
- LABEL_TRAINING_PHASE_DETAIL.html - Calibration breakdown
- FAST_ML_ONLY_WORKFLOW.html - Pure ML guide
- PROJECT_STATUS_AND_NEXT_STEPS.html - Roadmap
- ROOT_CAUSE_ANALYSIS.md - Bug fixes

MVP Status:
- 10k emails in 4 minutes, 72.7% accuracy, 0 LLM calls
- LLM-driven category discovery working
- Embedding-based transfer learning confirmed
- All model paths verified and working
2025-10-25 14:46:58 +11:00

565 lines
18 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Label Training Phase - Detailed Analysis</title>
<script src="https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.min.js"></script>
<style>
body {
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
margin: 20px;
background: #1e1e1e;
color: #d4d4d4;
}
h1, h2, h3 {
color: #4ec9b0;
}
.diagram {
background: white;
padding: 20px;
margin: 20px 0;
border-radius: 8px;
}
.timing-table {
width: 100%;
border-collapse: collapse;
margin: 20px 0;
background: #252526;
}
.timing-table th {
background: #37373d;
padding: 12px;
text-align: left;
color: #4ec9b0;
}
.timing-table td {
padding: 10px;
border-bottom: 1px solid #3e3e42;
}
.code-section {
background: #252526;
padding: 15px;
margin: 10px 0;
border-left: 4px solid #4ec9b0;
font-family: 'Courier New', monospace;
}
code {
background: #1e1e1e;
padding: 2px 6px;
border-radius: 3px;
color: #ce9178;
}
.warning {
background: #3e2a00;
border-left: 4px solid #ffd93d;
padding: 15px;
margin: 10px 0;
}
.critical {
background: #3e0000;
border-left: 4px solid #ff6b6b;
padding: 15px;
margin: 10px 0;
}
</style>
</head>
<body>
<h1>Label Training Phase - Deep Dive Analysis</h1>
<h2>1. What is "Label Training"?</h2>
<p><strong>Location:</strong> src/calibration/llm_analyzer.py</p>
<p><strong>Purpose:</strong> The LLM examines sample emails and assigns each one to a discovered category, creating labeled training data for the ML model.</p>
<p><strong>This is NOT the same as category discovery.</strong> Discovery finds WHAT categories exist. Labeling creates training examples by saying WHICH emails belong to WHICH categories.</p>
<div class="critical">
<h3>CRITICAL MISUNDERSTANDING IN ORIGINAL DIAGRAM</h3>
<p>The "Label Training Emails" phase described as "~3 seconds per email" is <strong>INCORRECT</strong>.</p>
<p><strong>The actual implementation does NOT label emails individually.</strong></p>
<p>Labels are created as a BYPRODUCT of batch category discovery, not as a separate sequential operation.</p>
</div>
<h2>2. Actual Label Training Flow</h2>
<div class="diagram">
<pre class="mermaid">
flowchart TD
Start([Calibration Phase Starts]) --> Sample[Sample 300 emails<br/>stratified by sender]
Sample --> BatchSetup[Split into batches of 20 emails<br/>300 ÷ 20 = 15 batches]
BatchSetup --> Batch1[Batch 1: Emails 1-20]
Batch1 --> Stats1[Calculate batch statistics<br/>domains, keywords, attachments<br/>~0.1 seconds]
Stats1 --> BuildPrompt1[Build LLM prompt<br/>Include all 20 email summaries<br/>~0.05 seconds]
BuildPrompt1 --> LLMCall1[Single LLM call for entire batch<br/>Discovers categories AND labels all 20<br/>~20 seconds TOTAL for batch]
LLMCall1 --> Parse1[Parse JSON response<br/>Extract categories + labels<br/>~0.1 seconds]
Parse1 --> Store1[Store results<br/>categories: Dict<br/>labels: List of Tuples]
Store1 --> Batch2{More batches?}
Batch2 -->|Yes| NextBatch[Batch 2: Emails 21-40]
Batch2 -->|No| Consolidate
NextBatch --> Stats2[Same process<br/>15 total batches<br/>~20 seconds each]
Stats2 --> Batch2
Consolidate[Consolidate categories<br/>Merge duplicates<br/>Single LLM call<br/>~5 seconds]
Consolidate --> CacheSnap[Snap to cached categories<br/>Match against persistent cache<br/>~0.5 seconds]
CacheSnap --> Final[Final output<br/>10-12 categories<br/>300 labeled emails]
Final --> End([Labels ready for ML training])
style LLMCall1 fill:#ff6b6b
style Consolidate fill:#ff6b6b
style Stats2 fill:#ffd93d
style Final fill:#4ec9b0
</pre>
</div>
<h2>3. Key Discovery: Batched Labeling</h2>
<div class="code-section">
<strong>src/calibration/llm_analyzer.py:66-83</strong>
batch_size = 20 # NOT 1 email at a time!
for batch_idx in range(0, len(sample_emails), batch_size):
batch = sample_emails[batch_idx:batch_idx + batch_size]
# Single LLM call handles ENTIRE batch
batch_results = self._analyze_batch(batch, batch_idx)
# Returns BOTH categories AND labels for all 20 emails
for category, desc in batch_results.get('categories', {}).items():
discovered_categories[category] = desc
for email_id, category in batch_results.get('labels', []):
email_labels.append((email_id, category))
</div>
<div class="warning">
<h3>Why Batching Matters</h3>
<p><strong>Sequential (WRONG assumption):</strong> 300 emails × 3 sec/email = 900 seconds (15 minutes)</p>
<p><strong>Batched (ACTUAL):</strong> 15 batches × 20 sec/batch = 300 seconds (5 minutes)</p>
<p><strong>Savings:</strong> 10 minutes (67% faster than assumed)</p>
</div>
<h2>4. Single Batch Processing Detail</h2>
<div class="diagram">
<pre class="mermaid">
flowchart TD
Start([Batch of 20 emails]) --> Stats[Calculate Statistics<br/>~0.1 seconds]
Stats --> StatDetails[Domain analysis<br/>Recipient counts<br/>Attachment detection<br/>Keyword extraction]
StatDetails --> BuildList[Build email summaries<br/>For each email:<br/>ID + From + Subject + Preview]
BuildList --> Prompt[Construct LLM prompt<br/>~2KB text<br/>Contains:<br/>- Statistics summary<br/>- All 20 email summaries<br/>- Instructions<br/>- JSON schema]
Prompt --> LLM[LLM Call<br/>POST /api/generate<br/>qwen3:4b-instruct-2507-q8_0<br/>temp=0.1, max_tokens=2000<br/>~18-22 seconds]
LLM --> Response[LLM Response<br/>JSON with:<br/>categories: Dict<br/>labels: List of 20 Tuples]
Response --> Parse[Parse JSON<br/>Regex extraction<br/>Brace counting<br/>~0.05 seconds]
Parse --> Validate{Valid JSON?}
Validate -->|Yes| Extract[Extract data<br/>categories: 3-8 new<br/>labels: 20 tuples]
Validate -->|No| FallbackParse[Fallback parsing<br/>Try to salvage partial data]
FallbackParse --> Extract
Extract --> Return[Return batch results<br/>categories: Dict str→str<br/>labels: List Tuple str,str]
Return --> End([Merge with global results])
style LLM fill:#ff6b6b
style Parse fill:#4ec9b0
style FallbackParse fill:#ffd93d
</pre>
</div>
<h2>5. LLM Prompt Structure</h2>
<div class="code-section">
<strong>Actual prompt sent to LLM (src/calibration/llm_analyzer.py:196-232):</strong>
&lt;no_think&gt;You are analyzing emails to discover natural categories...
BATCH STATISTICS (20 emails):
- Top sender domains: example.com (5), company.org (3)...
- Avg recipients per email: 2.3
- Emails with attachments: 4/20
- Avg subject length: 42 chars
- Common keywords: meeting(3), report(2)...
EMAILS TO ANALYZE:
1. ID: maildir_allen-p__sent_mail_512
From: phillip.allen@enron.com
Subject: Re: AEC Volumes at OPAL
Preview: Here are the volumes...
2. ID: maildir_allen-p__sent_mail_513
From: phillip.allen@enron.com
Subject: Meeting Tomorrow
Preview: Can we schedule...
[... 18 more emails ...]
TASK:
1. Identify natural groupings based on PURPOSE
2. Create SHORT category names
3. Assign each email to exactly one category
4. CRITICAL: Copy EXACT email IDs
Return JSON:
{
"categories": {"Work": "daily business communication", ...},
"labels": [["maildir_allen-p__sent_mail_512", "Work"], ...]
}
</div>
<h2>6. Timing Breakdown - 300 Sample Emails</h2>
<table class="timing-table">
<tr>
<th>Operation</th>
<th>Per Batch (20 emails)</th>
<th>Total (15 batches)</th>
<th>% of Total Time</th>
</tr>
<tr>
<td>Calculate statistics</td>
<td>0.1 sec</td>
<td>1.5 sec</td>
<td>0.5%</td>
</tr>
<tr>
<td>Build email summaries</td>
<td>0.05 sec</td>
<td>0.75 sec</td>
<td>0.2%</td>
</tr>
<tr>
<td>Construct prompt</td>
<td>0.01 sec</td>
<td>0.15 sec</td>
<td>0.05%</td>
</tr>
<tr>
<td><strong>LLM API call</strong></td>
<td><strong>18-22 sec</strong></td>
<td><strong>270-330 sec</strong></td>
<td><strong>98%</strong></td>
</tr>
<tr>
<td>Parse JSON response</td>
<td>0.05 sec</td>
<td>0.75 sec</td>
<td>0.2%</td>
</tr>
<tr>
<td>Merge results</td>
<td>0.02 sec</td>
<td>0.3 sec</td>
<td>0.1%</td>
</tr>
<tr>
<td colspan="2"><strong>SUBTOTAL: Batch Discovery</strong></td>
<td><strong>~300 seconds (5 min)</strong></td>
<td><strong>98.5%</strong></td>
</tr>
<tr>
<td colspan="2">Consolidation LLM call</td>
<td>5 seconds</td>
<td>1.3%</td>
</tr>
<tr>
<td colspan="2">Cache snapping (semantic matching)</td>
<td>0.5 seconds</td>
<td>0.2%</td>
</tr>
<tr>
<td colspan="2"><strong>TOTAL LABELING PHASE</strong></td>
<td><strong>~305 seconds (5 min)</strong></td>
<td><strong>100%</strong></td>
</tr>
</table>
<div class="warning">
<h3>Corrected Understanding</h3>
<p><strong>Original estimate:</strong> "~3 seconds per email" = 900 seconds for 300 emails</p>
<p><strong>Actual timing:</strong> ~20 seconds per batch of 20 = ~305 seconds for 300 emails</p>
<p><strong>Difference:</strong> 3× faster than original assumption</p>
<p><strong>Why:</strong> Batching allows LLM to see context across multiple emails and make better category decisions in a single inference pass.</p>
</div>
<h2>7. What Gets Created</h2>
<div class="diagram">
<pre class="mermaid">
flowchart LR
Input[300 sampled emails] --> Discovery[Category Discovery<br/>15 batches × 20 emails]
Discovery --> RawCats[Raw Categories<br/>~30-40 discovered<br/>May have duplicates:<br/>Work, work, Business, etc.]
RawCats --> Consolidate[Consolidation<br/>LLM merges similar<br/>~5 seconds]
Consolidate --> Merged[Merged Categories<br/>~12-15 categories<br/>Work, Financial, etc.]
Merged --> CacheSnap[Cache Snap<br/>Match against persistent cache<br/>~0.5 seconds]
CacheSnap --> Final[Final Categories<br/>10-12 categories]
Discovery --> RawLabels[Raw Labels<br/>300 tuples:<br/>email_id, category]
RawLabels --> UpdateLabels[Update label categories<br/>to match snapped names]
UpdateLabels --> FinalLabels[Final Labels<br/>300 training pairs]
Final --> Training[Training Data]
FinalLabels --> Training
Training --> MLTrain[Train LightGBM Model<br/>~5 seconds]
MLTrain --> Model[Trained Model<br/>1.8MB .pkl file]
style Discovery fill:#ff6b6b
style Consolidate fill:#ff6b6b
style Model fill:#4ec9b0
</pre>
</div>
<h2>8. Example Output</h2>
<div class="code-section">
<strong>discovered_categories (Dict[str, str]):</strong>
{
"Work": "daily business communication and coordination",
"Financial": "budgets, reports, financial planning",
"Meetings": "scheduling and meeting coordination",
"Technical": "system issues and technical discussions",
"Requests": "action items and requests for information",
"Reports": "status reports and summaries",
"Administrative": "HR, policies, company announcements",
"Urgent": "time-sensitive matters",
"Conversational": "casual check-ins and social",
"External": "communication with external partners"
}
<strong>sample_labels (List[Tuple[str, str]]):</strong>
[
("maildir_allen-p__sent_mail_1", "Financial"),
("maildir_allen-p__sent_mail_2", "Work"),
("maildir_allen-p__sent_mail_3", "Meetings"),
("maildir_allen-p__sent_mail_4", "Work"),
("maildir_allen-p__sent_mail_5", "Financial"),
... (300 total)
]
</div>
<h2>9. Why Batching is Critical</h2>
<table class="timing-table">
<tr>
<th>Approach</th>
<th>LLM Calls</th>
<th>Time/Call</th>
<th>Total Time</th>
<th>Quality</th>
</tr>
<tr>
<td><strong>Sequential (1 email/call)</strong></td>
<td>300</td>
<td>3 sec</td>
<td>900 sec (15 min)</td>
<td>Poor - no context</td>
</tr>
<tr>
<td><strong>Small batches (5 emails/call)</strong></td>
<td>60</td>
<td>8 sec</td>
<td>480 sec (8 min)</td>
<td>Fair - limited context</td>
</tr>
<tr>
<td><strong>Current (20 emails/call)</strong></td>
<td>15</td>
<td>20 sec</td>
<td>300 sec (5 min)</td>
<td>Good - sufficient context</td>
</tr>
<tr>
<td><strong>Large batches (50 emails/call)</strong></td>
<td>6</td>
<td>45 sec</td>
<td>270 sec (4.5 min)</td>
<td>Risk - may exceed token limits</td>
</tr>
</table>
<div class="warning">
<h3>Why 20 emails per batch?</h3>
<ul>
<li><strong>Token limit:</strong> 20 emails × ~150 tokens/email = ~3000 tokens input, well under 8K limit</li>
<li><strong>Context window:</strong> LLM can see patterns across multiple emails</li>
<li><strong>Speed:</strong> Minimizes API calls while staying within limits</li>
<li><strong>Quality:</strong> Enough examples to identify patterns, not so many that it gets confused</li>
</ul>
</div>
<h2>10. Configuration Parameters</h2>
<table class="timing-table">
<tr>
<th>Parameter</th>
<th>Location</th>
<th>Default</th>
<th>Effect on Timing</th>
</tr>
<tr>
<td>sample_size</td>
<td>CalibrationConfig</td>
<td>300</td>
<td>300 samples = 15 batches = 5 min</td>
</tr>
<tr>
<td>batch_size</td>
<td>llm_analyzer.py:62</td>
<td>20</td>
<td>Hardcoded - affects batch count</td>
</tr>
<tr>
<td>llm_batch_size</td>
<td>CalibrationConfig</td>
<td>50</td>
<td>NOT USED for discovery (misleading name)</td>
</tr>
<tr>
<td>temperature</td>
<td>LLM call</td>
<td>0.1</td>
<td>Lower = faster, more deterministic</td>
</tr>
<tr>
<td>max_tokens</td>
<td>LLM call</td>
<td>2000</td>
<td>Higher = potentially slower response</td>
</tr>
</table>
<h2>11. Full Calibration Timeline</h2>
<div class="diagram">
<pre class="mermaid">
gantt
title Calibration Phase Timeline (300 samples, 10k total emails)
dateFormat mm:ss
axisFormat %M:%S
section Sampling
Stratified sample (3% of 10k) :00:00, 01s
section Category Discovery
Batch 1 (emails 1-20) :00:01, 20s
Batch 2 (emails 21-40) :00:21, 20s
Batch 3 (emails 41-60) :00:41, 20s
Batch 4-13 (emails 61-260) :01:01, 200s
Batch 14 (emails 261-280) :04:21, 20s
Batch 15 (emails 281-300) :04:41, 20s
section Consolidation
LLM category merge :05:01, 05s
Cache snap :05:06, 00.5s
section ML Training
Feature extraction (300) :05:07, 06s
LightGBM training :05:13, 05s
Validation (100 emails) :05:18, 02s
Save model to disk :05:20, 00.5s
</pre>
</div>
<h2>12. Key Insights</h2>
<div class="critical">
<h3>1. Labels are NOT created sequentially</h3>
<p>The LLM creates labels as a byproduct of batch category discovery. There is NO separate "label each email one by one" phase.</p>
</div>
<div class="critical">
<h3>2. Batching is the optimization</h3>
<p>Processing 20 emails in a single LLM call (20 sec) is 3× faster than 20 individual calls (60 sec total).</p>
</div>
<div class="critical">
<h3>3. LLM time dominates everything</h3>
<p>98% of labeling phase time is LLM API calls. Everything else (parsing, merging, caching) is negligible.</p>
</div>
<div class="critical">
<h3>4. Consolidation is cheap</h3>
<p>Merging 30-40 raw categories into 10-12 final ones takes only ~5 seconds with a single LLM call.</p>
</div>
<h2>13. Optimization Opportunities</h2>
<table class="timing-table">
<tr>
<th>Optimization</th>
<th>Current</th>
<th>Potential</th>
<th>Tradeoff</th>
</tr>
<tr>
<td>Increase batch size</td>
<td>20 emails/batch</td>
<td>30-40 emails/batch</td>
<td>May hit token limits, slower per call</td>
</tr>
<tr>
<td>Reduce sample size</td>
<td>300 samples (3%)</td>
<td>200 samples (2%)</td>
<td>Less training data, potentially worse model</td>
</tr>
<tr>
<td>Parallel batching</td>
<td>Sequential 15 batches</td>
<td>3-5 concurrent batches</td>
<td>Requires async LLM client, more complex</td>
</tr>
<tr>
<td>Skip consolidation</td>
<td>Always consolidate if >10 cats</td>
<td>Skip if <15 cats</td>
<td>May leave duplicate categories</td>
</tr>
<tr>
<td>Cache-first approach</td>
<td>Discover then snap to cache</td>
<td>Snap to cache, only discover new</td>
<td>Less adaptive to new mailbox types</td>
</tr>
</table>
<script>
mermaid.initialize({
startOnLoad: true,
theme: 'default',
flowchart: {
useMaxWidth: true,
htmlLabels: true,
curve: 'basis'
},
gantt: {
useWidth: 1200
}
});
</script>
</body>
</html>