email-sorter/docs/LABEL_TRAINING_PHASE_DETAIL.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Label Training Phase - Detailed Analysis</title>
    <script src="https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.min.js"></script>
    <style>
        body {
            font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
            margin: 20px;
            background: #1e1e1e;
            color: #d4d4d4;
        }
        h1, h2, h3 {
            color: #4ec9b0;
        }
        .diagram {
            background: white;
            padding: 20px;
            margin: 20px 0;
            border-radius: 8px;
        }
        .timing-table {
            width: 100%;
            border-collapse: collapse;
            margin: 20px 0;
            background: #252526;
        }
        .timing-table th {
            background: #37373d;
            padding: 12px;
            text-align: left;
            color: #4ec9b0;
        }
        .timing-table td {
            padding: 10px;
            border-bottom: 1px solid #3e3e42;
        }
        .code-section {
            background: #252526;
            padding: 15px;
            margin: 10px 0;
            border-left: 4px solid #4ec9b0;
            font-family: 'Courier New', monospace;
        }
        code {
            background: #1e1e1e;
            padding: 2px 6px;
            border-radius: 3px;
            color: #ce9178;
        }
        .warning {
            background: #3e2a00;
            border-left: 4px solid #ffd93d;
            padding: 15px;
            margin: 10px 0;
        }
        .critical {
            background: #3e0000;
            border-left: 4px solid #ff6b6b;
            padding: 15px;
            margin: 10px 0;
        }
    </style>
</head>
<body>
    <h1>Label Training Phase - Deep Dive Analysis</h1>

    <h2>1. What is "Label Training"?</h2>
    <p><strong>Location:</strong> src/calibration/llm_analyzer.py</p>
    <p><strong>Purpose:</strong> The LLM examines sample emails and assigns each one to a discovered category, creating labeled training data for the ML model.</p>
    <p><strong>This is NOT the same as category discovery.</strong> Discovery finds WHAT categories exist. Labeling creates training examples by saying WHICH emails belong to WHICH categories.</p>

    <div class="critical">
        <h3>CRITICAL MISUNDERSTANDING IN ORIGINAL DIAGRAM</h3>
        <p>The "Label Training Emails" phase described as "~3 seconds per email" is <strong>INCORRECT</strong>.</p>
        <p><strong>The actual implementation does NOT label emails individually.</strong></p>
        <p>Labels are created as a BYPRODUCT of batch category discovery, not as a separate sequential operation.</p>
    </div>

    <h2>2. Actual Label Training Flow</h2>
    <div class="diagram">
        <pre class="mermaid">
flowchart TD
    Start([Calibration Phase Starts]) --> Sample[Sample 300 emails<br/>stratified by sender]
    Sample --> BatchSetup[Split into batches of 20 emails<br/>300 ÷ 20 = 15 batches]

    BatchSetup --> Batch1[Batch 1: Emails 1-20]
    Batch1 --> Stats1[Calculate batch statistics<br/>domains, keywords, attachments<br/>~0.1 seconds]

    Stats1 --> BuildPrompt1[Build LLM prompt<br/>Include all 20 email summaries<br/>~0.05 seconds]

    BuildPrompt1 --> LLMCall1[Single LLM call for entire batch<br/>Discovers categories AND labels all 20<br/>~20 seconds TOTAL for batch]

    LLMCall1 --> Parse1[Parse JSON response<br/>Extract categories + labels<br/>~0.1 seconds]

    Parse1 --> Store1[Store results<br/>categories: Dict<br/>labels: List of Tuples]

    Store1 --> Batch2{More batches?}
    Batch2 -->|Yes| NextBatch[Batch 2: Emails 21-40]
    Batch2 -->|No| Consolidate

    NextBatch --> Stats2[Same process<br/>15 total batches<br/>~20 seconds each]
    Stats2 --> Batch2

    Consolidate[Consolidate categories<br/>Merge duplicates<br/>Single LLM call<br/>~5 seconds]

    Consolidate --> CacheSnap[Snap to cached categories<br/>Match against persistent cache<br/>~0.5 seconds]

    CacheSnap --> Final[Final output<br/>10-12 categories<br/>300 labeled emails]

    Final --> End([Labels ready for ML training])

    style LLMCall1 fill:#ff6b6b
    style Consolidate fill:#ff6b6b
    style Stats2 fill:#ffd93d
    style Final fill:#4ec9b0
</pre>
    </div>

    <h2>3. Key Discovery: Batched Labeling</h2>

    <div class="code-section">
<strong>src/calibration/llm_analyzer.py:66-83</strong>

batch_size = 20  # NOT 1 email at a time!

for batch_idx in range(0, len(sample_emails), batch_size):
    batch = sample_emails[batch_idx:batch_idx + batch_size]

    # Single LLM call handles ENTIRE batch
    batch_results = self._analyze_batch(batch, batch_idx)

    # Returns BOTH categories AND labels for all 20 emails
    for category, desc in batch_results.get('categories', {}).items():
        discovered_categories[category] = desc

    for email_id, category in batch_results.get('labels', []):
        email_labels.append((email_id, category))
    </div>

    <div class="warning">
        <h3>Why Batching Matters</h3>
        <p><strong>Sequential (WRONG assumption):</strong> 300 emails × 3 sec/email = 900 seconds (15 minutes)</p>
        <p><strong>Batched (ACTUAL):</strong> 15 batches × 20 sec/batch = 300 seconds (5 minutes)</p>
        <p><strong>Savings:</strong> 10 minutes (67% faster than assumed)</p>
    </div>

    <h2>4. Single Batch Processing Detail</h2>
    <div class="diagram">
        <pre class="mermaid">
flowchart TD
    Start([Batch of 20 emails]) --> Stats[Calculate Statistics<br/>~0.1 seconds]

    Stats --> StatDetails[Domain analysis<br/>Recipient counts<br/>Attachment detection<br/>Keyword extraction]

    StatDetails --> BuildList[Build email summaries<br/>For each email:<br/>ID + From + Subject + Preview]

    BuildList --> Prompt[Construct LLM prompt<br/>~2KB text<br/>Contains:<br/>- Statistics summary<br/>- All 20 email summaries<br/>- Instructions<br/>- JSON schema]

    Prompt --> LLM[LLM Call<br/>POST /api/generate<br/>qwen3:4b-instruct-2507-q8_0<br/>temp=0.1, max_tokens=2000<br/>~18-22 seconds]

    LLM --> Response[LLM Response<br/>JSON with:<br/>categories: Dict<br/>labels: List of 20 Tuples]

    Response --> Parse[Parse JSON<br/>Regex extraction<br/>Brace counting<br/>~0.05 seconds]

    Parse --> Validate{Valid JSON?}
    Validate -->|Yes| Extract[Extract data<br/>categories: 3-8 new<br/>labels: 20 tuples]
    Validate -->|No| FallbackParse[Fallback parsing<br/>Try to salvage partial data]

    FallbackParse --> Extract

    Extract --> Return[Return batch results<br/>categories: Dict str→str<br/>labels: List Tuple str,str]

    Return --> End([Merge with global results])

    style LLM fill:#ff6b6b
    style Parse fill:#4ec9b0
    style FallbackParse fill:#ffd93d
</pre>
    </div>

    <h2>5. LLM Prompt Structure</h2>

    <div class="code-section">
<strong>Actual prompt sent to LLM (src/calibration/llm_analyzer.py:196-232):</strong>

&lt;no_think&gt;You are analyzing emails to discover natural categories...

BATCH STATISTICS (20 emails):
- Top sender domains: example.com (5), company.org (3)...
- Avg recipients per email: 2.3
- Emails with attachments: 4/20
- Avg subject length: 42 chars
- Common keywords: meeting(3), report(2)...

EMAILS TO ANALYZE:
1. ID: maildir_allen-p__sent_mail_512
   From: phillip.allen@enron.com
   Subject: Re: AEC Volumes at OPAL
   Preview: Here are the volumes...

2. ID: maildir_allen-p__sent_mail_513
   From: phillip.allen@enron.com
   Subject: Meeting Tomorrow
   Preview: Can we schedule...

[... 18 more emails ...]

TASK:
1. Identify natural groupings based on PURPOSE
2. Create SHORT category names
3. Assign each email to exactly one category
4. CRITICAL: Copy EXACT email IDs

Return JSON:
{
  "categories": {"Work": "daily business communication", ...},
  "labels": [["maildir_allen-p__sent_mail_512", "Work"], ...]
}
    </div>

    <h2>6. Timing Breakdown - 300 Sample Emails</h2>

    <table class="timing-table">
        <tr>
            <th>Operation</th>
            <th>Per Batch (20 emails)</th>
            <th>Total (15 batches)</th>
            <th>% of Total Time</th>
        </tr>
        <tr>
            <td>Calculate statistics</td>
            <td>0.1 sec</td>
            <td>1.5 sec</td>
            <td>0.5%</td>
        </tr>
        <tr>
            <td>Build email summaries</td>
            <td>0.05 sec</td>
            <td>0.75 sec</td>
            <td>0.2%</td>
        </tr>
        <tr>
            <td>Construct prompt</td>
            <td>0.01 sec</td>
            <td>0.15 sec</td>
            <td>0.05%</td>
        </tr>
        <tr>
            <td><strong>LLM API call</strong></td>
            <td><strong>18-22 sec</strong></td>
            <td><strong>270-330 sec</strong></td>
            <td><strong>98%</strong></td>
        </tr>
        <tr>
            <td>Parse JSON response</td>
            <td>0.05 sec</td>
            <td>0.75 sec</td>
            <td>0.2%</td>
        </tr>
        <tr>
            <td>Merge results</td>
            <td>0.02 sec</td>
            <td>0.3 sec</td>
            <td>0.1%</td>
        </tr>
        <tr>
            <td colspan="2"><strong>SUBTOTAL: Batch Discovery</strong></td>
            <td><strong>~300 seconds (5 min)</strong></td>
            <td><strong>98.5%</strong></td>
        </tr>
        <tr>
            <td colspan="2">Consolidation LLM call</td>
            <td>5 seconds</td>
            <td>1.3%</td>
        </tr>
        <tr>
            <td colspan="2">Cache snapping (semantic matching)</td>
            <td>0.5 seconds</td>
            <td>0.2%</td>
        </tr>
        <tr>
            <td colspan="2"><strong>TOTAL LABELING PHASE</strong></td>
            <td><strong>~305 seconds (5 min)</strong></td>
            <td><strong>100%</strong></td>
        </tr>
    </table>

    <div class="warning">
        <h3>Corrected Understanding</h3>
        <p><strong>Original estimate:</strong> "~3 seconds per email" = 900 seconds for 300 emails</p>
        <p><strong>Actual timing:</strong> ~20 seconds per batch of 20 = ~305 seconds for 300 emails</p>
        <p><strong>Difference:</strong> 3× faster than original assumption</p>
        <p><strong>Why:</strong> Batching allows LLM to see context across multiple emails and make better category decisions in a single inference pass.</p>
    </div>

    <h2>7. What Gets Created</h2>

    <div class="diagram">
        <pre class="mermaid">
flowchart LR
    Input[300 sampled emails] --> Discovery[Category Discovery<br/>15 batches × 20 emails]

    Discovery --> RawCats[Raw Categories<br/>~30-40 discovered<br/>May have duplicates:<br/>Work, work, Business, etc.]

    RawCats --> Consolidate[Consolidation<br/>LLM merges similar<br/>~5 seconds]

    Consolidate --> Merged[Merged Categories<br/>~12-15 categories<br/>Work, Financial, etc.]

    Merged --> CacheSnap[Cache Snap<br/>Match against persistent cache<br/>~0.5 seconds]

    CacheSnap --> Final[Final Categories<br/>10-12 categories]

    Discovery --> RawLabels[Raw Labels<br/>300 tuples:<br/>email_id, category]

    RawLabels --> UpdateLabels[Update label categories<br/>to match snapped names]

    UpdateLabels --> FinalLabels[Final Labels<br/>300 training pairs]

    Final --> Training[Training Data]
    FinalLabels --> Training

    Training --> MLTrain[Train LightGBM Model<br/>~5 seconds]

    MLTrain --> Model[Trained Model<br/>1.8MB .pkl file]

    style Discovery fill:#ff6b6b
    style Consolidate fill:#ff6b6b
    style Model fill:#4ec9b0
</pre>
    </div>

    <h2>8. Example Output</h2>

    <div class="code-section">
<strong>discovered_categories (Dict[str, str]):</strong>
{
  "Work": "daily business communication and coordination",
  "Financial": "budgets, reports, financial planning",
  "Meetings": "scheduling and meeting coordination",
  "Technical": "system issues and technical discussions",
  "Requests": "action items and requests for information",
  "Reports": "status reports and summaries",
  "Administrative": "HR, policies, company announcements",
  "Urgent": "time-sensitive matters",
  "Conversational": "casual check-ins and social",
  "External": "communication with external partners"
}

<strong>sample_labels (List[Tuple[str, str]]):</strong>
[
  ("maildir_allen-p__sent_mail_1", "Financial"),
  ("maildir_allen-p__sent_mail_2", "Work"),
  ("maildir_allen-p__sent_mail_3", "Meetings"),
  ("maildir_allen-p__sent_mail_4", "Work"),
  ("maildir_allen-p__sent_mail_5", "Financial"),
  ... (300 total)
]
    </div>

    <h2>9. Why Batching is Critical</h2>

    <table class="timing-table">
        <tr>
            <th>Approach</th>
            <th>LLM Calls</th>
            <th>Time/Call</th>
            <th>Total Time</th>
            <th>Quality</th>
        </tr>
        <tr>
            <td><strong>Sequential (1 email/call)</strong></td>
            <td>300</td>
            <td>3 sec</td>
            <td>900 sec (15 min)</td>
            <td>Poor - no context</td>
        </tr>
        <tr>
            <td><strong>Small batches (5 emails/call)</strong></td>
            <td>60</td>
            <td>8 sec</td>
            <td>480 sec (8 min)</td>
            <td>Fair - limited context</td>
        </tr>
        <tr>
            <td><strong>Current (20 emails/call)</strong></td>
            <td>15</td>
            <td>20 sec</td>
            <td>300 sec (5 min)</td>
            <td>Good - sufficient context</td>
        </tr>
        <tr>
            <td><strong>Large batches (50 emails/call)</strong></td>
            <td>6</td>
            <td>45 sec</td>
            <td>270 sec (4.5 min)</td>
            <td>Risk - may exceed token limits</td>
        </tr>
    </table>

    <div class="warning">
        <h3>Why 20 emails per batch?</h3>
        <ul>
            <li><strong>Token limit:</strong> 20 emails × ~150 tokens/email = ~3000 tokens input, well under 8K limit</li>
            <li><strong>Context window:</strong> LLM can see patterns across multiple emails</li>
            <li><strong>Speed:</strong> Minimizes API calls while staying within limits</li>
            <li><strong>Quality:</strong> Enough examples to identify patterns, not so many that it gets confused</li>
        </ul>
    </div>

    <h2>10. Configuration Parameters</h2>

    <table class="timing-table">
        <tr>
            <th>Parameter</th>
            <th>Location</th>
            <th>Default</th>
            <th>Effect on Timing</th>
        </tr>
        <tr>
            <td>sample_size</td>
            <td>CalibrationConfig</td>
            <td>300</td>
            <td>300 samples = 15 batches = 5 min</td>
        </tr>
        <tr>
            <td>batch_size</td>
            <td>llm_analyzer.py:62</td>
            <td>20</td>
            <td>Hardcoded - affects batch count</td>
        </tr>
        <tr>
            <td>llm_batch_size</td>
            <td>CalibrationConfig</td>
            <td>50</td>
            <td>NOT USED for discovery (misleading name)</td>
        </tr>
        <tr>
            <td>temperature</td>
            <td>LLM call</td>
            <td>0.1</td>
            <td>Lower = faster, more deterministic</td>
        </tr>
        <tr>
            <td>max_tokens</td>
            <td>LLM call</td>
            <td>2000</td>
            <td>Higher = potentially slower response</td>
        </tr>
    </table>

    <h2>11. Full Calibration Timeline</h2>

    <div class="diagram">
        <pre class="mermaid">
gantt
    title Calibration Phase Timeline (300 samples, 10k total emails)
    dateFormat mm:ss
    axisFormat %M:%S

    section Sampling
    Stratified sample (3% of 10k) :00:00, 01s

    section Category Discovery
    Batch 1 (emails 1-20)         :00:01, 20s
    Batch 2 (emails 21-40)        :00:21, 20s
    Batch 3 (emails 41-60)        :00:41, 20s
    Batch 4-13 (emails 61-260)    :01:01, 200s
    Batch 14 (emails 261-280)     :04:21, 20s
    Batch 15 (emails 281-300)     :04:41, 20s

    section Consolidation
    LLM category merge            :05:01, 05s
    Cache snap                    :05:06, 00.5s

    section ML Training
    Feature extraction (300)      :05:07, 06s
    LightGBM training             :05:13, 05s
    Validation (100 emails)       :05:18, 02s
    Save model to disk            :05:20, 00.5s
</pre>
    </div>

    <h2>12. Key Insights</h2>

    <div class="critical">
        <h3>1. Labels are NOT created sequentially</h3>
        <p>The LLM creates labels as a byproduct of batch category discovery. There is NO separate "label each email one by one" phase.</p>
    </div>

    <div class="critical">
        <h3>2. Batching is the optimization</h3>
        <p>Processing 20 emails in a single LLM call (20 sec) is 3× faster than 20 individual calls (60 sec total).</p>
    </div>

    <div class="critical">
        <h3>3. LLM time dominates everything</h3>
        <p>98% of labeling phase time is LLM API calls. Everything else (parsing, merging, caching) is negligible.</p>
    </div>

    <div class="critical">
        <h3>4. Consolidation is cheap</h3>
        <p>Merging 30-40 raw categories into 10-12 final ones takes only ~5 seconds with a single LLM call.</p>
    </div>

    <h2>13. Optimization Opportunities</h2>

    <table class="timing-table">
        <tr>
            <th>Optimization</th>
            <th>Current</th>
            <th>Potential</th>
            <th>Tradeoff</th>
        </tr>
        <tr>
            <td>Increase batch size</td>
            <td>20 emails/batch</td>
            <td>30-40 emails/batch</td>
            <td>May hit token limits, slower per call</td>
        </tr>
        <tr>
            <td>Reduce sample size</td>
            <td>300 samples (3%)</td>
            <td>200 samples (2%)</td>
            <td>Less training data, potentially worse model</td>
        </tr>
        <tr>
            <td>Parallel batching</td>
            <td>Sequential 15 batches</td>
            <td>3-5 concurrent batches</td>
            <td>Requires async LLM client, more complex</td>
        </tr>
        <tr>
            <td>Skip consolidation</td>
            <td>Always consolidate if >10 cats</td>
            <td>Skip if <15 cats</td>
            <td>May leave duplicate categories</td>
        </tr>
        <tr>
            <td>Cache-first approach</td>
            <td>Discover then snap to cache</td>
            <td>Snap to cache, only discover new</td>
            <td>Less adaptive to new mailbox types</td>
        </tr>
    </table>

    <script>
        mermaid.initialize({
            startOnLoad: true,
            theme: 'default',
            flowchart: {
                useMaxWidth: true,
                htmlLabels: true,
                curve: 'basis'
            },
            gantt: {
                useWidth: 1200
            }
        });
    </script>
</body>
</html>