Label Training Phase - Deep Dive Analysis

1. What is "Label Training"?

Location: src/calibration/llm_analyzer.py

Purpose: The LLM examines sample emails and assigns each one to a discovered category, creating labeled training data for the ML model.

This is NOT the same as category discovery. Discovery finds WHAT categories exist. Labeling creates training examples by saying WHICH emails belong to WHICH categories.

CRITICAL MISUNDERSTANDING IN ORIGINAL DIAGRAM

The "Label Training Emails" phase described as "~3 seconds per email" is INCORRECT.

The actual implementation does NOT label emails individually.

Labels are created as a BYPRODUCT of batch category discovery, not as a separate sequential operation.

2. Actual Label Training Flow

flowchart TD
    Start([Calibration Phase Starts]) --> Sample[Sample 300 emails
stratified by sender] Sample --> BatchSetup[Split into batches of 20 emails
300 ÷ 20 = 15 batches] BatchSetup --> Batch1[Batch 1: Emails 1-20] Batch1 --> Stats1[Calculate batch statistics
domains, keywords, attachments
~0.1 seconds] Stats1 --> BuildPrompt1[Build LLM prompt
Include all 20 email summaries
~0.05 seconds] BuildPrompt1 --> LLMCall1[Single LLM call for entire batch
Discovers categories AND labels all 20
~20 seconds TOTAL for batch] LLMCall1 --> Parse1[Parse JSON response
Extract categories + labels
~0.1 seconds] Parse1 --> Store1[Store results
categories: Dict
labels: List of Tuples] Store1 --> Batch2{More batches?} Batch2 -->|Yes| NextBatch[Batch 2: Emails 21-40] Batch2 -->|No| Consolidate NextBatch --> Stats2[Same process
15 total batches
~20 seconds each] Stats2 --> Batch2 Consolidate[Consolidate categories
Merge duplicates
Single LLM call
~5 seconds] Consolidate --> CacheSnap[Snap to cached categories
Match against persistent cache
~0.5 seconds] CacheSnap --> Final[Final output
10-12 categories
300 labeled emails] Final --> End([Labels ready for ML training]) style LLMCall1 fill:#ff6b6b style Consolidate fill:#ff6b6b style Stats2 fill:#ffd93d style Final fill:#4ec9b0

3. Key Discovery: Batched Labeling

src/calibration/llm_analyzer.py:66-83 batch_size = 20 # NOT 1 email at a time! for batch_idx in range(0, len(sample_emails), batch_size): batch = sample_emails[batch_idx:batch_idx + batch_size] # Single LLM call handles ENTIRE batch batch_results = self._analyze_batch(batch, batch_idx) # Returns BOTH categories AND labels for all 20 emails for category, desc in batch_results.get('categories', {}).items(): discovered_categories[category] = desc for email_id, category in batch_results.get('labels', []): email_labels.append((email_id, category))

Why Batching Matters

Sequential (WRONG assumption): 300 emails × 3 sec/email = 900 seconds (15 minutes)

Batched (ACTUAL): 15 batches × 20 sec/batch = 300 seconds (5 minutes)

Savings: 10 minutes (67% faster than assumed)

4. Single Batch Processing Detail

flowchart TD
    Start([Batch of 20 emails]) --> Stats[Calculate Statistics
~0.1 seconds] Stats --> StatDetails[Domain analysis
Recipient counts
Attachment detection
Keyword extraction] StatDetails --> BuildList[Build email summaries
For each email:
ID + From + Subject + Preview] BuildList --> Prompt[Construct LLM prompt
~2KB text
Contains:
- Statistics summary
- All 20 email summaries
- Instructions
- JSON schema] Prompt --> LLM[LLM Call
POST /api/generate
qwen3:4b-instruct-2507-q8_0
temp=0.1, max_tokens=2000
~18-22 seconds] LLM --> Response[LLM Response
JSON with:
categories: Dict
labels: List of 20 Tuples] Response --> Parse[Parse JSON
Regex extraction
Brace counting
~0.05 seconds] Parse --> Validate{Valid JSON?} Validate -->|Yes| Extract[Extract data
categories: 3-8 new
labels: 20 tuples] Validate -->|No| FallbackParse[Fallback parsing
Try to salvage partial data] FallbackParse --> Extract Extract --> Return[Return batch results
categories: Dict str→str
labels: List Tuple str,str] Return --> End([Merge with global results]) style LLM fill:#ff6b6b style Parse fill:#4ec9b0 style FallbackParse fill:#ffd93d

5. LLM Prompt Structure

Actual prompt sent to LLM (src/calibration/llm_analyzer.py:196-232): <no_think>You are analyzing emails to discover natural categories... BATCH STATISTICS (20 emails): - Top sender domains: example.com (5), company.org (3)... - Avg recipients per email: 2.3 - Emails with attachments: 4/20 - Avg subject length: 42 chars - Common keywords: meeting(3), report(2)... EMAILS TO ANALYZE: 1. ID: maildir_allen-p__sent_mail_512 From: phillip.allen@enron.com Subject: Re: AEC Volumes at OPAL Preview: Here are the volumes... 2. ID: maildir_allen-p__sent_mail_513 From: phillip.allen@enron.com Subject: Meeting Tomorrow Preview: Can we schedule... [... 18 more emails ...] TASK: 1. Identify natural groupings based on PURPOSE 2. Create SHORT category names 3. Assign each email to exactly one category 4. CRITICAL: Copy EXACT email IDs Return JSON: { "categories": {"Work": "daily business communication", ...}, "labels": [["maildir_allen-p__sent_mail_512", "Work"], ...] }

6. Timing Breakdown - 300 Sample Emails

Operation Per Batch (20 emails) Total (15 batches) % of Total Time
Calculate statistics 0.1 sec 1.5 sec 0.5%
Build email summaries 0.05 sec 0.75 sec 0.2%
Construct prompt 0.01 sec 0.15 sec 0.05%
LLM API call 18-22 sec 270-330 sec 98%
Parse JSON response 0.05 sec 0.75 sec 0.2%
Merge results 0.02 sec 0.3 sec 0.1%
SUBTOTAL: Batch Discovery ~300 seconds (5 min) 98.5%
Consolidation LLM call 5 seconds 1.3%
Cache snapping (semantic matching) 0.5 seconds 0.2%
TOTAL LABELING PHASE ~305 seconds (5 min) 100%

Corrected Understanding

Original estimate: "~3 seconds per email" = 900 seconds for 300 emails

Actual timing: ~20 seconds per batch of 20 = ~305 seconds for 300 emails

Difference: 3× faster than original assumption

Why: Batching allows LLM to see context across multiple emails and make better category decisions in a single inference pass.

7. What Gets Created

flowchart LR
    Input[300 sampled emails] --> Discovery[Category Discovery
15 batches × 20 emails] Discovery --> RawCats[Raw Categories
~30-40 discovered
May have duplicates:
Work, work, Business, etc.] RawCats --> Consolidate[Consolidation
LLM merges similar
~5 seconds] Consolidate --> Merged[Merged Categories
~12-15 categories
Work, Financial, etc.] Merged --> CacheSnap[Cache Snap
Match against persistent cache
~0.5 seconds] CacheSnap --> Final[Final Categories
10-12 categories] Discovery --> RawLabels[Raw Labels
300 tuples:
email_id, category] RawLabels --> UpdateLabels[Update label categories
to match snapped names] UpdateLabels --> FinalLabels[Final Labels
300 training pairs] Final --> Training[Training Data] FinalLabels --> Training Training --> MLTrain[Train LightGBM Model
~5 seconds] MLTrain --> Model[Trained Model
1.8MB .pkl file] style Discovery fill:#ff6b6b style Consolidate fill:#ff6b6b style Model fill:#4ec9b0

8. Example Output

discovered_categories (Dict[str, str]): { "Work": "daily business communication and coordination", "Financial": "budgets, reports, financial planning", "Meetings": "scheduling and meeting coordination", "Technical": "system issues and technical discussions", "Requests": "action items and requests for information", "Reports": "status reports and summaries", "Administrative": "HR, policies, company announcements", "Urgent": "time-sensitive matters", "Conversational": "casual check-ins and social", "External": "communication with external partners" } sample_labels (List[Tuple[str, str]]): [ ("maildir_allen-p__sent_mail_1", "Financial"), ("maildir_allen-p__sent_mail_2", "Work"), ("maildir_allen-p__sent_mail_3", "Meetings"), ("maildir_allen-p__sent_mail_4", "Work"), ("maildir_allen-p__sent_mail_5", "Financial"), ... (300 total) ]

9. Why Batching is Critical

Approach LLM Calls Time/Call Total Time Quality
Sequential (1 email/call) 300 3 sec 900 sec (15 min) Poor - no context
Small batches (5 emails/call) 60 8 sec 480 sec (8 min) Fair - limited context
Current (20 emails/call) 15 20 sec 300 sec (5 min) Good - sufficient context
Large batches (50 emails/call) 6 45 sec 270 sec (4.5 min) Risk - may exceed token limits

Why 20 emails per batch?

10. Configuration Parameters

Parameter Location Default Effect on Timing
sample_size CalibrationConfig 300 300 samples = 15 batches = 5 min
batch_size llm_analyzer.py:62 20 Hardcoded - affects batch count
llm_batch_size CalibrationConfig 50 NOT USED for discovery (misleading name)
temperature LLM call 0.1 Lower = faster, more deterministic
max_tokens LLM call 2000 Higher = potentially slower response

11. Full Calibration Timeline

gantt
    title Calibration Phase Timeline (300 samples, 10k total emails)
    dateFormat mm:ss
    axisFormat %M:%S

    section Sampling
    Stratified sample (3% of 10k) :00:00, 01s

    section Category Discovery
    Batch 1 (emails 1-20)         :00:01, 20s
    Batch 2 (emails 21-40)        :00:21, 20s
    Batch 3 (emails 41-60)        :00:41, 20s
    Batch 4-13 (emails 61-260)    :01:01, 200s
    Batch 14 (emails 261-280)     :04:21, 20s
    Batch 15 (emails 281-300)     :04:41, 20s

    section Consolidation
    LLM category merge            :05:01, 05s
    Cache snap                    :05:06, 00.5s

    section ML Training
    Feature extraction (300)      :05:07, 06s
    LightGBM training             :05:13, 05s
    Validation (100 emails)       :05:18, 02s
    Save model to disk            :05:20, 00.5s

12. Key Insights

1. Labels are NOT created sequentially

The LLM creates labels as a byproduct of batch category discovery. There is NO separate "label each email one by one" phase.

2. Batching is the optimization

Processing 20 emails in a single LLM call (20 sec) is 3× faster than 20 individual calls (60 sec total).

3. LLM time dominates everything

98% of labeling phase time is LLM API calls. Everything else (parsing, merging, caching) is negligible.

4. Consolidation is cheap

Merging 30-40 raw categories into 10-12 final ones takes only ~5 seconds with a single LLM call.

13. Optimization Opportunities

Optimization Current Potential Tradeoff
Increase batch size 20 emails/batch 30-40 emails/batch May hit token limits, slower per call
Reduce sample size 300 samples (3%) 200 samples (2%) Less training data, potentially worse model
Parallel batching Sequential 15 batches 3-5 concurrent batches Requires async LLM client, more complex
Skip consolidation Always consolidate if >10 cats Skip if <15 cats May leave duplicate categories
Cache-first approach Discover then snap to cache Snap to cache, only discover new Less adaptive to new mailbox types