Label Training Phase - Detailed Analysis

Purpose: The LLM examines sample emails and assigns each one to a discovered category, creating labeled training data for the ML model.

This is NOT the same as category discovery. Discovery finds WHAT categories exist. Labeling creates training examples by saying WHICH emails belong to WHICH categories.

CRITICAL MISUNDERSTANDING IN ORIGINAL DIAGRAM

The "Label Training Emails" phase described as "~3 seconds per email" is INCORRECT.

The actual implementation does NOT label emails individually.

Labels are created as a BYPRODUCT of batch category discovery, not as a separate sequential operation.

2. Actual Label Training Flow

flowchart TD
    Start([Calibration Phase Starts]) --> Sample[Sample 300 emails
stratified by sender]
    Sample --> BatchSetup[Split into batches of 20 emails
300 ÷ 20 = 15 batches]

    BatchSetup --> Batch1[Batch 1: Emails 1-20]
    Batch1 --> Stats1[Calculate batch statistics
domains, keywords, attachments
~0.1 seconds]

    Stats1 --> BuildPrompt1[Build LLM prompt
Include all 20 email summaries
~0.05 seconds]

    BuildPrompt1 --> LLMCall1[Single LLM call for entire batch
Discovers categories AND labels all 20
~20 seconds TOTAL for batch]

    LLMCall1 --> Parse1[Parse JSON response
Extract categories + labels
~0.1 seconds]

    Parse1 --> Store1[Store results
categories: Dict
labels: List of Tuples]

    Store1 --> Batch2{More batches?}
    Batch2 -->|Yes| NextBatch[Batch 2: Emails 21-40]
    Batch2 -->|No| Consolidate

    NextBatch --> Stats2[Same process
15 total batches
~20 seconds each]
    Stats2 --> Batch2

    Consolidate[Consolidate categories
Merge duplicates
Single LLM call
~5 seconds]

    Consolidate --> CacheSnap[Snap to cached categories
Match against persistent cache
~0.5 seconds]

    CacheSnap --> Final[Final output
10-12 categories
300 labeled emails]

    Final --> End([Labels ready for ML training])

    style LLMCall1 fill:#ff6b6b
    style Consolidate fill:#ff6b6b
    style Stats2 fill:#ffd93d
    style Final fill:#4ec9b0

3. Key Discovery: Batched Labeling

src/calibration/llm_analyzer.py:66-83 batch_size = 20 # NOT 1 email at a time! for batch_idx in range(0, len(sample_emails), batch_size): batch = sample_emails[batch_idx:batch_idx + batch_size] # Single LLM call handles ENTIRE batch batch_results = self._analyze_batch(batch, batch_idx) # Returns BOTH categories AND labels for all 20 emails for category, desc in batch_results.get('categories', {}).items(): discovered_categories[category] = desc for email_id, category in batch_results.get('labels', []): email_labels.append((email_id, category))

Why Batching Matters

Sequential (WRONG assumption): 300 emails × 3 sec/email = 900 seconds (15 minutes)

Batched (ACTUAL): 15 batches × 20 sec/batch = 300 seconds (5 minutes)

Savings: 10 minutes (67% faster than assumed)

4. Single Batch Processing Detail

flowchart TD
    Start([Batch of 20 emails]) --> Stats[Calculate Statistics
~0.1 seconds]

    Stats --> StatDetails[Domain analysis
Recipient counts
Attachment detection
Keyword extraction]

    StatDetails --> BuildList[Build email summaries
For each email:
ID + From + Subject + Preview]

    BuildList --> Prompt[Construct LLM prompt
~2KB text
Contains:
- Statistics summary
- All 20 email summaries
- Instructions
- JSON schema]

    Prompt --> LLM[LLM Call
POST /api/generate
qwen3:4b-instruct-2507-q8_0
temp=0.1, max_tokens=2000
~18-22 seconds]

    LLM --> Response[LLM Response
JSON with:
categories: Dict
labels: List of 20 Tuples]

    Response --> Parse[Parse JSON
Regex extraction
Brace counting
~0.05 seconds]

    Parse --> Validate{Valid JSON?}
    Validate -->|Yes| Extract[Extract data
categories: 3-8 new
labels: 20 tuples]
    Validate -->|No| FallbackParse[Fallback parsing
Try to salvage partial data]

    FallbackParse --> Extract

    Extract --> Return[Return batch results
categories: Dict str→str
labels: List Tuple str,str]

    Return --> End([Merge with global results])

    style LLM fill:#ff6b6b
    style Parse fill:#4ec9b0
    style FallbackParse fill:#ffd93d

5. LLM Prompt Structure

Actual prompt sent to LLM (src/calibration/llm_analyzer.py:196-232): <no_think>You are analyzing emails to discover natural categories... BATCH STATISTICS (20 emails): - Top sender domains: example.com (5), company.org (3)... - Avg recipients per email: 2.3 - Emails with attachments: 4/20 - Avg subject length: 42 chars - Common keywords: meeting(3), report(2)... EMAILS TO ANALYZE: 1. ID: maildir_allen-p__sent_mail_512 From: phillip.allen@enron.com Subject: Re: AEC Volumes at OPAL Preview: Here are the volumes... 2. ID: maildir_allen-p__sent_mail_513 From: phillip.allen@enron.com Subject: Meeting Tomorrow Preview: Can we schedule... [... 18 more emails ...] TASK: 1. Identify natural groupings based on PURPOSE 2. Create SHORT category names 3. Assign each email to exactly one category 4. CRITICAL: Copy EXACT email IDs Return JSON: { "categories": {"Work": "daily business communication", ...}, "labels": [["maildir_allen-p__sent_mail_512", "Work"], ...] }

6. Timing Breakdown - 300 Sample Emails

Operation	Per Batch (20 emails)	Total (15 batches)	% of Total Time
Calculate statistics	0.1 sec	1.5 sec	0.5%
Build email summaries	0.05 sec	0.75 sec	0.2%
Construct prompt	0.01 sec	0.15 sec	0.05%
LLM API call	18-22 sec	270-330 sec	98%
Parse JSON response	0.05 sec	0.75 sec	0.2%
Merge results	0.02 sec	0.3 sec	0.1%
SUBTOTAL: Batch Discovery	~300 seconds (5 min)	98.5%
Consolidation LLM call	5 seconds	1.3%
Cache snapping (semantic matching)	0.5 seconds	0.2%
TOTAL LABELING PHASE	~305 seconds (5 min)	100%

Corrected Understanding

Original estimate: "~3 seconds per email" = 900 seconds for 300 emails

Actual timing: ~20 seconds per batch of 20 = ~305 seconds for 300 emails

Difference: 3× faster than original assumption

Why: Batching allows LLM to see context across multiple emails and make better category decisions in a single inference pass.

7. What Gets Created

flowchart LR
    Input[300 sampled emails] --> Discovery[Category Discovery
15 batches × 20 emails]

    Discovery --> RawCats[Raw Categories
~30-40 discovered
May have duplicates:
Work, work, Business, etc.]

    RawCats --> Consolidate[Consolidation
LLM merges similar
~5 seconds]

    Consolidate --> Merged[Merged Categories
~12-15 categories
Work, Financial, etc.]

    Merged --> CacheSnap[Cache Snap
Match against persistent cache
~0.5 seconds]

    CacheSnap --> Final[Final Categories
10-12 categories]

    Discovery --> RawLabels[Raw Labels
300 tuples:
email_id, category]

    RawLabels --> UpdateLabels[Update label categories
to match snapped names]

    UpdateLabels --> FinalLabels[Final Labels
300 training pairs]

    Final --> Training[Training Data]
    FinalLabels --> Training

    Training --> MLTrain[Train LightGBM Model
~5 seconds]

    MLTrain --> Model[Trained Model
1.8MB .pkl file]

    style Discovery fill:#ff6b6b
    style Consolidate fill:#ff6b6b
    style Model fill:#4ec9b0

8. Example Output

discovered_categories (Dict[str, str]): { "Work": "daily business communication and coordination", "Financial": "budgets, reports, financial planning", "Meetings": "scheduling and meeting coordination", "Technical": "system issues and technical discussions", "Requests": "action items and requests for information", "Reports": "status reports and summaries", "Administrative": "HR, policies, company announcements", "Urgent": "time-sensitive matters", "Conversational": "casual check-ins and social", "External": "communication with external partners" } sample_labels (List[Tuple[str, str]]): [ ("maildir_allen-p__sent_mail_1", "Financial"), ("maildir_allen-p__sent_mail_2", "Work"), ("maildir_allen-p__sent_mail_3", "Meetings"), ("maildir_allen-p__sent_mail_4", "Work"), ("maildir_allen-p__sent_mail_5", "Financial"), ... (300 total) ]

9. Why Batching is Critical

Approach	LLM Calls	Time/Call	Total Time	Quality
Sequential (1 email/call)	300	3 sec	900 sec (15 min)	Poor - no context
Small batches (5 emails/call)	60	8 sec	480 sec (8 min)	Fair - limited context
Current (20 emails/call)	15	20 sec	300 sec (5 min)	Good - sufficient context
Large batches (50 emails/call)	6	45 sec	270 sec (4.5 min)	Risk - may exceed token limits

Why 20 emails per batch?

Token limit: 20 emails × ~150 tokens/email = ~3000 tokens input, well under 8K limit
Context window: LLM can see patterns across multiple emails
Speed: Minimizes API calls while staying within limits
Quality: Enough examples to identify patterns, not so many that it gets confused

10. Configuration Parameters

11. Full Calibration Timeline

Parameter	Location	Default	Effect on Timing
sample_size	CalibrationConfig	300	300 samples = 15 batches = 5 min
batch_size	llm_analyzer.py:62	20	Hardcoded - affects batch count
llm_batch_size	CalibrationConfig	50	NOT USED for discovery (misleading name)
temperature	LLM call	0.1	Lower = faster, more deterministic
max_tokens	LLM call	2000	Higher = potentially slower response

gantt
    title Calibration Phase Timeline (300 samples, 10k total emails)
    dateFormat mm:ss
    axisFormat %M:%S

    section Sampling
    Stratified sample (3% of 10k) :00:00, 01s

    section Category Discovery
    Batch 1 (emails 1-20)         :00:01, 20s
    Batch 2 (emails 21-40)        :00:21, 20s
    Batch 3 (emails 41-60)        :00:41, 20s
    Batch 4-13 (emails 61-260)    :01:01, 200s
    Batch 14 (emails 261-280)     :04:21, 20s
    Batch 15 (emails 281-300)     :04:41, 20s

    section Consolidation
    LLM category merge            :05:01, 05s
    Cache snap                    :05:06, 00.5s

    section ML Training
    Feature extraction (300)      :05:07, 06s
    LightGBM training             :05:13, 05s
    Validation (100 emails)       :05:18, 02s
    Save model to disk            :05:20, 00.5s

Optimization	Current	Potential	Tradeoff
Increase batch size	20 emails/batch	30-40 emails/batch	May hit token limits, slower per call
Reduce sample size	300 samples (3%)	200 samples (2%)	Less training data, potentially worse model
Parallel batching	Sequential 15 batches	3-5 concurrent batches	Requires async LLM client, more complex
Skip consolidation	Always consolidate if >10 cats	Skip if <15 cats	May leave duplicate categories
Cache-first approach	Discover then snap to cache	Snap to cache, only discover new	Less adaptive to new mailbox types

Label Training Phase - Deep Dive Analysis

1. What is "Label Training"?