email-sorter/docs/ROOT_CAUSE_ANALYSIS.md

# Root Cause Analysis: Category Explosion & Over-Confidence

**Date:** 2025-10-24
**Run:** 100k emails, qwen3:4b model
**Issue:** Model trained on 29 categories instead of expected 11, with extreme over-confidence

---

## Executive Summary

The 100k classification run technically succeeded (92.1% accuracy estimate) but revealed critical architectural issues:

1. **Category Explosion:** 29 training categories vs expected 11
2. **Duplicate Categories:** Work/work, Administrative/auth, finance/Financial
3. **Extreme Over-Confidence:** 99%+ classifications at 1.0 confidence
4. **Category Leakage:** Hardcoded categories leaked into LLM-discovered categories

---

## The Bug

### Location
[src/calibration/workflow.py:110](src/calibration/workflow.py#L110)

```python
all_categories = list(set(self.categories) | set(discovered_categories.keys()) | label_categories)
```

### What Happened

The workflow merges THREE category sources:

1. **`self.categories`** - 12 hardcoded categories from `config/categories.yaml`:
   - junk, transactional, auth, newsletters, social, automated
   - conversational, work, personal, finance, travel, unknown

2. **`discovered_categories.keys()`** - 11 LLM-discovered categories:
   - Work, Financial, Administrative, Operational, Meeting
   - Technical, External, Announcements, Urgent, Miscellaneous, Forwarded

3. **`label_categories`** - Additional categories from LLM labels:
   - Bowl Pool 2000, California Market, Prehearing, Change, Monitoring
   - Information

### Result: 29 Total Categories

```
1. Administrative           (LLM discovered)
2. Announcements           (LLM discovered)
3. Bowl Pool 2000          (LLM label - weird)
4. California Market       (LLM label - too specific)
5. Change                  (LLM label - vague)
6. External                (LLM discovered)
7. Financial               (LLM discovered)
8. Forwarded               (LLM discovered)
9. Information             (LLM label - vague)
10. Meeting                (LLM discovered)
11. Miscellaneous          (LLM discovered)
12. Monitoring             (LLM label - too specific)
13. Operational            (LLM discovered)
14. Prehearing             (LLM label - too specific)
15. Technical              (LLM discovered)
16. Urgent                 (LLM discovered)
17. Work                   (LLM discovered)
18. auth                   (hardcoded)
19. automated              (hardcoded)
20. conversational         (hardcoded)
21. finance                (hardcoded)
22. junk                   (hardcoded)
23. newsletters            (hardcoded)
24. personal               (hardcoded)
25. social                 (hardcoded)
26. transactional          (hardcoded)
27. travel                 (hardcoded)
28. unknown                (hardcoded)
29. work                   (hardcoded)
```

### Duplicates Identified

- **Work (LLM) vs work (hardcoded)** - 14,223 vs 368 emails
- **Financial (LLM) vs finance (hardcoded)** - 5,943 vs 0 emails
- **Administrative (LLM) vs auth (hardcoded)** - 67,195 vs 37 emails

---

## Impact Analysis

### 1. Category Distribution (100k Results)

| Category | Count | Confidence | Source |
|----------|-------|------------|--------|
| Administrative | 67,195 | 1.000 | LLM discovered |
| Work | 14,223 | 1.000 | LLM discovered |
| Meeting | 7,785 | 1.000 | LLM discovered |
| Financial | 5,943 | 1.000 | LLM discovered |
| Operational | 3,274 | 1.000 | LLM discovered |
| junk | 394 | 0.960 | Hardcoded |
| work | 368 | 0.950 | Hardcoded |
| Miscellaneous | 238 | 1.000 | LLM discovered |
| Technical | 193 | 1.000 | LLM discovered |
| External | 137 | 1.000 | LLM discovered |
| transactional | 44 | 0.970 | Hardcoded |
| auth | 37 | 0.990 | Hardcoded |
| unknown | 23 | 0.500 | Hardcoded |
| Others | <20 each | Various | Mixed |

### 2. Extreme Over-Confidence

- **67,195 emails** classified as "Administrative" with **1.0 confidence**
- **99.9%** of all classifications have confidence >= 0.95
- This is unrealistic - suggests overfitting or poor calibration

### 3. Why It Still "Worked"

- LLM-discovered categories (uppercase) handled 99%+ of emails
- Hardcoded categories (lowercase) mostly unused except for rules
- Model learned both sets but strongly preferred LLM categories
- Enron dataset doesn't match hardcoded categories well

---

## Why This Happened

### Design Intent vs Reality

**Original Design:**
- Hardcoded categories in `categories.yaml` for rule-based matching
- LLM discovers NEW categories during calibration
- Merge both for flexible classification

**Reality:**
- Hardcoded categories leak into ML training
- Creates duplicate concepts (Work vs work)
- LLM labels include one-off categories (Bowl Pool 2000)
- No deduplication or conflict resolution

### The Workflow Path

```
1. CLI loads hardcoded categories from categories.yaml
   → ['junk', 'transactional', 'auth', ... 'work', 'finance', 'unknown']

2. Passes to CalibrationWorkflow.__init__(categories=...)
   → self.categories = list(categories.keys())

3. LLM discovers categories from emails
   → {'Work': 'business emails', 'Financial': 'budgets', ...}

4. Consolidation reduces duplicates (within LLM categories only)
   → But doesn't see hardcoded categories

5. Merge ALL sources at workflow.py:110
   → Hardcoded + Discovered + Label anomalies = 29 categories

6. Trainer learns all 29 categories
   → Model becomes confused but weights LLM categories heavily
```

---

## Spot-Check Findings

### High Confidence Samples (Correct)

✅ **Sample 1:** "i'll get the movie and wine. my suggestion is something from central market"
   - Classified: Administrative (1.0)
   - **Assessment:** Questionable - looks more personal

✅ **Sample 2:** "Can you spell S-N-O-O-T-Y?"
   - Classified: Administrative (1.0)
   - **Assessment:** Wrong - clearly conversational/personal

✅ **Sample 3:** "MEETING TONIGHT - 6:00 pm Central Time at The Houstonian"
   - Classified: Meeting (1.0)
   - **Assessment:** Correct

### Low Confidence Samples (Unknown)

⚠️ **All low confidence samples classified as "unknown" (0.500)**
- These fell back to LLM
- LLM failed to classify (returned unknown)
- Actual content: Legitimate business emails about deferrals, power units

### Category Anomalies

❌ **"California Market" (6 emails, 1.0 confidence)**
- Too specific - shouldn't be a standalone category
- Should be "Work" or "External"

❌ **"Bowl Pool 2000" (exists in training set)**
- One-off event category
- Should never have been kept

---

## Performance Impact

### What Went Right

- **ML handled 99.1%** of emails (99,134 / 100,000)
- **Only 31 fell to LLM** (0.03%)
- Fast classification (~3 minutes for 100k)
- Discovered categories are semantically good

### What Went Wrong

- **Unrealistic confidence** - Almost everything is 1.0
- **Category pollution** - 29 instead of 11
- **Duplicates** - Work/work, finance/Financial
- **No calibration** - Model confidence not properly calibrated
- **Hardcoded categories unused** - 368 "work" vs 14,223 "Work"

---

## Root Causes

### 1. Architectural Confusion

**Two competing philosophies:**
- **Rule-based system:** Use hardcoded categories with pattern matching
- **LLM-driven system:** Discover categories from data

**Result:** They interfere with each other instead of complementing

### 2. Missing Deduplication

The workflow.py:110 line does a simple set union without:
- Case normalization
- Semantic similarity checking
- Conflict resolution
- Priority rules

### 3. No Consolidation Across Sources

The LLM consolidation step (line 91-100) only consolidates within discovered categories. It doesn't:
- Check against hardcoded categories
- Merge similar concepts
- Remove one-off labels

### 4. Poor Category Cache Design

The category cache (src/models/category_cache.json) saves LLM categories but:
- Doesn't deduplicate against hardcoded categories
- Allows case-sensitive duplicates
- No validation of category quality

---

## Recommendations

### Immediate Fixes

1. **Remove hardcoded categories from ML training**
   - Use them ONLY for rule-based matching
   - Don't merge into `all_categories` for training
   - Let LLM discover all ML categories

2. **Add case-insensitive deduplication**
   - Normalize to title case
   - Check semantic similarity
   - Merge duplicates before training

3. **Filter label anomalies**
   - Reject categories with <10 training samples
   - Reject overly specific categories (Bowl Pool 2000)
   - LLM review step for quality

4. **Calibrate model confidence**
   - Use temperature scaling or Platt scaling
   - Ensure confidence reflects actual accuracy

### Architecture Decision

**Option A: Rule-Based + ML (Current)**
- Keep hardcoded categories for RULES ONLY
- LLM discovers categories for ML ONLY
- Never merge the two

**Option B: Pure LLM Discovery (Recommended)**
- Remove categories.yaml entirely
- LLM discovers ALL categories
- Rules can still match on keywords but don't define categories

**Option C: Hybrid with Priority**
- Define 3-5 HIGH-PRIORITY hardcoded categories (junk, auth, transactional)
- Let LLM discover everything else
- Clear hierarchy: Rules → Hardcoded ML → Discovered ML

---

## Next Steps

1. **Decision:** Choose architecture (A, B, or C above)
2. **Fix workflow.py:110** - Implement chosen strategy
3. **Add deduplication logic** - Case-insensitive, semantic matching
4. **Rerun calibration** - Clean 250-sample run
5. **Validate results** - Ensure clean categories
6. **Fix confidence** - Add calibration layer

---

## Files to Modify

1. [src/calibration/workflow.py:110](src/calibration/workflow.py#L110) - Category merging logic
2. [src/calibration/llm_analyzer.py](src/calibration/llm_analyzer.py) - Add cross-source consolidation
3. [src/cli.py:70](src/cli.py#L70) - Decide whether to load hardcoded categories
4. [config/categories.yaml](config/categories.yaml) - Clarify purpose (rules only?)
5. [src/calibration/trainer.py](src/calibration/trainer.py) - Add confidence calibration

---

## Conclusion

The system technically worked - it classified 100k emails with high ML efficiency. However, the category explosion and over-confidence issues reveal fundamental architectural problems that need resolution before production use.

The core question: **Should hardcoded categories participate in ML training at all?**

My recommendation: **No.** Use them for rules only, let LLM discover ML categories cleanly.