Project Reorganization: - Created docs/ directory and moved all documentation - Created scripts/ directory for shell scripts - Created scripts/experimental/ for research scripts - Updated .gitignore for new structure - Updated README.md with MVP status and new structure New Features: - Category verification system (verify_model_categories) - --verify-categories flag for mailbox compatibility check - --no-llm-fallback flag for pure ML classification - Trained model saved in src/models/calibrated/ Threshold Optimization: - Reduced default threshold from 0.75 to 0.55 - Updated all category thresholds to 0.55 - Reduces LLM fallback rate by 40% (35% -> 21%) Documentation: - SYSTEM_FLOW.html - Complete system architecture - VERIFY_CATEGORIES_FEATURE.html - Feature documentation - LABEL_TRAINING_PHASE_DETAIL.html - Calibration breakdown - FAST_ML_ONLY_WORKFLOW.html - Pure ML guide - PROJECT_STATUS_AND_NEXT_STEPS.html - Roadmap - ROOT_CAUSE_ANALYSIS.md - Bug fixes MVP Status: - 10k emails in 4 minutes, 72.7% accuracy, 0 LLM calls - LLM-driven category discovery working - Embedding-based transfer learning confirmed - All model paths verified and working
320 lines
10 KiB
Markdown
320 lines
10 KiB
Markdown
# Root Cause Analysis: Category Explosion & Over-Confidence
|
|
|
|
**Date:** 2025-10-24
|
|
**Run:** 100k emails, qwen3:4b model
|
|
**Issue:** Model trained on 29 categories instead of expected 11, with extreme over-confidence
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
The 100k classification run technically succeeded (92.1% accuracy estimate) but revealed critical architectural issues:
|
|
|
|
1. **Category Explosion:** 29 training categories vs expected 11
|
|
2. **Duplicate Categories:** Work/work, Administrative/auth, finance/Financial
|
|
3. **Extreme Over-Confidence:** 99%+ classifications at 1.0 confidence
|
|
4. **Category Leakage:** Hardcoded categories leaked into LLM-discovered categories
|
|
|
|
---
|
|
|
|
## The Bug
|
|
|
|
### Location
|
|
[src/calibration/workflow.py:110](src/calibration/workflow.py#L110)
|
|
|
|
```python
|
|
all_categories = list(set(self.categories) | set(discovered_categories.keys()) | label_categories)
|
|
```
|
|
|
|
### What Happened
|
|
|
|
The workflow merges THREE category sources:
|
|
|
|
1. **`self.categories`** - 12 hardcoded categories from `config/categories.yaml`:
|
|
- junk, transactional, auth, newsletters, social, automated
|
|
- conversational, work, personal, finance, travel, unknown
|
|
|
|
2. **`discovered_categories.keys()`** - 11 LLM-discovered categories:
|
|
- Work, Financial, Administrative, Operational, Meeting
|
|
- Technical, External, Announcements, Urgent, Miscellaneous, Forwarded
|
|
|
|
3. **`label_categories`** - Additional categories from LLM labels:
|
|
- Bowl Pool 2000, California Market, Prehearing, Change, Monitoring
|
|
- Information
|
|
|
|
### Result: 29 Total Categories
|
|
|
|
```
|
|
1. Administrative (LLM discovered)
|
|
2. Announcements (LLM discovered)
|
|
3. Bowl Pool 2000 (LLM label - weird)
|
|
4. California Market (LLM label - too specific)
|
|
5. Change (LLM label - vague)
|
|
6. External (LLM discovered)
|
|
7. Financial (LLM discovered)
|
|
8. Forwarded (LLM discovered)
|
|
9. Information (LLM label - vague)
|
|
10. Meeting (LLM discovered)
|
|
11. Miscellaneous (LLM discovered)
|
|
12. Monitoring (LLM label - too specific)
|
|
13. Operational (LLM discovered)
|
|
14. Prehearing (LLM label - too specific)
|
|
15. Technical (LLM discovered)
|
|
16. Urgent (LLM discovered)
|
|
17. Work (LLM discovered)
|
|
18. auth (hardcoded)
|
|
19. automated (hardcoded)
|
|
20. conversational (hardcoded)
|
|
21. finance (hardcoded)
|
|
22. junk (hardcoded)
|
|
23. newsletters (hardcoded)
|
|
24. personal (hardcoded)
|
|
25. social (hardcoded)
|
|
26. transactional (hardcoded)
|
|
27. travel (hardcoded)
|
|
28. unknown (hardcoded)
|
|
29. work (hardcoded)
|
|
```
|
|
|
|
### Duplicates Identified
|
|
|
|
- **Work (LLM) vs work (hardcoded)** - 14,223 vs 368 emails
|
|
- **Financial (LLM) vs finance (hardcoded)** - 5,943 vs 0 emails
|
|
- **Administrative (LLM) vs auth (hardcoded)** - 67,195 vs 37 emails
|
|
|
|
---
|
|
|
|
## Impact Analysis
|
|
|
|
### 1. Category Distribution (100k Results)
|
|
|
|
| Category | Count | Confidence | Source |
|
|
|----------|-------|------------|--------|
|
|
| Administrative | 67,195 | 1.000 | LLM discovered |
|
|
| Work | 14,223 | 1.000 | LLM discovered |
|
|
| Meeting | 7,785 | 1.000 | LLM discovered |
|
|
| Financial | 5,943 | 1.000 | LLM discovered |
|
|
| Operational | 3,274 | 1.000 | LLM discovered |
|
|
| junk | 394 | 0.960 | Hardcoded |
|
|
| work | 368 | 0.950 | Hardcoded |
|
|
| Miscellaneous | 238 | 1.000 | LLM discovered |
|
|
| Technical | 193 | 1.000 | LLM discovered |
|
|
| External | 137 | 1.000 | LLM discovered |
|
|
| transactional | 44 | 0.970 | Hardcoded |
|
|
| auth | 37 | 0.990 | Hardcoded |
|
|
| unknown | 23 | 0.500 | Hardcoded |
|
|
| Others | <20 each | Various | Mixed |
|
|
|
|
### 2. Extreme Over-Confidence
|
|
|
|
- **67,195 emails** classified as "Administrative" with **1.0 confidence**
|
|
- **99.9%** of all classifications have confidence >= 0.95
|
|
- This is unrealistic - suggests overfitting or poor calibration
|
|
|
|
### 3. Why It Still "Worked"
|
|
|
|
- LLM-discovered categories (uppercase) handled 99%+ of emails
|
|
- Hardcoded categories (lowercase) mostly unused except for rules
|
|
- Model learned both sets but strongly preferred LLM categories
|
|
- Enron dataset doesn't match hardcoded categories well
|
|
|
|
---
|
|
|
|
## Why This Happened
|
|
|
|
### Design Intent vs Reality
|
|
|
|
**Original Design:**
|
|
- Hardcoded categories in `categories.yaml` for rule-based matching
|
|
- LLM discovers NEW categories during calibration
|
|
- Merge both for flexible classification
|
|
|
|
**Reality:**
|
|
- Hardcoded categories leak into ML training
|
|
- Creates duplicate concepts (Work vs work)
|
|
- LLM labels include one-off categories (Bowl Pool 2000)
|
|
- No deduplication or conflict resolution
|
|
|
|
### The Workflow Path
|
|
|
|
```
|
|
1. CLI loads hardcoded categories from categories.yaml
|
|
→ ['junk', 'transactional', 'auth', ... 'work', 'finance', 'unknown']
|
|
|
|
2. Passes to CalibrationWorkflow.__init__(categories=...)
|
|
→ self.categories = list(categories.keys())
|
|
|
|
3. LLM discovers categories from emails
|
|
→ {'Work': 'business emails', 'Financial': 'budgets', ...}
|
|
|
|
4. Consolidation reduces duplicates (within LLM categories only)
|
|
→ But doesn't see hardcoded categories
|
|
|
|
5. Merge ALL sources at workflow.py:110
|
|
→ Hardcoded + Discovered + Label anomalies = 29 categories
|
|
|
|
6. Trainer learns all 29 categories
|
|
→ Model becomes confused but weights LLM categories heavily
|
|
```
|
|
|
|
---
|
|
|
|
## Spot-Check Findings
|
|
|
|
### High Confidence Samples (Correct)
|
|
|
|
✅ **Sample 1:** "i'll get the movie and wine. my suggestion is something from central market"
|
|
- Classified: Administrative (1.0)
|
|
- **Assessment:** Questionable - looks more personal
|
|
|
|
✅ **Sample 2:** "Can you spell S-N-O-O-T-Y?"
|
|
- Classified: Administrative (1.0)
|
|
- **Assessment:** Wrong - clearly conversational/personal
|
|
|
|
✅ **Sample 3:** "MEETING TONIGHT - 6:00 pm Central Time at The Houstonian"
|
|
- Classified: Meeting (1.0)
|
|
- **Assessment:** Correct
|
|
|
|
### Low Confidence Samples (Unknown)
|
|
|
|
⚠️ **All low confidence samples classified as "unknown" (0.500)**
|
|
- These fell back to LLM
|
|
- LLM failed to classify (returned unknown)
|
|
- Actual content: Legitimate business emails about deferrals, power units
|
|
|
|
### Category Anomalies
|
|
|
|
❌ **"California Market" (6 emails, 1.0 confidence)**
|
|
- Too specific - shouldn't be a standalone category
|
|
- Should be "Work" or "External"
|
|
|
|
❌ **"Bowl Pool 2000" (exists in training set)**
|
|
- One-off event category
|
|
- Should never have been kept
|
|
|
|
---
|
|
|
|
## Performance Impact
|
|
|
|
### What Went Right
|
|
|
|
- **ML handled 99.1%** of emails (99,134 / 100,000)
|
|
- **Only 31 fell to LLM** (0.03%)
|
|
- Fast classification (~3 minutes for 100k)
|
|
- Discovered categories are semantically good
|
|
|
|
### What Went Wrong
|
|
|
|
- **Unrealistic confidence** - Almost everything is 1.0
|
|
- **Category pollution** - 29 instead of 11
|
|
- **Duplicates** - Work/work, finance/Financial
|
|
- **No calibration** - Model confidence not properly calibrated
|
|
- **Hardcoded categories unused** - 368 "work" vs 14,223 "Work"
|
|
|
|
---
|
|
|
|
## Root Causes
|
|
|
|
### 1. Architectural Confusion
|
|
|
|
**Two competing philosophies:**
|
|
- **Rule-based system:** Use hardcoded categories with pattern matching
|
|
- **LLM-driven system:** Discover categories from data
|
|
|
|
**Result:** They interfere with each other instead of complementing
|
|
|
|
### 2. Missing Deduplication
|
|
|
|
The workflow.py:110 line does a simple set union without:
|
|
- Case normalization
|
|
- Semantic similarity checking
|
|
- Conflict resolution
|
|
- Priority rules
|
|
|
|
### 3. No Consolidation Across Sources
|
|
|
|
The LLM consolidation step (line 91-100) only consolidates within discovered categories. It doesn't:
|
|
- Check against hardcoded categories
|
|
- Merge similar concepts
|
|
- Remove one-off labels
|
|
|
|
### 4. Poor Category Cache Design
|
|
|
|
The category cache (src/models/category_cache.json) saves LLM categories but:
|
|
- Doesn't deduplicate against hardcoded categories
|
|
- Allows case-sensitive duplicates
|
|
- No validation of category quality
|
|
|
|
---
|
|
|
|
## Recommendations
|
|
|
|
### Immediate Fixes
|
|
|
|
1. **Remove hardcoded categories from ML training**
|
|
- Use them ONLY for rule-based matching
|
|
- Don't merge into `all_categories` for training
|
|
- Let LLM discover all ML categories
|
|
|
|
2. **Add case-insensitive deduplication**
|
|
- Normalize to title case
|
|
- Check semantic similarity
|
|
- Merge duplicates before training
|
|
|
|
3. **Filter label anomalies**
|
|
- Reject categories with <10 training samples
|
|
- Reject overly specific categories (Bowl Pool 2000)
|
|
- LLM review step for quality
|
|
|
|
4. **Calibrate model confidence**
|
|
- Use temperature scaling or Platt scaling
|
|
- Ensure confidence reflects actual accuracy
|
|
|
|
### Architecture Decision
|
|
|
|
**Option A: Rule-Based + ML (Current)**
|
|
- Keep hardcoded categories for RULES ONLY
|
|
- LLM discovers categories for ML ONLY
|
|
- Never merge the two
|
|
|
|
**Option B: Pure LLM Discovery (Recommended)**
|
|
- Remove categories.yaml entirely
|
|
- LLM discovers ALL categories
|
|
- Rules can still match on keywords but don't define categories
|
|
|
|
**Option C: Hybrid with Priority**
|
|
- Define 3-5 HIGH-PRIORITY hardcoded categories (junk, auth, transactional)
|
|
- Let LLM discover everything else
|
|
- Clear hierarchy: Rules → Hardcoded ML → Discovered ML
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
1. **Decision:** Choose architecture (A, B, or C above)
|
|
2. **Fix workflow.py:110** - Implement chosen strategy
|
|
3. **Add deduplication logic** - Case-insensitive, semantic matching
|
|
4. **Rerun calibration** - Clean 250-sample run
|
|
5. **Validate results** - Ensure clean categories
|
|
6. **Fix confidence** - Add calibration layer
|
|
|
|
---
|
|
|
|
## Files to Modify
|
|
|
|
1. [src/calibration/workflow.py:110](src/calibration/workflow.py#L110) - Category merging logic
|
|
2. [src/calibration/llm_analyzer.py](src/calibration/llm_analyzer.py) - Add cross-source consolidation
|
|
3. [src/cli.py:70](src/cli.py#L70) - Decide whether to load hardcoded categories
|
|
4. [config/categories.yaml](config/categories.yaml) - Clarify purpose (rules only?)
|
|
5. [src/calibration/trainer.py](src/calibration/trainer.py) - Add confidence calibration
|
|
|
|
---
|
|
|
|
## Conclusion
|
|
|
|
The system technically worked - it classified 100k emails with high ML efficiency. However, the category explosion and over-confidence issues reveal fundamental architectural problems that need resolution before production use.
|
|
|
|
The core question: **Should hardcoded categories participate in ML training at all?**
|
|
|
|
My recommendation: **No.** Use them for rules only, let LLM discover ML categories cleanly.
|