email-sorter/docs/ROOT_CAUSE_ANALYSIS.md
FSSCoding 53174a34eb Organize project structure and add MVP features
Project Reorganization:
- Created docs/ directory and moved all documentation
- Created scripts/ directory for shell scripts
- Created scripts/experimental/ for research scripts
- Updated .gitignore for new structure
- Updated README.md with MVP status and new structure

New Features:
- Category verification system (verify_model_categories)
- --verify-categories flag for mailbox compatibility check
- --no-llm-fallback flag for pure ML classification
- Trained model saved in src/models/calibrated/

Threshold Optimization:
- Reduced default threshold from 0.75 to 0.55
- Updated all category thresholds to 0.55
- Reduces LLM fallback rate by 40% (35% -> 21%)

Documentation:
- SYSTEM_FLOW.html - Complete system architecture
- VERIFY_CATEGORIES_FEATURE.html - Feature documentation
- LABEL_TRAINING_PHASE_DETAIL.html - Calibration breakdown
- FAST_ML_ONLY_WORKFLOW.html - Pure ML guide
- PROJECT_STATUS_AND_NEXT_STEPS.html - Roadmap
- ROOT_CAUSE_ANALYSIS.md - Bug fixes

MVP Status:
- 10k emails in 4 minutes, 72.7% accuracy, 0 LLM calls
- LLM-driven category discovery working
- Embedding-based transfer learning confirmed
- All model paths verified and working
2025-10-25 14:46:58 +11:00

10 KiB

Root Cause Analysis: Category Explosion & Over-Confidence

Date: 2025-10-24 Run: 100k emails, qwen3:4b model Issue: Model trained on 29 categories instead of expected 11, with extreme over-confidence


Executive Summary

The 100k classification run technically succeeded (92.1% accuracy estimate) but revealed critical architectural issues:

  1. Category Explosion: 29 training categories vs expected 11
  2. Duplicate Categories: Work/work, Administrative/auth, finance/Financial
  3. Extreme Over-Confidence: 99%+ classifications at 1.0 confidence
  4. Category Leakage: Hardcoded categories leaked into LLM-discovered categories

The Bug

Location

src/calibration/workflow.py:110

all_categories = list(set(self.categories) | set(discovered_categories.keys()) | label_categories)

What Happened

The workflow merges THREE category sources:

  1. self.categories - 12 hardcoded categories from config/categories.yaml:

    • junk, transactional, auth, newsletters, social, automated
    • conversational, work, personal, finance, travel, unknown
  2. discovered_categories.keys() - 11 LLM-discovered categories:

    • Work, Financial, Administrative, Operational, Meeting
    • Technical, External, Announcements, Urgent, Miscellaneous, Forwarded
  3. label_categories - Additional categories from LLM labels:

    • Bowl Pool 2000, California Market, Prehearing, Change, Monitoring
    • Information

Result: 29 Total Categories

1. Administrative           (LLM discovered)
2. Announcements           (LLM discovered)
3. Bowl Pool 2000          (LLM label - weird)
4. California Market       (LLM label - too specific)
5. Change                  (LLM label - vague)
6. External                (LLM discovered)
7. Financial               (LLM discovered)
8. Forwarded               (LLM discovered)
9. Information             (LLM label - vague)
10. Meeting                (LLM discovered)
11. Miscellaneous          (LLM discovered)
12. Monitoring             (LLM label - too specific)
13. Operational            (LLM discovered)
14. Prehearing             (LLM label - too specific)
15. Technical              (LLM discovered)
16. Urgent                 (LLM discovered)
17. Work                   (LLM discovered)
18. auth                   (hardcoded)
19. automated              (hardcoded)
20. conversational         (hardcoded)
21. finance                (hardcoded)
22. junk                   (hardcoded)
23. newsletters            (hardcoded)
24. personal               (hardcoded)
25. social                 (hardcoded)
26. transactional          (hardcoded)
27. travel                 (hardcoded)
28. unknown                (hardcoded)
29. work                   (hardcoded)

Duplicates Identified

  • Work (LLM) vs work (hardcoded) - 14,223 vs 368 emails
  • Financial (LLM) vs finance (hardcoded) - 5,943 vs 0 emails
  • Administrative (LLM) vs auth (hardcoded) - 67,195 vs 37 emails

Impact Analysis

1. Category Distribution (100k Results)

Category Count Confidence Source
Administrative 67,195 1.000 LLM discovered
Work 14,223 1.000 LLM discovered
Meeting 7,785 1.000 LLM discovered
Financial 5,943 1.000 LLM discovered
Operational 3,274 1.000 LLM discovered
junk 394 0.960 Hardcoded
work 368 0.950 Hardcoded
Miscellaneous 238 1.000 LLM discovered
Technical 193 1.000 LLM discovered
External 137 1.000 LLM discovered
transactional 44 0.970 Hardcoded
auth 37 0.990 Hardcoded
unknown 23 0.500 Hardcoded
Others <20 each Various Mixed

2. Extreme Over-Confidence

  • 67,195 emails classified as "Administrative" with 1.0 confidence
  • 99.9% of all classifications have confidence >= 0.95
  • This is unrealistic - suggests overfitting or poor calibration

3. Why It Still "Worked"

  • LLM-discovered categories (uppercase) handled 99%+ of emails
  • Hardcoded categories (lowercase) mostly unused except for rules
  • Model learned both sets but strongly preferred LLM categories
  • Enron dataset doesn't match hardcoded categories well

Why This Happened

Design Intent vs Reality

Original Design:

  • Hardcoded categories in categories.yaml for rule-based matching
  • LLM discovers NEW categories during calibration
  • Merge both for flexible classification

Reality:

  • Hardcoded categories leak into ML training
  • Creates duplicate concepts (Work vs work)
  • LLM labels include one-off categories (Bowl Pool 2000)
  • No deduplication or conflict resolution

The Workflow Path

1. CLI loads hardcoded categories from categories.yaml
   → ['junk', 'transactional', 'auth', ... 'work', 'finance', 'unknown']

2. Passes to CalibrationWorkflow.__init__(categories=...)
   → self.categories = list(categories.keys())

3. LLM discovers categories from emails
   → {'Work': 'business emails', 'Financial': 'budgets', ...}

4. Consolidation reduces duplicates (within LLM categories only)
   → But doesn't see hardcoded categories

5. Merge ALL sources at workflow.py:110
   → Hardcoded + Discovered + Label anomalies = 29 categories

6. Trainer learns all 29 categories
   → Model becomes confused but weights LLM categories heavily

Spot-Check Findings

High Confidence Samples (Correct)

Sample 1: "i'll get the movie and wine. my suggestion is something from central market"

  • Classified: Administrative (1.0)
  • Assessment: Questionable - looks more personal

Sample 2: "Can you spell S-N-O-O-T-Y?"

  • Classified: Administrative (1.0)
  • Assessment: Wrong - clearly conversational/personal

Sample 3: "MEETING TONIGHT - 6:00 pm Central Time at The Houstonian"

  • Classified: Meeting (1.0)
  • Assessment: Correct

Low Confidence Samples (Unknown)

⚠️ All low confidence samples classified as "unknown" (0.500)

  • These fell back to LLM
  • LLM failed to classify (returned unknown)
  • Actual content: Legitimate business emails about deferrals, power units

Category Anomalies

"California Market" (6 emails, 1.0 confidence)

  • Too specific - shouldn't be a standalone category
  • Should be "Work" or "External"

"Bowl Pool 2000" (exists in training set)

  • One-off event category
  • Should never have been kept

Performance Impact

What Went Right

  • ML handled 99.1% of emails (99,134 / 100,000)
  • Only 31 fell to LLM (0.03%)
  • Fast classification (~3 minutes for 100k)
  • Discovered categories are semantically good

What Went Wrong

  • Unrealistic confidence - Almost everything is 1.0
  • Category pollution - 29 instead of 11
  • Duplicates - Work/work, finance/Financial
  • No calibration - Model confidence not properly calibrated
  • Hardcoded categories unused - 368 "work" vs 14,223 "Work"

Root Causes

1. Architectural Confusion

Two competing philosophies:

  • Rule-based system: Use hardcoded categories with pattern matching
  • LLM-driven system: Discover categories from data

Result: They interfere with each other instead of complementing

2. Missing Deduplication

The workflow.py:110 line does a simple set union without:

  • Case normalization
  • Semantic similarity checking
  • Conflict resolution
  • Priority rules

3. No Consolidation Across Sources

The LLM consolidation step (line 91-100) only consolidates within discovered categories. It doesn't:

  • Check against hardcoded categories
  • Merge similar concepts
  • Remove one-off labels

4. Poor Category Cache Design

The category cache (src/models/category_cache.json) saves LLM categories but:

  • Doesn't deduplicate against hardcoded categories
  • Allows case-sensitive duplicates
  • No validation of category quality

Recommendations

Immediate Fixes

  1. Remove hardcoded categories from ML training

    • Use them ONLY for rule-based matching
    • Don't merge into all_categories for training
    • Let LLM discover all ML categories
  2. Add case-insensitive deduplication

    • Normalize to title case
    • Check semantic similarity
    • Merge duplicates before training
  3. Filter label anomalies

    • Reject categories with <10 training samples
    • Reject overly specific categories (Bowl Pool 2000)
    • LLM review step for quality
  4. Calibrate model confidence

    • Use temperature scaling or Platt scaling
    • Ensure confidence reflects actual accuracy

Architecture Decision

Option A: Rule-Based + ML (Current)

  • Keep hardcoded categories for RULES ONLY
  • LLM discovers categories for ML ONLY
  • Never merge the two

Option B: Pure LLM Discovery (Recommended)

  • Remove categories.yaml entirely
  • LLM discovers ALL categories
  • Rules can still match on keywords but don't define categories

Option C: Hybrid with Priority

  • Define 3-5 HIGH-PRIORITY hardcoded categories (junk, auth, transactional)
  • Let LLM discover everything else
  • Clear hierarchy: Rules → Hardcoded ML → Discovered ML

Next Steps

  1. Decision: Choose architecture (A, B, or C above)
  2. Fix workflow.py:110 - Implement chosen strategy
  3. Add deduplication logic - Case-insensitive, semantic matching
  4. Rerun calibration - Clean 250-sample run
  5. Validate results - Ensure clean categories
  6. Fix confidence - Add calibration layer

Files to Modify

  1. src/calibration/workflow.py:110 - Category merging logic
  2. src/calibration/llm_analyzer.py - Add cross-source consolidation
  3. src/cli.py:70 - Decide whether to load hardcoded categories
  4. config/categories.yaml - Clarify purpose (rules only?)
  5. src/calibration/trainer.py - Add confidence calibration

Conclusion

The system technically worked - it classified 100k emails with high ML efficiency. However, the category explosion and over-confidence issues reveal fundamental architectural problems that need resolution before production use.

The core question: Should hardcoded categories participate in ML training at all?

My recommendation: No. Use them for rules only, let LLM discover ML categories cleanly.