FSSCoding 53174a34eb Organize project structure and add MVP features

Project Reorganization:
- Created docs/ directory and moved all documentation
- Created scripts/ directory for shell scripts
- Created scripts/experimental/ for research scripts
- Updated .gitignore for new structure
- Updated README.md with MVP status and new structure

New Features:
- Category verification system (verify_model_categories)
- --verify-categories flag for mailbox compatibility check
- --no-llm-fallback flag for pure ML classification
- Trained model saved in src/models/calibrated/

Threshold Optimization:
- Reduced default threshold from 0.75 to 0.55
- Updated all category thresholds to 0.55
- Reduces LLM fallback rate by 40% (35% -> 21%)

Documentation:
- SYSTEM_FLOW.html - Complete system architecture
- VERIFY_CATEGORIES_FEATURE.html - Feature documentation
- LABEL_TRAINING_PHASE_DETAIL.html - Calibration breakdown
- FAST_ML_ONLY_WORKFLOW.html - Pure ML guide
- PROJECT_STATUS_AND_NEXT_STEPS.html - Roadmap
- ROOT_CAUSE_ANALYSIS.md - Bug fixes

MVP Status:
- 10k emails in 4 minutes, 72.7% accuracy, 0 LLM calls
- LLM-driven category discovery working
- Embedding-based transfer learning confirmed
- All model paths verified and working

2025-10-25 14:46:58 +11:00

10 KiB

Raw Blame History

Root Cause Analysis: Category Explosion & Over-Confidence

Date: 2025-10-24 Run: 100k emails, qwen3:4b model Issue: Model trained on 29 categories instead of expected 11, with extreme over-confidence

Executive Summary

The 100k classification run technically succeeded (92.1% accuracy estimate) but revealed critical architectural issues:

Category Explosion: 29 training categories vs expected 11
Duplicate Categories: Work/work, Administrative/auth, finance/Financial
Extreme Over-Confidence: 99%+ classifications at 1.0 confidence
Category Leakage: Hardcoded categories leaked into LLM-discovered categories

The Bug

Location

src/calibration/workflow.py:110

all_categories = list(set(self.categories) | set(discovered_categories.keys()) | label_categories)

What Happened

The workflow merges THREE category sources:

self.categories - 12 hardcoded categories from config/categories.yaml:
- junk, transactional, auth, newsletters, social, automated
- conversational, work, personal, finance, travel, unknown
discovered_categories.keys() - 11 LLM-discovered categories:
- Work, Financial, Administrative, Operational, Meeting
- Technical, External, Announcements, Urgent, Miscellaneous, Forwarded
label_categories - Additional categories from LLM labels:
- Bowl Pool 2000, California Market, Prehearing, Change, Monitoring
- Information

Result: 29 Total Categories

1. Administrative           (LLM discovered)
2. Announcements           (LLM discovered)
3. Bowl Pool 2000          (LLM label - weird)
4. California Market       (LLM label - too specific)
5. Change                  (LLM label - vague)
6. External                (LLM discovered)
7. Financial               (LLM discovered)
8. Forwarded               (LLM discovered)
9. Information             (LLM label - vague)
10. Meeting                (LLM discovered)
11. Miscellaneous          (LLM discovered)
12. Monitoring             (LLM label - too specific)
13. Operational            (LLM discovered)
14. Prehearing             (LLM label - too specific)
15. Technical              (LLM discovered)
16. Urgent                 (LLM discovered)
17. Work                   (LLM discovered)
18. auth                   (hardcoded)
19. automated              (hardcoded)
20. conversational         (hardcoded)
21. finance                (hardcoded)
22. junk                   (hardcoded)
23. newsletters            (hardcoded)
24. personal               (hardcoded)
25. social                 (hardcoded)
26. transactional          (hardcoded)
27. travel                 (hardcoded)
28. unknown                (hardcoded)
29. work                   (hardcoded)

Duplicates Identified

Work (LLM) vs work (hardcoded) - 14,223 vs 368 emails
Financial (LLM) vs finance (hardcoded) - 5,943 vs 0 emails
Administrative (LLM) vs auth (hardcoded) - 67,195 vs 37 emails

Impact Analysis

1. Category Distribution (100k Results)

Category	Count	Confidence	Source
Administrative	67,195	1.000	LLM discovered
Work	14,223	1.000	LLM discovered
Meeting	7,785	1.000	LLM discovered
Financial	5,943	1.000	LLM discovered
Operational	3,274	1.000	LLM discovered
junk	394	0.960	Hardcoded
work	368	0.950	Hardcoded
Miscellaneous	238	1.000	LLM discovered
Technical	193	1.000	LLM discovered
External	137	1.000	LLM discovered
transactional	44	0.970	Hardcoded
auth	37	0.990	Hardcoded
unknown	23	0.500	Hardcoded
Others	<20 each	Various	Mixed

2. Extreme Over-Confidence

67,195 emails classified as "Administrative" with 1.0 confidence
99.9% of all classifications have confidence >= 0.95
This is unrealistic - suggests overfitting or poor calibration

3. Why It Still "Worked"

LLM-discovered categories (uppercase) handled 99%+ of emails
Hardcoded categories (lowercase) mostly unused except for rules
Model learned both sets but strongly preferred LLM categories
Enron dataset doesn't match hardcoded categories well

Why This Happened

Design Intent vs Reality

Original Design:

Hardcoded categories in categories.yaml for rule-based matching
LLM discovers NEW categories during calibration
Merge both for flexible classification

Reality:

Hardcoded categories leak into ML training
Creates duplicate concepts (Work vs work)
LLM labels include one-off categories (Bowl Pool 2000)
No deduplication or conflict resolution

The Workflow Path

1. CLI loads hardcoded categories from categories.yaml
   → ['junk', 'transactional', 'auth', ... 'work', 'finance', 'unknown']

2. Passes to CalibrationWorkflow.__init__(categories=...)
   → self.categories = list(categories.keys())

3. LLM discovers categories from emails
   → {'Work': 'business emails', 'Financial': 'budgets', ...}

4. Consolidation reduces duplicates (within LLM categories only)
   → But doesn't see hardcoded categories

5. Merge ALL sources at workflow.py:110
   → Hardcoded + Discovered + Label anomalies = 29 categories

6. Trainer learns all 29 categories
   → Model becomes confused but weights LLM categories heavily

Spot-Check Findings

High Confidence Samples (Correct)

✅ Sample 1: "i'll get the movie and wine. my suggestion is something from central market"

Classified: Administrative (1.0)
Assessment: Questionable - looks more personal

✅ Sample 2: "Can you spell S-N-O-O-T-Y?"

Classified: Administrative (1.0)
Assessment: Wrong - clearly conversational/personal

✅ Sample 3: "MEETING TONIGHT - 6:00 pm Central Time at The Houstonian"

Classified: Meeting (1.0)
Assessment: Correct

Low Confidence Samples (Unknown)

⚠️ All low confidence samples classified as "unknown" (0.500)

These fell back to LLM
LLM failed to classify (returned unknown)
Actual content: Legitimate business emails about deferrals, power units

Category Anomalies

❌ "California Market" (6 emails, 1.0 confidence)

Too specific - shouldn't be a standalone category
Should be "Work" or "External"

❌ "Bowl Pool 2000" (exists in training set)

One-off event category
Should never have been kept

Performance Impact

What Went Right

ML handled 99.1% of emails (99,134 / 100,000)
Only 31 fell to LLM (0.03%)
Fast classification (~3 minutes for 100k)
Discovered categories are semantically good

What Went Wrong

Unrealistic confidence - Almost everything is 1.0
Category pollution - 29 instead of 11
Duplicates - Work/work, finance/Financial
No calibration - Model confidence not properly calibrated
Hardcoded categories unused - 368 "work" vs 14,223 "Work"

Root Causes

1. Architectural Confusion

Two competing philosophies:

Rule-based system: Use hardcoded categories with pattern matching
LLM-driven system: Discover categories from data

Result: They interfere with each other instead of complementing

2. Missing Deduplication

The workflow.py:110 line does a simple set union without:

Case normalization
Semantic similarity checking
Conflict resolution
Priority rules

3. No Consolidation Across Sources

The LLM consolidation step (line 91-100) only consolidates within discovered categories. It doesn't:

Check against hardcoded categories
Merge similar concepts
Remove one-off labels

4. Poor Category Cache Design

The category cache (src/models/category_cache.json) saves LLM categories but:

Doesn't deduplicate against hardcoded categories
Allows case-sensitive duplicates
No validation of category quality

Recommendations

Immediate Fixes

Remove hardcoded categories from ML training
- Use them ONLY for rule-based matching
- Don't merge into all_categories for training
- Let LLM discover all ML categories
Add case-insensitive deduplication
- Normalize to title case
- Check semantic similarity
- Merge duplicates before training
Filter label anomalies
- Reject categories with <10 training samples
- Reject overly specific categories (Bowl Pool 2000)
- LLM review step for quality
Calibrate model confidence
- Use temperature scaling or Platt scaling
- Ensure confidence reflects actual accuracy

Architecture Decision

Option A: Rule-Based + ML (Current)

Keep hardcoded categories for RULES ONLY
LLM discovers categories for ML ONLY
Never merge the two

Option B: Pure LLM Discovery (Recommended)

Remove categories.yaml entirely
LLM discovers ALL categories
Rules can still match on keywords but don't define categories

Option C: Hybrid with Priority

Define 3-5 HIGH-PRIORITY hardcoded categories (junk, auth, transactional)
Let LLM discover everything else
Clear hierarchy: Rules → Hardcoded ML → Discovered ML

Next Steps

Decision: Choose architecture (A, B, or C above)
Fix workflow.py:110 - Implement chosen strategy
Add deduplication logic - Case-insensitive, semantic matching
Rerun calibration - Clean 250-sample run
Validate results - Ensure clean categories
Fix confidence - Add calibration layer

Files to Modify

src/calibration/workflow.py:110 - Category merging logic
src/calibration/llm_analyzer.py - Add cross-source consolidation
src/cli.py:70 - Decide whether to load hardcoded categories
config/categories.yaml - Clarify purpose (rules only?)
src/calibration/trainer.py - Add confidence calibration

Conclusion

The system technically worked - it classified 100k emails with high ML efficiency. However, the category explosion and over-confidence issues reveal fundamental architectural problems that need resolution before production use.

The core question: Should hardcoded categories participate in ML training at all?

My recommendation: No. Use them for rules only, let LLM discover ML categories cleanly.

10 KiB Raw Blame History