LLM now decides which new categories should be added to persistent cache
for future mailbox runs vs temporary (run-only) categories.
ENHANCED LLM REVIEW:
- New field: "cache_worthy" (true/false) for each "new" category
- LLM judges: "Is this category useful across different mailboxes?"
- Examples:
- "Customer Support" → cache_worthy: true (universal)
- "Project X Updates" → cache_worthy: false (mailbox-specific)
CACHE EVOLUTION:
- cache_worthy=true → Added to persistent cache for future runs
- cache_worthy=false → Used for current run only, not cached
- First run (empty cache) → All categories treated as cache-worthy
- LLM reasoning logged for transparency
INTELLIGENT GROWTH:
- Cache grows organically with high-quality, reusable categories
- Prevents pollution with mailbox-specific categories
- Maintains cross-mailbox consistency while allowing natural evolution
- LLM balances: consistency (snap existing) vs expansion (add worthy)
SINGLE LLM CALL EFFICIENCY:
- Same ~4 second LLM call now handles:
1. Snap vs new decision
2. Cache persistence decision
3. Reasoning for both
- No additional overhead for cache evolution
Result: Cache evolves intelligently over time, collecting universally
useful categories while filtering out temporary/specific ones.
CategoryCache now uses Ollama embeddings + cosine similarity for
true semantic category matching instead of weak keyword overlap.
Changes:
- src/calibration/category_cache.py: Use embedder.embeddings() API
- Calculate embeddings for discovered and cached category descriptions
- Compute cosine similarity between embedding vectors
- Fall back to partial name matching if embeddings unavailable
- Error handling with graceful degradation
- src/calibration/workflow.py: Pass feature_extractor.embedder
- Provide Ollama client to CalibrationAnalyzer
- Enables semantic matching during cache snap
- src/calibration/llm_analyzer.py: Accept embedding_model parameter
- Forward embedder to CategoryCache constructor
Test Results (embedding-based vs keyword):
- "Training Materials" → "Training": 0.72 (was 0.15)
- "Team Updates" → "Work Communication": 0.62 (was 0.24)
- "System Alerts" → "Technical": 0.63 (was 0.12)
- "Meeting Invitations" → "Meetings": 0.75+ (exact match)
Semantic matching now properly identifies similar categories based
on meaning rather than superficial word overlap.
Category Cache System (src/calibration/category_cache.py):
- Persistent storage of discovered categories across mailbox runs
- Semantic matching to snap new categories to existing ones
- Usage tracking for category popularity
- Configurable similarity threshold and new category limits
- JSON-based cache with metadata (created, last_seen, email counts)
Discovery Improvements (src/calibration/llm_analyzer.py):
- Calculate batch statistics: sender domains, recipient counts,
attachments, subject lengths, common keywords
- Add statistics to LLM discovery prompt for better decisions
- Integrate CategoryCache into CalibrationAnalyzer
- 3-step workflow: Discover → Consolidate → Snap to Cache
Consolidation Improvements:
- Add cached categories as hints in consolidation prompt
- LLM prefers snapping to established categories
- Maintains cross-mailbox consistency while allowing new categories
Configuration Parameters:
- use_category_cache: Enable/disable caching (default: true)
- cache_similarity_threshold: Min similarity for snap (default: 0.7)
- cache_allow_new: Allow new categories (default: true)
- cache_max_new: Max new categories per run (default: 3)
- category_cache_path: Custom cache location
Result: Consistent category sets across different mailboxes
with intelligent discovery of new categories when appropriate.
Both discovery and consolidation prompts now explain:
- What the system does (train ML classifier for auto-sorting)
- What makes good categories (broad, timeless, learnable)
- Why this matters (user needs, ML training requirements)
- How to think about the task (user-focused, functional)
Discovery prompt changes:
- Explains goal of identifying natural categories for ML training
- Lists guidelines for good categories (broad, user-focused, learnable)
- Provides concrete examples of functional categories
- Emphasizes PURPOSE over topic
Consolidation prompt changes:
- Explains full system context (LightGBM, auto-labeling, user search)
- Defines what makes categories effective for ML and users
- Provides user-centric thinking framework
- Emphasizes reusability and timelessness
Prompts now give the brilliant 8b model proper context to deliver
excellent category decisions instead of lazy generic categorization.
Enhanced _consolidate_categories() with comprehensive validation:
- Edge case guards: Skip if ≤5 categories or no labels
- Parameter validation: Clamp ranges for all config values
- 5-stage validation after LLM response:
1. Structure check (valid dicts)
2. Reduction check (consolidation must reduce count)
3. Target compliance (soft 50% overage limit)
4. Complete mapping (all old categories mapped)
5. Valid targets (all mappings point to existing categories)
- Auto-repair for common LLM failures:
- Unmapped categories → map to first consolidated category
- Invalid mapping targets → create missing categories
- Failed updates → log with details
- Fallback consolidation using top-N by count
- Triggered on JSON parse errors, validation failures
- Heuristic-based, no LLM required
- Guarantees output even if LLM fails
All error paths now have proper handling and logging.
Root cause: Pre-trained model was loading successfully, causing CLI to skip
calibration entirely. System went straight to classification with 35% model.
Changes:
- config: Set calibration_model to qwen3:8b-q4_K_M (larger model for better instruction following)
- cli: Create separate calibration_llm provider with 8b model
- llm_analyzer: Improved prompt to force exact email ID copying
- workflow: Merge discovered categories with predefined ones
- workflow: Add detailed error logging for label mismatches
- ml_classifier: Fixed model path checking (was checking None parameter)
- ml_classifier: Add dual API support (sklearn predict_proba vs LightGBM predict)
- ollama: Fixed model list parsing (use m.model not m.get('name'))
- feature_extractor: Switch to Ollama embeddings (instant vs 90s load time)
Result: Calibration now runs and generates 16 categories + 50 labels correctly.
Next: Investigate calibration sampling to reduce overfitting on small samples.
PHASE 9: Processing Pipeline & Queue Management (bulk_processor.py)
- BulkProcessor class for batch processing with checkpointing
- ProcessingCheckpoint: Save/resume state for resumable processing
- Handles batches with periodic checkpoints every N emails
- Tracks completed, queued_for_llm, and failed emails
- Progress callbacks for UI integration
PHASE 10: Calibration System (sampler.py, llm_analyzer.py)
- EmailSampler: Stratified and random sampling
- Stratifies by sender domain type for representativeness
- CalibrationAnalyzer: Use LLM to discover natural categories
- Batched analysis to control LLM load
- Maps discovered categories to universal schema
PHASE 11: Export & Reporting (exporter.py)
- ResultsExporter: Export to JSON, CSV, organized by category
- ReportGenerator: Generate human-readable text reports
- Category statistics and method breakdown
- Accuracy metrics and processing time tracking
PHASE 13: Enron Dataset Parser (enron_parser.py)
- Parses Enron maildir format into Email objects
- Handles multipart emails and attachments
- Date parsing with fallback for malformed dates
- Ready to train mock model on real data
PHASE 14: Main Orchestration (orchestration.py)
- EmailSorterOrchestrator: Coordinates entire pipeline
- 4-phase workflow: Calibration → Bulk → LLM → Export
- Lazy initialization of components
- Progress tracking and timing
- Full pipeline runner with resume support
Components Now Available:
✅ Sampling (stratified and random)
✅ Calibration (LLM-driven category discovery)
✅ Bulk processing (with checkpointing)
✅ LLM review (batched)
✅ Export (JSON, CSV, by category)
✅ Reporting (text summaries)
✅ Enron parsing (ready for training)
✅ Full orchestration (4 phases)
What's Left (Phases 15-16):
- E2E pipeline tests
- Integration test with Enron data
- Setup.py and wheel packaging
- Deployment documentation
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>