8 Commits

Author SHA1 Message Date
fa09d14e52 Add LLM-driven cache evolution - selective category persistence
LLM now decides which new categories should be added to persistent cache
for future mailbox runs vs temporary (run-only) categories.

ENHANCED LLM REVIEW:
- New field: "cache_worthy" (true/false) for each "new" category
- LLM judges: "Is this category useful across different mailboxes?"
- Examples:
  - "Customer Support" → cache_worthy: true (universal)
  - "Project X Updates" → cache_worthy: false (mailbox-specific)

CACHE EVOLUTION:
- cache_worthy=true → Added to persistent cache for future runs
- cache_worthy=false → Used for current run only, not cached
- First run (empty cache) → All categories treated as cache-worthy
- LLM reasoning logged for transparency

INTELLIGENT GROWTH:
- Cache grows organically with high-quality, reusable categories
- Prevents pollution with mailbox-specific categories
- Maintains cross-mailbox consistency while allowing natural evolution
- LLM balances: consistency (snap existing) vs expansion (add worthy)

SINGLE LLM CALL EFFICIENCY:
- Same ~4 second LLM call now handles:
  1. Snap vs new decision
  2. Cache persistence decision
  3. Reasoning for both
- No additional overhead for cache evolution

Result: Cache evolves intelligently over time, collecting universally
useful categories while filtering out temporary/specific ones.
2025-10-23 15:36:51 +11:00
eab378409e Add intelligent multi-stage category matching with LLM review
Implements a sophisticated 5-stage matching strategy for category cache:

MATCHING PIPELINE:
1. Exact name match (1.0) → instant snap
2. High embedding similarity (≥0.7) → confident snap
3. Ambiguous similarity (0.5-0.7) → LLM review
4. Low similarity (<0.5) → accept as new (if slots available)
5. Exceeded max_new → force review/snap

LLM REVIEW FOR AMBIGUOUS CASES:
- Triggered when similarity scores are 0.5-0.7 (too low to snap, too high to ignore)
- LLM decides: snap to existing OR approve as new category
- Considers: semantic overlap, functional distinction, user value
- Conservative bias toward snapping (consistency > fragmentation)
- Respects max_new limit and remaining slots

HEURISTIC FALLBACK:
- If no LLM available: 0.6+ snaps, <0.6 becomes new (if allowed)
- Ensures system always produces valid category mapping

Configuration:
- similarity_threshold: 0.7 (confident match)
- llm_review_threshold: 0.5 (triggers LLM review)
- max_new: 3 (limits new categories per run)

This solves the key problem: embedding similarity alone can't decide
edge cases (0.5-0.7 scores). LLM provides intelligent judgment for
ambiguous matches, accepting valuable new categories while maintaining
cross-mailbox consistency.
2025-10-23 15:19:50 +11:00
288b341f4e Replace keyword heuristics with embedding-based semantic matching
CategoryCache now uses Ollama embeddings + cosine similarity for
true semantic category matching instead of weak keyword overlap.

Changes:
- src/calibration/category_cache.py: Use embedder.embeddings() API
  - Calculate embeddings for discovered and cached category descriptions
  - Compute cosine similarity between embedding vectors
  - Fall back to partial name matching if embeddings unavailable
  - Error handling with graceful degradation

- src/calibration/workflow.py: Pass feature_extractor.embedder
  - Provide Ollama client to CalibrationAnalyzer
  - Enables semantic matching during cache snap

- src/calibration/llm_analyzer.py: Accept embedding_model parameter
  - Forward embedder to CategoryCache constructor

Test Results (embedding-based vs keyword):
- "Training Materials" → "Training": 0.72 (was 0.15)
- "Team Updates" → "Work Communication": 0.62 (was 0.24)
- "System Alerts" → "Technical": 0.63 (was 0.12)
- "Meeting Invitations" → "Meetings": 0.75+ (exact match)

Semantic matching now properly identifies similar categories based
on meaning rather than superficial word overlap.
2025-10-23 15:12:08 +11:00
874caf38bc Add category caching system and analytical data to prompts
Category Cache System (src/calibration/category_cache.py):
- Persistent storage of discovered categories across mailbox runs
- Semantic matching to snap new categories to existing ones
- Usage tracking for category popularity
- Configurable similarity threshold and new category limits
- JSON-based cache with metadata (created, last_seen, email counts)

Discovery Improvements (src/calibration/llm_analyzer.py):
- Calculate batch statistics: sender domains, recipient counts,
  attachments, subject lengths, common keywords
- Add statistics to LLM discovery prompt for better decisions
- Integrate CategoryCache into CalibrationAnalyzer
- 3-step workflow: Discover → Consolidate → Snap to Cache

Consolidation Improvements:
- Add cached categories as hints in consolidation prompt
- LLM prefers snapping to established categories
- Maintains cross-mailbox consistency while allowing new categories

Configuration Parameters:
- use_category_cache: Enable/disable caching (default: true)
- cache_similarity_threshold: Min similarity for snap (default: 0.7)
- cache_allow_new: Allow new categories (default: true)
- cache_max_new: Max new categories per run (default: 3)
- category_cache_path: Custom cache location

Result: Consistent category sets across different mailboxes
with intelligent discovery of new categories when appropriate.
2025-10-23 14:25:41 +11:00
183b12c9b4 Improve LLM prompts with proper context and purpose
Both discovery and consolidation prompts now explain:
- What the system does (train ML classifier for auto-sorting)
- What makes good categories (broad, timeless, learnable)
- Why this matters (user needs, ML training requirements)
- How to think about the task (user-focused, functional)

Discovery prompt changes:
- Explains goal of identifying natural categories for ML training
- Lists guidelines for good categories (broad, user-focused, learnable)
- Provides concrete examples of functional categories
- Emphasizes PURPOSE over topic

Consolidation prompt changes:
- Explains full system context (LightGBM, auto-labeling, user search)
- Defines what makes categories effective for ML and users
- Provides user-centric thinking framework
- Emphasizes reusability and timelessness

Prompts now give the brilliant 8b model proper context to deliver
excellent category decisions instead of lazy generic categorization.
2025-10-23 14:15:17 +11:00
88ef570fed Add robust edge case handling to category consolidation
Enhanced _consolidate_categories() with comprehensive validation:

- Edge case guards: Skip if ≤5 categories or no labels
- Parameter validation: Clamp ranges for all config values
- 5-stage validation after LLM response:
  1. Structure check (valid dicts)
  2. Reduction check (consolidation must reduce count)
  3. Target compliance (soft 50% overage limit)
  4. Complete mapping (all old categories mapped)
  5. Valid targets (all mappings point to existing categories)

- Auto-repair for common LLM failures:
  - Unmapped categories → map to first consolidated category
  - Invalid mapping targets → create missing categories
  - Failed updates → log with details

- Fallback consolidation using top-N by count
  - Triggered on JSON parse errors, validation failures
  - Heuristic-based, no LLM required
  - Guarantees output even if LLM fails

All error paths now have proper handling and logging.
2025-10-23 14:12:20 +11:00
50ddaa4b39 Fix calibration workflow - LLM now generates categories/labels correctly
Root cause: Pre-trained model was loading successfully, causing CLI to skip
calibration entirely. System went straight to classification with 35% model.

Changes:
- config: Set calibration_model to qwen3:8b-q4_K_M (larger model for better instruction following)
- cli: Create separate calibration_llm provider with 8b model
- llm_analyzer: Improved prompt to force exact email ID copying
- workflow: Merge discovered categories with predefined ones
- workflow: Add detailed error logging for label mismatches
- ml_classifier: Fixed model path checking (was checking None parameter)
- ml_classifier: Add dual API support (sklearn predict_proba vs LightGBM predict)
- ollama: Fixed model list parsing (use m.model not m.get('name'))
- feature_extractor: Switch to Ollama embeddings (instant vs 90s load time)

Result: Calibration now runs and generates 16 categories + 50 labels correctly.
Next: Investigate calibration sampling to reduce overfitting on small samples.
2025-10-23 13:51:09 +11:00
02be616c5c Phase 9-14: Complete processing pipeline, calibration, export, and orchestration
PHASE 9: Processing Pipeline & Queue Management (bulk_processor.py)
- BulkProcessor class for batch processing with checkpointing
- ProcessingCheckpoint: Save/resume state for resumable processing
- Handles batches with periodic checkpoints every N emails
- Tracks completed, queued_for_llm, and failed emails
- Progress callbacks for UI integration

PHASE 10: Calibration System (sampler.py, llm_analyzer.py)
- EmailSampler: Stratified and random sampling
- Stratifies by sender domain type for representativeness
- CalibrationAnalyzer: Use LLM to discover natural categories
- Batched analysis to control LLM load
- Maps discovered categories to universal schema

PHASE 11: Export & Reporting (exporter.py)
- ResultsExporter: Export to JSON, CSV, organized by category
- ReportGenerator: Generate human-readable text reports
- Category statistics and method breakdown
- Accuracy metrics and processing time tracking

PHASE 13: Enron Dataset Parser (enron_parser.py)
- Parses Enron maildir format into Email objects
- Handles multipart emails and attachments
- Date parsing with fallback for malformed dates
- Ready to train mock model on real data

PHASE 14: Main Orchestration (orchestration.py)
- EmailSorterOrchestrator: Coordinates entire pipeline
- 4-phase workflow: Calibration → Bulk → LLM → Export
- Lazy initialization of components
- Progress tracking and timing
- Full pipeline runner with resume support

Components Now Available:
 Sampling (stratified and random)
 Calibration (LLM-driven category discovery)
 Bulk processing (with checkpointing)
 LLM review (batched)
 Export (JSON, CSV, by category)
 Reporting (text summaries)
 Enron parsing (ready for training)
 Full orchestration (4 phases)

What's Left (Phases 15-16):
- E2E pipeline tests
- Integration test with Enron data
- Setup.py and wheel packaging
- Deployment documentation

Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-21 11:52:09 +11:00