email-sorter

Author	SHA1	Message	Date
FSSCoding	fa09d14e52	Add LLM-driven cache evolution - selective category persistence LLM now decides which new categories should be added to persistent cache for future mailbox runs vs temporary (run-only) categories. ENHANCED LLM REVIEW: - New field: "cache_worthy" (true/false) for each "new" category - LLM judges: "Is this category useful across different mailboxes?" - Examples: - "Customer Support" → cache_worthy: true (universal) - "Project X Updates" → cache_worthy: false (mailbox-specific) CACHE EVOLUTION: - cache_worthy=true → Added to persistent cache for future runs - cache_worthy=false → Used for current run only, not cached - First run (empty cache) → All categories treated as cache-worthy - LLM reasoning logged for transparency INTELLIGENT GROWTH: - Cache grows organically with high-quality, reusable categories - Prevents pollution with mailbox-specific categories - Maintains cross-mailbox consistency while allowing natural evolution - LLM balances: consistency (snap existing) vs expansion (add worthy) SINGLE LLM CALL EFFICIENCY: - Same ~4 second LLM call now handles: 1. Snap vs new decision 2. Cache persistence decision 3. Reasoning for both - No additional overhead for cache evolution Result: Cache evolves intelligently over time, collecting universally useful categories while filtering out temporary/specific ones.	2025-10-23 15:36:51 +11:00
FSSCoding	eab378409e	Add intelligent multi-stage category matching with LLM review Implements a sophisticated 5-stage matching strategy for category cache: MATCHING PIPELINE: 1. Exact name match (1.0) → instant snap 2. High embedding similarity (≥0.7) → confident snap 3. Ambiguous similarity (0.5-0.7) → LLM review 4. Low similarity (<0.5) → accept as new (if slots available) 5. Exceeded max_new → force review/snap LLM REVIEW FOR AMBIGUOUS CASES: - Triggered when similarity scores are 0.5-0.7 (too low to snap, too high to ignore) - LLM decides: snap to existing OR approve as new category - Considers: semantic overlap, functional distinction, user value - Conservative bias toward snapping (consistency > fragmentation) - Respects max_new limit and remaining slots HEURISTIC FALLBACK: - If no LLM available: 0.6+ snaps, <0.6 becomes new (if allowed) - Ensures system always produces valid category mapping Configuration: - similarity_threshold: 0.7 (confident match) - llm_review_threshold: 0.5 (triggers LLM review) - max_new: 3 (limits new categories per run) This solves the key problem: embedding similarity alone can't decide edge cases (0.5-0.7 scores). LLM provides intelligent judgment for ambiguous matches, accepting valuable new categories while maintaining cross-mailbox consistency.	2025-10-23 15:19:50 +11:00
FSSCoding	288b341f4e	Replace keyword heuristics with embedding-based semantic matching CategoryCache now uses Ollama embeddings + cosine similarity for true semantic category matching instead of weak keyword overlap. Changes: - src/calibration/category_cache.py: Use embedder.embeddings() API - Calculate embeddings for discovered and cached category descriptions - Compute cosine similarity between embedding vectors - Fall back to partial name matching if embeddings unavailable - Error handling with graceful degradation - src/calibration/workflow.py: Pass feature_extractor.embedder - Provide Ollama client to CalibrationAnalyzer - Enables semantic matching during cache snap - src/calibration/llm_analyzer.py: Accept embedding_model parameter - Forward embedder to CategoryCache constructor Test Results (embedding-based vs keyword): - "Training Materials" → "Training": 0.72 (was 0.15) - "Team Updates" → "Work Communication": 0.62 (was 0.24) - "System Alerts" → "Technical": 0.63 (was 0.12) - "Meeting Invitations" → "Meetings": 0.75+ (exact match) Semantic matching now properly identifies similar categories based on meaning rather than superficial word overlap.	2025-10-23 15:12:08 +11:00
FSSCoding	874caf38bc	Add category caching system and analytical data to prompts Category Cache System (src/calibration/category_cache.py): - Persistent storage of discovered categories across mailbox runs - Semantic matching to snap new categories to existing ones - Usage tracking for category popularity - Configurable similarity threshold and new category limits - JSON-based cache with metadata (created, last_seen, email counts) Discovery Improvements (src/calibration/llm_analyzer.py): - Calculate batch statistics: sender domains, recipient counts, attachments, subject lengths, common keywords - Add statistics to LLM discovery prompt for better decisions - Integrate CategoryCache into CalibrationAnalyzer - 3-step workflow: Discover → Consolidate → Snap to Cache Consolidation Improvements: - Add cached categories as hints in consolidation prompt - LLM prefers snapping to established categories - Maintains cross-mailbox consistency while allowing new categories Configuration Parameters: - use_category_cache: Enable/disable caching (default: true) - cache_similarity_threshold: Min similarity for snap (default: 0.7) - cache_allow_new: Allow new categories (default: true) - cache_max_new: Max new categories per run (default: 3) - category_cache_path: Custom cache location Result: Consistent category sets across different mailboxes with intelligent discovery of new categories when appropriate.	2025-10-23 14:25:41 +11:00
FSSCoding	183b12c9b4	Improve LLM prompts with proper context and purpose Both discovery and consolidation prompts now explain: - What the system does (train ML classifier for auto-sorting) - What makes good categories (broad, timeless, learnable) - Why this matters (user needs, ML training requirements) - How to think about the task (user-focused, functional) Discovery prompt changes: - Explains goal of identifying natural categories for ML training - Lists guidelines for good categories (broad, user-focused, learnable) - Provides concrete examples of functional categories - Emphasizes PURPOSE over topic Consolidation prompt changes: - Explains full system context (LightGBM, auto-labeling, user search) - Defines what makes categories effective for ML and users - Provides user-centric thinking framework - Emphasizes reusability and timelessness Prompts now give the brilliant 8b model proper context to deliver excellent category decisions instead of lazy generic categorization.	2025-10-23 14:15:17 +11:00
FSSCoding	88ef570fed	Add robust edge case handling to category consolidation Enhanced _consolidate_categories() with comprehensive validation: - Edge case guards: Skip if ≤5 categories or no labels - Parameter validation: Clamp ranges for all config values - 5-stage validation after LLM response: 1. Structure check (valid dicts) 2. Reduction check (consolidation must reduce count) 3. Target compliance (soft 50% overage limit) 4. Complete mapping (all old categories mapped) 5. Valid targets (all mappings point to existing categories) - Auto-repair for common LLM failures: - Unmapped categories → map to first consolidated category - Invalid mapping targets → create missing categories - Failed updates → log with details - Fallback consolidation using top-N by count - Triggered on JSON parse errors, validation failures - Heuristic-based, no LLM required - Guarantees output even if LLM fails All error paths now have proper handling and logging.	2025-10-23 14:12:20 +11:00
FSSCoding	50ddaa4b39	Fix calibration workflow - LLM now generates categories/labels correctly Root cause: Pre-trained model was loading successfully, causing CLI to skip calibration entirely. System went straight to classification with 35% model. Changes: - config: Set calibration_model to qwen3:8b-q4_K_M (larger model for better instruction following) - cli: Create separate calibration_llm provider with 8b model - llm_analyzer: Improved prompt to force exact email ID copying - workflow: Merge discovered categories with predefined ones - workflow: Add detailed error logging for label mismatches - ml_classifier: Fixed model path checking (was checking None parameter) - ml_classifier: Add dual API support (sklearn predict_proba vs LightGBM predict) - ollama: Fixed model list parsing (use m.model not m.get('name')) - feature_extractor: Switch to Ollama embeddings (instant vs 90s load time) Result: Calibration now runs and generates 16 categories + 50 labels correctly. Next: Investigate calibration sampling to reduce overfitting on small samples.	2025-10-23 13:51:09 +11:00
Brett Fox	02be616c5c	Phase 9-14: Complete processing pipeline, calibration, export, and orchestration PHASE 9: Processing Pipeline & Queue Management (bulk_processor.py) - BulkProcessor class for batch processing with checkpointing - ProcessingCheckpoint: Save/resume state for resumable processing - Handles batches with periodic checkpoints every N emails - Tracks completed, queued_for_llm, and failed emails - Progress callbacks for UI integration PHASE 10: Calibration System (sampler.py, llm_analyzer.py) - EmailSampler: Stratified and random sampling - Stratifies by sender domain type for representativeness - CalibrationAnalyzer: Use LLM to discover natural categories - Batched analysis to control LLM load - Maps discovered categories to universal schema PHASE 11: Export & Reporting (exporter.py) - ResultsExporter: Export to JSON, CSV, organized by category - ReportGenerator: Generate human-readable text reports - Category statistics and method breakdown - Accuracy metrics and processing time tracking PHASE 13: Enron Dataset Parser (enron_parser.py) - Parses Enron maildir format into Email objects - Handles multipart emails and attachments - Date parsing with fallback for malformed dates - Ready to train mock model on real data PHASE 14: Main Orchestration (orchestration.py) - EmailSorterOrchestrator: Coordinates entire pipeline - 4-phase workflow: Calibration → Bulk → LLM → Export - Lazy initialization of components - Progress tracking and timing - Full pipeline runner with resume support Components Now Available: ✅ Sampling (stratified and random) ✅ Calibration (LLM-driven category discovery) ✅ Bulk processing (with checkpointing) ✅ LLM review (batched) ✅ Export (JSON, CSV, by category) ✅ Reporting (text summaries) ✅ Enron parsing (ready for training) ✅ Full orchestration (4 phases) What's Left (Phases 15-16): - E2E pipeline tests - Integration test with Enron data - Setup.py and wheel packaging - Deployment documentation Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-21 11:52:09 +11:00

8 Commits