# Email Sorter - Development Guide ## What This Tool Does **Email Sorter is a TRIAGE tool** that sorts emails into buckets for downstream processing. It is NOT a complete email management solution - it's one part of a larger ecosystem. ``` Raw Inbox (10k+) --> Email Sorter --> Categorized Buckets --> Specialized Tools (this tool) (output) (other tools) ``` --- ## Quick Start ```bash cd /MASTERFOLDER/Tools/email-sorter source venv/bin/activate # Classify emails with ML + LLM fallback python -m src.cli run --source local \ --directory "/path/to/emails" \ --output "/path/to/output" \ --force-ml --llm-provider openai # Generate HTML report from results python tools/generate_html_report.py --input /path/to/results.json ``` --- ## Key Documentation | Document | Purpose | Location | |----------|---------|----------| | **PROJECT_ROADMAP_2025.md** | Master learnings, research findings, development roadmap | `docs/` | | **CLASSIFICATION_METHODS_COMPARISON.md** | ML vs LLM vs Agent comparison | `docs/` | | **REPORT_FORMAT.md** | HTML report documentation | `docs/` | | **BATCH_LLM_QUICKSTART.md** | Quick LLM batch processing guide | root | --- ## Research Findings Summary ### Dataset Size Routing | Size | Best Method | Why | |------|-------------|-----| | <500 | Agent-only | ML overhead exceeds benefit | | 500-5000 | Agent pre-scan + ML | Discovery improves accuracy | | >5000 | ML pipeline | Speed critical | ### Research Results | Dataset | Type | ML-Only | ML+LLM | Agent | |---------|------|---------|--------|-------| | brett-gmail (801) | Personal | 54.9% | 93.3% | 99.8% | | brett-microsoft (596) | Business | - | - | 98.2% | ### Key Insight: Inbox Character Matters | Type | Pattern | Approach | |------|---------|----------| | **Personal** | Subscriptions, marketing (40-50% automated) | Sender domain first | | **Business** | Client work, operations (60-70% professional) | Sender + Subject context | --- ## Project Structure ``` email-sorter/ ├── CLAUDE.md # THIS FILE ├── README.md # General readme ├── BATCH_LLM_QUICKSTART.md # LLM batch processing │ ├── src/ # Source code │ ├── cli.py # Main entry point │ ├── classification/ # ML/LLM classification │ ├── calibration/ # Model training, email parsing │ ├── email_providers/ # Gmail, Outlook, IMAP, Local │ └── llm/ # LLM providers │ ├── tools/ # Utility scripts │ ├── brett_gmail_analyzer.py # Personal inbox template │ ├── brett_microsoft_analyzer.py # Business inbox template │ ├── generate_html_report.py # HTML report generator │ └── batch_llm_classifier.py # Batch LLM classification │ ├── config/ # Configuration │ ├── default_config.yaml # LLM endpoints, thresholds │ └── categories.yaml # Category definitions │ ├── docs/ # Current documentation │ ├── PROJECT_ROADMAP_2025.md │ ├── CLASSIFICATION_METHODS_COMPARISON.md │ ├── REPORT_FORMAT.md │ └── archive/ # Old docs (historical) │ ├── data/ # Analysis outputs (gitignored) │ ├── brett_gmail_analysis.json │ └── brett_microsoft_analysis.json │ ├── credentials/ # OAuth/API creds (gitignored) ├── results/ # Classification outputs (gitignored) ├── archive/ # Old scripts (gitignored) ├── maildir/ # Enron test data └── venv/ # Python environment ``` --- ## Common Operations ### 1. Classify Emails (ML Pipeline) ```bash source venv/bin/activate # With LLM fallback for low confidence python -m src.cli run --source local \ --directory "/path/to/emails" \ --output "/path/to/output" \ --force-ml --llm-provider openai # Pure ML (fastest, no LLM) python -m src.cli run --source local \ --directory "/path/to/emails" \ --output "/path/to/output" \ --force-ml --no-llm-fallback ``` ### 2. Generate HTML Report ```bash python tools/generate_html_report.py --input /path/to/results.json # Creates report.html in same directory ``` ### 3. Manual Agent Analysis (Best Accuracy) For <1000 emails, agent analysis gives 98-99% accuracy: ```bash # Copy and customize analyzer template cp tools/brett_gmail_analyzer.py tools/my_inbox_analyzer.py # Edit classify_email() function for your inbox patterns # Update email_dir path # Run python tools/my_inbox_analyzer.py ``` ### 4. Different Email Sources ```bash # Local .eml/.msg files --source local --directory "/path/to/emails" # Gmail (OAuth) --source gmail --credentials credentials/gmail/account1.json # Outlook (OAuth) --source outlook --credentials credentials/outlook/account1.json # Enron test data --source enron --limit 10000 ``` --- ## Output Locations **Analysis reports are stored OUTSIDE this project:** ``` /home/bob/Documents/Email Manager/emails/ ├── brett-gmail/ # Source emails (untouched) ├── brett-gm-md/ # ML-only classification output │ ├── results.json │ ├── report.html │ └── BRETT_GMAIL_ANALYSIS_REPORT.md ├── brett-gm-llm/ # ML+LLM classification output │ ├── results.json │ └── report.html └── brett-ms-sorter/ # Microsoft inbox analysis └── BRETT_MICROSOFT_ANALYSIS_REPORT.md ``` **Project data outputs (gitignored):** ``` /MASTERFOLDER/Tools/email-sorter/data/ ├── brett_gmail_analysis.json └── brett_microsoft_analysis.json ``` --- ## Configuration ### LLM Endpoint (config/default_config.yaml) ```yaml llm: provider: "openai" openai: base_url: "http://localhost:11433/v1" # vLLM endpoint api_key: "not-needed" classification_model: "qwen3-coder-30b" ``` ### Thresholds (config/categories.yaml) Default: 0.55 (reduced from 0.75 for 40% less LLM fallback) --- ## Key Code Locations | Function | File | |----------|------| | CLI entry | `src/cli.py` | | ML classifier | `src/classification/ml_classifier.py` | | LLM classifier | `src/classification/llm_classifier.py` | | Feature extraction | `src/classification/feature_extractor.py` | | Email parsing | `src/calibration/local_file_parser.py` | | OpenAI-compat LLM | `src/llm/openai_compat.py` | --- ## Recent Changes (Nov 2025) 1. **cli.py**: Added `--force-ml` flag, enriched results.json with metadata 2. **openai_compat.py**: Removed API key requirement for local vLLM 3. **default_config.yaml**: Changed to openai provider on localhost:11433 4. **tools/**: Added brett_gmail_analyzer.py, brett_microsoft_analyzer.py, generate_html_report.py 5. **docs/**: Added PROJECT_ROADMAP_2025.md, CLASSIFICATION_METHODS_COMPARISON.md --- ## Troubleshooting ### "LLM endpoint not responding" - Check vLLM running on localhost:11433 - Verify model name in config matches running model ### "Low accuracy (50-60%)" - For <1000 emails, use agent analysis - Dataset may differ from Enron training data ### "Too many LLM calls" - Use `--no-llm-fallback` for pure ML - Increase threshold in categories.yaml --- ## Development Notes ### Virtual Environment Required ```bash source venv/bin/activate # ALWAYS activate before Python commands ``` ### Batched Feature Extraction (CRITICAL) ```python # CORRECT - Batched (150x faster) all_features = feature_extractor.extract_batch(emails, batch_size=512) # WRONG - Sequential (extremely slow) for email in emails: result = classifier.classify(email) # Don't do this ``` ### Model Paths - `src/models/calibrated/` - Created during calibration - `src/models/pretrained/` - Loaded by default --- ## What's Gitignored - `credentials/` - OAuth tokens - `results/`, `data/` - User data - `archive/`, `docs/archive/` - Historical content - `maildir/` - Enron test data (large) - `enron_mail_20150507.tar.gz` - Source archive - `venv/` - Python environment - `*.log`, `logs/` - Log files --- ## Philosophy 1. **Triage, not management** - Sort into buckets for other tools 2. **Risk-based accuracy** - High for personal, acceptable errors for junk 3. **Speed matters** - 10k emails in <1 min 4. **Inbox character matters** - Business vs personal = different approaches 5. **Agent pre-scan adds value** - 10-15 min discovery improves everything --- *Last Updated: 2025-11-28* *See docs/PROJECT_ROADMAP_2025.md for full research findings*