diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..f67d924 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,377 @@ +# Email Sorter - Claude Development Guide + +This document provides essential context for Claude (or other AI assistants) working on this project. + +## Project Overview + +**Email Sorter** is a hybrid ML/LLM email classification system designed to process large email backlogs (10k-100k+ emails) with high speed and accuracy. + +### Current MVP Status + +**✅ PROVEN WORKING** - 10,000 emails classified in ~24 seconds with 72.7% accuracy + +**Core Features:** +- LLM-driven category discovery (no hardcoded categories) +- ML model training on discovered categories (LightGBM) +- Fast pure-ML classification with `--no-llm-fallback` +- Category verification for new mailboxes with `--verify-categories` +- Batched embedding extraction (512 emails/batch) +- Multiple email provider support (Gmail, Outlook, IMAP, Enron) + +## Architecture + +### Three-Tier Classification Pipeline + +``` +Email → Rules Check → ML Classifier → LLM Fallback (optional) + ↓ ↓ ↓ + Definite High Confidence Low Confidence + (5-10%) (70-80%) (10-20%) +``` + +### Key Technologies + +- **ML Model**: LightGBM (1.8MB, 11 categories, 28 threads) +- **Embeddings**: all-minilm:l6-v2 via Ollama (384-dim, universal) +- **LLM**: qwen3:4b-instruct-2507-q8_0 via Ollama (calibration only) +- **Feature Extraction**: Embeddings + TF-IDF + pattern detection +- **Thresholds**: 0.55 (optimized from 0.75 to reduce LLM fallback) + +### Performance Metrics + +| Emails | Time | Accuracy | LLM Calls | Throughput | +|--------|------|----------|-----------|------------| +| 10,000 | 24s | 72.7% | 0 | 423/sec | +| 10,000 | 5min | 92.7% | 2,100 | 33/sec | + +## Project Structure + +``` +email-sorter/ +├── src/ +│ ├── cli.py # Main CLI interface +│ ├── classification/ # Classification pipeline +│ │ ├── adaptive_classifier.py # Rules → ML → LLM orchestration +│ │ ├── ml_classifier.py # LightGBM classifier +│ │ ├── llm_classifier.py # LLM fallback +│ │ └── feature_extractor.py # Batched embedding extraction +│ ├── calibration/ # LLM-driven calibration +│ │ ├── workflow.py # Calibration orchestration +│ │ ├── llm_analyzer.py # Batch category discovery (20 emails/call) +│ │ ├── trainer.py # ML model training +│ │ └── category_verifier.py # Category verification +│ ├── email_providers/ # Email source connectors +│ │ ├── gmail.py # Gmail API (OAuth 2.0) +│ │ ├── outlook.py # Microsoft Graph API (OAuth 2.0) +│ │ ├── imap.py # IMAP protocol +│ │ └── enron.py # Enron dataset (testing) +│ ├── llm/ # LLM provider interfaces +│ │ ├── ollama.py # Ollama provider +│ │ └── openai_compat.py # OpenAI-compatible provider +│ └── models/ # Trained models +│ ├── calibrated/ # User-calibrated models +│ │ └── classifier.pkl # Current trained model (1.8MB) +│ └── pretrained/ # Default models +├── config/ +│ ├── default_config.yaml # System defaults +│ ├── categories.yaml # Category definitions (thresholds: 0.55) +│ └── llm_models.yaml # LLM configuration +├── credentials/ # Email provider credentials (gitignored) +│ ├── gmail/ # Gmail OAuth (3 accounts) +│ ├── outlook/ # Outlook OAuth (3 accounts) +│ └── imap/ # IMAP credentials (3 accounts) +├── docs/ # Documentation +├── scripts/ # Utility scripts +└── logs/ # Log files (gitignored) +``` + +## Critical Implementation Details + +### 1. Batched Embedding Extraction (CRITICAL!) + +**ALWAYS use batched feature extraction:** + +```python +# ✅ CORRECT - Batched (150x faster) +all_features = feature_extractor.extract_batch(emails, batch_size=512) +for email, features in zip(emails, all_features): + result = adaptive_classifier.classify_with_features(email, features) + +# ❌ WRONG - Sequential (extremely slow) +for email in emails: + result = adaptive_classifier.classify(email) # Extracts features one-at-a-time +``` + +**Why this matters:** +- Sequential: 10,000 emails × 15ms = 150 seconds just for embeddings +- Batched: 20 batches × 1s = 20 seconds for embeddings +- **150x performance difference** + +### 2. Model Paths + +**The model exists in TWO locations:** +- `src/models/calibrated/classifier.pkl` - Created during calibration (authoritative) +- `src/models/pretrained/classifier.pkl` - Loaded by default (copy of calibrated) + +**When calibration runs:** +1. Saves model to `calibrated/classifier.pkl` +2. MLClassifier loads from `pretrained/classifier.pkl` by default +3. Need to copy or update path + +**Current status:** Both paths have the same 1.8MB model (Oct 25 02:54) + +### 3. LLM-Driven Calibration + +**NOT hardcoded categories** - categories are discovered by LLM: + +```python +# Calibration process: +1. Sample 300 emails (3% of 10k) +2. Batch process in groups of 20 emails +3. LLM discovers categories (not predefined) +4. LLM labels each email +5. Train LightGBM on discovered categories +``` + +**Result:** 11 categories discovered from Enron dataset: +- Updates, Work, Meetings, External, Financial, Test, Administrative, Operational, Technical, Urgent, Requests + +### 4. Threshold Optimization + +**Default threshold: 0.55** (reduced from 0.75) + +**Impact:** +- 0.75 threshold: 35% LLM fallback +- 0.55 threshold: 21% LLM fallback +- **40% reduction in LLM usage** + +All category thresholds in `config/categories.yaml` set to 0.55. + +### 5. Email Provider Credentials + +**Multi-account support:** 3 accounts per provider type + +**Credential files:** +``` +credentials/ +├── gmail/ +│ ├── account1.json # Gmail OAuth credentials +│ ├── account2.json +│ └── account3.json +├── outlook/ +│ ├── account1.json # Outlook OAuth credentials +│ ├── account2.json +│ └── account3.json +└── imap/ + ├── account1.json # IMAP username/password + ├── account2.json + └── account3.json +``` + +**Security:** All `*.json` files in `credentials/` are gitignored (only `.example` files tracked). + +## Common Commands + +### Development + +```bash +# Activate virtual environment +source venv/bin/activate + +# Run classification (Enron dataset) +python -m src.cli run --source enron --limit 10000 --output results/ + +# Pure ML (no LLM fallback) - FAST +python -m src.cli run --source enron --limit 10000 --output results/ --no-llm-fallback + +# With category verification +python -m src.cli run --source enron --limit 10000 --output results/ --verify-categories + +# Gmail +python -m src.cli run --source gmail --credentials credentials/gmail/account1.json --limit 1000 + +# Outlook +python -m src.cli run --source outlook --credentials credentials/outlook/account1.json --limit 1000 +``` + +### Training + +```bash +# Force recalibration (clears cached model) +rm -rf src/models/calibrated/ src/models/pretrained/ +python -m src.cli run --source enron --limit 10000 --output results/ +``` + +## Code Patterns + +### Adding New Features + +1. **Update CLI** ([src/cli.py](src/cli.py)): + - Add click options + - Pass to appropriate modules + +2. **Update Classifier** ([src/classification/adaptive_classifier.py](src/classification/adaptive_classifier.py)): + - Add methods following existing pattern + - Use `classify_with_features()` for batched processing + +3. **Update Feature Extractor** ([src/classification/feature_extractor.py](src/classification/feature_extractor.py)): + - Always support batching (`extract_batch()`) + - Keep `extract()` for backward compatibility + +### Testing + +```bash +# Test imports +python -c "from src.cli import cli; print('OK')" + +# Test providers +python -c "from src.email_providers.gmail import GmailProvider; from src.email_providers.outlook import OutlookProvider; print('OK')" + +# Test classification +python -m src.cli run --source enron --limit 100 --output test/ +``` + +## Performance Optimization + +### Current Bottlenecks + +1. **Embedding generation** - 20s for 10k emails (batched) + - Optimized with batch_size=512 + - Could use local sentence-transformers for 5-10x speedup + +2. **Email parsing** - 0.5s for 10k emails (fast) + +3. **ML inference** - 0.7s for 10k emails (very fast) + +### Optimization Opportunities + +1. **Local embeddings** - Replace Ollama API with sentence-transformers + - Current: 20 API calls, ~20 seconds + - With local: Direct GPU, ~2-5 seconds + - Trade-off: More dependencies, larger memory footprint + +2. **Embedding cache** - Pre-compute and cache to disk + - One-time cost: 20 seconds + - Subsequent runs: 2-3 seconds to load from disk + - Perfect for development/testing + +3. **Larger batches** - Tested 512, 1024, 2048 + - 512: 23.6s (chosen for balance) + - 1024: 22.1s (6.6% faster) + - 2048: 21.9s (7.5% faster, diminishing returns) + +## Known Issues + +### 1. Background Processes + +There are stale background bash processes from previous sessions: +- These can be safely ignored +- Do NOT try to kill them (per user's CLAUDE.md instructions) + +### 2. Model Path Confusion + +- Calibration saves to `src/models/calibrated/` +- Default loads from `src/models/pretrained/` +- Both currently have the same model (synced) + +### 3. Category Cache + +- `src/models/category_cache.json` stores discovered categories +- Can become polluted if different datasets used +- Clear with `rm src/models/category_cache.json` if issues + +## Dependencies + +### Required + +```bash +pip install click pyyaml lightgbm numpy scikit-learn ollama +``` + +### Email Providers + +```bash +# Gmail +pip install google-api-python-client google-auth-oauthlib google-auth-httplib2 + +# Outlook +pip install msal requests + +# IMAP - no additional dependencies (Python stdlib) +``` + +### Optional + +```bash +# For faster local embeddings +pip install sentence-transformers + +# For development +pip install pytest black mypy +``` + +## Git Workflow + +### What's Gitignored + +- `credentials/` (except `.example` files) +- `logs/` +- `results/` +- `src/models/calibrated/` (trained models) +- `*.log` +- `debug_*.txt` +- Test directories + +### What's Tracked + +- All source code +- Configuration files +- Documentation +- Example credential files +- Pretrained model (if present) + +## Important Notes for AI Assistants + +1. **NEVER create files unless necessary** - Always prefer editing existing files + +2. **ALWAYS use batching** - Feature extraction MUST be batched (512 emails/batch) + +3. **Read before writing** - Use Read tool before any Edit operations + +4. **Verify paths** - Model paths can be confusing (calibrated vs pretrained) + +5. **No emoji in commits** - Per user's CLAUDE.md preferences + +6. **Test before committing** - Verify imports and CLI work + +7. **Security** - Never commit actual credentials, only `.example` files + +8. **Performance matters** - 10x performance differences are common, always batch + +9. **LLM is optional** - System works without LLM (pure ML mode with --no-llm-fallback) + +10. **Categories are dynamic** - They're discovered by LLM, not hardcoded + +## Recent Changes (Last Session) + +1. **Fixed embedding bottleneck** - Changed from sequential to batched feature extraction (10x speedup) +2. **Added Outlook provider** - Full Microsoft Graph API integration +3. **Added credentials system** - Support for 3 accounts per provider type +4. **Optimized thresholds** - Reduced from 0.75 to 0.55 (40% less LLM usage) +5. **Added category verifier** - Optional single LLM call to verify model fit +6. **Project reorganization** - Clean docs/, scripts/, logs/ structure + +## Next Steps (Roadmap) + +See [docs/PROJECT_STATUS_AND_NEXT_STEPS.html](docs/PROJECT_STATUS_AND_NEXT_STEPS.html) for complete roadmap. + +**Immediate priorities:** +1. Test Gmail provider with real credentials +2. Test Outlook provider with real credentials +3. Implement email syncing (apply labels back to mailbox) +4. Add incremental classification (process only new emails) +5. Create web dashboard for results visualization + +--- + +**Remember:** This is an MVP with proven performance. Don't over-engineer. Keep it fast and simple.