Content: - Project overview and MVP status - Architecture and performance metrics - Critical implementation details (batched embeddings, model paths) - Multi-account credential management - Common commands and code patterns - Performance optimization opportunities - Known issues and troubleshooting - Dependencies and git workflow - Recent changes and roadmap Key Sections: - Batched feature extraction (CRITICAL - 150x performance) - LLM-driven calibration (dynamic categories) - Threshold optimization (0.55 default) - Email provider credentials (3 accounts each) - Project structure reference - Important notes for AI assistants This document provides essential context for continuing development and ensures proper understanding of critical performance patterns.
12 KiB
Email Sorter - Claude Development Guide
This document provides essential context for Claude (or other AI assistants) working on this project.
Project Overview
Email Sorter is a hybrid ML/LLM email classification system designed to process large email backlogs (10k-100k+ emails) with high speed and accuracy.
Current MVP Status
✅ PROVEN WORKING - 10,000 emails classified in ~24 seconds with 72.7% accuracy
Core Features:
- LLM-driven category discovery (no hardcoded categories)
- ML model training on discovered categories (LightGBM)
- Fast pure-ML classification with
--no-llm-fallback - Category verification for new mailboxes with
--verify-categories - Batched embedding extraction (512 emails/batch)
- Multiple email provider support (Gmail, Outlook, IMAP, Enron)
Architecture
Three-Tier Classification Pipeline
Email → Rules Check → ML Classifier → LLM Fallback (optional)
↓ ↓ ↓
Definite High Confidence Low Confidence
(5-10%) (70-80%) (10-20%)
Key Technologies
- ML Model: LightGBM (1.8MB, 11 categories, 28 threads)
- Embeddings: all-minilm:l6-v2 via Ollama (384-dim, universal)
- LLM: qwen3:4b-instruct-2507-q8_0 via Ollama (calibration only)
- Feature Extraction: Embeddings + TF-IDF + pattern detection
- Thresholds: 0.55 (optimized from 0.75 to reduce LLM fallback)
Performance Metrics
| Emails | Time | Accuracy | LLM Calls | Throughput |
|---|---|---|---|---|
| 10,000 | 24s | 72.7% | 0 | 423/sec |
| 10,000 | 5min | 92.7% | 2,100 | 33/sec |
Project Structure
email-sorter/
├── src/
│ ├── cli.py # Main CLI interface
│ ├── classification/ # Classification pipeline
│ │ ├── adaptive_classifier.py # Rules → ML → LLM orchestration
│ │ ├── ml_classifier.py # LightGBM classifier
│ │ ├── llm_classifier.py # LLM fallback
│ │ └── feature_extractor.py # Batched embedding extraction
│ ├── calibration/ # LLM-driven calibration
│ │ ├── workflow.py # Calibration orchestration
│ │ ├── llm_analyzer.py # Batch category discovery (20 emails/call)
│ │ ├── trainer.py # ML model training
│ │ └── category_verifier.py # Category verification
│ ├── email_providers/ # Email source connectors
│ │ ├── gmail.py # Gmail API (OAuth 2.0)
│ │ ├── outlook.py # Microsoft Graph API (OAuth 2.0)
│ │ ├── imap.py # IMAP protocol
│ │ └── enron.py # Enron dataset (testing)
│ ├── llm/ # LLM provider interfaces
│ │ ├── ollama.py # Ollama provider
│ │ └── openai_compat.py # OpenAI-compatible provider
│ └── models/ # Trained models
│ ├── calibrated/ # User-calibrated models
│ │ └── classifier.pkl # Current trained model (1.8MB)
│ └── pretrained/ # Default models
├── config/
│ ├── default_config.yaml # System defaults
│ ├── categories.yaml # Category definitions (thresholds: 0.55)
│ └── llm_models.yaml # LLM configuration
├── credentials/ # Email provider credentials (gitignored)
│ ├── gmail/ # Gmail OAuth (3 accounts)
│ ├── outlook/ # Outlook OAuth (3 accounts)
│ └── imap/ # IMAP credentials (3 accounts)
├── docs/ # Documentation
├── scripts/ # Utility scripts
└── logs/ # Log files (gitignored)
Critical Implementation Details
1. Batched Embedding Extraction (CRITICAL!)
ALWAYS use batched feature extraction:
# ✅ CORRECT - Batched (150x faster)
all_features = feature_extractor.extract_batch(emails, batch_size=512)
for email, features in zip(emails, all_features):
result = adaptive_classifier.classify_with_features(email, features)
# ❌ WRONG - Sequential (extremely slow)
for email in emails:
result = adaptive_classifier.classify(email) # Extracts features one-at-a-time
Why this matters:
- Sequential: 10,000 emails × 15ms = 150 seconds just for embeddings
- Batched: 20 batches × 1s = 20 seconds for embeddings
- 150x performance difference
2. Model Paths
The model exists in TWO locations:
src/models/calibrated/classifier.pkl- Created during calibration (authoritative)src/models/pretrained/classifier.pkl- Loaded by default (copy of calibrated)
When calibration runs:
- Saves model to
calibrated/classifier.pkl - MLClassifier loads from
pretrained/classifier.pklby default - Need to copy or update path
Current status: Both paths have the same 1.8MB model (Oct 25 02:54)
3. LLM-Driven Calibration
NOT hardcoded categories - categories are discovered by LLM:
# Calibration process:
1. Sample 300 emails (3% of 10k)
2. Batch process in groups of 20 emails
3. LLM discovers categories (not predefined)
4. LLM labels each email
5. Train LightGBM on discovered categories
Result: 11 categories discovered from Enron dataset:
- Updates, Work, Meetings, External, Financial, Test, Administrative, Operational, Technical, Urgent, Requests
4. Threshold Optimization
Default threshold: 0.55 (reduced from 0.75)
Impact:
- 0.75 threshold: 35% LLM fallback
- 0.55 threshold: 21% LLM fallback
- 40% reduction in LLM usage
All category thresholds in config/categories.yaml set to 0.55.
5. Email Provider Credentials
Multi-account support: 3 accounts per provider type
Credential files:
credentials/
├── gmail/
│ ├── account1.json # Gmail OAuth credentials
│ ├── account2.json
│ └── account3.json
├── outlook/
│ ├── account1.json # Outlook OAuth credentials
│ ├── account2.json
│ └── account3.json
└── imap/
├── account1.json # IMAP username/password
├── account2.json
└── account3.json
Security: All *.json files in credentials/ are gitignored (only .example files tracked).
Common Commands
Development
# Activate virtual environment
source venv/bin/activate
# Run classification (Enron dataset)
python -m src.cli run --source enron --limit 10000 --output results/
# Pure ML (no LLM fallback) - FAST
python -m src.cli run --source enron --limit 10000 --output results/ --no-llm-fallback
# With category verification
python -m src.cli run --source enron --limit 10000 --output results/ --verify-categories
# Gmail
python -m src.cli run --source gmail --credentials credentials/gmail/account1.json --limit 1000
# Outlook
python -m src.cli run --source outlook --credentials credentials/outlook/account1.json --limit 1000
Training
# Force recalibration (clears cached model)
rm -rf src/models/calibrated/ src/models/pretrained/
python -m src.cli run --source enron --limit 10000 --output results/
Code Patterns
Adding New Features
-
Update CLI (src/cli.py):
- Add click options
- Pass to appropriate modules
-
Update Classifier (src/classification/adaptive_classifier.py):
- Add methods following existing pattern
- Use
classify_with_features()for batched processing
-
Update Feature Extractor (src/classification/feature_extractor.py):
- Always support batching (
extract_batch()) - Keep
extract()for backward compatibility
- Always support batching (
Testing
# Test imports
python -c "from src.cli import cli; print('OK')"
# Test providers
python -c "from src.email_providers.gmail import GmailProvider; from src.email_providers.outlook import OutlookProvider; print('OK')"
# Test classification
python -m src.cli run --source enron --limit 100 --output test/
Performance Optimization
Current Bottlenecks
-
Embedding generation - 20s for 10k emails (batched)
- Optimized with batch_size=512
- Could use local sentence-transformers for 5-10x speedup
-
Email parsing - 0.5s for 10k emails (fast)
-
ML inference - 0.7s for 10k emails (very fast)
Optimization Opportunities
-
Local embeddings - Replace Ollama API with sentence-transformers
- Current: 20 API calls, ~20 seconds
- With local: Direct GPU, ~2-5 seconds
- Trade-off: More dependencies, larger memory footprint
-
Embedding cache - Pre-compute and cache to disk
- One-time cost: 20 seconds
- Subsequent runs: 2-3 seconds to load from disk
- Perfect for development/testing
-
Larger batches - Tested 512, 1024, 2048
- 512: 23.6s (chosen for balance)
- 1024: 22.1s (6.6% faster)
- 2048: 21.9s (7.5% faster, diminishing returns)
Known Issues
1. Background Processes
There are stale background bash processes from previous sessions:
- These can be safely ignored
- Do NOT try to kill them (per user's CLAUDE.md instructions)
2. Model Path Confusion
- Calibration saves to
src/models/calibrated/ - Default loads from
src/models/pretrained/ - Both currently have the same model (synced)
3. Category Cache
src/models/category_cache.jsonstores discovered categories- Can become polluted if different datasets used
- Clear with
rm src/models/category_cache.jsonif issues
Dependencies
Required
pip install click pyyaml lightgbm numpy scikit-learn ollama
Email Providers
# Gmail
pip install google-api-python-client google-auth-oauthlib google-auth-httplib2
# Outlook
pip install msal requests
# IMAP - no additional dependencies (Python stdlib)
Optional
# For faster local embeddings
pip install sentence-transformers
# For development
pip install pytest black mypy
Git Workflow
What's Gitignored
credentials/(except.examplefiles)logs/results/src/models/calibrated/(trained models)*.logdebug_*.txt- Test directories
What's Tracked
- All source code
- Configuration files
- Documentation
- Example credential files
- Pretrained model (if present)
Important Notes for AI Assistants
-
NEVER create files unless necessary - Always prefer editing existing files
-
ALWAYS use batching - Feature extraction MUST be batched (512 emails/batch)
-
Read before writing - Use Read tool before any Edit operations
-
Verify paths - Model paths can be confusing (calibrated vs pretrained)
-
No emoji in commits - Per user's CLAUDE.md preferences
-
Test before committing - Verify imports and CLI work
-
Security - Never commit actual credentials, only
.examplefiles -
Performance matters - 10x performance differences are common, always batch
-
LLM is optional - System works without LLM (pure ML mode with --no-llm-fallback)
-
Categories are dynamic - They're discovered by LLM, not hardcoded
Recent Changes (Last Session)
- Fixed embedding bottleneck - Changed from sequential to batched feature extraction (10x speedup)
- Added Outlook provider - Full Microsoft Graph API integration
- Added credentials system - Support for 3 accounts per provider type
- Optimized thresholds - Reduced from 0.75 to 0.55 (40% less LLM usage)
- Added category verifier - Optional single LLM call to verify model fit
- Project reorganization - Clean docs/, scripts/, logs/ structure
Next Steps (Roadmap)
See docs/PROJECT_STATUS_AND_NEXT_STEPS.html for complete roadmap.
Immediate priorities:
- Test Gmail provider with real credentials
- Test Outlook provider with real credentials
- Implement email syncing (apply labels back to mailbox)
- Add incremental classification (process only new emails)
- Create web dashboard for results visualization
Remember: This is an MVP with proven performance. Don't over-engineer. Keep it fast and simple.