Rewrite CLAUDE.md and clean project structure

- Rewrote CLAUDE.md with comprehensive development guide
- Archived 20 old docs to docs/archive/
- Added PROJECT_ROADMAP_2025.md with research learnings
- Added CLASSIFICATION_METHODS_COMPARISON.md
- Added SESSION_HANDOVER_20251128.md
- Added tools for analysis (brett_gmail/microsoft analyzers)
- Updated .gitignore for archive folders
- Config changes for local vLLM endpoint
This commit is contained in:
FSSCoding 2025-11-28 13:07:27 +11:00
parent 4eee962c09
commit 8f25e30f52
32 changed files with 3592 additions and 14417 deletions

13
.gitignore vendored
View File

@ -72,8 +72,17 @@ ml_only_test/
results_*/ results_*/
phase1_*/ phase1_*/
# Python scripts (experimental/research) # Python scripts (experimental/research - not in src/tests/tools)
*.py *.py
!src/**/*.py !src/**/*.py
!tests/**/*.py !tests/**/*.py
!setup.py !tools/**/*.py
!setup.py
# Archive folders (historical content)
archive/
docs/archive/
# Data folders (user-specific content)
data/Bruce emails/
data/emails-for-link/

645
CLAUDE.md
View File

@ -1,377 +1,304 @@
# Email Sorter - Claude Development Guide # Email Sorter - Development Guide
This document provides essential context for Claude (or other AI assistants) working on this project. ## What This Tool Does
## Project Overview **Email Sorter is a TRIAGE tool** that sorts emails into buckets for downstream processing. It is NOT a complete email management solution - it's one part of a larger ecosystem.
**Email Sorter** is a hybrid ML/LLM email classification system designed to process large email backlogs (10k-100k+ emails) with high speed and accuracy.
### Current MVP Status
**✅ PROVEN WORKING** - 10,000 emails classified in ~24 seconds with 72.7% accuracy
**Core Features:**
- LLM-driven category discovery (no hardcoded categories)
- ML model training on discovered categories (LightGBM)
- Fast pure-ML classification with `--no-llm-fallback`
- Category verification for new mailboxes with `--verify-categories`
- Batched embedding extraction (512 emails/batch)
- Multiple email provider support (Gmail, Outlook, IMAP, Enron)
## Architecture
### Three-Tier Classification Pipeline
``` ```
Email → Rules Check → ML Classifier → LLM Fallback (optional) Raw Inbox (10k+) --> Email Sorter --> Categorized Buckets --> Specialized Tools
↓ ↓ ↓ (this tool) (output) (other tools)
Definite High Confidence Low Confidence
(5-10%) (70-80%) (10-20%)
``` ```
### Key Technologies ---
- **ML Model**: LightGBM (1.8MB, 11 categories, 28 threads) ## Quick Start
- **Embeddings**: all-minilm:l6-v2 via Ollama (384-dim, universal)
- **LLM**: qwen3:4b-instruct-2507-q8_0 via Ollama (calibration only)
- **Feature Extraction**: Embeddings + TF-IDF + pattern detection
- **Thresholds**: 0.55 (optimized from 0.75 to reduce LLM fallback)
### Performance Metrics ```bash
cd /MASTERFOLDER/Tools/email-sorter
source venv/bin/activate
| Emails | Time | Accuracy | LLM Calls | Throughput | # Classify emails with ML + LLM fallback
|--------|------|----------|-----------|------------| python -m src.cli run --source local \
| 10,000 | 24s | 72.7% | 0 | 423/sec | --directory "/path/to/emails" \
| 10,000 | 5min | 92.7% | 2,100 | 33/sec | --output "/path/to/output" \
--force-ml --llm-provider openai
# Generate HTML report from results
python tools/generate_html_report.py --input /path/to/results.json
```
---
## Key Documentation
| Document | Purpose | Location |
|----------|---------|----------|
| **PROJECT_ROADMAP_2025.md** | Master learnings, research findings, development roadmap | `docs/` |
| **CLASSIFICATION_METHODS_COMPARISON.md** | ML vs LLM vs Agent comparison | `docs/` |
| **REPORT_FORMAT.md** | HTML report documentation | `docs/` |
| **BATCH_LLM_QUICKSTART.md** | Quick LLM batch processing guide | root |
---
## Research Findings Summary
### Dataset Size Routing
| Size | Best Method | Why |
|------|-------------|-----|
| <500 | Agent-only | ML overhead exceeds benefit |
| 500-5000 | Agent pre-scan + ML | Discovery improves accuracy |
| >5000 | ML pipeline | Speed critical |
### Research Results
| Dataset | Type | ML-Only | ML+LLM | Agent |
|---------|------|---------|--------|-------|
| brett-gmail (801) | Personal | 54.9% | 93.3% | 99.8% |
| brett-microsoft (596) | Business | - | - | 98.2% |
### Key Insight: Inbox Character Matters
| Type | Pattern | Approach |
|------|---------|----------|
| **Personal** | Subscriptions, marketing (40-50% automated) | Sender domain first |
| **Business** | Client work, operations (60-70% professional) | Sender + Subject context |
---
## Project Structure ## Project Structure
``` ```
email-sorter/ email-sorter/
├── src/ ├── CLAUDE.md # THIS FILE
│ ├── cli.py # Main CLI interface ├── README.md # General readme
│ ├── classification/ # Classification pipeline ├── BATCH_LLM_QUICKSTART.md # LLM batch processing
│ │ ├── adaptive_classifier.py # Rules → ML → LLM orchestration
│ │ ├── ml_classifier.py # LightGBM classifier ├── src/ # Source code
│ │ ├── llm_classifier.py # LLM fallback │ ├── cli.py # Main entry point
│ │ └── feature_extractor.py # Batched embedding extraction │ ├── classification/ # ML/LLM classification
│ ├── calibration/ # LLM-driven calibration │ ├── calibration/ # Model training, email parsing
│ │ ├── workflow.py # Calibration orchestration │ ├── email_providers/ # Gmail, Outlook, IMAP, Local
│ │ ├── llm_analyzer.py # Batch category discovery (20 emails/call) │ └── llm/ # LLM providers
│ │ ├── trainer.py # ML model training
│ │ └── category_verifier.py # Category verification ├── tools/ # Utility scripts
│ ├── email_providers/ # Email source connectors │ ├── brett_gmail_analyzer.py # Personal inbox template
│ │ ├── gmail.py # Gmail API (OAuth 2.0) │ ├── brett_microsoft_analyzer.py # Business inbox template
│ │ ├── outlook.py # Microsoft Graph API (OAuth 2.0) │ ├── generate_html_report.py # HTML report generator
│ │ ├── imap.py # IMAP protocol │ └── batch_llm_classifier.py # Batch LLM classification
│ │ └── enron.py # Enron dataset (testing)
│ ├── llm/ # LLM provider interfaces ├── config/ # Configuration
│ │ ├── ollama.py # Ollama provider │ ├── default_config.yaml # LLM endpoints, thresholds
│ │ └── openai_compat.py # OpenAI-compatible provider │ └── categories.yaml # Category definitions
│ └── models/ # Trained models
│ ├── calibrated/ # User-calibrated models ├── docs/ # Current documentation
│ │ └── classifier.pkl # Current trained model (1.8MB) │ ├── PROJECT_ROADMAP_2025.md
│ └── pretrained/ # Default models │ ├── CLASSIFICATION_METHODS_COMPARISON.md
├── config/ │ ├── REPORT_FORMAT.md
│ ├── default_config.yaml # System defaults │ └── archive/ # Old docs (historical)
│ ├── categories.yaml # Category definitions (thresholds: 0.55)
│ └── llm_models.yaml # LLM configuration ├── data/ # Analysis outputs (gitignored)
├── credentials/ # Email provider credentials (gitignored) │ ├── brett_gmail_analysis.json
│ ├── gmail/ # Gmail OAuth (3 accounts) │ └── brett_microsoft_analysis.json
│ ├── outlook/ # Outlook OAuth (3 accounts)
│ └── imap/ # IMAP credentials (3 accounts) ├── credentials/ # OAuth/API creds (gitignored)
├── docs/ # Documentation ├── results/ # Classification outputs (gitignored)
├── scripts/ # Utility scripts ├── archive/ # Old scripts (gitignored)
└── logs/ # Log files (gitignored) ├── maildir/ # Enron test data
└── venv/ # Python environment
``` ```
## Critical Implementation Details
### 1. Batched Embedding Extraction (CRITICAL!)
**ALWAYS use batched feature extraction:**
```python
# ✅ CORRECT - Batched (150x faster)
all_features = feature_extractor.extract_batch(emails, batch_size=512)
for email, features in zip(emails, all_features):
result = adaptive_classifier.classify_with_features(email, features)
# ❌ WRONG - Sequential (extremely slow)
for email in emails:
result = adaptive_classifier.classify(email) # Extracts features one-at-a-time
```
**Why this matters:**
- Sequential: 10,000 emails × 15ms = 150 seconds just for embeddings
- Batched: 20 batches × 1s = 20 seconds for embeddings
- **150x performance difference**
### 2. Model Paths
**The model exists in TWO locations:**
- `src/models/calibrated/classifier.pkl` - Created during calibration (authoritative)
- `src/models/pretrained/classifier.pkl` - Loaded by default (copy of calibrated)
**When calibration runs:**
1. Saves model to `calibrated/classifier.pkl`
2. MLClassifier loads from `pretrained/classifier.pkl` by default
3. Need to copy or update path
**Current status:** Both paths have the same 1.8MB model (Oct 25 02:54)
### 3. LLM-Driven Calibration
**NOT hardcoded categories** - categories are discovered by LLM:
```python
# Calibration process:
1. Sample 300 emails (3% of 10k)
2. Batch process in groups of 20 emails
3. LLM discovers categories (not predefined)
4. LLM labels each email
5. Train LightGBM on discovered categories
```
**Result:** 11 categories discovered from Enron dataset:
- Updates, Work, Meetings, External, Financial, Test, Administrative, Operational, Technical, Urgent, Requests
### 4. Threshold Optimization
**Default threshold: 0.55** (reduced from 0.75)
**Impact:**
- 0.75 threshold: 35% LLM fallback
- 0.55 threshold: 21% LLM fallback
- **40% reduction in LLM usage**
All category thresholds in `config/categories.yaml` set to 0.55.
### 5. Email Provider Credentials
**Multi-account support:** 3 accounts per provider type
**Credential files:**
```
credentials/
├── gmail/
│ ├── account1.json # Gmail OAuth credentials
│ ├── account2.json
│ └── account3.json
├── outlook/
│ ├── account1.json # Outlook OAuth credentials
│ ├── account2.json
│ └── account3.json
└── imap/
├── account1.json # IMAP username/password
├── account2.json
└── account3.json
```
**Security:** All `*.json` files in `credentials/` are gitignored (only `.example` files tracked).
## Common Commands
### Development
```bash
# Activate virtual environment
source venv/bin/activate
# Run classification (Enron dataset)
python -m src.cli run --source enron --limit 10000 --output results/
# Pure ML (no LLM fallback) - FAST
python -m src.cli run --source enron --limit 10000 --output results/ --no-llm-fallback
# With category verification
python -m src.cli run --source enron --limit 10000 --output results/ --verify-categories
# Gmail
python -m src.cli run --source gmail --credentials credentials/gmail/account1.json --limit 1000
# Outlook
python -m src.cli run --source outlook --credentials credentials/outlook/account1.json --limit 1000
```
### Training
```bash
# Force recalibration (clears cached model)
rm -rf src/models/calibrated/ src/models/pretrained/
python -m src.cli run --source enron --limit 10000 --output results/
```
## Code Patterns
### Adding New Features
1. **Update CLI** ([src/cli.py](src/cli.py)):
- Add click options
- Pass to appropriate modules
2. **Update Classifier** ([src/classification/adaptive_classifier.py](src/classification/adaptive_classifier.py)):
- Add methods following existing pattern
- Use `classify_with_features()` for batched processing
3. **Update Feature Extractor** ([src/classification/feature_extractor.py](src/classification/feature_extractor.py)):
- Always support batching (`extract_batch()`)
- Keep `extract()` for backward compatibility
### Testing
```bash
# Test imports
python -c "from src.cli import cli; print('OK')"
# Test providers
python -c "from src.email_providers.gmail import GmailProvider; from src.email_providers.outlook import OutlookProvider; print('OK')"
# Test classification
python -m src.cli run --source enron --limit 100 --output test/
```
## Performance Optimization
### Current Bottlenecks
1. **Embedding generation** - 20s for 10k emails (batched)
- Optimized with batch_size=512
- Could use local sentence-transformers for 5-10x speedup
2. **Email parsing** - 0.5s for 10k emails (fast)
3. **ML inference** - 0.7s for 10k emails (very fast)
### Optimization Opportunities
1. **Local embeddings** - Replace Ollama API with sentence-transformers
- Current: 20 API calls, ~20 seconds
- With local: Direct GPU, ~2-5 seconds
- Trade-off: More dependencies, larger memory footprint
2. **Embedding cache** - Pre-compute and cache to disk
- One-time cost: 20 seconds
- Subsequent runs: 2-3 seconds to load from disk
- Perfect for development/testing
3. **Larger batches** - Tested 512, 1024, 2048
- 512: 23.6s (chosen for balance)
- 1024: 22.1s (6.6% faster)
- 2048: 21.9s (7.5% faster, diminishing returns)
## Known Issues
### 1. Background Processes
There are stale background bash processes from previous sessions:
- These can be safely ignored
- Do NOT try to kill them (per user's CLAUDE.md instructions)
### 2. Model Path Confusion
- Calibration saves to `src/models/calibrated/`
- Default loads from `src/models/pretrained/`
- Both currently have the same model (synced)
### 3. Category Cache
- `src/models/category_cache.json` stores discovered categories
- Can become polluted if different datasets used
- Clear with `rm src/models/category_cache.json` if issues
## Dependencies
### Required
```bash
pip install click pyyaml lightgbm numpy scikit-learn ollama
```
### Email Providers
```bash
# Gmail
pip install google-api-python-client google-auth-oauthlib google-auth-httplib2
# Outlook
pip install msal requests
# IMAP - no additional dependencies (Python stdlib)
```
### Optional
```bash
# For faster local embeddings
pip install sentence-transformers
# For development
pip install pytest black mypy
```
## Git Workflow
### What's Gitignored
- `credentials/` (except `.example` files)
- `logs/`
- `results/`
- `src/models/calibrated/` (trained models)
- `*.log`
- `debug_*.txt`
- Test directories
### What's Tracked
- All source code
- Configuration files
- Documentation
- Example credential files
- Pretrained model (if present)
## Important Notes for AI Assistants
1. **NEVER create files unless necessary** - Always prefer editing existing files
2. **ALWAYS use batching** - Feature extraction MUST be batched (512 emails/batch)
3. **Read before writing** - Use Read tool before any Edit operations
4. **Verify paths** - Model paths can be confusing (calibrated vs pretrained)
5. **No emoji in commits** - Per user's CLAUDE.md preferences
6. **Test before committing** - Verify imports and CLI work
7. **Security** - Never commit actual credentials, only `.example` files
8. **Performance matters** - 10x performance differences are common, always batch
9. **LLM is optional** - System works without LLM (pure ML mode with --no-llm-fallback)
10. **Categories are dynamic** - They're discovered by LLM, not hardcoded
## Recent Changes (Last Session)
1. **Fixed embedding bottleneck** - Changed from sequential to batched feature extraction (10x speedup)
2. **Added Outlook provider** - Full Microsoft Graph API integration
3. **Added credentials system** - Support for 3 accounts per provider type
4. **Optimized thresholds** - Reduced from 0.75 to 0.55 (40% less LLM usage)
5. **Added category verifier** - Optional single LLM call to verify model fit
6. **Project reorganization** - Clean docs/, scripts/, logs/ structure
## Next Steps (Roadmap)
See [docs/PROJECT_STATUS_AND_NEXT_STEPS.html](docs/PROJECT_STATUS_AND_NEXT_STEPS.html) for complete roadmap.
**Immediate priorities:**
1. Test Gmail provider with real credentials
2. Test Outlook provider with real credentials
3. Implement email syncing (apply labels back to mailbox)
4. Add incremental classification (process only new emails)
5. Create web dashboard for results visualization
--- ---
**Remember:** This is an MVP with proven performance. Don't over-engineer. Keep it fast and simple. ## Common Operations
### 1. Classify Emails (ML Pipeline)
```bash
source venv/bin/activate
# With LLM fallback for low confidence
python -m src.cli run --source local \
--directory "/path/to/emails" \
--output "/path/to/output" \
--force-ml --llm-provider openai
# Pure ML (fastest, no LLM)
python -m src.cli run --source local \
--directory "/path/to/emails" \
--output "/path/to/output" \
--force-ml --no-llm-fallback
```
### 2. Generate HTML Report
```bash
python tools/generate_html_report.py --input /path/to/results.json
# Creates report.html in same directory
```
### 3. Manual Agent Analysis (Best Accuracy)
For <1000 emails, agent analysis gives 98-99% accuracy:
```bash
# Copy and customize analyzer template
cp tools/brett_gmail_analyzer.py tools/my_inbox_analyzer.py
# Edit classify_email() function for your inbox patterns
# Update email_dir path
# Run
python tools/my_inbox_analyzer.py
```
### 4. Different Email Sources
```bash
# Local .eml/.msg files
--source local --directory "/path/to/emails"
# Gmail (OAuth)
--source gmail --credentials credentials/gmail/account1.json
# Outlook (OAuth)
--source outlook --credentials credentials/outlook/account1.json
# Enron test data
--source enron --limit 10000
```
---
## Output Locations
**Analysis reports are stored OUTSIDE this project:**
```
/home/bob/Documents/Email Manager/emails/
├── brett-gmail/ # Source emails (untouched)
├── brett-gm-md/ # ML-only classification output
│ ├── results.json
│ ├── report.html
│ └── BRETT_GMAIL_ANALYSIS_REPORT.md
├── brett-gm-llm/ # ML+LLM classification output
│ ├── results.json
│ └── report.html
└── brett-ms-sorter/ # Microsoft inbox analysis
└── BRETT_MICROSOFT_ANALYSIS_REPORT.md
```
**Project data outputs (gitignored):**
```
/MASTERFOLDER/Tools/email-sorter/data/
├── brett_gmail_analysis.json
└── brett_microsoft_analysis.json
```
---
## Configuration
### LLM Endpoint (config/default_config.yaml)
```yaml
llm:
provider: "openai"
openai:
base_url: "http://localhost:11433/v1" # vLLM endpoint
api_key: "not-needed"
classification_model: "qwen3-coder-30b"
```
### Thresholds (config/categories.yaml)
Default: 0.55 (reduced from 0.75 for 40% less LLM fallback)
---
## Key Code Locations
| Function | File |
|----------|------|
| CLI entry | `src/cli.py` |
| ML classifier | `src/classification/ml_classifier.py` |
| LLM classifier | `src/classification/llm_classifier.py` |
| Feature extraction | `src/classification/feature_extractor.py` |
| Email parsing | `src/calibration/local_file_parser.py` |
| OpenAI-compat LLM | `src/llm/openai_compat.py` |
---
## Recent Changes (Nov 2025)
1. **cli.py**: Added `--force-ml` flag, enriched results.json with metadata
2. **openai_compat.py**: Removed API key requirement for local vLLM
3. **default_config.yaml**: Changed to openai provider on localhost:11433
4. **tools/**: Added brett_gmail_analyzer.py, brett_microsoft_analyzer.py, generate_html_report.py
5. **docs/**: Added PROJECT_ROADMAP_2025.md, CLASSIFICATION_METHODS_COMPARISON.md
---
## Troubleshooting
### "LLM endpoint not responding"
- Check vLLM running on localhost:11433
- Verify model name in config matches running model
### "Low accuracy (50-60%)"
- For <1000 emails, use agent analysis
- Dataset may differ from Enron training data
### "Too many LLM calls"
- Use `--no-llm-fallback` for pure ML
- Increase threshold in categories.yaml
---
## Development Notes
### Virtual Environment Required
```bash
source venv/bin/activate
# ALWAYS activate before Python commands
```
### Batched Feature Extraction (CRITICAL)
```python
# CORRECT - Batched (150x faster)
all_features = feature_extractor.extract_batch(emails, batch_size=512)
# WRONG - Sequential (extremely slow)
for email in emails:
result = classifier.classify(email) # Don't do this
```
### Model Paths
- `src/models/calibrated/` - Created during calibration
- `src/models/pretrained/` - Loaded by default
---
## What's Gitignored
- `credentials/` - OAuth tokens
- `results/`, `data/` - User data
- `archive/`, `docs/archive/` - Historical content
- `maildir/` - Enron test data (large)
- `enron_mail_20150507.tar.gz` - Source archive
- `venv/` - Python environment
- `*.log`, `logs/` - Log files
---
## Philosophy
1. **Triage, not management** - Sort into buckets for other tools
2. **Risk-based accuracy** - High for personal, acceptable errors for junk
3. **Speed matters** - 10k emails in <1 min
4. **Inbox character matters** - Business vs personal = different approaches
5. **Agent pre-scan adds value** - 10-15 min discovery improves everything
---
*Last Updated: 2025-11-28*
*See docs/PROJECT_ROADMAP_2025.md for full research findings*

View File

@ -27,7 +27,7 @@ classification:
conversational: 0.55 conversational: 0.55
llm: llm:
provider: "ollama" provider: "openai"
fallback_enabled: true fallback_enabled: true
ollama: ollama:
@ -41,9 +41,10 @@ llm:
retry_attempts: 3 retry_attempts: 3
openai: openai:
base_url: "https://rtx3090.bobai.com.au/v1" base_url: "http://localhost:11433/v1"
api_key: "rtx3090_foxadmin_10_8034ecb47841f45ba1d5f3f5d875c092" api_key: "not-needed"
calibration_model: "qwen3-coder-30b" calibration_model: "qwen3-coder-30b"
consolidation_model: "qwen3-coder-30b"
classification_model: "qwen3-coder-30b" classification_model: "qwen3-coder-30b"
temperature: 0.1 temperature: 0.1
max_tokens: 500 max_tokens: 500

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,518 @@
# Email Classification Methods: Comparative Analysis
## Executive Summary
This document compares three email classification approaches tested on an 801-email personal Gmail dataset:
| Method | Accuracy | Time | Best For |
|--------|----------|------|----------|
| ML-Only | 54.9% | 5 sec | 10k+ emails, speed critical |
| ML+LLM Fallback | 93.3% | 3.5 min | 1k-10k emails, balanced |
| Agent Analysis | 99.8% | 15-30 min | <1k emails, deep insights |
**Key Finding:** The ML pipeline is overkill for datasets under ~5,000 emails. A 10-15 minute agent pre-analysis phase could dramatically improve ML accuracy for larger datasets.
---
## Test Dataset Profile
| Characteristic | Value |
|----------------|-------|
| Total Emails | 801 |
| Date Range | 20 years (2005-2025) |
| Unique Senders | ~150 |
| Automated % | 48.8% |
| Personal % | 1.6% |
| Structure Level | MEDIUM-HIGH |
### Email Type Breakdown (Sanitized)
```
Automated Notifications 48.8% ████████████████████████
├─ Art marketplace alerts 16.2% ████████
├─ Shopping promotions 15.4% ███████
├─ Travel recommendations 13.4% ██████
└─ Streaming promotions 8.5% ████
Business/Professional 20.1% ██████████
├─ Cloud service reports 13.0% ██████
├─ Security alerts 7.1% ███
AI/Developer Services 12.8% ██████
├─ AI platform updates 6.4% ███
├─ Developer tool updates 6.4% ███
Personal/Other 18.3% █████████
├─ Entertainment 5.1% ██
├─ Productivity tools 3.7% █
├─ Direct correspondence 1.6% █
└─ Miscellaneous 7.9% ███
```
---
## Method 1: ML-Only Classification
### Configuration
```yaml
model: LightGBM (pretrained on Enron dataset)
embeddings: all-minilm:l6-v2 (384 dimensions)
threshold: 0.55 confidence
categories: 11 generic (Work, Updates, Financial, etc.)
```
### Results
| Metric | Value |
|--------|-------|
| Accuracy Estimate | 54.9% |
| High Confidence (>55%) | 477 (59.6%) |
| Low Confidence | 324 (40.4%) |
| Processing Time | ~5 seconds |
| LLM Calls | 0 |
### Category Distribution (ML-Only)
| Category | Count | % |
|----------|-------|---|
| Work | 243 | 30.3% |
| Technical | 198 | 24.7% |
| Updates | 156 | 19.5% |
| External | 89 | 11.1% |
| Operational | 45 | 5.6% |
| Financial | 38 | 4.7% |
| Other | 32 | 4.0% |
### Limitations Observed
1. **Domain Mismatch:** Trained on corporate Enron emails, applied to personal Gmail
2. **Generic Categories:** "Work" and "Technical" absorbed everything
3. **No Sender Intelligence:** Didn't leverage sender domain patterns
4. **High Uncertainty:** 40% needed LLM review but got none
### When ML-Only Works
- 10,000+ emails where speed matters
- Corporate/enterprise datasets similar to training data
- Pre-filtering before human review
- Cost-constrained environments (no LLM API)
---
## Method 2: ML + LLM Fallback
### Configuration
```yaml
ml_model: LightGBM (same as above)
llm_model: qwen3-coder-30b (vLLM on localhost:11433)
threshold: 0.55 confidence
fallback_trigger: confidence < threshold
```
### Results
| Metric | Value |
|--------|-------|
| Accuracy Estimate | 93.3% |
| ML Classified | 477 (59.6%) |
| LLM Classified | 324 (40.4%) |
| Processing Time | ~3.5 minutes |
| LLM Calls | 324 |
### Category Distribution (ML+LLM)
| Category | Count | % | Source |
|----------|-------|---|--------|
| Work | 243 | 30.3% | ML |
| Technical | 156 | 19.5% | ML |
| newsletters | 98 | 12.2% | LLM |
| junk | 87 | 10.9% | LLM |
| transactional | 76 | 9.5% | LLM |
| Updates | 62 | 7.7% | ML |
| auth | 45 | 5.6% | LLM |
| Other | 34 | 4.2% | Mixed |
### Improvements Over ML-Only
1. **New Categories:** LLM introduced "newsletters", "junk", "transactional", "auth"
2. **Better Separation:** Marketing vs. transactional distinguished
3. **Higher Confidence:** 93.3% vs 54.9% accuracy estimate
### Limitations Observed
1. **Category Inconsistency:** ML uses "Updates", LLM uses "newsletters"
2. **No Sender Context:** Still classifying email-by-email
3. **Generic LLM Prompt:** Doesn't know about user's specific interests
4. **Time Cost:** 324 sequential LLM calls at ~0.6s each
### When ML+LLM Works
- 1,000-10,000 emails
- Mixed automated/personal content
- When accuracy matters more than speed
- Local LLM available (cost-free fallback)
---
## Method 3: Agent Analysis (Manual)
### Approach
```
Phase 1: Initial Discovery (5 min)
- Sample filenames and subjects
- Identify sender domains
- Detect patterns
Phase 2: Pattern Extraction (10 min)
- Design domain-specific rules
- Test regex patterns
- Validate on subset
Phase 3: Deep Dive (5 min)
- Track order lifecycles
- Identify billing patterns
- Find edge cases
Phase 4: Report Generation (5 min)
- Synthesize findings
- Create actionable recommendations
```
### Results
| Metric | Value |
|--------|-------|
| Accuracy | 99.8% (799/801) |
| Categories | 15 custom |
| Processing Time | ~25 minutes |
| LLM Calls | ~20 (analysis only) |
### Category Distribution (Agent Analysis)
| Category | Count | % | Subcategories |
|----------|-------|---|---------------|
| Art & Collectibles | 130 | 16.2% | Marketplace alerts |
| Shopping | 123 | 15.4% | eBay, AliExpress, Automotive |
| Entertainment | 109 | 13.6% | Streaming, Gaming, Social |
| Travel & Tourism | 107 | 13.4% | Review sites, Bookings |
| Google Services | 104 | 13.0% | Business, Ads, Analytics |
| Security | 57 | 7.1% | Sign-in alerts, 2FA |
| AI Services | 51 | 6.4% | Claude, OpenAI, Lambda |
| Developer Tools | 51 | 6.4% | ngrok, Firebase, Docker |
| Productivity | 30 | 3.7% | Screen recording, Docs |
| Personal | 13 | 1.6% | Direct correspondence |
| Other | 26 | 3.2% | Childcare, Legal, etc. |
### Unique Insights (Not Found by ML)
1. **Specific Artist Tracking:** 95 alerts for specific artist "Dan Colen"
2. **Order Lifecycle:** Single order generated 7 notification emails
3. **Billing Patterns:** Monthly receipts from AI services on 15th
4. **Business Context:** User runs "Fox Software Solutions"
5. **Filtering Rules:** Ready-to-implement Gmail filters
### When Agent Analysis Works
- Under 1,000 emails
- Initial dataset understanding
- Creating filtering rules
- One-time deep analysis
- Training data preparation
---
## Comparative Analysis
### Accuracy vs Time Tradeoff
```
Accuracy
100% ─┬─────────────────────────●─── Agent (99.8%)
│ ●─────── ML+LLM (93.3%)
75% ─┤
50% ─┼────●───────────────────────── ML-Only (54.9%)
25% ─┤
0% ─┴────┬────────┬────────┬────────┬─── Time
5s 1m 5m 30m
```
### Cost Analysis (per 1000 emails)
| Method | Compute | LLM Calls | Est. Cost |
|--------|---------|-----------|-----------|
| ML-Only | 5 sec | 0 | $0.00 |
| ML+LLM | 4 min | ~400 | $0.02-0.40* |
| Agent | 30 min | ~30 | $0.01-0.10* |
*Depends on LLM provider; local = free, cloud = varies
### Category Quality
| Aspect | ML-Only | ML+LLM | Agent |
|--------|---------|--------|-------|
| Granularity | Low (11) | Medium (16) | High (15+subs) |
| Domain-Specific | No | Partial | Yes |
| Actionable | Limited | Moderate | High |
| Sender-Aware | No | No | Yes |
| Context-Aware | No | Limited | Yes |
---
## Enhancement Recommendations
### 1. Pre-Analysis Phase (10-15 min investment)
**Concept:** Run agent analysis BEFORE ML classification to:
- Discover sender domains and their purposes
- Identify category patterns specific to dataset
- Generate custom classification rules
- Create sender-to-category mappings
**Implementation:**
```python
class PreAnalysisAgent:
def analyze(self, emails: List[Email], sample_size=100):
# Phase 1: Sender domain clustering
domains = self.cluster_by_sender_domain(emails)
# Phase 2: Subject pattern extraction
patterns = self.extract_subject_patterns(emails)
# Phase 3: Generate custom categories
categories = self.generate_categories(domains, patterns)
# Phase 4: Create sender-category mapping
sender_map = self.map_senders_to_categories(domains, categories)
return {
'categories': categories,
'sender_map': sender_map,
'patterns': patterns
}
```
**Expected Impact:**
- Accuracy: 54.9% → 85-90% (ML-only with pre-analysis)
- Time: +10 min setup, same runtime
- Best for: 5,000+ email datasets
### 2. Sender-First Classification
**Concept:** Classify by sender domain BEFORE content analysis:
```python
SENDER_CATEGORIES = {
# High-volume automated
'mutualart.com': ('Notifications', 'Art Alerts'),
'tripadvisor.com': ('Notifications', 'Travel Marketing'),
'ebay.com': ('Shopping', 'Marketplace'),
'spotify.com': ('Entertainment', 'Streaming'),
# Security - never auto-filter
'accounts.google.com': ('Security', 'Account Alerts'),
# Business
'businessprofile-noreply@google.com': ('Business', 'Reports'),
}
def classify(email):
domain = extract_domain(email.sender)
if domain in SENDER_CATEGORIES:
return SENDER_CATEGORIES[domain] # 80% of emails
else:
return ml_classify(email) # Fallback for 20%
```
**Expected Impact:**
- Accuracy: 85-95% for known senders
- Speed: 10x faster (skip ML for known senders)
- Maintenance: Requires sender map updates
### 3. Post-Analysis Enhancement
**Concept:** Run agent analysis AFTER ML to:
- Validate classification quality
- Extract deeper insights
- Generate reports and recommendations
- Identify misclassifications
**Implementation:**
```python
class PostAnalysisAgent:
def analyze(self, emails: List[Email], classifications: List[Result]):
# Validate: Check for obvious errors
errors = self.detect_misclassifications(emails, classifications)
# Enrich: Add metadata not captured by ML
enriched = self.extract_metadata(emails)
# Insights: Generate actionable recommendations
insights = self.generate_insights(emails, classifications)
return {
'corrections': errors,
'enrichments': enriched,
'insights': insights
}
```
### 4. Dataset Size Routing
**Concept:** Automatically choose method based on volume:
```python
def choose_method(email_count: int, time_budget: str = 'normal'):
if email_count < 500:
return 'agent_only' # Full agent analysis
elif email_count < 2000:
return 'agent_then_ml' # Pre-analysis + ML
elif email_count < 10000:
return 'ml_with_llm' # ML + LLM fallback
else:
return 'ml_only' # Pure ML for speed
```
**Recommended Thresholds:**
| Volume | Recommended Method | Rationale |
|--------|-------------------|-----------|
| <500 | Agent Only | ML overhead not worth it |
| 500-2000 | Agent Pre-Analysis + ML | Investment pays off |
| 2000-10000 | ML + LLM Fallback | Balanced approach |
| >10000 | ML-Only | Speed critical |
### 5. Hybrid Category System
**Concept:** Merge ML categories with agent-discovered categories:
```python
# ML Generic Categories (trained)
ML_CATEGORIES = ['Work', 'Updates', 'Technical', 'Financial', ...]
# Agent-Discovered Categories (per-dataset)
AGENT_CATEGORIES = {
'Art Alerts': {'parent': 'Updates', 'sender': 'mutualart.com'},
'Travel Marketing': {'parent': 'Updates', 'sender': 'tripadvisor.com'},
'AI Services': {'parent': 'Technical', 'keywords': ['anthropic', 'openai']},
}
def classify_hybrid(email, ml_result):
# First: Check agent-specific rules
for cat, rules in AGENT_CATEGORIES.items():
if matches_rules(email, rules):
return (cat, ml_result.category) # Specific + generic
# Fallback: ML result
return (ml_result.category, None)
```
---
## Implementation Roadmap
### Phase 1: Quick Wins (1-2 hours)
1. **Add sender-domain classifier**
- Map top 20 senders to categories
- Use as fast-path before ML
- Expected: +20% accuracy
2. **Add dataset size routing**
- Check email count before processing
- Route small datasets to agent analysis
- Route large datasets to ML pipeline
### Phase 2: Pre-Analysis Agent (4-8 hours)
1. **Build sender clustering**
- Group emails by domain
- Calculate volume per domain
- Identify automated vs personal
2. **Build pattern extraction**
- Find subject templates
- Extract IDs and tracking numbers
- Identify lifecycle stages
3. **Generate sender map**
- Output: JSON mapping senders to categories
- Feed into ML pipeline as rules
### Phase 3: Post-Analysis Enhancement (4-8 hours)
1. **Build validation agent**
- Check low-confidence results
- Detect category conflicts
- Flag for review
2. **Build enrichment agent**
- Extract order IDs
- Track lifecycles
- Generate insights
3. **Integrate with HTML report**
- Add insights section
- Show lifecycle tracking
- Include recommendations
---
## Conclusion
### Key Takeaways
1. **ML pipeline is overkill for <5,000 emails** - Agent analysis provides better accuracy with similar time investment
2. **Sender domain is the strongest signal** - 80%+ emails can be classified by sender alone
3. **Pre-analysis investment pays off** - 10-15 min agent setup dramatically improves ML accuracy
4. **One-size-fits-all doesn't work** - Route by dataset size for optimal results
5. **Post-analysis adds unique value** - Lifecycle tracking and insights not possible with ML alone
### Recommended Default Pipeline
```
┌─────────────────────────────────────────────────────────────┐
│ EMAIL CLASSIFICATION │
└─────────────────────────────────────────────────────────────┘
┌─────────────────┐
│ Count Emails │
└────────┬────────┘
┌──────────────────┼──────────────────┐
│ │ │
▼ ▼ ▼
<500 emails 500-5000 >5000
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Agent Only │ │ Pre-Analysis │ │ ML Pipeline │
│ (15-30 min) │ │ + ML + Post │ │ (fast) │
│ │ │ (15 min + ML)│ │ │
└──────────────┘ └──────────────┘ └──────────────┘
│ │ │
▼ ▼ ▼
┌──────────────────────────────────────────────────┐
│ UNIFIED OUTPUT │
│ - Categorized emails │
│ - Confidence scores │
│ - Insights & recommendations │
│ - Filtering rules │
└──────────────────────────────────────────────────┘
```
---
*Document Version: 1.0*
*Created: 2025-11-28*
*Based on: brett-gmail dataset analysis (801 emails)*

View File

@ -1,526 +0,0 @@
# Email Sorter - Completion Assessment
**Date**: 2025-10-21
**Status**: FEATURE COMPLETE - All 16 Phases Implemented
**Test Results**: 27/30 passing (90% success rate)
**Code Quality**: Complete with full type hints and clear mock labeling
---
## Executive Summary
The Email Sorter framework is **100% feature-complete** with all 16 development phases implemented. The system is ready for:
1. **Immediate Use**: Framework testing with mock model (~90% test pass rate)
2. **Real Model Integration**: Download/train LightGBM model and deploy
3. **Production Processing**: Process Marion's 80k+ emails with real Gmail integration
All core infrastructure, classifiers, learning systems, and export/sync mechanisms are complete and tested.
---
## Phase Completion Checklist
### Phase 1-3: Core Infrastructure ✅
- [x] Project setup & dependencies (42 packages)
- [x] YAML-based configuration system
- [x] Rich-based logging with file output
- [x] Email data models with full type hints
- [x] Pydantic validation
- **Status**: Complete
### Phase 4: Email Providers ✅
- [x] MockProvider (fully functional for testing)
- [x] GmailProvider stub (OAuth-ready, graceful error handling)
- [x] IMAPProvider stub (ready for server config)
- [x] Attachment handling
- **Status**: Framework complete, awaiting credentials
### Phase 5: Feature Extraction ✅
- [x] Semantic embeddings (sentence-transformers, 384 dims)
- [x] Hard pattern matching (20+ regex patterns)
- [x] Structural features (metadata, timing, attachments)
- [x] Attachment analysis (PDF, DOCX, XLSX text extraction)
- [x] Embedding cache with MD5 hashing
- [x] Batch processing for efficiency
- **Status**: Complete with 90%+ test coverage
### Phase 6: ML Classifier ✅
- [x] Mock Random Forest (clearly labeled)
- [x] LightGBM trainer for real models
- [x] Model serialization/deserialization
- [x] Model integration framework
- [x] Pre-trained model loading
- **Status**: Framework ready, mock model for testing, real model integration tools provided
### Phase 7: LLM Integration ✅
- [x] OllamaProvider (local, with retry logic)
- [x] OpenAIProvider (API-compatible)
- [x] Graceful degradation when unavailable
- [x] Batch processing support
- **Status**: Complete
### Phase 8: Adaptive Classifier ✅
- [x] Three-tier classification system
- [x] Hard rules (instant, ~10%)
- [x] ML classifier (fast, ~85%)
- [x] LLM review (uncertain cases, ~5%)
- [x] Dynamic threshold management
- [x] Statistics tracking
- **Status**: Complete
### Phase 9: Processing Pipeline ✅
- [x] BulkProcessor with checkpointing
- [x] Resumable processing from checkpoints
- [x] Batch-based processing
- [x] Progress tracking
- [x] Error recovery
- **Status**: Complete with test coverage
### Phase 10: Calibration System ✅
- [x] EmailSampler (stratified + random)
- [x] LLMAnalyzer (discover natural categories)
- [x] CalibrationWorkflow (end-to-end)
- [x] Category validation
- **Status**: Complete with Enron dataset support
### Phase 11: Export & Reporting ✅
- [x] JSON export with metadata
- [x] CSV export for analysis
- [x] Organization by category
- [x] Human-readable reports
- [x] Statistics and metrics
- **Status**: Complete
### Phase 12: Threshold & Pattern Learning ✅
- [x] ThresholdAdjuster (learn from LLM feedback)
- [x] Agreement tracking per category
- [x] Automatic threshold suggestions
- [x] PatternLearner (sender-specific rules)
- [x] Category distribution tracking
- [x] Hard rule suggestions
- **Status**: Complete
### Phase 13: Advanced Processing ✅
- [x] EnronParser (maildir format support)
- [x] AttachmentHandler (PDF/DOCX content extraction)
- [x] ModelTrainer (real LightGBM training)
- [x] EmbeddingCache (MD5-based with disk persistence)
- [x] EmbeddingBatcher (parallel processing)
- [x] QueueManager (batch persistence)
- **Status**: Complete
### Phase 14: Provider Sync ✅
- [x] GmailSync (sync to Gmail labels)
- [x] IMAPSync (sync to IMAP keywords)
- [x] Configurable label mapping
- [x] Batch update support
- [x] Error handling and retry logic
- **Status**: Complete
### Phase 15: Orchestration ✅
- [x] EmailSorterOrchestrator (4-phase pipeline)
- [x] Full progress tracking
- [x] Timing and metrics
- [x] Error recovery
- [x] Modular component design
- **Status**: Complete
### Phase 16: Packaging ✅
- [x] setup.py with setuptools
- [x] pyproject.toml with PEP 517/518
- [x] Optional dependencies (dev, gmail, ollama, openai)
- [x] Console script entry point
- [x] Git history with 11 commits
- **Status**: Complete
### Phase 17: Testing ✅
- [x] 23 unit tests
- [x] Integration tests
- [x] E2E pipeline tests
- [x] Feature extraction validation
- [x] Classifier flow testing
- **Status**: 27/30 passing (90% success rate)
---
## Test Results Summary
```
======================== Test Execution Results ========================
PASSED (27 tests):
✅ test_email_model_validation - Email dataclass validation
✅ test_attachment_parsing - Attachment metadata extraction
✅ test_mock_provider - Mock email provider
✅ test_feature_extraction_basic - Basic feature extraction
✅ test_semantic_embeddings - Embedding generation (384 dims)
✅ test_hard_pattern_matching - Pattern detection (19/20 patterns)
✅ test_ml_classifier_prediction - Random Forest predictions
✅ test_adaptive_classifier_workflow - Three-tier classification
✅ test_embedding_cache - MD5-based cache hits/misses
✅ test_embedding_batcher - Batch processing
✅ test_queue_manager - LLM queue management
✅ test_bulk_processor - Resumable checkpointing
✅ test_email_sampler - Stratified sampling
✅ test_llm_analyzer - Category discovery
✅ test_threshold_adjuster - Dynamic threshold learning
✅ test_pattern_learner - Sender-specific rules
✅ test_results_exporter - JSON/CSV export
✅ test_provider_sync - Gmail/IMAP sync
✅ test_ollama_provider - LLM provider integration
✅ test_openai_provider - API-compatible LLM
✅ test_configuration_loading - YAML config parsing
✅ test_logging_system - Rich logging output
✅ test_end_to_end_mock_classification - Full pipeline
✅ test_e2e_mock_pipeline - Mock pipeline validation
✅ test_e2e_export_formats - Export format validation
✅ test_e2e_hard_rules_accuracy - Hard rule precision
✅ test_e2e_batch_processing_performance - Batch efficiency
FAILED (3 tests - Expected/Documented):
❌ test_e2e_checkpoint_resume - Feature vector mismatch (expected when upgrading models)
❌ test_e2e_enron_parsing - Parser validation (Enron dataset needs validation)
❌ test_pattern_detection_invoice - Minor regex pattern issue (cosmetic)
======================== Summary ========================
Total: 30 tests
Passed: 27 (90%)
Failed: 3 (10% - all expected and documented)
Duration: ~90 seconds
Coverage: All major components
```
---
## Code Statistics
```
Files: 38 Python modules + configs
Lines of Code: ~6,000+ production code
Core Modules: 16 major components
Test Files: 6 test suites
Dependencies: 42 packages installed
Git Commits: 11 tracking full development
Total Size: ~450 MB (includes venv + Enron dataset)
```
### Module Breakdown
**Core Infrastructure (3 modules)**
- `src/utils/config.py` - Configuration management
- `src/utils/logging.py` - Logging system
- `src/email_providers/base.py` - Base classes
**Classification (5 modules)**
- `src/classification/feature_extractor.py` - Feature extraction
- `src/classification/ml_classifier.py` - ML predictions
- `src/classification/llm_classifier.py` - LLM predictions
- `src/classification/adaptive_classifier.py` - Orchestration
- `src/classification/embedding_cache.py` - Caching & batching
**Calibration (4 modules)**
- `src/calibration/sampler.py` - Email sampling
- `src/calibration/llm_analyzer.py` - Category discovery
- `src/calibration/trainer.py` - Model training
- `src/calibration/workflow.py` - Calibration pipeline
**Processing & Learning (5 modules)**
- `src/processing/bulk_processor.py` - Batch processing
- `src/processing/queue_manager.py` - Queue management
- `src/processing/attachment_handler.py` - Attachment analysis
- `src/adjustment/threshold_adjuster.py` - Threshold learning
- `src/adjustment/pattern_learner.py` - Pattern learning
**Export & Sync (4 modules)**
- `src/export/exporter.py` - Results export
- `src/export/provider_sync.py` - Gmail/IMAP sync
**Integration (3 modules)**
- `src/llm/ollama.py` - Ollama provider
- `src/llm/openai_compat.py` - OpenAI provider
- `src/orchestration.py` - Main orchestrator
**Email Providers (3 modules)**
- `src/email_providers/gmail.py` - Gmail provider
- `src/email_providers/imap.py` - IMAP provider
- `src/email_providers/mock.py` - Mock provider
**CLI & Testing (2 modules)**
- `src/cli.py` - Command-line interface
- `tests/` - 23 test cases
**Tools & Setup (2 scripts)**
- `tools/download_pretrained_model.py` - Model downloading
- `tools/setup_real_model.py` - Model setup
---
## Current Framework Status
### What's Complete Now
✅ All core infrastructure
✅ Feature extraction system
✅ Three-tier adaptive classifier
✅ Embedding cache and batching
✅ Mock model for testing
✅ LLM integration (Ollama/OpenAI)
✅ Processing pipeline with checkpointing
✅ Calibration workflow
✅ Export (JSON/CSV)
✅ Provider sync (Gmail/IMAP)
✅ Learning systems (threshold + patterns)
✅ CLI interface
✅ Test suite (90% pass rate)
### What Requires Your Input
1. **Real Model**: Download or train LightGBM model
2. **Gmail Credentials**: OAuth setup for live email access
3. **Real Data**: Use Enron dataset (already downloaded) or your email data
---
## Real Model Integration
### Quick Start: Using Pre-trained Model
```bash
# Check if model is installed
python tools/setup_real_model.py --check
# Setup a pre-trained model (download or local file)
python tools/setup_real_model.py --model-path /path/to/model.pkl
# Create model info documentation
python tools/setup_real_model.py --info
```
### Step 1: Get a Real Model
**Option A: Train on Enron Dataset** (Recommended)
```python
from src.calibration.enron_parser import EnronParser
from src.calibration.trainer import ModelTrainer
from src.classification.feature_extractor import FeatureExtractor
# Parse Enron
parser = EnronParser("enron_mail_20150507")
emails = parser.parse_emails(limit=5000)
# Train model
extractor = FeatureExtractor()
trainer = ModelTrainer(extractor, categories=['junk', 'transactional', ...])
results = trainer.train(labeled_data)
# Save
trainer.save_model("src/models/pretrained/classifier.pkl")
```
**Option B: Download Pre-trained**
```bash
python tools/download_pretrained_model.py \
--url https://example.com/model.pkl \
--hash abc123def456
```
### Step 2: Verify Integration
```bash
# Check model is loaded
python -c "from src.classification.ml_classifier import MLClassifier; \
c = MLClassifier(); \
print(c.get_info())"
# Should show: is_mock: False, model_type: LightGBM
```
### Step 3: Run Full Pipeline
```bash
# With real model (once set up)
python -m src.cli run --source mock --output results/
```
---
## Feature Overview
### Classification Accuracy
- **Hard Rules**: 94-96% (instant, ~10% of emails)
- **ML Model**: 85-90% (fast, ~85% of emails)
- **LLM Review**: 92-95% (slower, ~5% uncertain)
- **Overall**: 90-94% (weighted average)
### Performance
- **Calibration**: 3-5 minutes (1500 emails)
- **Bulk Processing**: 10-12 minutes (80k emails)
- **LLM Review**: 4-5 minutes (batched)
- **Export**: 2-3 minutes
- **Total**: ~17-25 minutes for 80k emails
### Categories (12)
junk, transactional, auth, newsletters, social, automated, conversational, work, personal, finance, travel, unknown
### Features Extracted
- **Semantic**: 384-dimensional embeddings (all-MiniLM-L6-v2)
- **Patterns**: 20+ regex-based patterns
- **Structural**: Metadata, timing, attachments, sender analysis
---
## Known Issues & Limitations
### Expected Test Failures (3/30 - Documented)
**1. test_e2e_checkpoint_resume**
- **Reason**: Feature vector mismatch when switching from mock to real model
- **Impact**: Only relevant when upgrading models
- **Resolution**: Not needed until real model deployed
**2. test_e2e_enron_parsing**
- **Reason**: EnronParser needs validation against actual maildir format
- **Impact**: Parser works but needs dataset verification
- **Resolution**: Will be validated during real training phase
**3. test_pattern_detection_invoice**
- **Reason**: Minor regex pattern doesn't match "bill #456"
- **Impact**: Cosmetic - doesn't affect production accuracy
- **Resolution**: Easy regex adjustment if needed
### Pydantic Warnings (16 warnings)
- **Reason**: Using deprecated `.dict()` method (Pydantic v2 compatibility)
- **Severity**: Cosmetic - code still works perfectly
- **Resolution**: Will migrate to `.model_dump()` in next update
---
## Component Validation
### Critical Components ✅
- [x] Feature extraction (embeddings + patterns + structural)
- [x] Three-tier adaptive classifier
- [x] Mock model clearly labeled
- [x] Real model integration framework
- [x] LLM providers (Ollama + OpenAI)
- [x] Queue management with persistence
- [x] Checkpointed processing
- [x] Export/sync mechanisms
- [x] Learning systems (threshold + patterns)
- [x] End-to-end orchestration
### Framework Quality ✅
- [x] Type hints on all functions
- [x] Comprehensive error handling
- [x] Logging at all critical points
- [x] Clear mock vs production separation
- [x] Graceful degradation
- [x] Batch processing optimization
- [x] Cache efficiency
- [x] Resumable operations
### Testing ✅
- [x] 27/30 tests passing
- [x] All core functions tested
- [x] Integration tests included
- [x] E2E pipeline tests
- [x] Mock model clearly separated
- [x] 90% coverage of critical paths
---
## Deployment Path
### Phase 1: Framework Validation ✓ (COMPLETE)
- All 16 phases implemented
- 27/30 tests passing
- Documentation complete
- Ready for real data
### Phase 2: Real Model Deployment (NEXT)
1. Download or train LightGBM model
2. Place in `src/models/pretrained/classifier.pkl`
3. Run verification tests
4. Deploy to production
### Phase 3: Gmail Integration (PARALLEL)
1. Set up Google Cloud Console
2. Download OAuth credentials
3. Configure `credentials.json`
4. Test with 100 emails first
5. Scale to full dataset
### Phase 4: Production Processing (FINAL)
1. Process all 80k+ emails
2. Sync results to Gmail labels
3. Review accuracy metrics
4. Iterate on threshold tuning
---
## How to Proceed
### Immediate (Framework Testing)
```bash
# Test current framework with mock model
pytest tests/ -v # Run full test suite
python -m src.cli test-config # Test config loading
python -m src.cli run --source mock # Test mock pipeline
```
### Short Term (Real Model)
```bash
# Option 1: Train on Enron dataset
python -c "from tools import train_enron; train_enron.train()"
# Option 2: Download pre-trained
python tools/download_pretrained_model.py --url https://...
# Verify
python tools/setup_real_model.py --check
```
### Medium Term (Gmail Integration)
```bash
# Set up credentials
# Place credentials.json in project root
# Test with 100 emails
python -m src.cli run --source gmail --limit 100 --output test_results/
# Review results
```
### Production (Full Processing)
```bash
# Process all emails
python -m src.cli run --source gmail --output marion_results/
# Package for deployment
python setup.py sdist bdist_wheel
```
---
## Conclusion
The Email Sorter framework is **100% feature-complete** and ready to use. All 16 development phases are implemented with:
- ✅ 38 Python modules with full type hints
- ✅ 27/30 tests passing (90% success rate)
- ✅ ~6,000 lines of code
- ✅ Clear mock vs real model separation
- ✅ Comprehensive logging and error handling
- ✅ Graceful degradation
- ✅ Batch processing optimization
- ✅ Complete documentation
**The system is ready for:**
1. Real model integration (tools provided)
2. Gmail OAuth setup (framework ready)
3. Full production deployment (80k+ emails)
No architectural changes needed. Just add real data and credentials.
---
**Next Step**: Download/train a real LightGBM model or use the mock for continued framework testing.

File diff suppressed because it is too large Load Diff

View File

@ -1,232 +0,0 @@
# Email Sorter - Current Work Summary
**Date:** 2025-10-23
**Status:** 100k Enron Classification Complete with Optimization
---
## Current Achievements
### 1. Calibration System (Phase 1) ✅
- **LLM-driven category discovery** using qwen3:8b-q4_K_M
- **Trained on:** 50 emails (stratified sample from 100 email batch)
- **Categories discovered:** 10 quality categories
- Work Communication, Financial, Forwarded, Technical Analysis, Administrative, Reports, Technical Issues, Requests, Meetings, HR & Personnel
- **Category cache system:** Cross-mailbox consistency with semantic matching
- **Model:** LightGBM classifier on 384-dim embeddings (all-minilm:l6-v2)
- **Model file:** `src/models/calibrated/classifier.pkl` (1.1MB)
### 2. Performance Optimization ✅
**Batch Size Testing Results:**
- batch_size=32: 6.993s (baseline)
- batch_size=64: 5.636s (19.4% faster)
- batch_size=128: 5.617s (19.7% faster)
- batch_size=256: 5.572s (20.3% faster)
- **batch_size=512: 5.453s (22.0% faster)** ← WINNER
**Key Optimizations:**
- Fixed sequential embedding calls → batched API calls
- Used Ollama's `embed()` API with batch support
- Removed duplicate `extract_batch()` method causing cache issues
- Optimized to 512 batch size for GPU utilization
### 3. 100k Classification Complete ✅
**Performance:**
- **Total time:** 3.4 minutes (202 seconds)
- **Speed:** 495 emails/second
- **Per email:** ~2ms (including all processing)
**Accuracy:**
- **Average confidence:** 81.1%
- **High confidence (≥0.7):** 74,777 emails (74.8%)
- **Medium confidence (0.5-0.7):** 17,381 emails (17.4%)
- **Low confidence (<0.5):** 7,842 emails (7.8%)
**Category Distribution:**
1. Work Communication: 89,807 (89.8%) | Avg conf: 83.7%
2. Financial: 6,534 (6.5%) | Avg conf: 58.7%
3. Forwarded: 2,457 (2.5%) | Avg conf: 54.4%
4. Technical Analysis: 1,129 (1.1%) | Avg conf: 56.9%
5. Reports: 42 (0.04%)
6. Technical Issues: 14 (0.01%)
7. Administrative: 14 (0.01%)
8. Requests: 3 (0.00%)
**Output Files:**
- `enron_100k_results/results.json` (19MB) - Full classifications
- `enron_100k_results/summary.json` (1.5KB) - Statistics
- `enron_100k_results/classifications.csv` (8.6MB) - Spreadsheet format
### 4. Evaluation & Validation Tools ✅
**A. LLM Evaluation Script** (`evaluate_with_llm.py`)
- Loads actual email content with EnronProvider
- Uses qwen3:8b-q4_K_M with `<no_think>` for speed
- Stratified sampling (high/medium/low confidence)
- Verdict parsing: YES/PARTIAL/NO
- Temperature=0.1 for consistency
**B. Feedback Fine-tuning System** (`feedback_finetune.py`)
- Collects LLM corrections on low-confidence predictions
- Continues LightGBM training with `init_model` parameter
- Lower learning rate (0.05) for stability
- Creates `classifier_finetuned.pkl`
- **Result on 200 samples:** 0 corrections needed (model already accurate!)
**C. Attachment Handler** (exists but NOT integrated)
- PDF text extraction (PyPDF2)
- DOCX text extraction (python-docx)
- Keyword detection (financial, legal, meeting, report)
- Classification hints
- **Status:** Available in `src/processing/attachment_handler.py` but unused
---
## Technical Architecture
### Data Flow
```
Enron Maildir (100k emails)
EnronParser (stratified sampling)
FeatureExtractor (batch_size=512)
Ollama Embeddings (all-minilm:l6-v2, 384-dim)
LightGBM Classifier (22 categories)
Results (JSON/CSV export)
```
### Calibration Flow
```
100 emails → 5 LLM batches (20 emails each)
qwen3:8b-q4_K_M discovers categories
Consolidation (15 → 10 categories)
Category cache (semantic matching)
50 emails labeled for training
LightGBM training (200 boosting rounds)
Model saved (classifier.pkl)
```
### Performance Metrics
- **Calibration:** ~100 emails, ~1 minute
- **Training:** 50 samples, LightGBM 200 rounds, ~1 second
- **Classification:** 100k emails, batch 512, 3.4 minutes
- **Per email:** 2ms total (embedding + inference)
- **GPU utilization:** Batched embeddings, efficient processing
---
## Key Files & Components
### Models
- `src/models/calibrated/classifier.pkl` - Trained LightGBM model (1.1MB)
- `src/models/category_cache.json` - 10 discovered categories
### Core Components
- `src/calibration/enron_parser.py` - Enron dataset parsing
- `src/calibration/llm_analyzer.py` - LLM category discovery
- `src/calibration/trainer.py` - LightGBM training
- `src/calibration/workflow.py` - Orchestration
- `src/classification/feature_extractor.py` - Batch embeddings (512)
- `src/email_providers/enron.py` - Enron provider
- `src/processing/attachment_handler.py` - Attachment extraction (unused)
### Scripts
- `run_100k_classification.py` - Full 100k processing
- `test_model_burst.py` - Batch testing (configurable size)
- `evaluate_with_llm.py` - LLM quality evaluation
- `feedback_finetune.py` - Feedback-driven fine-tuning
### Results
- `enron_100k_results/` - 100k classification output
- `enron_100k_full_run.log` - Complete processing log
---
## Known Issues & Limitations
### 1. Attachment Handling ❌
- AttachmentAnalyzer exists but NOT integrated
- Enron dataset has minimal attachments
- Need integration for Marion emails with PDFs/DOCX
### 2. Category Imbalance ⚠️
- 89.8% classified as "Work Communication"
- May be accurate for Enron (internal work emails)
- Other categories underrepresented
### 3. Low Confidence Samples
- 7,842 emails (7.8%) with confidence <0.5
- LLM validation shows they're actually correct
- Model confidence may be overly conservative
### 4. Feature Extraction
- Currently uses only subject + body text
- Attachments not analyzed
- Sender domain/patterns used but could be enhanced
---
## Next Steps
### Immediate
1. **Comprehensive validation script:**
- 50 low-confidence samples
- 25 random samples
- LLM summary of findings
2. **Mermaid workflow diagram:**
- Complete data flow visualization
- All LLM call points
- Performance metrics at each stage
3. **Fresh end-to-end run:**
- Clear all models
- Run calibration → classification → validation
- Document complete pipeline
### Future Enhancements
1. **Integrate attachment handling** for Marion emails
2. **Add more structural features** (time patterns, thread depth)
3. **Active learning loop** with user feedback
4. **Multi-model ensemble** for higher accuracy
5. **Confidence calibration** to improve certainty estimates
---
## Performance Summary
| Metric | Value |
|--------|-------|
| **Calibration Time** | ~1 minute |
| **Training Samples** | 50 emails |
| **Model Size** | 1.1MB |
| **Categories** | 10 discovered |
| **100k Processing** | 3.4 minutes |
| **Speed** | 495 emails/sec |
| **Avg Confidence** | 81.1% |
| **High Confidence** | 74.8% |
| **Batch Size** | 512 (optimal) |
| **Embedding Dim** | 384 (all-minilm) |
---
## Conclusion
The email sorter has achieved:
- ✅ **Fast calibration** (1 minute on 100 emails)
- ✅ **High accuracy** (81% avg confidence)
- ✅ **Excellent performance** (495 emails/sec)
- ✅ **Quality categories** (10 broad, reusable)
- ✅ **Scalable architecture** (100k emails in 3.4 min)
The system is **ready for production** with Marion emails after integrating attachment handling.

View File

@ -1,527 +0,0 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Fast ML-Only Workflow Analysis</title>
<script src="https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.min.js"></script>
<style>
body {
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
margin: 20px;
background: #1e1e1e;
color: #d4d4d4;
}
h1, h2, h3 {
color: #4ec9b0;
}
.diagram {
background: white;
padding: 20px;
margin: 20px 0;
border-radius: 8px;
}
.timing-table {
width: 100%;
border-collapse: collapse;
margin: 20px 0;
background: #252526;
}
.timing-table th {
background: #37373d;
padding: 12px;
text-align: left;
color: #4ec9b0;
}
.timing-table td {
padding: 10px;
border-bottom: 1px solid #3e3e42;
}
.code-section {
background: #252526;
padding: 15px;
margin: 10px 0;
border-left: 4px solid #4ec9b0;
font-family: 'Courier New', monospace;
}
code {
background: #1e1e1e;
padding: 2px 6px;
border-radius: 3px;
color: #ce9178;
}
.success {
background: #002a00;
border-left: 4px solid #4ec9b0;
padding: 15px;
margin: 10px 0;
}
.warning {
background: #3e2a00;
border-left: 4px solid #ffd93d;
padding: 15px;
margin: 10px 0;
}
.critical {
background: #3e0000;
border-left: 4px solid #ff6b6b;
padding: 15px;
margin: 10px 0;
}
</style>
</head>
<body>
<h1>Fast ML-Only Workflow Analysis</h1>
<h2>Your Question</h2>
<blockquote>
"I want to run ML-only classification on new mailboxes WITHOUT full calibration. Maybe 1 LLM call to verify categories match, then pure ML on embeddings. How can we do this fast for experimentation?"
</blockquote>
<h2>Current Trained Model</h2>
<div class="success">
<h3>Model: src/models/calibrated/classifier.pkl (1.8MB)</h3>
<ul>
<li><strong>Type:</strong> LightGBM Booster (not mock)</li>
<li><strong>Categories (11):</strong> Updates, Work, Meetings, External, Financial, Test, Administrative, Operational, Technical, Urgent, Requests</li>
<li><strong>Trained on:</strong> 10,000 Enron emails</li>
<li><strong>Input:</strong> Embeddings (384-dim) + TF-IDF features</li>
</ul>
</div>
<h2>1. Current Flow: With Calibration (Slow)</h2>
<div class="diagram">
<pre class="mermaid">
flowchart TD
Start([New Mailbox: 10k emails]) --> Check{Model exists?}
Check -->|No| Calibration[CALIBRATION PHASE<br/>~20 minutes]
Check -->|Yes| LoadModel[Load existing model]
Calibration --> Sample[Sample 300 emails]
Sample --> Discovery[LLM Category Discovery<br/>15 batches × 20 emails<br/>~5 minutes]
Discovery --> Consolidate[Consolidate categories<br/>LLM call<br/>~5 seconds]
Consolidate --> Label[Label 300 samples]
Label --> Extract[Feature extraction]
Extract --> Train[Train LightGBM<br/>~5 seconds]
Train --> SaveModel[Save new model]
SaveModel --> Classify[CLASSIFICATION PHASE]
LoadModel --> Classify
Classify --> Loop{For each email}
Loop --> Embed[Generate embedding<br/>~0.02 sec]
Embed --> TFIDF[TF-IDF features<br/>~0.001 sec]
TFIDF --> Predict[ML Prediction<br/>~0.003 sec]
Predict --> Threshold{Confidence?}
Threshold -->|High| MLDone[ML result]
Threshold -->|Low| LLMFallback[LLM fallback<br/>~4 sec]
MLDone --> Next{More?}
LLMFallback --> Next
Next -->|Yes| Loop
Next -->|No| Done[Results]
style Calibration fill:#ff6b6b
style Discovery fill:#ff6b6b
style LLMFallback fill:#ff6b6b
style MLDone fill:#4ec9b0
</pre>
</div>
<h2>2. Desired Flow: Fast ML-Only (Your Goal)</h2>
<div class="diagram">
<pre class="mermaid">
flowchart TD
Start([New Mailbox: 10k emails]) --> LoadModel[Load pre-trained model<br/>Categories: 11 known<br/>~0.5 seconds]
LoadModel --> OptionalCheck{Verify categories?}
OptionalCheck -->|Yes| QuickVerify[Single LLM call<br/>Sample 10-20 emails<br/>Check category match<br/>~20 seconds]
OptionalCheck -->|Skip| StartClassify
QuickVerify --> MatchCheck{Categories match?}
MatchCheck -->|Yes| StartClassify[START CLASSIFICATION]
MatchCheck -->|No| Warn[Warning: Category mismatch<br/>Continue anyway]
Warn --> StartClassify
StartClassify --> Loop{For each email}
Loop --> Embed[Generate embedding<br/>all-minilm:l6-v2<br/>384 dimensions<br/>~0.02 sec]
Embed --> TFIDF[TF-IDF features<br/>~0.001 sec]
TFIDF --> Combine[Combine features<br/>Embedding + TF-IDF vector]
Combine --> Predict[LightGBM prediction<br/>~0.003 sec]
Predict --> Result[Category + confidence<br/>NO threshold check<br/>NO LLM fallback]
Result --> Next{More emails?}
Next -->|Yes| Loop
Next -->|No| Done[10k emails classified<br/>Total time: ~4 minutes]
style QuickVerify fill:#ffd93d
style Result fill:#4ec9b0
style Done fill:#4ec9b0
</pre>
</div>
<h2>3. What Already Works (No Code Changes Needed)</h2>
<div class="success">
<h3>✓ The Model is Portable</h3>
<p>Your trained model contains:</p>
<ul>
<li>LightGBM Booster (the actual trained weights)</li>
<li>Category list (11 categories)</li>
<li>Category-to-index mapping</li>
</ul>
<p><strong>It can classify ANY email that has the same feature structure (embeddings + TF-IDF).</strong></p>
</div>
<div class="success">
<h3>✓ Embeddings are Universal</h3>
<p>The <code>all-minilm:l6-v2</code> model creates 384-dim embeddings for ANY text. It doesn't need to be "trained" on your categories - it just maps text to semantic space.</p>
<p><strong>Same embedding model works on Gmail, Outlook, any mailbox.</strong></p>
</div>
<div class="success">
<h3>✓ --no-llm-fallback Flag Exists</h3>
<p>Already implemented. When set:</p>
<ul>
<li>Low confidence emails still get ML classification</li>
<li>NO LLM fallback calls</li>
<li>100% pure ML speed</li>
</ul>
</div>
<div class="success">
<h3>✓ Model Loads Without Calibration</h3>
<p>If model exists at <code>src/models/pretrained/classifier.pkl</code>, calibration is skipped entirely.</p>
</div>
<h2>4. The Problem: Category Drift</h2>
<div class="warning">
<h3>What Happens When Mailboxes Differ</h3>
<p><strong>Scenario:</strong> Model trained on Enron (business emails)</p>
<p><strong>New mailbox:</strong> Personal Gmail (shopping, social, newsletters)</p>
<table class="timing-table">
<tr>
<th>Enron Categories (Trained)</th>
<th>Gmail Categories (Natural)</th>
<th>ML Behavior</th>
</tr>
<tr>
<td>Work, Meetings, Financial</td>
<td>Shopping, Social, Travel</td>
<td>Forces Gmail into Enron categories</td>
</tr>
<tr>
<td>"Operational"</td>
<td>No equivalent</td>
<td>Emails mis-classified as "Operational"</td>
</tr>
<tr>
<td>"External"</td>
<td>"Newsletters"</td>
<td>May map but semantically different</td>
</tr>
</table>
<p><strong>Result:</strong> Model works, but accuracy drops. Emails get forced into inappropriate categories.</p>
</div>
<h2>5. Your Proposed Solution: Quick Category Verification</h2>
<div class="diagram">
<pre class="mermaid">
flowchart TD
Start([New Mailbox]) --> LoadModel[Load trained model<br/>11 categories known]
LoadModel --> Sample[Sample 10-20 emails<br/>Quick random sample<br/>~0.1 seconds]
Sample --> BuildPrompt[Build verification prompt<br/>Show trained categories<br/>Show sample emails]
BuildPrompt --> LLMCall[Single LLM call<br/>~20 seconds<br/>Task: Are these categories<br/>appropriate for this mailbox?]
LLMCall --> Parse[Parse response<br/>Expected: Yes/No + suggestions]
Parse --> Decision{Response?}
Decision -->|"Good match"| Proceed[Proceed with ML-only]
Decision -->|"Poor match"| Options{User choice}
Options -->|Continue anyway| Proceed
Options -->|Full calibration| Calibrate[Run full calibration<br/>Discover new categories]
Options -->|Abort| Stop[Stop - manual review]
Proceed --> FastML[Fast ML Classification<br/>10k emails in 4 minutes]
style LLMCall fill:#ffd93d
style FastML fill:#4ec9b0
style Calibrate fill:#ff6b6b
</pre>
</div>
<h2>6. Implementation Options</h2>
<h3>Option A: Pure ML (Fastest, No Verification)</h3>
<div class="code-section">
<strong>Command:</strong>
python -m src.cli run \
--source gmail \
--limit 10000 \
--output gmail_results/ \
--no-llm-fallback
<strong>What happens:</strong>
1. Load existing model (11 Enron categories)
2. Classify all 10k emails using those categories
3. NO LLM calls at all
4. Time: ~4 minutes
<strong>Accuracy:</strong> 60-80% depending on mailbox similarity to Enron
<strong>Use case:</strong> Quick experimentation, bulk processing
</div>
<h3>Option B: Quick Verify Then ML (Your Suggestion)</h3>
<div class="code-section">
<strong>Command:</strong>
python -m src.cli run \
--source gmail \
--limit 10000 \
--output gmail_results/ \
--no-llm-fallback \
--verify-categories \ # NEW FLAG (needs implementation)
--verify-sample 20 # NEW FLAG (needs implementation)
<strong>What happens:</strong>
1. Load existing model (11 Enron categories)
2. Sample 20 random emails from new mailbox
3. Single LLM call: "Are categories [Work, Meetings, ...] appropriate for these emails?"
4. LLM responds: "Good match" or "Poor match - suggest [Shopping, Social, ...]"
5. If good match: Proceed with ML-only
6. If poor match: Warn user, optionally run calibration
<strong>Time:</strong> ~4.5 minutes (20 sec verify + 4 min classify)
<strong>Accuracy:</strong> Same as Option A, but with confidence check
<strong>Use case:</strong> Production deployment with safety check
</div>
<h3>Option C: Lightweight Calibration (Middle Ground)</h3>
<div class="code-section">
<strong>Command:</strong>
python -m src.cli run \
--source gmail \
--limit 10000 \
--output gmail_results/ \
--no-llm-fallback \
--quick-calibrate \ # NEW FLAG (needs implementation)
--calibrate-sample 50 # Much smaller than 300
<strong>What happens:</strong>
1. Sample only 50 emails (not 300)
2. Run LLM discovery on 3 batches (not 15)
3. Map discovered categories to existing model categories
4. If >70% overlap: Use existing model
5. If <70% overlap: Train lightweight adapter
<strong>Time:</strong> ~6 minutes (2 min quick cal + 4 min classify)
<strong>Accuracy:</strong> 70-85% (better than Option A)
<strong>Use case:</strong> New mailbox types with some verification
</div>
<h2>7. What Actually Needs Implementation</h2>
<table class="timing-table">
<tr>
<th>Feature</th>
<th>Status</th>
<th>Work Required</th>
<th>Time</th>
</tr>
<tr>
<td><strong>Option A: Pure ML</strong></td>
<td>✅ WORKS NOW</td>
<td>None - just use --no-llm-fallback</td>
<td>0 hours</td>
</tr>
<tr>
<td><strong>--verify-categories flag</strong></td>
<td>❌ Needs implementation</td>
<td>Add CLI flag, sample logic, LLM prompt, response parsing</td>
<td>2-3 hours</td>
</tr>
<tr>
<td><strong>--quick-calibrate flag</strong></td>
<td>❌ Needs implementation</td>
<td>Modify calibration workflow, category mapping logic</td>
<td>4-6 hours</td>
</tr>
<tr>
<td><strong>Category adapter/mapper</strong></td>
<td>❌ Needs implementation</td>
<td>Map new categories to existing model categories using embeddings</td>
<td>6-8 hours</td>
</tr>
</table>
<h2>8. Recommended Approach: Start with Option A</h2>
<div class="success">
<h3>Why Option A (Pure ML, No Verification) is Best for Experimentation</h3>
<ol>
<li><strong>Works right now</strong> - No code changes needed</li>
<li><strong>4 minutes per 10k emails</strong> - Ultra fast</li>
<li><strong>Reveals real accuracy</strong> - See how well Enron model generalizes</li>
<li><strong>Easy to compare</strong> - Run on multiple mailboxes quickly</li>
<li><strong>No false confidence</strong> - You know it's approximate, act accordingly</li>
</ol>
<h3>Test Protocol</h3>
<p><strong>Step 1:</strong> Run on Enron subset (same domain)</p>
<code>python -m src.cli run --source enron --limit 5000 --output test_enron/ --no-llm-fallback</code>
<p>Expected accuracy: ~78% (baseline)</p>
<p><strong>Step 2:</strong> Run on different Enron mailbox</p>
<code>python -m src.cli run --source enron --limit 5000 --output test_enron2/ --no-llm-fallback</code>
<p>Expected accuracy: ~70-75% (slight drift)</p>
<p><strong>Step 3:</strong> If you have personal Gmail/Outlook data, run there</p>
<code>python -m src.cli run --source gmail --limit 5000 --output test_gmail/ --no-llm-fallback</code>
<p>Expected accuracy: ~50-65% (significant drift, but still useful)</p>
</div>
<h2>9. Timing Comparison: All Options</h2>
<table class="timing-table">
<tr>
<th>Approach</th>
<th>LLM Calls</th>
<th>Time (10k emails)</th>
<th>Accuracy (Same domain)</th>
<th>Accuracy (Different domain)</th>
</tr>
<tr>
<td><strong>Full Calibration</strong></td>
<td>~500 (discovery + labeling + classification fallback)</td>
<td>~2.5 hours</td>
<td>92-95%</td>
<td>92-95%</td>
</tr>
<tr>
<td><strong>Option A: Pure ML</strong></td>
<td>0</td>
<td>~4 minutes</td>
<td>75-80%</td>
<td>50-65%</td>
</tr>
<tr>
<td><strong>Option B: Verify + ML</strong></td>
<td>1 (verification)</td>
<td>~4.5 minutes</td>
<td>75-80%</td>
<td>50-65%</td>
</tr>
<tr>
<td><strong>Option C: Quick Calibrate + ML</strong></td>
<td>~50 (quick discovery)</td>
<td>~6 minutes</td>
<td>80-85%</td>
<td>65-75%</td>
</tr>
<tr>
<td><strong>Current: ML + LLM Fallback</strong></td>
<td>~2100 (21% fallback rate)</td>
<td>~2.5 hours</td>
<td>92-95%</td>
<td>85-90%</td>
</tr>
</table>
<h2>10. The Real Question: Embeddings as Universal Features</h2>
<div class="success">
<h3>Why Your Intuition is Correct</h3>
<p>You said: "map it all to our structured embedding and that's how it gets done"</p>
<p><strong>This is exactly right.</strong></p>
<ul>
<li><strong>Embeddings are semantic representations</strong> - "Meeting tomorrow" has similar embedding whether it's from Enron or Gmail</li>
<li><strong>LightGBM learns patterns in embedding space</strong> - "High values in dimensions 50-70 = Meetings"</li>
<li><strong>These patterns transfer</strong> - Different mailboxes have similar semantic patterns</li>
<li><strong>Categories are just labels</strong> - The model doesn't care if you call it "Work" or "Business" - it learns the embedding pattern</li>
</ul>
<h3>The Limit</h3>
<p>Transfer learning works when:</p>
<ul>
<li>Email <strong>types</strong> are similar (business emails train well on business emails)</li>
<li>Email <strong>structure</strong> is similar (length, formality, sender patterns)</li>
</ul>
<p>Transfer learning fails when:</p>
<ul>
<li>Email <strong>domains</strong> differ significantly (e-commerce emails vs internal memos)</li>
<li>Email <strong>purposes</strong> differ (personal chitchat vs corporate announcements)</li>
</ul>
</div>
<h2>11. Recommended Next Step</h2>
<div class="code-section">
<strong>Immediate action (works right now):</strong>
# Test current model on new 10k sample WITHOUT calibration
python -m src.cli run \
--source enron \
--limit 10000 \
--output ml_speed_test/ \
--no-llm-fallback
# Expected:
# - Time: ~4 minutes
# - Accuracy: ~75-80%
# - LLM calls: 0
# - Categories used: 11 from trained model
# Then inspect results:
cat ml_speed_test/results.json | python -m json.tool | less
# Check category distribution:
cat ml_speed_test/results.json | \
python -c "import json, sys; data=json.load(sys.stdin); \
from collections import Counter; \
print(Counter(c['category'] for c in data['classifications']))"
</div>
<h2>12. If You Want Verification (Future Work)</h2>
<p>I can implement <code>--verify-categories</code> flag that:</p>
<ol>
<li>Samples 20 emails from new mailbox</li>
<li>Makes single LLM call showing both:
<ul>
<li>Trained model categories: [Work, Meetings, Financial, ...]</li>
<li>Sample emails from new mailbox</li>
</ul>
</li>
<li>Asks LLM: "Rate category fit: Good/Fair/Poor + suggest alternatives"</li>
<li>Reports confidence score</li>
<li>Proceeds with ML-only if score > threshold</li>
</ol>
<p><strong>Time cost:</strong> +20 seconds (1 LLM call)</p>
<p><strong>Value:</strong> Automated sanity check before bulk processing</p>
<script>
mermaid.initialize({
startOnLoad: true,
theme: 'default',
flowchart: {
useMaxWidth: true,
htmlLabels: true,
curve: 'basis'
}
});
</script>
</body>
</html>

View File

@ -1,564 +0,0 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Label Training Phase - Detailed Analysis</title>
<script src="https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.min.js"></script>
<style>
body {
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
margin: 20px;
background: #1e1e1e;
color: #d4d4d4;
}
h1, h2, h3 {
color: #4ec9b0;
}
.diagram {
background: white;
padding: 20px;
margin: 20px 0;
border-radius: 8px;
}
.timing-table {
width: 100%;
border-collapse: collapse;
margin: 20px 0;
background: #252526;
}
.timing-table th {
background: #37373d;
padding: 12px;
text-align: left;
color: #4ec9b0;
}
.timing-table td {
padding: 10px;
border-bottom: 1px solid #3e3e42;
}
.code-section {
background: #252526;
padding: 15px;
margin: 10px 0;
border-left: 4px solid #4ec9b0;
font-family: 'Courier New', monospace;
}
code {
background: #1e1e1e;
padding: 2px 6px;
border-radius: 3px;
color: #ce9178;
}
.warning {
background: #3e2a00;
border-left: 4px solid #ffd93d;
padding: 15px;
margin: 10px 0;
}
.critical {
background: #3e0000;
border-left: 4px solid #ff6b6b;
padding: 15px;
margin: 10px 0;
}
</style>
</head>
<body>
<h1>Label Training Phase - Deep Dive Analysis</h1>
<h2>1. What is "Label Training"?</h2>
<p><strong>Location:</strong> src/calibration/llm_analyzer.py</p>
<p><strong>Purpose:</strong> The LLM examines sample emails and assigns each one to a discovered category, creating labeled training data for the ML model.</p>
<p><strong>This is NOT the same as category discovery.</strong> Discovery finds WHAT categories exist. Labeling creates training examples by saying WHICH emails belong to WHICH categories.</p>
<div class="critical">
<h3>CRITICAL MISUNDERSTANDING IN ORIGINAL DIAGRAM</h3>
<p>The "Label Training Emails" phase described as "~3 seconds per email" is <strong>INCORRECT</strong>.</p>
<p><strong>The actual implementation does NOT label emails individually.</strong></p>
<p>Labels are created as a BYPRODUCT of batch category discovery, not as a separate sequential operation.</p>
</div>
<h2>2. Actual Label Training Flow</h2>
<div class="diagram">
<pre class="mermaid">
flowchart TD
Start([Calibration Phase Starts]) --> Sample[Sample 300 emails<br/>stratified by sender]
Sample --> BatchSetup[Split into batches of 20 emails<br/>300 ÷ 20 = 15 batches]
BatchSetup --> Batch1[Batch 1: Emails 1-20]
Batch1 --> Stats1[Calculate batch statistics<br/>domains, keywords, attachments<br/>~0.1 seconds]
Stats1 --> BuildPrompt1[Build LLM prompt<br/>Include all 20 email summaries<br/>~0.05 seconds]
BuildPrompt1 --> LLMCall1[Single LLM call for entire batch<br/>Discovers categories AND labels all 20<br/>~20 seconds TOTAL for batch]
LLMCall1 --> Parse1[Parse JSON response<br/>Extract categories + labels<br/>~0.1 seconds]
Parse1 --> Store1[Store results<br/>categories: Dict<br/>labels: List of Tuples]
Store1 --> Batch2{More batches?}
Batch2 -->|Yes| NextBatch[Batch 2: Emails 21-40]
Batch2 -->|No| Consolidate
NextBatch --> Stats2[Same process<br/>15 total batches<br/>~20 seconds each]
Stats2 --> Batch2
Consolidate[Consolidate categories<br/>Merge duplicates<br/>Single LLM call<br/>~5 seconds]
Consolidate --> CacheSnap[Snap to cached categories<br/>Match against persistent cache<br/>~0.5 seconds]
CacheSnap --> Final[Final output<br/>10-12 categories<br/>300 labeled emails]
Final --> End([Labels ready for ML training])
style LLMCall1 fill:#ff6b6b
style Consolidate fill:#ff6b6b
style Stats2 fill:#ffd93d
style Final fill:#4ec9b0
</pre>
</div>
<h2>3. Key Discovery: Batched Labeling</h2>
<div class="code-section">
<strong>src/calibration/llm_analyzer.py:66-83</strong>
batch_size = 20 # NOT 1 email at a time!
for batch_idx in range(0, len(sample_emails), batch_size):
batch = sample_emails[batch_idx:batch_idx + batch_size]
# Single LLM call handles ENTIRE batch
batch_results = self._analyze_batch(batch, batch_idx)
# Returns BOTH categories AND labels for all 20 emails
for category, desc in batch_results.get('categories', {}).items():
discovered_categories[category] = desc
for email_id, category in batch_results.get('labels', []):
email_labels.append((email_id, category))
</div>
<div class="warning">
<h3>Why Batching Matters</h3>
<p><strong>Sequential (WRONG assumption):</strong> 300 emails × 3 sec/email = 900 seconds (15 minutes)</p>
<p><strong>Batched (ACTUAL):</strong> 15 batches × 20 sec/batch = 300 seconds (5 minutes)</p>
<p><strong>Savings:</strong> 10 minutes (67% faster than assumed)</p>
</div>
<h2>4. Single Batch Processing Detail</h2>
<div class="diagram">
<pre class="mermaid">
flowchart TD
Start([Batch of 20 emails]) --> Stats[Calculate Statistics<br/>~0.1 seconds]
Stats --> StatDetails[Domain analysis<br/>Recipient counts<br/>Attachment detection<br/>Keyword extraction]
StatDetails --> BuildList[Build email summaries<br/>For each email:<br/>ID + From + Subject + Preview]
BuildList --> Prompt[Construct LLM prompt<br/>~2KB text<br/>Contains:<br/>- Statistics summary<br/>- All 20 email summaries<br/>- Instructions<br/>- JSON schema]
Prompt --> LLM[LLM Call<br/>POST /api/generate<br/>qwen3:4b-instruct-2507-q8_0<br/>temp=0.1, max_tokens=2000<br/>~18-22 seconds]
LLM --> Response[LLM Response<br/>JSON with:<br/>categories: Dict<br/>labels: List of 20 Tuples]
Response --> Parse[Parse JSON<br/>Regex extraction<br/>Brace counting<br/>~0.05 seconds]
Parse --> Validate{Valid JSON?}
Validate -->|Yes| Extract[Extract data<br/>categories: 3-8 new<br/>labels: 20 tuples]
Validate -->|No| FallbackParse[Fallback parsing<br/>Try to salvage partial data]
FallbackParse --> Extract
Extract --> Return[Return batch results<br/>categories: Dict str→str<br/>labels: List Tuple str,str]
Return --> End([Merge with global results])
style LLM fill:#ff6b6b
style Parse fill:#4ec9b0
style FallbackParse fill:#ffd93d
</pre>
</div>
<h2>5. LLM Prompt Structure</h2>
<div class="code-section">
<strong>Actual prompt sent to LLM (src/calibration/llm_analyzer.py:196-232):</strong>
&lt;no_think&gt;You are analyzing emails to discover natural categories...
BATCH STATISTICS (20 emails):
- Top sender domains: example.com (5), company.org (3)...
- Avg recipients per email: 2.3
- Emails with attachments: 4/20
- Avg subject length: 42 chars
- Common keywords: meeting(3), report(2)...
EMAILS TO ANALYZE:
1. ID: maildir_allen-p__sent_mail_512
From: phillip.allen@enron.com
Subject: Re: AEC Volumes at OPAL
Preview: Here are the volumes...
2. ID: maildir_allen-p__sent_mail_513
From: phillip.allen@enron.com
Subject: Meeting Tomorrow
Preview: Can we schedule...
[... 18 more emails ...]
TASK:
1. Identify natural groupings based on PURPOSE
2. Create SHORT category names
3. Assign each email to exactly one category
4. CRITICAL: Copy EXACT email IDs
Return JSON:
{
"categories": {"Work": "daily business communication", ...},
"labels": [["maildir_allen-p__sent_mail_512", "Work"], ...]
}
</div>
<h2>6. Timing Breakdown - 300 Sample Emails</h2>
<table class="timing-table">
<tr>
<th>Operation</th>
<th>Per Batch (20 emails)</th>
<th>Total (15 batches)</th>
<th>% of Total Time</th>
</tr>
<tr>
<td>Calculate statistics</td>
<td>0.1 sec</td>
<td>1.5 sec</td>
<td>0.5%</td>
</tr>
<tr>
<td>Build email summaries</td>
<td>0.05 sec</td>
<td>0.75 sec</td>
<td>0.2%</td>
</tr>
<tr>
<td>Construct prompt</td>
<td>0.01 sec</td>
<td>0.15 sec</td>
<td>0.05%</td>
</tr>
<tr>
<td><strong>LLM API call</strong></td>
<td><strong>18-22 sec</strong></td>
<td><strong>270-330 sec</strong></td>
<td><strong>98%</strong></td>
</tr>
<tr>
<td>Parse JSON response</td>
<td>0.05 sec</td>
<td>0.75 sec</td>
<td>0.2%</td>
</tr>
<tr>
<td>Merge results</td>
<td>0.02 sec</td>
<td>0.3 sec</td>
<td>0.1%</td>
</tr>
<tr>
<td colspan="2"><strong>SUBTOTAL: Batch Discovery</strong></td>
<td><strong>~300 seconds (5 min)</strong></td>
<td><strong>98.5%</strong></td>
</tr>
<tr>
<td colspan="2">Consolidation LLM call</td>
<td>5 seconds</td>
<td>1.3%</td>
</tr>
<tr>
<td colspan="2">Cache snapping (semantic matching)</td>
<td>0.5 seconds</td>
<td>0.2%</td>
</tr>
<tr>
<td colspan="2"><strong>TOTAL LABELING PHASE</strong></td>
<td><strong>~305 seconds (5 min)</strong></td>
<td><strong>100%</strong></td>
</tr>
</table>
<div class="warning">
<h3>Corrected Understanding</h3>
<p><strong>Original estimate:</strong> "~3 seconds per email" = 900 seconds for 300 emails</p>
<p><strong>Actual timing:</strong> ~20 seconds per batch of 20 = ~305 seconds for 300 emails</p>
<p><strong>Difference:</strong> 3× faster than original assumption</p>
<p><strong>Why:</strong> Batching allows LLM to see context across multiple emails and make better category decisions in a single inference pass.</p>
</div>
<h2>7. What Gets Created</h2>
<div class="diagram">
<pre class="mermaid">
flowchart LR
Input[300 sampled emails] --> Discovery[Category Discovery<br/>15 batches × 20 emails]
Discovery --> RawCats[Raw Categories<br/>~30-40 discovered<br/>May have duplicates:<br/>Work, work, Business, etc.]
RawCats --> Consolidate[Consolidation<br/>LLM merges similar<br/>~5 seconds]
Consolidate --> Merged[Merged Categories<br/>~12-15 categories<br/>Work, Financial, etc.]
Merged --> CacheSnap[Cache Snap<br/>Match against persistent cache<br/>~0.5 seconds]
CacheSnap --> Final[Final Categories<br/>10-12 categories]
Discovery --> RawLabels[Raw Labels<br/>300 tuples:<br/>email_id, category]
RawLabels --> UpdateLabels[Update label categories<br/>to match snapped names]
UpdateLabels --> FinalLabels[Final Labels<br/>300 training pairs]
Final --> Training[Training Data]
FinalLabels --> Training
Training --> MLTrain[Train LightGBM Model<br/>~5 seconds]
MLTrain --> Model[Trained Model<br/>1.8MB .pkl file]
style Discovery fill:#ff6b6b
style Consolidate fill:#ff6b6b
style Model fill:#4ec9b0
</pre>
</div>
<h2>8. Example Output</h2>
<div class="code-section">
<strong>discovered_categories (Dict[str, str]):</strong>
{
"Work": "daily business communication and coordination",
"Financial": "budgets, reports, financial planning",
"Meetings": "scheduling and meeting coordination",
"Technical": "system issues and technical discussions",
"Requests": "action items and requests for information",
"Reports": "status reports and summaries",
"Administrative": "HR, policies, company announcements",
"Urgent": "time-sensitive matters",
"Conversational": "casual check-ins and social",
"External": "communication with external partners"
}
<strong>sample_labels (List[Tuple[str, str]]):</strong>
[
("maildir_allen-p__sent_mail_1", "Financial"),
("maildir_allen-p__sent_mail_2", "Work"),
("maildir_allen-p__sent_mail_3", "Meetings"),
("maildir_allen-p__sent_mail_4", "Work"),
("maildir_allen-p__sent_mail_5", "Financial"),
... (300 total)
]
</div>
<h2>9. Why Batching is Critical</h2>
<table class="timing-table">
<tr>
<th>Approach</th>
<th>LLM Calls</th>
<th>Time/Call</th>
<th>Total Time</th>
<th>Quality</th>
</tr>
<tr>
<td><strong>Sequential (1 email/call)</strong></td>
<td>300</td>
<td>3 sec</td>
<td>900 sec (15 min)</td>
<td>Poor - no context</td>
</tr>
<tr>
<td><strong>Small batches (5 emails/call)</strong></td>
<td>60</td>
<td>8 sec</td>
<td>480 sec (8 min)</td>
<td>Fair - limited context</td>
</tr>
<tr>
<td><strong>Current (20 emails/call)</strong></td>
<td>15</td>
<td>20 sec</td>
<td>300 sec (5 min)</td>
<td>Good - sufficient context</td>
</tr>
<tr>
<td><strong>Large batches (50 emails/call)</strong></td>
<td>6</td>
<td>45 sec</td>
<td>270 sec (4.5 min)</td>
<td>Risk - may exceed token limits</td>
</tr>
</table>
<div class="warning">
<h3>Why 20 emails per batch?</h3>
<ul>
<li><strong>Token limit:</strong> 20 emails × ~150 tokens/email = ~3000 tokens input, well under 8K limit</li>
<li><strong>Context window:</strong> LLM can see patterns across multiple emails</li>
<li><strong>Speed:</strong> Minimizes API calls while staying within limits</li>
<li><strong>Quality:</strong> Enough examples to identify patterns, not so many that it gets confused</li>
</ul>
</div>
<h2>10. Configuration Parameters</h2>
<table class="timing-table">
<tr>
<th>Parameter</th>
<th>Location</th>
<th>Default</th>
<th>Effect on Timing</th>
</tr>
<tr>
<td>sample_size</td>
<td>CalibrationConfig</td>
<td>300</td>
<td>300 samples = 15 batches = 5 min</td>
</tr>
<tr>
<td>batch_size</td>
<td>llm_analyzer.py:62</td>
<td>20</td>
<td>Hardcoded - affects batch count</td>
</tr>
<tr>
<td>llm_batch_size</td>
<td>CalibrationConfig</td>
<td>50</td>
<td>NOT USED for discovery (misleading name)</td>
</tr>
<tr>
<td>temperature</td>
<td>LLM call</td>
<td>0.1</td>
<td>Lower = faster, more deterministic</td>
</tr>
<tr>
<td>max_tokens</td>
<td>LLM call</td>
<td>2000</td>
<td>Higher = potentially slower response</td>
</tr>
</table>
<h2>11. Full Calibration Timeline</h2>
<div class="diagram">
<pre class="mermaid">
gantt
title Calibration Phase Timeline (300 samples, 10k total emails)
dateFormat mm:ss
axisFormat %M:%S
section Sampling
Stratified sample (3% of 10k) :00:00, 01s
section Category Discovery
Batch 1 (emails 1-20) :00:01, 20s
Batch 2 (emails 21-40) :00:21, 20s
Batch 3 (emails 41-60) :00:41, 20s
Batch 4-13 (emails 61-260) :01:01, 200s
Batch 14 (emails 261-280) :04:21, 20s
Batch 15 (emails 281-300) :04:41, 20s
section Consolidation
LLM category merge :05:01, 05s
Cache snap :05:06, 00.5s
section ML Training
Feature extraction (300) :05:07, 06s
LightGBM training :05:13, 05s
Validation (100 emails) :05:18, 02s
Save model to disk :05:20, 00.5s
</pre>
</div>
<h2>12. Key Insights</h2>
<div class="critical">
<h3>1. Labels are NOT created sequentially</h3>
<p>The LLM creates labels as a byproduct of batch category discovery. There is NO separate "label each email one by one" phase.</p>
</div>
<div class="critical">
<h3>2. Batching is the optimization</h3>
<p>Processing 20 emails in a single LLM call (20 sec) is 3× faster than 20 individual calls (60 sec total).</p>
</div>
<div class="critical">
<h3>3. LLM time dominates everything</h3>
<p>98% of labeling phase time is LLM API calls. Everything else (parsing, merging, caching) is negligible.</p>
</div>
<div class="critical">
<h3>4. Consolidation is cheap</h3>
<p>Merging 30-40 raw categories into 10-12 final ones takes only ~5 seconds with a single LLM call.</p>
</div>
<h2>13. Optimization Opportunities</h2>
<table class="timing-table">
<tr>
<th>Optimization</th>
<th>Current</th>
<th>Potential</th>
<th>Tradeoff</th>
</tr>
<tr>
<td>Increase batch size</td>
<td>20 emails/batch</td>
<td>30-40 emails/batch</td>
<td>May hit token limits, slower per call</td>
</tr>
<tr>
<td>Reduce sample size</td>
<td>300 samples (3%)</td>
<td>200 samples (2%)</td>
<td>Less training data, potentially worse model</td>
</tr>
<tr>
<td>Parallel batching</td>
<td>Sequential 15 batches</td>
<td>3-5 concurrent batches</td>
<td>Requires async LLM client, more complex</td>
</tr>
<tr>
<td>Skip consolidation</td>
<td>Always consolidate if >10 cats</td>
<td>Skip if <15 cats</td>
<td>May leave duplicate categories</td>
</tr>
<tr>
<td>Cache-first approach</td>
<td>Discover then snap to cache</td>
<td>Snap to cache, only discover new</td>
<td>Less adaptive to new mailbox types</td>
</tr>
</table>
<script>
mermaid.initialize({
startOnLoad: true,
theme: 'default',
flowchart: {
useMaxWidth: true,
htmlLabels: true,
curve: 'basis'
},
gantt: {
useWidth: 1200
}
});
</script>
</body>
</html>

View File

@ -1,129 +0,0 @@
# Model Information
## Current Status
- **Model Type**: LightGBM Classifier (Production)
- **Location**: `src/models/pretrained/classifier.pkl`
- **Categories**: 12 (junk, transactional, auth, newsletters, social, automated, conversational, work, personal, finance, travel, unknown)
- **Feature Extraction**: Hybrid (embeddings + patterns + structural features)
## Usage
The ML classifier will automatically use the real model if it exists at:
```
src/models/pretrained/classifier.pkl
```
### Programmatic Usage
```python
from src.classification.ml_classifier import MLClassifier
# Will automatically load real model if available
classifier = MLClassifier()
# Check if using mock or real model
info = classifier.get_info()
print(f"Is mock: {info['is_mock']}")
print(f"Model type: {info['model_type']}")
# Make predictions
result = classifier.predict(feature_vector)
print(f"Category: {result['category']}")
print(f"Confidence: {result['confidence']}")
```
### Command Line Usage
```bash
# Test with mock pipeline
python -m src.cli run --source mock --output test_results/
# Test with real model (when available)
python -m src.cli run --source gmail --limit 100 --output results/
```
## How to Get a Real Model
### Option 1: Train Your Own (Recommended)
```python
from src.calibration.trainer import ModelTrainer
from src.calibration.enron_parser import EnronParser
from src.classification.feature_extractor import FeatureExtractor
# Parse Enron dataset
parser = EnronParser("enron_mail_20150507")
emails = parser.parse_emails(limit=5000)
# Extract features
extractor = FeatureExtractor()
labeled_data = [(email, category) for email, category in zip(emails, categories)]
# Train model
trainer = ModelTrainer(extractor, categories)
results = trainer.train(labeled_data)
# Save model
trainer.save_model("src/models/pretrained/classifier.pkl")
```
### Option 2: Download Pre-trained Model
Use the provided script:
```bash
cd tools
python download_pretrained_model.py \
--url https://example.com/model.pkl \
--hash abc123def456
```
### Option 3: Use Community Model
Check available pre-trained models at:
- Email Sorter releases on GitHub
- Hugging Face model hub (when available)
- Community-trained models
## Model Performance
Expected accuracy on real data:
- **Hard Rules**: 94-96% (instant, ~10% of emails)
- **ML Model**: 85-90% (fast, ~85% of emails)
- **LLM Review**: 92-95% (slower, ~5% uncertain cases)
- **Overall**: 90-94% (weighted average)
## Retraining
To retrain the model:
```bash
python -m src.cli train \
--source enron \
--output models/new_model.pkl \
--limit 10000
```
## Troubleshooting
### Model Not Loading
1. Check file exists: `src/models/pretrained/classifier.pkl`
2. Try to load directly:
```python
import pickle
with open('src/models/pretrained/classifier.pkl', 'rb') as f:
data = pickle.load(f)
print(data.keys())
```
3. Ensure pickle format is correct
### Low Accuracy
1. Model may be underfitted - train on more data
2. Feature extraction may need tuning
3. Categories may need adjustment
4. Consider LLM review for uncertain cases
### Slow Predictions
1. Use embedding cache for batch processing
2. Implement parallel processing
3. Consider quantization for LightGBM model
4. Profile feature extraction step

View File

@ -1,437 +0,0 @@
# Email Sorter - Next Steps & Action Plan
**Date**: 2025-10-21
**Status**: Framework Complete - Ready for Real Model Integration
**Test Status**: 27/30 passing (90%)
---
## Quick Summary
**Framework**: 100% complete, all 16 phases implemented
**Testing**: 90% pass rate (27/30 tests)
**Documentation**: Comprehensive and up-to-date
**Tools**: Model integration scripts provided
**Real Model**: Currently using mock (placeholder)
**Gmail Credentials**: Not yet configured
**Real Data Processing**: Ready when model + credentials available
---
## Three Paths Forward
Choose your path based on your needs:
### Path A: Quick Framework Validation (5 minutes)
**Goal**: Verify everything works with mock model
**Commands**:
```bash
cd "c:/Build Folder/email-sorter"
source venv/Scripts/activate
# Run quick validation
pytest tests/ -v --tb=short
python -m src.cli test-config
python -m src.cli run --source mock --output test_results/
```
**Result**: Confirms framework works correctly
### Path B: Real Model Integration (30-60 minutes)
**Goal**: Replace mock model with real LightGBM model
**Two Sub-Options**:
#### B1: Train Your Own Model on Enron Dataset
```bash
# Parse Enron emails (already downloaded)
python -c "
from src.calibration.enron_parser import EnronParser
from src.classification.feature_extractor import FeatureExtractor
from src.calibration.trainer import ModelTrainer
parser = EnronParser('enron_mail_20150507')
emails = parser.parse_emails(limit=5000)
extractor = FeatureExtractor()
trainer = ModelTrainer(extractor, ['junk', 'transactional', 'auth', 'newsletters',
'social', 'automated', 'conversational', 'work',
'personal', 'finance', 'travel', 'unknown'])
# Train (takes 5-10 minutes on this laptop)
results = trainer.train([(e, 'unknown') for e in emails])
trainer.save_model('src/models/pretrained/classifier.pkl')
"
# Verify
python tools/setup_real_model.py --check
```
#### B2: Download Pre-trained Model
```bash
# If you have a pre-trained model URL
python tools/download_pretrained_model.py \
--url https://example.com/lightgbm_model.pkl \
--hash abc123def456
# Or if you have local file
python tools/setup_real_model.py --model-path /path/to/model.pkl
# Verify
python tools/setup_real_model.py --check
```
**Result**: Real model installed, framework uses it automatically
### Path C: Full Production Deployment (2-3 hours)
**Goal**: Process all 80k+ emails with Gmail integration
**Prerequisites**: Path B (real model) + Gmail OAuth
**Steps**:
1. **Setup Gmail OAuth**
```bash
# Get credentials from Google Cloud Console
# https://console.cloud.google.com/
# - Create OAuth 2.0 credentials
# - Download as JSON
# - Place as credentials.json in project root
# Test Gmail connection
python -m src.cli test-gmail
```
2. **Test with 100 Emails**
```bash
python -m src.cli run \
--source gmail \
--limit 100 \
--output test_results/
```
3. **Process Full Dataset**
```bash
python -m src.cli run \
--source gmail \
--output marion_results/
```
4. **Review Results**
- Check `marion_results/results.json`
- Check `marion_results/report.txt`
- Review accuracy metrics
- Adjust thresholds if needed
---
## What's Ready Right Now
### ✅ Framework Components (All Complete)
- [x] Feature extraction (embeddings + patterns + structural)
- [x] Three-tier adaptive classifier (hard rules → ML → LLM)
- [x] Embedding cache and batch processing
- [x] Processing pipeline with checkpointing
- [x] LLM integration (Ollama ready, OpenAI compatible)
- [x] Calibration workflow
- [x] Export system (JSON/CSV)
- [x] Provider sync (Gmail/IMAP framework)
- [x] Learning systems (threshold + pattern learning)
- [x] Complete CLI interface
- [x] Comprehensive test suite
### ❌ What Needs Your Input
1. **Real Model** (50 MB file)
- Option: Train on Enron (~5-10 min, laptop-friendly)
- Option: Download pre-trained (~1 min)
2. **Gmail Credentials** (OAuth JSON)
- Get from Google Cloud Console
- Place in project root as `credentials.json`
3. **Real Data** (Already have: Enron dataset)
- Optional: Your own emails for better tuning
---
## File Locations & Important Paths
```
Project Root: c:/Build Folder/email-sorter
Key Files:
├── src/
│ ├── cli.py # Command-line interface
│ ├── orchestration.py # Main pipeline
│ ├── classification/
│ │ ├── feature_extractor.py # Feature extraction
│ │ ├── ml_classifier.py # ML predictions
│ │ ├── adaptive_classifier.py # Three-tier orchestration
│ │ └── embedding_cache.py # Caching & batching
│ ├── calibration/
│ │ ├── trainer.py # LightGBM trainer
│ │ ├── enron_parser.py # Parse Enron dataset
│ │ └── workflow.py # Calibration pipeline
│ ├── processing/
│ │ ├── bulk_processor.py # Batch processing
│ │ ├── queue_manager.py # LLM queue
│ │ └── attachment_handler.py # PDF/DOCX extraction
│ ├── llm/
│ │ ├── ollama.py # Ollama integration
│ │ └── openai_compat.py # OpenAI API
│ └── email_providers/
│ ├── gmail.py # Gmail provider
│ └── imap.py # IMAP provider
├── models/ # (Will be created)
│ └── pretrained/
│ └── classifier.pkl # Real model goes here
├── tools/
│ ├── download_pretrained_model.py # Download models
│ └── setup_real_model.py # Setup models
├── enron_mail_20150507/ # Enron dataset (already extracted)
├── tests/ # 23 test cases
├── config/ # Configuration
├── src/models/pretrained/ # (Will be created for real model)
└── Documentation:
├── PROJECT_STATUS.md # High-level overview
├── COMPLETION_ASSESSMENT.md # Detailed component review
├── MODEL_INFO.md # Model usage guide
└── NEXT_STEPS.md # This file
```
---
## Testing Your Setup
### Framework Validation
```bash
# Test configuration loading
python -m src.cli test-config
# Test Ollama (if running locally)
python -m src.cli test-ollama
# Run full test suite
pytest tests/ -v
```
### Mock Pipeline (No Real Data Needed)
```bash
python -m src.cli run --source mock --output test_results/
```
### Real Model Verification
```bash
python tools/setup_real_model.py --check
```
### Gmail Connection Test
```bash
python -m src.cli test-gmail
```
---
## Performance Expectations
### With Mock Model (Testing)
- Feature extraction: ~50-100ms per email
- ML prediction: ~10-20ms per email
- Total time for 100 emails: ~30-40 seconds
### With Real Model (Production)
- Feature extraction: ~50-100ms per email
- ML prediction: ~5-10ms per email (LightGBM is faster)
- LLM review (5% of emails): ~2-5 seconds per email
- Total time for 80k emails: 15-25 minutes
### Calibration Phase
- Sampling: 1-2 minutes
- LLM category discovery: 2-3 minutes
- Model training: 5-10 minutes
- Total: 10-15 minutes
---
## Troubleshooting
### Problem: "Model not found" but framework running
**Solution**: This is normal - system uses mock model automatically
```bash
python tools/setup_real_model.py --check # Shows current status
```
### Problem: Ollama tests failing
**Solution**: Ollama is optional, LLM review will skip gracefully
```bash
# Not critical - framework has graceful fallback
python -m src.cli run --source mock
```
### Problem: Gmail connection fails
**Solution**: Gmail is optional, test with mock first
```bash
python -m src.cli run --source mock --output results/
```
### Problem: Low accuracy with mock model
**Expected behavior**: Mock model is for framework testing only
```python
# Check model info
from src.classification.ml_classifier import MLClassifier
c = MLClassifier()
print(c.get_info()) # Shows is_mock: True
```
---
## Decision Tree: What to Do Next
```
START
├─ Do you want to test the framework first?
│ └─ YES → Run Path A (5 minutes)
│ pytest tests/ -v
│ python -m src.cli run --source mock
├─ Do you want to set up a real model?
│ ├─ YES (TRAIN) → Run Path B1 (30-60 min)
│ │ Train on Enron dataset
│ │ python tools/setup_real_model.py --check
│ │
│ └─ YES (DOWNLOAD) → Run Path B2 (5 min)
│ python tools/setup_real_model.py --model-path /path/to/model.pkl
├─ Do you want Gmail integration?
│ └─ YES → Setup OAuth credentials
│ Place credentials.json in project root
│ python -m src.cli test-gmail
└─ Do you want to process all 80k emails?
└─ YES → Run Path C (2-3 hours)
python -m src.cli run --source gmail --output results/
```
---
## Success Criteria
### ✅ Framework is Ready When:
- [ ] `pytest tests/` shows 27/30 passing
- [ ] `python -m src.cli test-config` succeeds
- [ ] `python -m src.cli run --source mock` completes
### ✅ Real Model is Ready When:
- [ ] `python tools/setup_real_model.py --check` shows model found
- [ ] `python -m src.cli run --source mock` shows `is_mock: False`
- [ ] Test predictions work without errors
### ✅ Gmail is Ready When:
- [ ] `credentials.json` exists in project root
- [ ] `python -m src.cli test-gmail` succeeds
- [ ] Can fetch 10 emails from Gmail
### ✅ Production is Ready When:
- [ ] Real model integrated
- [ ] Gmail credentials configured
- [ ] Test run on 100 emails succeeds
- [ ] Accuracy metrics are acceptable
- [ ] Ready to process full dataset
---
## Common Commands Reference
```bash
# Navigate to project
cd "c:/Build Folder/email-sorter"
source venv/Scripts/activate
# Testing
pytest tests/ -v # Run all tests
pytest tests/test_feature_extraction.py -v # Run specific test file
# Configuration
python -m src.cli test-config # Validate config
python -m src.cli test-ollama # Test LLM provider
python -m src.cli test-gmail # Test Gmail connection
# Framework testing (mock)
python -m src.cli run --source mock --output test_results/
# Model setup
python tools/setup_real_model.py --check # Check status
python tools/setup_real_model.py --model-path /path/to/model # Install model
python tools/setup_real_model.py --info # Show info
# Real processing (after setup)
python -m src.cli run --source gmail --limit 100 --output test/
python -m src.cli run --source gmail --output results/
# Development
python -m pytest tests/ --cov=src # Coverage report
python -m src.cli --help # Show all commands
```
---
## What NOT to Do
**Do NOT**:
- Try to use mock model in production (it's not accurate)
- Process all emails before testing with 100
- Skip Gmail credential setup (use mock for testing instead)
- Modify core classifier code (framework is complete)
- Skip the test suite validation
- Use Ollama if laptop is low on resources (graceful fallback available)
**DO**:
- Test with mock first
- Integrate real model before processing
- Start with 100 emails then scale
- Review results and adjust thresholds
- Keep this file for reference
- Use the tools provided for model integration
---
## Support & Questions
If something doesn't work:
1. **Check logs**: All operations log to `logs/email_sorter.log`
2. **Run tests**: `pytest tests/ -v` shows what's working
3. **Check framework**: `python -m src.cli test-config` validates setup
4. **Review docs**: See COMPLETION_ASSESSMENT.md for details
---
## Timeline Estimate
**What You Can Do Now:**
- Framework validation: 5 minutes
- Mock pipeline test: 10 minutes
- Documentation review: 15 minutes
**What You Can Do When Home:**
- Real model training: 30-60 minutes
- Gmail OAuth setup: 15-30 minutes
- Full processing: 20-30 minutes
**Total Time to Production**: 1.5-2 hours when you're home with better hardware
---
## Summary
Your Email Sorter framework is **100% complete and tested**. The next step is simply choosing:
1. **Now**: Validate framework with mock model (5 min)
2. **When home**: Integrate real model (30-60 min)
3. **When ready**: Process all 80k emails (20-30 min)
All tools are provided. All documentation is complete. Framework is ready to use.
**Choose your path above and get started!**

File diff suppressed because it is too large Load Diff

View File

@ -1,566 +0,0 @@
# EMAIL SORTER - PROJECT COMPLETE
**Date**: October 21, 2025
**Status**: FEATURE COMPLETE - Ready to Use
**Framework Maturity**: All Features Implemented
**Test Coverage**: 90% (27/30 passing)
**Code Quality**: Full Type Hints and Comprehensive Error Handling
---
## The Bottom Line
✅ **Email Sorter framework is 100% complete and ready to use**
All 16 planned development phases are implemented. The system is ready to process Marion's 80k+ emails with high accuracy. All you need to do is:
1. Optionally integrate a real LightGBM model (tools provided)
2. Set up Gmail OAuth credentials (when ready)
3. Run the pipeline
That's it. No more building. No more architecture decisions. Framework is done.
---
## What You Have
### Core System (Ready to Use)
- ✅ 38 Python modules (~6,000 lines of code)
- ✅ 12-category email classifier
- ✅ Hybrid ML/LLM classification system
- ✅ Smart feature extraction (embeddings + patterns + structure)
- ✅ Processing pipeline with checkpointing
- ✅ Gmail and IMAP sync capabilities
- ✅ Model training framework
- ✅ Learning systems (threshold + pattern adjustment)
### Tools (Ready to Use)
- ✅ CLI interface (`python -m src.cli --help`)
- ✅ Model download tool (`tools/download_pretrained_model.py`)
- ✅ Model setup tool (`tools/setup_real_model.py`)
- ✅ Test suite (23 tests, 90% pass rate)
### Documentation (Complete)
- ✅ PROJECT_STATUS.md - Feature inventory
- ✅ COMPLETION_ASSESSMENT.md - Detailed evaluation
- ✅ MODEL_INFO.md - Model usage guide
- ✅ NEXT_STEPS.md - Action plan
- ✅ README.md - Getting started
- ✅ Full API documentation via docstrings
### Data (Ready)
- ✅ Enron dataset extracted (569MB, real emails)
- ✅ Mock provider for testing
- ✅ Test data sets
---
## What's Different From Before
When we started, there were **16 planned phases** with many unknowns. Now:
| Phase | Status | Details |
|-------|--------|---------|
| 1-3 | ✅ DONE | Infrastructure, config, logging |
| 4 | ✅ DONE | Email providers (Gmail, IMAP, Mock) |
| 5 | ✅ DONE | Feature extraction (embeddings + patterns) |
| 6 | ✅ DONE | ML classifier (mock + LightGBM framework) |
| 7 | ✅ DONE | LLM integration (Ollama + OpenAI) |
| 8 | ✅ DONE | Adaptive classifier (3-tier system) |
| 9 | ✅ DONE | Processing pipeline (checkpointing) |
| 10 | ✅ DONE | Calibration system |
| 11 | ✅ DONE | Export & reporting |
| 12 | ✅ DONE | Learning systems |
| 13 | ✅ DONE | Advanced processing |
| 14 | ✅ DONE | Provider sync |
| 15 | ✅ DONE | Orchestration |
| 16 | ✅ DONE | Packaging |
| 17 | ✅ DONE | Testing |
**Every. Single. Phase. Complete.**
---
## Test Results
```
======================== Final Test Results ==========================
PASSED: 27/30 (90% success rate)
Core Components ✅
- Email models and validation
- Configuration system
- Feature extraction (embeddings + patterns + structure)
- ML classifier (mock + loading)
- Adaptive three-tier classifier
- LLM providers (Ollama + OpenAI)
- Queue management with persistence
- Bulk processing with checkpointing
- Email sampling and analysis
- Threshold learning
- Pattern learning
- Results export (JSON/CSV)
- Provider sync (Gmail/IMAP)
- End-to-end pipeline
KNOWN ISSUES (3 - All Expected & Documented):
❌ test_e2e_checkpoint_resume
Reason: Feature count mismatch between mock and real model
Impact: Only relevant when upgrading to real model
Status: Expected and acceptable
❌ test_e2e_enron_parsing
Reason: Parser needs validation against actual maildir format
Impact: Validation needed during training phase
Status: Parser works, needs Enron dataset validation
❌ test_pattern_detection_invoice
Reason: Minor regex doesn't match "bill #456"
Impact: Cosmetic issue in test data
Status: No production impact, easy to fix if needed
WARNINGS: 16 (All Pydantic deprecation - cosmetic, code works fine)
Duration: ~90 seconds
Coverage: All critical paths
Quality: Comprehensive with full type hints
```
---
## Project Metrics
```
CODEBASE
- Python Modules: 38 files
- Lines of Code: ~6,000+
- Type Hints: 100% coverage
- Docstrings: Comprehensive
- Error Handling: All critical paths
- Logging: Rich + file output
TESTING
- Unit Tests: 23 tests
- Test Files: 6 suites
- Pass Rate: 90% (27/30)
- Coverage: All core features
- Execution Time: ~90 seconds
ARCHITECTURE
- Core Modules: 16 major components
- Email Providers: 3 (Mock, Gmail, IMAP)
- Classifiers: 3 (Hard rules, ML, LLM)
- Processing Layers: 5 (Extract, Classify, Learn, Export, Sync)
- Learning Systems: 2 (Threshold, Patterns)
DEPENDENCIES
- Direct: 42 packages
- Python Version: 3.8+
- Key Libraries: LightGBM, sentence-transformers, Ollama, Google API
GIT HISTORY
- Commits: 14 total
- Build Path: Clear progression through all phases
- Latest Additions: Model integration tools + documentation
```
---
## System Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ EMAIL SORTER v1.0 - COMPLETE │
├─────────────────────────────────────────────────────────────┤
│ INPUT LAYER
│ ├── Gmail Provider (OAuth, ready for credentials)
│ ├── IMAP Provider (generic mail servers)
│ ├── Mock Provider (for testing)
│ └── Enron Dataset (real email data, 569MB)
│ FEATURE EXTRACTION
│ ├── Semantic embeddings (384D, all-MiniLM-L6-v2)
│ ├── Hard pattern matching (20+ patterns)
│ ├── Structural features (metadata, timing, attachments)
│ ├── Caching system (MD5-based, disk + memory)
│ └── Batch processing (parallel, efficient)
│ CLASSIFICATION ENGINE (3-Tier Adaptive)
│ ├── Tier 1: Hard Rules (instant, ~10%, 94-96% accuracy)
│ │ - Pattern detection
│ │ - Sender analysis
│ │ - Content matching
│ │
│ ├── Tier 2: ML Classifier (fast, ~85%, 85-90% accuracy)
│ │ - LightGBM gradient boosting (production model)
│ │ - Mock Random Forest (testing)
│ │ - Serializable for deployment
│ │
│ └── Tier 3: LLM Review (careful, ~5%, 92-95% accuracy)
│ - Ollama (local, recommended)
│ - OpenAI (API-compatible)
│ - Batch processing
│ - Queue management
│ LEARNING SYSTEM
│ ├── Threshold Adjuster
│ │ - Tracks ML vs LLM agreement
│ │ - Suggests dynamic thresholds
│ │ - Per-category analysis
│ │
│ └── Pattern Learner
│ - Sender-specific distributions
│ - Hard rule suggestions
│ - Domain-level patterns
│ PROCESSING PIPELINE
│ ├── Sampling (stratified + random)
│ ├── Bulk processing (with checkpointing)
│ ├── Batch queue management
│ └── Resumable from interruption
│ OUTPUT LAYER
│ ├── JSON Export (with full metadata)
│ ├── CSV Export (for analysis)
│ ├── Gmail Sync (labels)
│ ├── IMAP Sync (keywords)
│ └── Reports (human-readable)
│ CALIBRATION SYSTEM
│ ├── Sample selection
│ ├── LLM category discovery
│ ├── Training data preparation
│ ├── Model training
│ └── Validation
└─────────────────────────────────────────────────────────────┘
Performance:
- 1500 emails (calibration): ~5 minutes
- 80,000 emails (full run): ~20 minutes
- Classification accuracy: 90-94%
- Hard rule precision: 94-96%
```
---
## How to Use It
### Quick Start (Right Now)
```bash
cd "c:/Build Folder/email-sorter"
source venv/Scripts/activate
# Validate framework
pytest tests/ -v
# Run with mock model
python -m src.cli run --source mock --output test_results/
```
### With Real Model (When Ready)
```bash
# Option 1: Train on Enron
python tools/setup_real_model.py --model-path /path/to/trained_model.pkl
# Option 2: Use pre-trained
python tools/download_pretrained_model.py --url https://example.com/model.pkl
# Verify
python tools/setup_real_model.py --check
# Run with real model (automatic)
python -m src.cli run --source mock --output results/
```
### With Gmail (When Credentials Ready)
```bash
# Place credentials.json in project root
# Then:
python -m src.cli run --source gmail --limit 100 --output test/
python -m src.cli run --source gmail --output all_results/
```
---
## What's NOT Included (By Design)
### ❌ Not Here (Intentionally Deferred)
1. **Real Trained Model** - You decide: train on Enron or download
2. **Gmail Credentials** - Requires your Google Cloud setup
3. **Live Email Processing** - Requires #1 and #2 above
### ✅ Why This Is Good
- Framework is clean and unopinionated
- Your model, your training decisions
- Your credentials, your privacy
- Complete freedom to customize
---
## Key Decisions Made
### 1. Mock Model Strategy
- Framework uses clearly labeled mock for testing
- No deception (explicit warnings in output)
- Real model integration framework ready
- Smooth path to production
### 2. Modular Architecture
- Each component can be tested independently
- Easy to swap components (e.g., different LLM)
- Framework doesn't force decisions
- Extensible design
### 3. Three-Tier Classification
- Hard rules for instant/certain cases
- ML for bulk processing
- LLM for uncertain/complex cases
- Balances speed and accuracy
### 4. Learning Systems
- Threshold adjustment from LLM feedback
- Pattern learning from sender data
- Continuous improvement without retraining
- Dynamic tuning
### 5. Graceful Degradation
- Works without LLM (falls back to ML)
- Works without Gmail (uses mock)
- Works without real model (uses mock)
- No single point of failure
---
## Performance Characteristics
### CPU Usage
- Feature extraction: Single-threaded, parallelizable
- ML prediction: ~5-10ms per email
- LLM call: ~2-5 seconds per email
- Embedding cache: Reduces recomputation by 50-80%
### Memory Usage
- Embeddings cache: ~200-500MB (configurable)
- Batch processing: Configurable batch size
- Model (LightGBM): ~50-100MB
- Total runtime: ~500MB-1GB
### Accuracy
- Hard rules: 94-96% (pattern-based)
- ML alone: 85-90% (LightGBM)
- ML + LLM: 90-94% (adaptive)
- With fine-tuning: 95%+ possible
---
## Deployment Options
### Option 1: Local Development
```bash
python -m src.cli run --source mock --output local_results/
```
- No external dependencies
- Perfect for testing
- Mock model for framework validation
### Option 2: With Ollama (Local LLM)
```bash
# Start Ollama with qwen model
python -m src.cli run --source mock --output results/
```
- Local LLM processing (no internet)
- Privacy-first operation
- Careful resource usage
### Option 3: Cloud Integration
```bash
# With OpenAI API
python -m src.cli run --source gmail --output results/
```
- Real Gmail integration
- Cloud LLM support
- Full production setup
---
## Next Actions (Choose One)
### Right Now (5 minutes)
```bash
# Validate framework with mock
pytest tests/ -v
python -m src.cli test-config
python -m src.cli run --source mock --output test_results/
```
### When Home (30-60 minutes)
```bash
# Train real model or download pre-trained
python tools/setup_real_model.py --model-path /path/to/model.pkl
# Verify
python tools/setup_real_model.py --check
```
### When Ready (2-3 hours)
```bash
# Gmail OAuth setup
# credentials.json in project root
# Process all emails
python -m src.cli run --source gmail --output marion_results/
```
---
## Documentation Map
- **README.md** - Getting started
- **PROJECT_STATUS.md** - Feature inventory and architecture
- **COMPLETION_ASSESSMENT.md** - Detailed component evaluation (90-point checklist)
- **MODEL_INFO.md** - Model usage and training guide
- **NEXT_STEPS.md** - Action plan and deployment paths
- **PROJECT_COMPLETE.md** - This file
---
## Support Resources
### If Something Doesn't Work
1. Check logs: `tail -f logs/email_sorter.log`
2. Run tests: `pytest tests/ -v`
3. Validate config: `python -m src.cli test-config`
4. Review docs: See documentation map above
### Common Issues
- "Model not found" → Normal, using mock model
- "Ollama connection failed" → Optional, will skip gracefully
- "Low accuracy" → Expected with mock model
- Tests failing → Check 3 known issues (all documented)
---
## Success Criteria
### ✅ Framework is Complete
- [x] All 16 phases implemented
- [x] 90% test pass rate
- [x] Full type hints
- [x] Comprehensive logging
- [x] Clear error messages
- [x] Graceful degradation
### ✅ Ready for Real Model
- [x] Model integration framework complete
- [x] Tools for downloading/setup provided
- [x] Framework automatically uses real model when available
- [x] No code changes needed
### ✅ Ready for Gmail Integration
- [x] OAuth framework implemented
- [x] Provider sync completed
- [x] Label mapping configured
- [x] Batch update support
### ✅ Ready for Deployment
- [x] Checkpointing and resumability
- [x] Error recovery
- [x] Performance optimized
- [x] Resource-efficient
---
## What's Next?
You have three paths:
### Path A: Framework Validation (Do Now)
- Runtime: 15 minutes
- Effort: Minimal
- Result: Confirm everything works
### Path B: Model Integration (Do When Home)
- Runtime: 30-60 minutes
- Effort: Run one command or training script
- Result: Real LightGBM model installed
### Path C: Full Deployment (Do When Ready)
- Runtime: 2-3 hours
- Effort: Setup Gmail OAuth + run processing
- Result: All 80k emails sorted and labeled
**All paths are clear. All tools are provided. Framework is complete.**
---
## The Reality
This is a **complete email classification system** with:
- High-quality code (type hints, comprehensive logging, error handling)
- Smart hybrid classification (hard rules → ML → LLM)
- Proven ML framework (LightGBM)
- Real email data for training (Enron dataset)
- Flexible deployment options
- Clear upgrade path
The framework is **done**. The architecture is **solid**. The testing is **comprehensive**.
What remains is **optional optimization**:
1. Integrating your real trained model
2. Setting up Gmail credentials
3. Fine-tuning categories and thresholds
But none of that is required to start using the system.
**The system is ready. Your move.**
---
## Final Stats
```
PROJECT COMPLETE
Date: 2025-10-21
Status: 100% FEATURE COMPLETE
Framework Maturity: All Features Implemented
Test Coverage: 90% (27/30 passing)
Code Quality: Full type hints and comprehensive error handling
Documentation: Comprehensive
Ready for: Immediate use or real model integration
Development Path: 14 commits tracking complete implementation
Build Time: ~2 weeks of focused development
Lines of Code: ~6,000+
Core Modules: 38 Python files
Test Suite: 23 comprehensive tests
Dependencies: 42 packages
What You Can Do:
✅ Test framework now (mock model)
✅ Train on Enron when home
✅ Process 80k+ emails when ready
✅ Scale to production immediately
✅ Customize categories and rules
✅ Deploy to other systems
What's Not Needed:
❌ More architecture work
❌ Core framework changes
❌ Additional phase development
❌ More infrastructure setup
Bottom Line:
🎉 EMAIL SORTER IS COMPLETE AND READY TO USE 🎉
```
---
**Built with Python, LightGBM, Sentence-Transformers, Ollama, and Google APIs**
**Ready for email classification and Marion's 80k+ emails**
**What are you waiting for? Start processing!**

View File

@ -0,0 +1,479 @@
# Email Sorter: Project Roadmap & Learnings
## Document Purpose
This document captures learnings from the November 2025 research session and defines the project scope, role within a larger email processing ecosystem, and development roadmap for 2025.
---
## Project Scope Definition
### What This Tool IS
**Email Sorter is a TRIAGE tool.** Its job is:
1. **Bulk classification** - Sort emails into buckets quickly
2. **Risk-based routing** - Flag high-stakes items for careful handling
3. **Downstream handoff** - Prepare emails for specialized processing tools
### What This Tool IS NOT
- Not a spam filter (trust Gmail/Outlook for that)
- Not a complete email management solution
- Not trying to do everything
- Not the final destination for any email
### Role in Larger Ecosystem
```
┌─────────────────────────────────────────────────────────────────┐
│ EMAIL PROCESSING ECOSYSTEM │
└─────────────────────────────────────────────────────────────────┘
┌──────────────┐
│ RAW INBOX │ (Gmail, Outlook, IMAP)
│ 10k+ │
└──────┬───────┘
┌──────────────┐
│ SPAM FILTER │ ← Trust existing provider (Gmail/Outlook)
│ (existing) │
└──────┬───────┘
┌───────────────────────────────────────┐
│ EMAIL SORTER (THIS TOOL) │ ← TRIAGE/ROUTING
│ ┌─────────────┐ ┌────────────────┐ │
│ │ Agent Scan │→ │ ML/LLM Classify│ │
│ │ (discovery) │ │ (bulk sort) │ │
│ └─────────────┘ └────────────────┘ │
└───────────────────┬───────────────────┘
┌─────────────┼─────────────┬─────────────┐
▼ ▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ JUNK │ │ ROUTINE │ │ BUSINESS │ │ PERSONAL │
│ BUCKET │ │ BUCKET │ │ BUCKET │ │ BUCKET │
└────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │ │
▼ ▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Batch │ │ Batch │ │ Knowledge│ │ Human │
│ Cleanup │ │ Summary │ │ Graph │ │ Review │
│ (cheap) │ │ Tool │ │ Builder │ │(careful) │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
OTHER TOOLS IN ECOSYSTEM (not this project)
```
---
## Key Learnings from Research Sessions
### Session 1: brett-gmail (801 emails, Personal Inbox)
| Method | Accuracy | Time |
|--------|----------|------|
| ML-Only | 54.9% | ~5 sec |
| ML+LLM | 93.3% | ~3.5 min |
| Manual Agent | 99.8% | ~25 min |
### Session 2: brett-microsoft (596 emails, Business Inbox)
| Method | Accuracy | Time |
|--------|----------|------|
| Manual Agent | 98.2% | ~30 min |
**Key Insight:** Business inboxes require different classification approaches than personal inboxes.
---
### 1. ML Pipeline is Overkill for Small Datasets
| Dataset Size | Recommended Approach | Rationale |
|--------------|---------------------|-----------|
| <500 | Agent-only analysis | ML overhead exceeds benefit |
| 500-2000 | Agent pre-scan + ML | Discovery improves ML accuracy |
| 2000-10000 | ML + LLM fallback | Balanced speed/accuracy |
| >10000 | ML-only (fast mode) | Speed critical at scale |
**Evidence:** 801-email dataset achieved 99.8% accuracy with 25-min agent analysis vs 54.9% with pure ML.
### 2. Agent Pre-Scan Adds Massive Value
A 10-15 minute agent discovery phase before bulk classification:
- Identifies dominant sender domains
- Discovers subject patterns
- Suggests optimal categories for THIS dataset
- Can generate sender-to-category mappings
**This is NOT the same as the full manual analysis.** It's a quick reconnaissance pass.
### 3. Categories Should Serve Downstream Processing
Don't optimize for human-readable labels. Optimize for routing decisions:
| Category Type | Downstream Handler | Accuracy Need |
|---------------|-------------------|---------------|
| Junk/Marketing | Batch cleanup tool | LOW (errors OK) |
| Newsletters | Summary aggregator | MEDIUM |
| Transactional | Archive, searchable | MEDIUM |
| Business | Knowledge graph | HIGH |
| Personal | Human review | CRITICAL |
| Security | Never auto-filter | CRITICAL |
### 4. Risk-Based Accuracy Requirements
Not all emails need the same classification confidence:
```
HIGH STAKES (must not miss):
├─ Personal correspondence (sentimental value)
├─ Security alerts (account safety)
├─ Job applications (life-changing)
└─ Financial/legal documents
LOW STAKES (errors tolerable):
├─ Marketing promotions
├─ Newsletter digests
├─ Automated notifications
└─ Social media alerts
```
### 5. Spam Filtering is a Solved Problem
Don't reinvent spam filtering. Gmail and Outlook do it well. This tool should:
- Assume spam is already filtered
- Focus on categorizing legitimate mail
- Trust the upstream provider
If spam does get through, a simple secondary filter could catch obvious cases, but this is low priority.
### 6. Sender Domain is the Strongest Signal
From the 801-email analysis:
- Top 5 senders = 47.5% of all emails
- Sender domain alone could classify 80%+ of automated emails
- Subject patterns matter less than sender patterns
**Implication:** A sender-first classification approach could dramatically speed up processing.
### 7. Inbox Character Matters (NEW - Session 2)
**Critical Discovery:** Before classifying emails, assess the inbox CHARACTER:
| Inbox Type | Characteristics | Classification Approach |
|------------|-----------------|------------------------|
| **Personal/Consumer** | Subscription-heavy, marketing-dominant, automated 40-50% | Sender domain first |
| **Business/Professional** | Client work, operations, developer tools 60-70% | Sender + Subject context |
| **Mixed** | Both patterns present | Hybrid approach needed |
**Evidence from brett-microsoft analysis:**
- 73.2% Business/Professional content
- Only 8.2% Personal content
- Required client relationship tracking
- Support case ID extraction valuable
**Implications for Agent Pre-Scan:**
1. First determine inbox character (business vs personal vs mixed)
2. Select appropriate category templates
3. Business inboxes need relationship context, not just sender domains
### 8. Business Inboxes Need Special Handling (NEW - Session 2)
Business/professional inboxes require additional classification dimensions:
**Client Relationship Tracking:**
- Same domain may have different contexts (internal vs external)
- Client conversations span multiple senders
- Subject threading matters more than in consumer inboxes
**Support Case ID Extraction:**
- Business inboxes often have case/ticket IDs connecting emails
- Microsoft: Case #, TrackingID#
- Other vendors: Ticket numbers, reference IDs
- ID extraction should be first-class feature
**Accuracy Expectations:**
- Personal inboxes: 99%+ achievable with sender-first
- Business inboxes: 95-98% achievable (more nuanced)
- Accept lower accuracy ceiling, invest in risk-flagging
### 9. Multi-Inbox Analysis Reveals Patterns (NEW - Session 2)
Analyzing multiple inboxes from same user reveals:
- **Inbox segregation patterns** - Gmail for personal, Outlook for business
- **Cross-inbox senders** - Security alerts appear in both
- **Category overlap** - Some categories universal, some inbox-specific
**Implication:** Future feature could merge analysis across inboxes to build complete user profile.
---
## Technical Architecture (Refined)
### Current State
```
Email Source → LocalFileParser → FeatureExtractor → ML Classifier → Output
└→ LLM Fallback (if low confidence)
```
### Target State (2025)
```
Email Source
┌─────────────────────────────────────────────────────────────┐
│ ROUTING LAYER │
│ Check dataset size → Route to appropriate pipeline │
└─────────────────────────────────────────────────────────────┘
├─── <500 emails Agent-Only Analysis
├─── 500-5000 ───────→ Agent Pre-Scan + ML Pipeline
└─── >5000 ──────────→ ML Pipeline (optional LLM)
Each pipeline outputs:
- Categorized emails (with confidence)
- Risk flags (high-stakes items)
- Routing recommendations
- Insights report
```
### Agent Pre-Scan Module (NEW)
```python
class AgentPreScan:
"""
Quick discovery phase before bulk classification.
Time budget: 10-15 minutes.
"""
def scan(self, emails: List[Email]) -> PreScanResult:
# 1. Sender domain analysis (2 min)
sender_stats = self.analyze_senders(emails)
# 2. Subject pattern detection (3 min)
patterns = self.detect_patterns(emails, sample_size=100)
# 3. Category suggestions (5 min, uses LLM)
categories = self.suggest_categories(sender_stats, patterns)
# 4. Generate sender map (2 min)
sender_map = self.create_sender_mapping(sender_stats, categories)
return PreScanResult(
sender_stats=sender_stats,
patterns=patterns,
suggested_categories=categories,
sender_map=sender_map,
estimated_distribution=self.estimate_distribution(emails, categories)
)
```
---
## Development Roadmap
### Phase 0: Documentation Complete (NOW)
- [x] Research session findings documented
- [x] Classification methods comparison written
- [x] Project scope defined
- [x] This roadmap created
### Phase 1: Quick Wins (Q1 2025, 4-8 hours)
1. **Dataset size routing**
- Auto-detect email count
- Route small datasets to agent analysis
- Route large datasets to ML pipeline
2. **Sender-first classification**
- Extract sender domain
- Check against known sender map
- Skip ML for known high-volume senders
3. **Risk flagging**
- Flag low-confidence results
- Flag potential personal emails
- Flag security-related emails
### Phase 2: Agent Pre-Scan (Q1 2025, 8-16 hours)
1. **Sender analysis module**
- Cluster by domain
- Calculate volume statistics
- Identify automated vs personal
2. **Pattern detection module**
- Sample subject lines
- Find templates and IDs
- Detect lifecycle stages
3. **Category suggestion module**
- Use LLM to suggest categories
- Based on sender/pattern analysis
- Output category definitions
4. **Sender mapping module**
- Map senders to suggested categories
- Output as JSON for pipeline use
- Support manual overrides
### Phase 3: Integration & Polish (Q2 2025)
1. **Unified CLI**
- Single command handles all dataset sizes
- Progress reporting
- Configurable verbosity
2. **Output standardization**
- Common format for all pipelines
- Include routing recommendations
- Include confidence and risk flags
3. **Ecosystem integration**
- Define handoff format for downstream tools
- Document API for other tools to consume
- Create example integrations
### Phase 4: Scale Testing (Q2-Q3 2025)
1. **Test on real 10k+ mailboxes**
- Multiple users, different patterns
- Measure accuracy vs speed
- Refine thresholds
2. **Pattern library**
- Accumulate patterns from multiple mailboxes
- Build reusable sender maps
- Create category templates
3. **Feedback loop**
- Track classification accuracy
- Learn from corrections
- Improve over time
---
## Configuration Philosophy
### User-Facing Config (Keep Simple)
```yaml
# config/user_config.yaml
mode: auto # auto | agent | ml | hybrid
risk_threshold: high # low | medium | high
output_format: json # json | csv | html
```
### Internal Config (Full Control)
```yaml
# config/advanced_config.yaml
routing:
small_threshold: 500
medium_threshold: 5000
agent_prescan:
enabled: true
time_budget_minutes: 15
sample_size: 100
ml_pipeline:
confidence_threshold: 0.55
llm_fallback: true
batch_size: 512
risk_detection:
personal_indicators: [gmail.com, hotmail.com, outlook.com]
security_senders: [accounts.google.com, security@]
high_stakes_keywords: [urgent, important, legal, contract]
```
---
## Success Metrics
### For This Tool
| Metric | Target | Current |
|--------|--------|---------|
| Classification accuracy (large datasets) | >85% | 54.9% (ML), 93.3% (ML+LLM) |
| Processing speed (10k emails) | <5 min | ~24 sec (ML-only) |
| High-stakes miss rate | <1% | Not measured |
| Setup time for new mailbox | <20 min | Variable |
### For Ecosystem
| Metric | Target |
|--------|--------|
| End-to-end mailbox processing | <2 hours for 10k |
| User intervention needed | <10% of emails |
| Downstream tool compatibility | 100% |
---
## Open Questions (To Resolve in 2025)
1. **Category standardization**: Should categories be fixed across all users, or discovered per-mailbox?
2. **Sender map sharing**: Can sender maps be shared across users? Privacy implications?
3. **Incremental processing**: How to handle new emails added to already-processed mailboxes?
4. **Multi-account support**: Same user, multiple email accounts?
5. **Feedback integration**: How do corrections feed back into the system?
---
## Files Created During Research
### Session 1 (brett-gmail, Personal Inbox)
| File | Purpose |
|------|---------|
| `tools/brett_gmail_analyzer.py` | Custom analyzer for personal inbox |
| `tools/generate_html_report.py` | HTML report generator |
| `data/brett_gmail_analysis.json` | Analysis data output |
| `docs/CLASSIFICATION_METHODS_COMPARISON.md` | Method comparison |
| `docs/REPORT_FORMAT.md` | HTML report documentation |
| `docs/SESSION_HANDOVER_20251128.md` | Session 1 handover |
### Session 2 (brett-microsoft, Business Inbox)
| File | Purpose |
|------|---------|
| `tools/brett_microsoft_analyzer.py` | Custom analyzer for business inbox |
| `data/brett_microsoft_analysis.json` | Analysis data output |
| `/home/bob/.../brett-ms-sorter/BRETT_MICROSOFT_ANALYSIS_REPORT.md` | Full analysis report |
---
## Summary
**Email Sorter is a triage tool, not a complete solution.**
Its job is to quickly sort emails into buckets so that specialized downstream tools can handle each bucket appropriately. The key insight from this research session is that an agent pre-scan phase, even just 10-15 minutes, dramatically improves classification accuracy for any dataset size.
The ML pipeline is valuable for scale (10k+ emails) but overkill for smaller datasets. Risk-based accuracy means we can tolerate errors on junk but must be careful with personal correspondence.
2025 development should focus on:
1. Smart routing based on dataset size
2. Agent pre-scan for discovery
3. Standardized output for ecosystem integration
4. Scale testing on real large mailboxes
---
*Document Version: 1.1*
*Created: 2025-11-28*
*Updated: 2025-11-28 (Session 2 learnings)*
*Sessions: brett-gmail (801 emails, personal), brett-microsoft (596 emails, business)*

View File

@ -1,402 +0,0 @@
# EMAIL SORTER - PROJECT STATUS
**Date:** 2025-10-21
**Status:** PHASE 2 - IMPLEMENTATION COMPLETE
**Version:** 1.0.0 (Development)
---
## EXECUTIVE SUMMARY
Email Sorter framework is **100% code-complete and tested**. All 16 planned phases have been implemented. The system is ready for:
1. **Real data training** (when you get home with Enron dataset access)
2. **Gmail/IMAP credential configuration** (OAuth setup)
3. **Full end-to-end testing** with real email data
4. **Production deployment** to process Marion's 80k+ emails
---
## COMPLETED PHASES (1-16)
### Phase 1: Project Setup ✅
- Virtual environment configured
- All dependencies installed (42+ packages)
- Directory structure created
- Git initialized with 10 commits
### Phase 2-3: Core Infrastructure ✅
- `src/utils/config.py` - YAML-based configuration system
- `src/utils/logging.py` - Rich logging with file output
- Email data models with full type hints
### Phase 4: Email Providers ✅
- **MockProvider** - For testing (fully functional)
- **GmailProvider** - Stub ready for OAuth credentials
- **IMAPProvider** - Stub ready for server config
- All with graceful error handling
### Phase 5: Feature Extraction ✅
- Semantic embeddings (sentence-transformers, 384 dims)
- Hard pattern matching (20+ patterns)
- Structural features (metadata, timing, attachments)
- Attachment analysis (PDF, DOCX, XLSX text extraction)
### Phase 6: ML Classifier ✅
- Mock Random Forest (clearly labeled for testing)
- Placeholder for real LightGBM training
- Prediction with confidence scores
- Model serialization/deserialization
### Phase 7: LLM Integration ✅
- OllamaProvider (local, with retry logic)
- OpenAIProvider (API-compatible)
- Graceful degradation when LLM unavailable
- Batch processing support
### Phase 8: Adaptive Classifier ✅
- Three-tier classification:
1. Hard rules (10% - instant)
2. ML classifier (85% - fast)
3. LLM review (5% - uncertain cases)
- Dynamic threshold management
- Statistics tracking
### Phase 9: Processing Pipeline ✅
- BulkProcessor with checkpointing
- Resumable processing from checkpoints
- Batch-based processing
- Progress tracking
### Phase 10: Calibration System ✅
- EmailSampler (stratified + random)
- LLMAnalyzer (discover natural categories)
- CalibrationWorkflow (end-to-end)
- Category validation
### Phase 11: Export & Reporting ✅
- JSON export with metadata
- CSV export for analysis
- Organized by category
- Human-readable reports
### Phase 12: Threshold & Pattern Learning ✅
- **ThresholdAdjuster** - Learn from LLM feedback
- Agreement tracking per category
- Automatic threshold suggestions
- Adjustment history
- **PatternLearner** - Sender-specific rules
- Category distribution per sender
- Domain-level patterns
- Hard rule suggestions
### Phase 13: Advanced Processing ✅
- **EnronParser** - Parse Enron email dataset
- **AttachmentHandler** - Extract PDF/DOCX content
- **ModelTrainer** - Real LightGBM training
- **EmbeddingCache** - Cache with MD5 hashing
- **EmbeddingBatcher** - Parallel embedding generation
- **QueueManager** - Batch queue with persistence
### Phase 14: Provider Sync ✅
- **GmailSync** - Sync to Gmail labels
- **IMAPSync** - Sync to IMAP keywords
- Configurable label mapping
- Batch update support
### Phase 15: Orchestration ✅
- **EmailSorterOrchestrator** - 4-phase pipeline
1. Calibration
2. Bulk processing
3. LLM review
4. Export & sync
- Full progress tracking
- Timing and metrics
### Phase 16: Packaging ✅
- `setup.py` - setuptools configuration
- `pyproject.toml` - Modern PEP 517/518
- Optional dependencies (dev, gmail, ollama, openai)
- Console script entry point
### Phase 15: Testing ✅
- 23 unit tests written
- 5/7 E2E tests passing
- Feature extraction validated
- Classifier flow tested
- Mock provider integration tested
---
## CODE STATISTICS
```
Total Files: 37 Python modules + configs
Total Lines: ~6,000+ lines of code
Core Modules: 16 major components
Test Coverage: 23 tests (unit + integration)
Dependencies: 42 packages installed
Git Commits: 10 commits tracking all work
```
---
## ARCHITECTURE OVERVIEW
```
┌──────────────────────────────────────────────────────────────┐
│ EMAIL SORTER v1.0 │
└──────────────────────────────────────────────────────────────┘
┌─ INPUT ─────────────────┐
│ Email Providers │
│ - MockProvider ✅ │
│ - Gmail (OAuth ready) │
│ - IMAP (ready) │
└─────────────────────────┘
┌─ CALIBRATION ───────────┐
│ EmailSampler ✅ │
│ LLMAnalyzer ✅ │
│ CalibrationWorkflow ✅ │
│ ModelTrainer ✅ │
└─────────────────────────┘
┌─ FEATURE EXTRACTION ────┐
│ Embeddings ✅ │
│ Patterns ✅ │
│ Structural ✅ │
│ Attachments ✅ │
│ Cache + Batch ✅ │
└─────────────────────────┘
┌─ CLASSIFICATION ────────┐
│ Hard Rules ✅ │
│ ML (LightGBM) ✅ │
│ LLM (Ollama/OpenAI) ✅ │
│ Adaptive Orchestrator ✅
│ Queue Management ✅ │
└─────────────────────────┘
┌─ LEARNING ─────────────┐
│ Threshold Adjuster ✅ │
│ Pattern Learner ✅ │
└─────────────────────────┘
┌─ OUTPUT ────────────────┐
│ JSON Export ✅ │
│ CSV Export ✅ │
│ Reports ✅ │
│ Gmail Sync ✅ │
│ IMAP Sync ✅ │
└─────────────────────────┘
```
---
## WHAT'S READY RIGHT NOW
### ✅ Framework (Complete)
- All core infrastructure
- Config management
- Logging system
- Email data models
- Feature extraction
- Classifier orchestration
- Processing pipeline
- Export system
- All tests passing
### ✅ Testing (Verified)
- Mock provider works
- Feature extraction validated
- Classification flow tested
- Export formats work
- Hard rules accurate
- CLI interface operational
### ⚠️ Requires Your Input
1. **ML Model Training**
- Mock Random Forest included
- Real LightGBM training code ready
- Enron dataset available (569MB)
- Just needs: `trainer.train(labeled_emails)`
2. **Gmail OAuth**
- Provider code complete
- Needs: credentials.json
- Clear error messages when missing
3. **LLM Testing**
- Ollama integration ready
- qwen3:1.7b loaded
- Integration tested (careful with laptop)
---
## NEXT STEPS - WHEN YOU GET HOME
### Step 1: Model Training
```python
from src.calibration.enron_parser import EnronParser
from src.calibration.trainer import ModelTrainer
# Parse Enron
parser = EnronParser("enron_mail_20150507")
enron_emails = parser.parse_emails(limit=5000)
# Train real model
trainer = ModelTrainer(feature_extractor, categories, config)
results = trainer.train(labeled_emails)
trainer.save_model("models/lightgbm_real.pkl")
```
### Step 2: Gmail OAuth Setup
```bash
# Download credentials.json from Google Cloud Console
# Place in project root or config/
# Run: email-sorter --source gmail --credentials credentials.json
```
### Step 3: Full Pipeline Test
```bash
# Test with 100 emails
email-sorter --source gmail --limit 100 --output test_results/
# Full production run
email-sorter --source gmail --output marion_results/
```
### Step 4: Production Deployment
```bash
# Package as wheel
python setup.py sdist bdist_wheel
# Install
pip install dist/email_sorter-1.0.0-py3-none-any.whl
# Run
email-sorter --source gmail --credentials ~/.gmail_creds.json --output results/
```
---
## KEY FILES TO KNOW
**Core Entry Points:**
- `src/cli.py` - Command-line interface
- `src/orchestration.py` - Main pipeline orchestrator
**Training & Calibration:**
- `src/calibration/trainer.py` - Real LightGBM training
- `src/calibration/workflow.py` - End-to-end calibration
- `src/calibration/enron_parser.py` - Dataset parsing
**Classification:**
- `src/classification/adaptive_classifier.py` - Main classifier
- `src/classification/feature_extractor.py` - Feature extraction
- `src/classification/ml_classifier.py` - ML predictions
- `src/classification/llm_classifier.py` - LLM predictions
**Learning:**
- `src/adjustment/threshold_adjuster.py` - Dynamic thresholds
- `src/adjustment/pattern_learner.py` - Sender patterns
**Processing:**
- `src/processing/bulk_processor.py` - Batch processing
- `src/processing/queue_manager.py` - LLM queue
- `src/processing/attachment_handler.py` - Attachment analysis
**Export:**
- `src/export/exporter.py` - Results export
- `src/export/provider_sync.py` - Gmail/IMAP sync
---
## GIT HISTORY
```
b34bb50 Add pyproject.toml - modern Python packaging configuration
ee6c276 Add queue management, embedding optimization, and calibration workflow
f5d89a6 CRITICAL: Add missing Phase 12 modules and advanced features
c531412 Phase 15: End-to-end pipeline tests - 5/7 passing
02be616 Phase 9-14: Complete processing pipeline, calibration, export
b7cc744 Complete IMAP provider import fixes
16bc6f0 Fix IMAP provider imports
b49dad9 Build Phase 1-7: Core infrastructure and classifiers
8c73f25 Initial commit: Complete project blueprint and research
```
---
## TESTING
### Run All Tests
```bash
cd email-sorter
source venv/Scripts/activate
pytest tests/ -v
```
### Quick CLI Test
```bash
# Test config loading
python -m src.cli test-config
# Test Ollama connection (if running)
python -m src.cli test-ollama
# Full mock pipeline
python -m src.cli run --source mock --output test_results/
```
---
## WHAT MAKES THIS COMPLETE
1. **All 16 Phases Implemented** - No shortcuts, everything built
2. **Production Code Quality** - Type hints, error handling, logging
3. **End-to-End Tested** - 23 tests, multiple integration tests
4. **Well Documented** - Docstrings, comments, README
5. **Clearly Labeled Mocks** - Mock components transparent about limitations
6. **Ready for Real Data** - All systems tested, waiting for:
- Real Gmail credentials
- Real Enron training data
- Real model training at home
---
## PERFORMANCE EXPECTATIONS
- **Calibration:** 3-5 minutes (1500 email sample)
- **Bulk Processing:** 10-12 minutes (80k emails)
- **LLM Review:** 4-5 minutes (batched)
- **Export:** 2-3 minutes
- **Total:** ~17-25 minutes for 80k emails
**Accuracy:** 94-96% (when trained on real data)
---
## RESOURCES
- **Documentation:** README.md, PROJECT_BLUEPRINT.md, BUILD_INSTRUCTIONS.md
- **Research:** RESEARCH_FINDINGS.md
- **Config:** config/default_config.yaml, config/categories.yaml
- **Enron Dataset:** enron_mail_20150507/ (569MB, ready to use)
- **Tests:** tests/ (23 tests)
---
## SUMMARY
**Status:** ✅ FEATURE COMPLETE
Email Sorter is a fully implemented, tested, and documented system ready for production use. All 16 development phases are complete with over 6,000 lines of production code. The system is waiting for real data (your Enron dataset) and real credentials (Gmail OAuth) to demonstrate its full capabilities.
**You can now:** Train a real model, configure Gmail, and process your 80k+ emails with confidence that the system is complete and ready.
---
**Built with:** Python 3.8+, LightGBM, Sentence-Transformers, Ollama, Gmail API
**Ready for:** Production email classification, local processing, privacy-first operation

View File

@ -1,648 +0,0 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Email Sorter - Project Status & Next Steps</title>
<script src="https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.min.js"></script>
<style>
body {
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
margin: 20px;
background: #1e1e1e;
color: #d4d4d4;
}
h1, h2, h3 {
color: #4ec9b0;
}
.diagram {
background: white;
padding: 20px;
margin: 20px 0;
border-radius: 8px;
}
.success {
background: #002a00;
border-left: 4px solid #4ec9b0;
padding: 15px;
margin: 10px 0;
}
.section {
background: #252526;
padding: 15px;
margin: 10px 0;
border-left: 4px solid #569cd6;
}
table {
width: 100%;
border-collapse: collapse;
margin: 20px 0;
background: #252526;
}
th {
background: #37373d;
padding: 12px;
text-align: left;
color: #4ec9b0;
}
td {
padding: 10px;
border-bottom: 1px solid #3e3e42;
}
code {
background: #1e1e1e;
padding: 2px 6px;
border-radius: 3px;
color: #ce9178;
}
.mvp-proven {
background: #003a00;
border: 3px solid #4ec9b0;
padding: 20px;
margin: 20px 0;
border-radius: 8px;
text-align: center;
}
.mvp-proven h2 {
font-size: 2em;
margin: 0;
}
</style>
</head>
<body>
<div class="mvp-proven">
<h2>🎉 MVP PROVEN AND WORKING 🎉</h2>
<p style="font-size: 1.2em; margin: 10px 0;">
<strong>10,000 emails classified in 4 minutes</strong><br/>
72.7% accuracy | 0 LLM calls | Pure ML speed
</p>
</div>
<h1>Email Sorter - Project Status & Next Steps</h1>
<h2>✅ What We've Achieved (MVP Complete)</h2>
<div class="success">
<h3>Core System Working</h3>
<ul>
<li><strong>LLM-Driven Calibration:</strong> Discovers categories from email samples (11 categories found)</li>
<li><strong>ML Model Training:</strong> LightGBM trained on 10k emails (1.8MB model)</li>
<li><strong>Fast Classification:</strong> 10k emails in ~4 minutes with --no-llm-fallback</li>
<li><strong>Category Verification:</strong> Single LLM call validates model fit for new mailboxes</li>
<li><strong>Embedding-Based Features:</strong> Universal 384-dim embeddings transfer across mailboxes</li>
<li><strong>Threshold Optimization:</strong> 0.55 threshold reduces LLM fallback by 40%</li>
</ul>
</div>
<h2>📊 Test Results Summary</h2>
<table>
<tr>
<th>Metric</th>
<th>Result</th>
<th>Status</th>
</tr>
<tr>
<td>Total emails processed</td>
<td>10,000</td>
<td></td>
</tr>
<tr>
<td>Processing time</td>
<td>~4 minutes</td>
<td></td>
</tr>
<tr>
<td>ML classification rate</td>
<td>78.4%</td>
<td></td>
</tr>
<tr>
<td>LLM calls (with --no-llm-fallback)</td>
<td>0</td>
<td></td>
</tr>
<tr>
<td>Accuracy estimate</td>
<td>72.7%</td>
<td>✅ (acceptable for speed)</td>
</tr>
<tr>
<td>Categories discovered</td>
<td>11 (Work, Financial, Updates, etc.)</td>
<td></td>
</tr>
<tr>
<td>Model size</td>
<td>1.8MB</td>
<td>✅ (portable)</td>
</tr>
</table>
<h2>🗂️ Project Organization</h2>
<h3>Core Modules</h3>
<table>
<tr>
<th>Module</th>
<th>Purpose</th>
<th>Status</th>
</tr>
<tr>
<td><code>src/cli.py</code></td>
<td>Main CLI with all flags (--verify-categories, --no-llm-fallback)</td>
<td>✅ Complete</td>
</tr>
<tr>
<td><code>src/calibration/workflow.py</code></td>
<td>LLM-driven category discovery + training</td>
<td>✅ Complete</td>
</tr>
<tr>
<td><code>src/calibration/llm_analyzer.py</code></td>
<td>Batch LLM analysis (20 emails/call)</td>
<td>✅ Complete</td>
</tr>
<tr>
<td><code>src/calibration/category_verifier.py</code></td>
<td>Single LLM call to verify categories</td>
<td>✅ New feature</td>
</tr>
<tr>
<td><code>src/classification/ml_classifier.py</code></td>
<td>LightGBM model wrapper</td>
<td>✅ Complete</td>
</tr>
<tr>
<td><code>src/classification/adaptive_classifier.py</code></td>
<td>Rule → ML → LLM orchestrator</td>
<td>✅ Complete</td>
</tr>
<tr>
<td><code>src/classification/feature_extractor.py</code></td>
<td>Embeddings (384-dim) + TF-IDF</td>
<td>✅ Complete</td>
</tr>
</table>
<h3>Models & Data</h3>
<table>
<tr>
<th>Asset</th>
<th>Location</th>
<th>Status</th>
</tr>
<tr>
<td>Trained model</td>
<td><code>src/models/calibrated/classifier.pkl</code></td>
<td>✅ 1.8MB, 11 categories</td>
</tr>
<tr>
<td>Pretrained copy</td>
<td><code>src/models/pretrained/classifier.pkl</code></td>
<td>✅ Ready for fast load</td>
</tr>
<tr>
<td>Category cache</td>
<td><code>src/models/category_cache.json</code></td>
<td>✅ 10 cached categories</td>
</tr>
<tr>
<td>Test results</td>
<td><code>test/results.json</code></td>
<td>✅ 10k classifications</td>
</tr>
</table>
<h3>Documentation</h3>
<table>
<tr>
<th>Document</th>
<th>Purpose</th>
</tr>
<tr>
<td><code>SYSTEM_FLOW.html</code></td>
<td>Complete system flow diagrams with timing</td>
</tr>
<tr>
<td><code>LABEL_TRAINING_PHASE_DETAIL.html</code></td>
<td>Deep dive into calibration phase</td>
</tr>
<tr>
<td><code>FAST_ML_ONLY_WORKFLOW.html</code></td>
<td>Pure ML workflow analysis</td>
</tr>
<tr>
<td><code>VERIFY_CATEGORIES_FEATURE.html</code></td>
<td>Category verification documentation</td>
</tr>
<tr>
<td><code>PROJECT_STATUS_AND_NEXT_STEPS.html</code></td>
<td>This document - status and roadmap</td>
</tr>
</table>
<h2>🎯 Next Steps (Priority Order)</h2>
<h3>Phase 1: Clean Up & Organize (Next Session)</h3>
<div class="section">
<h4>1.1 Clean Root Directory</h4>
<p><strong>Goal:</strong> Move test artifacts and scripts to organized locations</p>
<ul>
<li>Create <code>docs/</code> folder - move all .html files there</li>
<li>Create <code>scripts/</code> folder - move all .sh files there</li>
<li>Create <code>logs/</code> folder - move all .log files there</li>
<li>Delete debug files (debug_*.txt, spot_check_results.txt)</li>
<li>Create .gitignore for logs/, results/, test/, ml_only_test/, etc.</li>
</ul>
<p><strong>Time:</strong> 10 minutes</p>
</div>
<div class="section">
<h4>1.2 Create README.md</h4>
<p><strong>Goal:</strong> Professional project documentation</p>
<ul>
<li>Overview of system architecture</li>
<li>Quick start guide</li>
<li>Usage examples (with/without calibration, with/without verification)</li>
<li>Performance benchmarks (from our tests)</li>
<li>Configuration options</li>
</ul>
<p><strong>Time:</strong> 30 minutes</p>
</div>
<div class="section">
<h4>1.3 Add Tests</h4>
<p><strong>Goal:</strong> Ensure code quality and catch regressions</p>
<ul>
<li>Unit tests for feature extraction</li>
<li>Unit tests for category verification</li>
<li>Integration test for full pipeline</li>
<li>Test for --no-llm-fallback flag</li>
<li>Test for --verify-categories flag</li>
</ul>
<p><strong>Time:</strong> 2 hours</p>
</div>
<h3>Phase 2: Real-World Integration (Week 1-2)</h3>
<div class="section">
<h4>2.1 Gmail Provider Implementation</h4>
<p><strong>Goal:</strong> Connect to real Gmail accounts</p>
<ul>
<li>Implement Gmail API authentication (OAuth2)</li>
<li>Fetch emails with pagination</li>
<li>Handle Gmail-specific metadata (labels, threads)</li>
<li>Test with personal Gmail account</li>
</ul>
<p><strong>Time:</strong> 4-6 hours</p>
</div>
<div class="section">
<h4>2.2 IMAP Provider Implementation</h4>
<p><strong>Goal:</strong> Support any email provider (Outlook, custom servers)</p>
<ul>
<li>IMAP connection handling</li>
<li>SSL/TLS support</li>
<li>Folder navigation</li>
<li>Test with Outlook/Protonmail</li>
</ul>
<p><strong>Time:</strong> 3-4 hours</p>
</div>
<div class="section">
<h4>2.3 Email Syncing (Apply Classifications)</h4>
<p><strong>Goal:</strong> Move/label emails based on classification</p>
<ul>
<li>Gmail: Apply labels to emails</li>
<li>IMAP: Move emails to folders</li>
<li>Dry-run mode (preview without applying)</li>
<li>Batch operations for speed</li>
<li>Rollback capability</li>
</ul>
<p><strong>Time:</strong> 6-8 hours</p>
</div>
<h3>Phase 3: Production Features (Week 3-4)</h3>
<div class="section">
<h4>3.1 Incremental Classification</h4>
<p><strong>Goal:</strong> Only classify new emails, not entire inbox</p>
<ul>
<li>Track last processed email ID</li>
<li>Resume from checkpoint</li>
<li>Database/file-based state tracking</li>
<li>Scheduled runs (cron integration)</li>
</ul>
<p><strong>Time:</strong> 4-6 hours</p>
</div>
<div class="section">
<h4>3.2 Multi-Account Support</h4>
<p><strong>Goal:</strong> Manage multiple email accounts</p>
<ul>
<li>Per-account configuration</li>
<li>Per-account trained models</li>
<li>Account switching CLI</li>
<li>Shared category cache across accounts</li>
</ul>
<p><strong>Time:</strong> 3-4 hours</p>
</div>
<div class="section">
<h4>3.3 Model Management</h4>
<p><strong>Goal:</strong> Handle model lifecycle</p>
<ul>
<li>Model versioning (timestamps)</li>
<li>Model comparison (A/B testing)</li>
<li>Model export/import</li>
<li>Retraining scheduler</li>
<li>Model degradation detection</li>
</ul>
<p><strong>Time:</strong> 4-5 hours</p>
</div>
<h3>Phase 4: Advanced Features (Month 2)</h3>
<div class="section">
<h4>4.1 Web Dashboard</h4>
<p><strong>Goal:</strong> Visual interface for monitoring and management</p>
<ul>
<li>Flask/FastAPI backend</li>
<li>React/Vue frontend</li>
<li>View classification results</li>
<li>Manually correct classifications (feedback loop)</li>
<li>Monitor accuracy over time</li>
<li>Trigger recalibration</li>
</ul>
<p><strong>Time:</strong> 20-30 hours</p>
</div>
<div class="section">
<h4>4.2 Active Learning</h4>
<p><strong>Goal:</strong> Improve model from user corrections</p>
<ul>
<li>User feedback collection</li>
<li>Disagreement-based sampling (low confidence + user correction)</li>
<li>Incremental model updates</li>
<li>Feedback-driven category evolution</li>
</ul>
<p><strong>Time:</strong> 8-10 hours</p>
</div>
<div class="section">
<h4>4.3 Performance Optimization</h4>
<p><strong>Goal:</strong> Scale to 100k+ emails</p>
<ul>
<li>Batch embedding generation (reduce API calls)</li>
<li>Async/parallel classification</li>
<li>Model quantization (reduce size)</li>
<li>GPU acceleration for embeddings</li>
<li>Caching layer (Redis)</li>
</ul>
<p><strong>Time:</strong> 10-15 hours</p>
</div>
<h2>🔧 Immediate Action Items (This Week)</h2>
<table>
<tr>
<th>Task</th>
<th>Priority</th>
<th>Time</th>
<th>Status</th>
</tr>
<tr>
<td>Clean root directory - organize files</td>
<td>High</td>
<td>10 min</td>
<td>Pending</td>
</tr>
<tr>
<td>Create comprehensive README.md</td>
<td>High</td>
<td>30 min</td>
<td>Pending</td>
</tr>
<tr>
<td>Add .gitignore for test artifacts</td>
<td>High</td>
<td>5 min</td>
<td>Pending</td>
</tr>
<tr>
<td>Create setup.py for pip installation</td>
<td>Medium</td>
<td>20 min</td>
<td>Pending</td>
</tr>
<tr>
<td>Write basic unit tests</td>
<td>Medium</td>
<td>2 hours</td>
<td>Pending</td>
</tr>
<tr>
<td>Test Gmail provider (basic fetch)</td>
<td>Medium</td>
<td>2 hours</td>
<td>Pending</td>
</tr>
</table>
<h2>📈 Success Metrics</h2>
<div class="diagram">
<pre class="mermaid">
flowchart LR
MVP[MVP Proven] --> P1[Phase 1: Organization]
P1 --> P2[Phase 2: Integration]
P2 --> P3[Phase 3: Production]
P3 --> P4[Phase 4: Advanced]
P1 --> M1[Metric: Clean codebase<br/>100% docs coverage]
P2 --> M2[Metric: Real email support<br/>Gmail + IMAP working]
P3 --> M3[Metric: Daily automation<br/>Incremental processing]
P4 --> M4[Metric: User adoption<br/>10+ users, 90%+ satisfaction]
style MVP fill:#4ec9b0
style P1 fill:#569cd6
style P2 fill:#569cd6
style P3 fill:#569cd6
style P4 fill:#569cd6
</pre>
</div>
<h2>🚀 Quick Start Commands</h2>
<div class="section">
<h3>Train New Model (Full Calibration)</h3>
<code>
source venv/bin/activate<br/>
python -m src.cli run \<br/>
&nbsp;&nbsp;--source enron \<br/>
&nbsp;&nbsp;--limit 10000 \<br/>
&nbsp;&nbsp;--output results/<br/>
</code>
<p><strong>Time:</strong> ~25 minutes | <strong>LLM calls:</strong> ~500 | <strong>Accuracy:</strong> 92-95%</p>
</div>
<div class="section">
<h3>Fast ML-Only Classification (Existing Model)</h3>
<code>
source venv/bin/activate<br/>
python -m src.cli run \<br/>
&nbsp;&nbsp;--source enron \<br/>
&nbsp;&nbsp;--limit 10000 \<br/>
&nbsp;&nbsp;--output fast_test/ \<br/>
&nbsp;&nbsp;--no-llm-fallback<br/>
</code>
<p><strong>Time:</strong> ~4 minutes | <strong>LLM calls:</strong> 0 | <strong>Accuracy:</strong> 72-78%</p>
</div>
<div class="section">
<h3>ML with Category Verification (Recommended)</h3>
<code>
source venv/bin/activate<br/>
python -m src.cli run \<br/>
&nbsp;&nbsp;--source enron \<br/>
&nbsp;&nbsp;--limit 10000 \<br/>
&nbsp;&nbsp;--output verified_test/ \<br/>
&nbsp;&nbsp;--no-llm-fallback \<br/>
&nbsp;&nbsp;--verify-categories<br/>
</code>
<p><strong>Time:</strong> ~4.5 minutes | <strong>LLM calls:</strong> 1 | <strong>Accuracy:</strong> 72-78%</p>
</div>
<h2>📁 Recommended Project Structure (After Cleanup)</h2>
<pre style="background: #252526; padding: 15px; border-radius: 5px; font-family: monospace;">
email-sorter/
├── README.md # Main documentation
├── setup.py # Pip installation
├── requirements.txt # Dependencies
├── .gitignore # Ignore test artifacts
├── src/ # Core source code
│ ├── calibration/ # LLM-driven calibration
│ ├── classification/ # ML classification
│ ├── email_providers/ # Gmail, IMAP, Enron
│ ├── llm/ # LLM providers
│ ├── utils/ # Shared utilities
│ └── models/ # Trained models
│ ├── calibrated/ # Current trained model
│ ├── pretrained/ # Quick-load copy
│ └── category_cache.json
├── config/ # Configuration files
│ ├── default_config.yaml
│ └── categories.yaml
├── tests/ # Unit & integration tests
│ ├── test_calibration.py
│ ├── test_classification.py
│ └── test_verification.py
├── scripts/ # Helper scripts
│ ├── train_model.sh
│ ├── fast_classify.sh
│ └── verify_and_classify.sh
├── docs/ # HTML documentation
│ ├── SYSTEM_FLOW.html
│ ├── LABEL_TRAINING_PHASE_DETAIL.html
│ ├── FAST_ML_ONLY_WORKFLOW.html
│ └── VERIFY_CATEGORIES_FEATURE.html
├── logs/ # Runtime logs (gitignored)
│ └── *.log
└── results/ # Test results (gitignored)
└── *.json
</pre>
<h2>🎓 Key Learnings</h2>
<div class="section">
<ul>
<li><strong>Embeddings are universal:</strong> Same model works across different mailboxes</li>
<li><strong>Batching is critical:</strong> 20 emails/LLM call = 3× faster than sequential</li>
<li><strong>Thresholds matter:</strong> 0.55 threshold reduces LLM usage by 40%</li>
<li><strong>Category verification adds value:</strong> 20 sec for confidence check is worth it</li>
<li><strong>Pure ML is viable:</strong> 73% accuracy with 0 LLM calls for speed tests</li>
<li><strong>LLM-driven calibration works:</strong> Discovers natural categories without hardcoding</li>
</ul>
</div>
<h2>✅ Ready for Production?</h2>
<table>
<tr>
<th>Component</th>
<th>Status</th>
<th>Blocker</th>
</tr>
<tr>
<td>Core ML Pipeline</td>
<td>✅ Ready</td>
<td>None</td>
</tr>
<tr>
<td>LLM Calibration</td>
<td>✅ Ready</td>
<td>None</td>
</tr>
<tr>
<td>Category Verification</td>
<td>✅ Ready</td>
<td>None</td>
</tr>
<tr>
<td>Fast ML-Only Mode</td>
<td>✅ Ready</td>
<td>None</td>
</tr>
<tr>
<td>Enron Provider</td>
<td>✅ Ready</td>
<td>None (test only)</td>
</tr>
<tr>
<td>Gmail Provider</td>
<td>⚠️ Needs implementation</td>
<td>OAuth2 + API calls</td>
</tr>
<tr>
<td>IMAP Provider</td>
<td>⚠️ Needs implementation</td>
<td>IMAP library integration</td>
</tr>
<tr>
<td>Email Syncing</td>
<td>❌ Not implemented</td>
<td>Apply labels/move emails</td>
</tr>
<tr>
<td>Tests</td>
<td>⚠️ Minimal coverage</td>
<td>Need comprehensive tests</td>
</tr>
<tr>
<td>Documentation</td>
<td>✅ Excellent</td>
<td>Need README.md</td>
</tr>
</table>
<p><strong>Verdict:</strong> MVP is production-ready for <em>Enron dataset testing</em>. Need Gmail/IMAP providers for real-world use.</p>
<script>
mermaid.initialize({
startOnLoad: true,
theme: 'default',
flowchart: {
useMaxWidth: true,
htmlLabels: true,
curve: 'basis'
}
});
</script>
</body>
</html>

232
docs/REPORT_FORMAT.md Normal file
View File

@ -0,0 +1,232 @@
# Email Classification Report Format
This document explains the HTML report generation system, its data sources, and how to customize it.
## Overview
The report generator creates a static HTML file from classification results. It requires enriched `results.json` with email metadata (subject, sender, date, etc.) - not just classification data.
## Files Involved
| File | Purpose |
|------|---------|
| `tools/generate_html_report.py` | Main report generator script |
| `src/cli.py` | Classification CLI - outputs enriched `results.json` |
| `src/export/exporter.py` | Legacy exporter (JSON/CSV) - not used for HTML |
## Data Flow
```
Email Source (.eml/.msg files)
src/cli.py (classification)
results.json (enriched with metadata)
tools/generate_html_report.py
report.html (static, self-contained)
```
## Usage
### Generate Report
```bash
python tools/generate_html_report.py \
--input /path/to/results.json \
--output /path/to/report.html
```
If `--output` is omitted, creates `report.html` in same directory as input.
### Full Workflow
```bash
# 1. Classify emails
python -m src.cli run \
--source local \
--directory "/path/to/emails" \
--output "/path/to/output" \
--no-llm-fallback
# 2. Generate report
python tools/generate_html_report.py \
--input "/path/to/output/results.json"
```
## results.json Format
The report generator expects this structure:
```json
{
"metadata": {
"total_emails": 801,
"accuracy_estimate": 0.55,
"classification_stats": {
"rule_matched": 9,
"ml_classified": 468,
"llm_classified": 0,
"needs_review": 324
},
"generated_at": "2025-11-28T02:34:00.680196",
"source": "local",
"source_path": "/path/to/emails"
},
"classifications": [
{
"email_id": "unique_id.eml",
"subject": "Email subject line",
"sender": "sender@example.com",
"sender_name": "Sender Name",
"date": "2023-04-13T09:43:29+10:00",
"has_attachments": false,
"category": "Work",
"confidence": 0.81,
"method": "ml"
}
]
}
```
### Required Fields
| Field | Type | Description |
|-------|------|-------------|
| `email_id` | string | Unique identifier (usually filename) |
| `subject` | string | Email subject line |
| `sender` | string | Sender email address |
| `category` | string | Assigned category |
| `confidence` | float | Classification confidence (0-1) |
| `method` | string | Classification method: `ml`, `rule`, or `llm` |
### Optional Fields
| Field | Type | Description |
|-------|------|-------------|
| `sender_name` | string | Display name of sender |
| `date` | string | ISO 8601 date string |
| `has_attachments` | boolean | Whether email has attachments |
## Report Sections
### 1. Header
- Report title
- Generation timestamp
- Source info
- Total email count
### 2. Stats Grid
- Total emails
- Number of categories
- High confidence count (>=70%)
- Unique sender domains
### 3. Category Distribution
- Horizontal bar chart
- Count and percentage per category
- Sorted by count (descending)
### 4. Classification Methods
- Breakdown of ML vs Rule vs LLM
- Shows which method handled what percentage
### 5. Confidence Distribution
- High (>=70%): Green
- Medium (50-70%): Yellow
- Low (<50%): Red
### 6. Top Senders
- Top 20 senders by email count
- Grid layout
### 7. Email Tables (Tabbed)
- "All" tab shows all emails
- Category tabs filter by category
- Search box filters by subject/sender
- Columns: Date, Subject, Sender, Category, Confidence, Method
- Sorted by date (newest first)
- Attachment indicator (📎)
## Customization
### Changing Colors
Edit the CSS variables in `generate_html_report.py`:
```css
:root {
--bg-primary: #1a1a2e; /* Main background */
--bg-secondary: #16213e; /* Card backgrounds */
--bg-card: #0f3460; /* Nested elements */
--text-primary: #eee; /* Main text */
--text-secondary: #aaa; /* Muted text */
--accent: #e94560; /* Accent color (red) */
--accent-hover: #ff6b6b; /* Accent hover */
--success: #00d9a5; /* Green (high confidence) */
--warning: #ffc107; /* Yellow (medium confidence) */
--border: #2a2a4a; /* Border color */
}
```
### Light Theme Example
```css
:root {
--bg-primary: #f5f5f5;
--bg-secondary: #ffffff;
--bg-card: #e8e8e8;
--text-primary: #333;
--text-secondary: #666;
--accent: #2563eb;
--accent-hover: #3b82f6;
--success: #10b981;
--warning: #f59e0b;
--border: #d1d5db;
}
```
### Adding New Sections
1. Add data extraction in `generate_html_report()` function
2. Add HTML section in the main template string
3. Style with existing CSS classes or add new ones
### Adding New Table Columns
1. Modify `generate_email_row()` function
2. Add `<th>` in table header
3. Add `<td>` in row template
## Performance Notes
- Report is fully static (no server required)
- JavaScript is minimal (tab switching, search filtering)
- Handles 1000+ emails without performance issues
- For 10k+ emails, consider pagination (not yet implemented)
## Future Enhancements (TODO)
- [ ] Pagination for large datasets
- [ ] Export to PDF option
- [ ] Configurable color themes via CLI
- [ ] Column sorting (click headers)
- [ ] Date range filter
- [ ] Sender domain grouping
- [ ] Category confidence heatmap
- [ ] Email body preview on hover
## Troubleshooting
### "KeyError: 'subject'"
Results.json lacks email metadata. Re-run classification with latest cli.py.
### Empty tables
Check that results.json has `classifications` array with data.
### Dates showing "N/A"
Date parsing failed. Check date format in results.json is ISO 8601.
### Search not working
JavaScript error. Check browser console. Ensure no HTML entities in data.

View File

@ -1,419 +0,0 @@
# EMAIL SORTER - RESEARCH FINDINGS
Date: 2024-10-21
Research Phase: Complete
---
## SEARCH SUMMARY
We conducted web research on:
1. Email classification benchmarks (2024)
2. XGBoost vs LightGBM for embeddings and mixed features
3. Competition analysis (existing email organizers)
4. Gradient boosting with embeddings + categorical features
---
## 1. EMAIL CLASSIFICATION BENCHMARKS (2024)
### Key Findings
**Enron Dataset Performance:**
- Traditional ML (SVM, Random Forest): **95-98% accuracy**
- Deep Learning (DNN-BiLSTM): **98.69% accuracy**
- Transformer models (BERT, RoBERTa, DistilBERT): **~99% accuracy**
- LLMs (GPT-4): **99.7% accuracy** (phishing detection)
- Ensemble stacking methods: **98.8% accuracy**, F1: 98.9%
**Zero-Shot LLM Performance:**
- Flan-T5: **94% accuracy**, F1: 90%
- GPT-4: **97% accuracy**, F1: 95%
**Key insight:** Modern ML methods can achieve 95-98% accuracy on email classification. Our hybrid target of 94-96% is realistic and competitive.
### Dataset Details
- **Enron Email Dataset**: 500,000+ emails from 150 employees
- **EnronQA benchmark**: 103,638 emails with 528,304 Q&A pairs
- **AESLC**: Annotated Enron Subject Line Corpus (for summarization)
### Implications for Our System
- Our 94-96% target is achievable and competitive
- LightGBM + embeddings should hit 92-95% easily
- LLM review for 5-10% uncertain cases will push us to upper range
- Attachment analysis is a differentiator (not tested in benchmarks)
---
## 2. LIGHTGBM VS XGBOOST FOR HYBRID FEATURES
### Decision: LightGBM WINS 🏆
| Feature | LightGBM | XGBoost | Winner |
|---------|----------|---------|--------|
| **Categorical handling** | Native support | Needs encoding | ✅ LightGBM |
| **Speed** | 2-5x faster | Baseline | ✅ LightGBM |
| **Memory** | Very efficient | Standard | ✅ LightGBM |
| **Accuracy** | Equivalent | Equivalent | Tie |
| **Mixed features** | 4x speedup | Slower | ✅ LightGBM |
### Key Advantages of LightGBM
1. **Native Categorical Support**
- LightGBM splits categorical features by equality
- No need for one-hot encoding
- Avoids dimensionality explosion
- XGBoost requires manual encoding (label, mean, or one-hot)
2. **Speed Performance**
- 2-5x faster than XGBoost in general
- **4x speedup** on datasets with categorical features
- Same AUC performance, drastically better speed
3. **Memory Efficiency**
- Preferable for large, sparse datasets
- Better for memory-constrained environments
4. **Embedding Compatibility**
- Handles dense numerical features (embeddings) excellently
- Native categorical handling for mixed feature types
- Perfect for our hybrid approach
### Research Quote
> "LightGBM is significantly faster than XGBoost but delivers almost equivalent performance. In tests, both algorithms achieve pretty much the same AUC, but LightGBM runs from 2 to 5 times faster."
### Implications for Our System
**Perfect for our hybrid features:**
```python
features = {
'embeddings': [384 dense numerical], # ✅ LightGBM handles
'patterns': [20 boolean/numerical], # ✅ LightGBM handles
'sender_type': 'corporate', # ✅ LightGBM native categorical
'time_of_day': 'morning', # ✅ LightGBM native categorical
}
# No encoding needed! 4x faster than XGBoost with encoding
```
---
## 3. COMPETITION ANALYSIS
### Cloud-Based Email Organizers (2024)
| Tool | Price | Features | Privacy | Accuracy Estimate |
|------|-------|----------|---------|-------------------|
| **SaneBox** | $7-15/mo | AI filtering, smart folders | ❌ Cloud | ~85% |
| **Clean Email** | $10-30/mo | 30+ smart filters, bulk ops | ❌ Cloud | ~80% |
| **Spark** | Free/Paid | Smart inbox, categorization | ❌ Cloud | ~75% |
| **EmailTree.ai** | Enterprise | NLP classification, routing | ❌ Cloud | ~90% |
| **Mailstrom** | $30-50/yr | Bulk analysis, categorization | ❌ Cloud | ~70% |
### Key Features They Offer
**Common capabilities:**
- Automatic categorization (newsletters, social, etc.)
- Smart folders based on sender/topic
- Bulk operations (archive, delete)
- Unsubscribe management
- Search and filter
**What they DON'T offer:**
- ❌ Local processing (all require cloud upload)
- ❌ Attachment content analysis
- ❌ One-time cleanup (all are subscriptions)
- ❌ Offline capability
- ❌ Custom LLM integration
- ❌ Open source / distributable
### Our Competitive Advantages
**100% LOCAL** - No data leaves the machine
**Privacy-first** - Perfect for business owners with sensitive data
**One-time use** - No subscription, pay per job or DIY
**Attachment analysis** - Extract and classify PDF/DOCX content
**Customizable** - Adapts to each inbox via calibration
**Open source potential** - Distributable as Python wheel
**Offline capable** - Works without internet after setup
### Market Gap Identified
**Target customers:**
- Self-employed / business owners with 10k-100k+ emails
- Can't/won't upload to cloud (privacy, GDPR, security concerns)
- Want one-time cleanup, not ongoing subscription
- Tech-savvy enough to run Python tool or hire someone to run it
- Have sensitive business correspondence, invoices, contracts
**Pain point:**
> "I've thought about just deleting it all, but there's some stuff I need to keep..."
**Our solution:**
- Local processing (100% private)
- Smart classification (94-96% accurate)
- Attachment analysis (find those invoices!)
- One-time fee or DIY
**Pricing comparison:**
- SaneBox: $120-180/year subscription
- Clean Email: $120-360/year subscription
- **Us**: $50-200 one-time job OR free (DIY wheel)
---
## 4. GRADIENT BOOSTING WITH EMBEDDINGS
### Key Finding: CatBoost Has Embedding Support
**GB-CENT Model** (Gradient Boosted Categorical Embedding and Numerical Trees):
- Combines latent factor embeddings with tree components
- Handles categorical features via low-dimensional representation
- Captures nonlinear interactions of numerical features
- Best of both worlds approach
**CatBoost's "killer feature":**
> "CatBoost has a killer feature that knows how to work with embeddings, though this is not well-documented."
**Performance insights:**
- Embeddings both as a feature AND as separate numerical features → best quality
- Native categorical handling has slight edge over encoded approaches
- One-hot encoding generally performs poorly (especially with limited tree depth)
### Implications for Our System
**LightGBM strategy (validated by research):**
```python
import lightgbm as lgb
# Combine embeddings + categorical features
X = np.concatenate([
embeddings, # 384 dense numerical
pattern_booleans, # 20 numerical (0/1)
structural_numerical # 10 numerical (counts, lengths)
], axis=1)
# Specify categorical features by name
categorical_features = ['sender_domain_type', 'time_of_day', 'day_of_week']
model = lgb.LGBMClassifier(
categorical_feature=categorical_features, # Native handling
n_estimators=200,
learning_rate=0.1,
max_depth=8
)
model.fit(X, y)
```
**Why this works:**
- LightGBM handles embeddings (dense numerical) excellently
- Native categorical handling for domain_type, time_of_day, etc.
- No encoding overhead (faster, less memory)
- Research shows slight accuracy edge over encoded approaches
---
## 5. SENTENCE EMBEDDINGS FOR EMAIL
### all-MiniLM-L6-v2 - The Sweet Spot
**Model specs:**
- Size: 23MB (tiny!)
- Dimensions: 384 (vs 768 for larger models)
- Speed: ~100 emails/sec on CPU
- Accuracy: 85-95% on email/text classification tasks
- Pretrained on 1B+ sentence pairs
**Why it's perfect for us:**
- Small enough to bundle with wheel distribution
- Fast on CPU (no GPU required)
- Semantic understanding (handles synonyms, paraphrasing)
- Works with short text (emails are perfect)
- No fine-tuning needed (pretrained is excellent)
### Structured Embeddings (Our Innovation)
Instead of naive embedding:
```python
# BAD
text = f"{subject} {body}"
embedding = model.encode(text)
```
**Our approach (parameterized headers):**
```python
# GOOD - gives model rich context
text = f"""[EMAIL_METADATA]
sender_type: corporate
has_attachments: true
[DETECTED_PATTERNS]
has_otp: false
has_invoice: true
[CONTENT]
subject: {subject}
body: {body[:300]}
"""
embedding = model.encode(text)
```
**Research-backed benefit:** 5-10% accuracy boost from structured context
---
## 6. ATTACHMENT ANALYSIS (COMPETITIVE ADVANTAGE)
### What Competitors Do
**Most tools:**
- Note "has attachment: true/false"
- Maybe detect attachment type (PDF, DOCX, etc.)
- **DO NOT** extract or analyze attachment content
### What We Can Do
**Simple extraction (fast, high value):**
```python
if attachment_type == 'pdf':
text = extract_pdf_text(attachment) # PyPDF2 library
# Pattern matching in PDF
has_invoice = 'invoice' in text.lower()
has_account_number = bool(re.search(r'account\s*#?\d+', text))
has_total_amount = bool(re.search(r'total.*\$\d+', text, re.I))
# Boost classification confidence
if has_invoice and has_account_number:
category = 'transactional' # 99% confidence
if attachment_type == 'docx':
text = extract_docx_text(attachment) # python-docx library
word_count = len(text.split())
# Long documents might be contracts, reports
if word_count > 1000:
category_hint = 'work'
```
**Business owner value:**
- "Find all invoices" → includes PDFs with invoice content
- "Financial documents" → PDFs with account numbers
- "Contracts" → DOCX files with legal terms
- "Reports" → Long DOCX or PDF files
**Implementation:**
- Use PyPDF2 for PDFs (<5MB size limit)
- Use python-docx for Word docs
- Use openpyxl for simple Excel files
- Flag complex/large attachments for review
---
## 7. PERFORMANCE OPTIMIZATION
### Batching Strategy (Critical)
**Embedding generation bottleneck:**
- Sequential: 80,000 emails × 10ms = 13 minutes
- Batched (128 emails): 80,000 ÷ 128 × 100ms = ~1 minute
**LLM processing optimization:**
- Don't send 1500 individual requests during calibration
- Batch 10-20 emails per prompt → 75-150 requests instead
- Compress sample if needed (1500 → 500 smarter selection)
### Expected Performance (Revised)
```
80,000 emails breakdown:
├─ Calibration (500 compressed samples): 2-3 min
├─ Pattern detection (all 80k): 10 sec
├─ Embedding generation (batched): 1-2 min
├─ LightGBM classification: 3 sec
├─ Hard rules (10%): instant
├─ LLM review (5%, batched): 4 min
└─ Export: 2 min
Total: ~10-12 minutes (optimistic)
Total: ~15-20 minutes (realistic with overhead)
```
---
## 8. SECURITY & PRIVACY ADVANTAGES
### Why Local Processing Matters
**GDPR considerations:**
- Cloud upload = data processing agreement needed
- Local processing = no third-party involvement
- Business emails often contain sensitive data
**Privacy concerns:**
- Client lists, pricing, contracts
- Financial information, invoices
- Personal health information (if medical business)
- Legal correspondence
**Our advantage:**
- 100% local processing
- No data retention
- No cloud storage
- Fresh repo per job (isolation)
---
## CONCLUSIONS & RECOMMENDATIONS
### 1. Use LightGBM (Not XGBoost)
- 2-5x faster
- Native categorical handling
- Perfect for our hybrid features
- Research-validated choice
### 2. Structured Embeddings Work
- Parameterized headers boost accuracy 5-10%
- Guide model with detected patterns
- Research-backed technique
### 3. Attachment Analysis is Differentiator
- Competitors don't do this
- High value for business owners
- Simple to implement (PyPDF2, python-docx)
### 4. Qwen 3 Model Strategy
- **qwen3:4b** for calibration (better discovery)
- **qwen3:1.7b** for bulk review (faster)
- Single config file for easy swapping
### 5. Market Gap Validated
- No local, privacy-first alternatives
- Business owners have this pain point
- One-time cleanup vs subscription
- 94-96% accuracy is competitive
### 6. Performance Target Achievable
- 15-20 min for 80k emails (realistic)
- 94-96% accuracy (research-backed)
- <5% need LLM review
- Competitive with cloud tools
---
## NEXT STEPS
1. ✅ Research complete
2. ✅ Architecture validated
3. ⏭ Build core infrastructure
4. ⏭ Implement hybrid features
5. ⏭ Create LightGBM classifier
6. ⏭ Add LLM providers
7. ⏭ Build test harness
8. ⏭ Package as wheel
9. ⏭ Test on real inbox
---
**Research phase complete. Architecture validated. Ready to build.**

View File

@ -1,319 +0,0 @@
# Root Cause Analysis: Category Explosion & Over-Confidence
**Date:** 2025-10-24
**Run:** 100k emails, qwen3:4b model
**Issue:** Model trained on 29 categories instead of expected 11, with extreme over-confidence
---
## Executive Summary
The 100k classification run technically succeeded (92.1% accuracy estimate) but revealed critical architectural issues:
1. **Category Explosion:** 29 training categories vs expected 11
2. **Duplicate Categories:** Work/work, Administrative/auth, finance/Financial
3. **Extreme Over-Confidence:** 99%+ classifications at 1.0 confidence
4. **Category Leakage:** Hardcoded categories leaked into LLM-discovered categories
---
## The Bug
### Location
[src/calibration/workflow.py:110](src/calibration/workflow.py#L110)
```python
all_categories = list(set(self.categories) | set(discovered_categories.keys()) | label_categories)
```
### What Happened
The workflow merges THREE category sources:
1. **`self.categories`** - 12 hardcoded categories from `config/categories.yaml`:
- junk, transactional, auth, newsletters, social, automated
- conversational, work, personal, finance, travel, unknown
2. **`discovered_categories.keys()`** - 11 LLM-discovered categories:
- Work, Financial, Administrative, Operational, Meeting
- Technical, External, Announcements, Urgent, Miscellaneous, Forwarded
3. **`label_categories`** - Additional categories from LLM labels:
- Bowl Pool 2000, California Market, Prehearing, Change, Monitoring
- Information
### Result: 29 Total Categories
```
1. Administrative (LLM discovered)
2. Announcements (LLM discovered)
3. Bowl Pool 2000 (LLM label - weird)
4. California Market (LLM label - too specific)
5. Change (LLM label - vague)
6. External (LLM discovered)
7. Financial (LLM discovered)
8. Forwarded (LLM discovered)
9. Information (LLM label - vague)
10. Meeting (LLM discovered)
11. Miscellaneous (LLM discovered)
12. Monitoring (LLM label - too specific)
13. Operational (LLM discovered)
14. Prehearing (LLM label - too specific)
15. Technical (LLM discovered)
16. Urgent (LLM discovered)
17. Work (LLM discovered)
18. auth (hardcoded)
19. automated (hardcoded)
20. conversational (hardcoded)
21. finance (hardcoded)
22. junk (hardcoded)
23. newsletters (hardcoded)
24. personal (hardcoded)
25. social (hardcoded)
26. transactional (hardcoded)
27. travel (hardcoded)
28. unknown (hardcoded)
29. work (hardcoded)
```
### Duplicates Identified
- **Work (LLM) vs work (hardcoded)** - 14,223 vs 368 emails
- **Financial (LLM) vs finance (hardcoded)** - 5,943 vs 0 emails
- **Administrative (LLM) vs auth (hardcoded)** - 67,195 vs 37 emails
---
## Impact Analysis
### 1. Category Distribution (100k Results)
| Category | Count | Confidence | Source |
|----------|-------|------------|--------|
| Administrative | 67,195 | 1.000 | LLM discovered |
| Work | 14,223 | 1.000 | LLM discovered |
| Meeting | 7,785 | 1.000 | LLM discovered |
| Financial | 5,943 | 1.000 | LLM discovered |
| Operational | 3,274 | 1.000 | LLM discovered |
| junk | 394 | 0.960 | Hardcoded |
| work | 368 | 0.950 | Hardcoded |
| Miscellaneous | 238 | 1.000 | LLM discovered |
| Technical | 193 | 1.000 | LLM discovered |
| External | 137 | 1.000 | LLM discovered |
| transactional | 44 | 0.970 | Hardcoded |
| auth | 37 | 0.990 | Hardcoded |
| unknown | 23 | 0.500 | Hardcoded |
| Others | <20 each | Various | Mixed |
### 2. Extreme Over-Confidence
- **67,195 emails** classified as "Administrative" with **1.0 confidence**
- **99.9%** of all classifications have confidence >= 0.95
- This is unrealistic - suggests overfitting or poor calibration
### 3. Why It Still "Worked"
- LLM-discovered categories (uppercase) handled 99%+ of emails
- Hardcoded categories (lowercase) mostly unused except for rules
- Model learned both sets but strongly preferred LLM categories
- Enron dataset doesn't match hardcoded categories well
---
## Why This Happened
### Design Intent vs Reality
**Original Design:**
- Hardcoded categories in `categories.yaml` for rule-based matching
- LLM discovers NEW categories during calibration
- Merge both for flexible classification
**Reality:**
- Hardcoded categories leak into ML training
- Creates duplicate concepts (Work vs work)
- LLM labels include one-off categories (Bowl Pool 2000)
- No deduplication or conflict resolution
### The Workflow Path
```
1. CLI loads hardcoded categories from categories.yaml
→ ['junk', 'transactional', 'auth', ... 'work', 'finance', 'unknown']
2. Passes to CalibrationWorkflow.__init__(categories=...)
→ self.categories = list(categories.keys())
3. LLM discovers categories from emails
→ {'Work': 'business emails', 'Financial': 'budgets', ...}
4. Consolidation reduces duplicates (within LLM categories only)
→ But doesn't see hardcoded categories
5. Merge ALL sources at workflow.py:110
→ Hardcoded + Discovered + Label anomalies = 29 categories
6. Trainer learns all 29 categories
→ Model becomes confused but weights LLM categories heavily
```
---
## Spot-Check Findings
### High Confidence Samples (Correct)
**Sample 1:** "i'll get the movie and wine. my suggestion is something from central market"
- Classified: Administrative (1.0)
- **Assessment:** Questionable - looks more personal
**Sample 2:** "Can you spell S-N-O-O-T-Y?"
- Classified: Administrative (1.0)
- **Assessment:** Wrong - clearly conversational/personal
**Sample 3:** "MEETING TONIGHT - 6:00 pm Central Time at The Houstonian"
- Classified: Meeting (1.0)
- **Assessment:** Correct
### Low Confidence Samples (Unknown)
⚠️ **All low confidence samples classified as "unknown" (0.500)**
- These fell back to LLM
- LLM failed to classify (returned unknown)
- Actual content: Legitimate business emails about deferrals, power units
### Category Anomalies
**"California Market" (6 emails, 1.0 confidence)**
- Too specific - shouldn't be a standalone category
- Should be "Work" or "External"
**"Bowl Pool 2000" (exists in training set)**
- One-off event category
- Should never have been kept
---
## Performance Impact
### What Went Right
- **ML handled 99.1%** of emails (99,134 / 100,000)
- **Only 31 fell to LLM** (0.03%)
- Fast classification (~3 minutes for 100k)
- Discovered categories are semantically good
### What Went Wrong
- **Unrealistic confidence** - Almost everything is 1.0
- **Category pollution** - 29 instead of 11
- **Duplicates** - Work/work, finance/Financial
- **No calibration** - Model confidence not properly calibrated
- **Hardcoded categories unused** - 368 "work" vs 14,223 "Work"
---
## Root Causes
### 1. Architectural Confusion
**Two competing philosophies:**
- **Rule-based system:** Use hardcoded categories with pattern matching
- **LLM-driven system:** Discover categories from data
**Result:** They interfere with each other instead of complementing
### 2. Missing Deduplication
The workflow.py:110 line does a simple set union without:
- Case normalization
- Semantic similarity checking
- Conflict resolution
- Priority rules
### 3. No Consolidation Across Sources
The LLM consolidation step (line 91-100) only consolidates within discovered categories. It doesn't:
- Check against hardcoded categories
- Merge similar concepts
- Remove one-off labels
### 4. Poor Category Cache Design
The category cache (src/models/category_cache.json) saves LLM categories but:
- Doesn't deduplicate against hardcoded categories
- Allows case-sensitive duplicates
- No validation of category quality
---
## Recommendations
### Immediate Fixes
1. **Remove hardcoded categories from ML training**
- Use them ONLY for rule-based matching
- Don't merge into `all_categories` for training
- Let LLM discover all ML categories
2. **Add case-insensitive deduplication**
- Normalize to title case
- Check semantic similarity
- Merge duplicates before training
3. **Filter label anomalies**
- Reject categories with <10 training samples
- Reject overly specific categories (Bowl Pool 2000)
- LLM review step for quality
4. **Calibrate model confidence**
- Use temperature scaling or Platt scaling
- Ensure confidence reflects actual accuracy
### Architecture Decision
**Option A: Rule-Based + ML (Current)**
- Keep hardcoded categories for RULES ONLY
- LLM discovers categories for ML ONLY
- Never merge the two
**Option B: Pure LLM Discovery (Recommended)**
- Remove categories.yaml entirely
- LLM discovers ALL categories
- Rules can still match on keywords but don't define categories
**Option C: Hybrid with Priority**
- Define 3-5 HIGH-PRIORITY hardcoded categories (junk, auth, transactional)
- Let LLM discover everything else
- Clear hierarchy: Rules → Hardcoded ML → Discovered ML
---
## Next Steps
1. **Decision:** Choose architecture (A, B, or C above)
2. **Fix workflow.py:110** - Implement chosen strategy
3. **Add deduplication logic** - Case-insensitive, semantic matching
4. **Rerun calibration** - Clean 250-sample run
5. **Validate results** - Ensure clean categories
6. **Fix confidence** - Add calibration layer
---
## Files to Modify
1. [src/calibration/workflow.py:110](src/calibration/workflow.py#L110) - Category merging logic
2. [src/calibration/llm_analyzer.py](src/calibration/llm_analyzer.py) - Add cross-source consolidation
3. [src/cli.py:70](src/cli.py#L70) - Decide whether to load hardcoded categories
4. [config/categories.yaml](config/categories.yaml) - Clarify purpose (rules only?)
5. [src/calibration/trainer.py](src/calibration/trainer.py) - Add confidence calibration
---
## Conclusion
The system technically worked - it classified 100k emails with high ML efficiency. However, the category explosion and over-confidence issues reveal fundamental architectural problems that need resolution before production use.
The core question: **Should hardcoded categories participate in ML training at all?**
My recommendation: **No.** Use them for rules only, let LLM discover ML categories cleanly.

View File

@ -0,0 +1,128 @@
# Session Handover Report - Email Sorter
**Date:** 2025-11-28
**Session ID:** eb549838-a153-48d1-ae5d-891e0e83108f
---
## What Was Done This Session
### 1. Classified 801 emails from brett-gmail using three methods:
| Method | Accuracy | Time | Output Location |
|--------|----------|------|-----------------|
| ML-Only | 54.9% | ~5 sec | `/home/bob/Documents/Email Manager/emails/brett-gm-md/` |
| ML+LLM | 93.3% | ~3.5 min | `/home/bob/Documents/Email Manager/emails/brett-gm-llm/` |
| Manual Agent | 99.8% | ~25 min | Same as ML-only + analysis files |
### 2. Created/Modified Files
**New Files:**
- `tools/generate_html_report.py` - HTML report generator
- `tools/brett_gmail_analyzer.py` - Custom dataset analyzer
- `data/brett_gmail_analysis.json` - Analysis output
- `docs/REPORT_FORMAT.md` - Report system documentation
- `docs/CLASSIFICATION_METHODS_COMPARISON.md` - Method comparison
- `docs/PROJECT_ROADMAP_2025.md` - Full roadmap and learnings
- `/home/bob/Documents/Email Manager/emails/brett-gm-md/BRETT_GMAIL_ANALYSIS_REPORT.md` - Analysis report
- `/home/bob/Documents/Email Manager/emails/brett-gm-md/report.html` - HTML report (ML-only)
- `/home/bob/Documents/Email Manager/emails/brett-gm-llm/report.html` - HTML report (ML+LLM)
**Modified Files:**
- `src/cli.py` - Added `--force-ml` flag, enriched results.json with email metadata
- `src/llm/openai_compat.py` - Removed API key requirement for local vLLM
- `config/default_config.yaml` - Changed LLM to openai provider on localhost:11433
### 3. Key Configuration Changes
```yaml
# config/default_config.yaml - LLM now uses vLLM endpoint
llm:
provider: "openai"
openai:
base_url: "http://localhost:11433/v1"
api_key: "not-needed"
classification_model: "qwen3-coder-30b"
```
---
## Key Findings
1. **ML pipeline overkill for <5000 emails** - Agent analysis gives better accuracy in similar time
2. **Sender domain is strongest signal** - Top 5 senders = 47.5% of emails
3. **Categories should serve downstream routing** - Not human labels, but processing decisions
4. **Risk-based accuracy** - Personal emails need high accuracy, junk can tolerate errors
5. **This tool = triage** - Sorts into buckets for other specialized tools
---
## Project Scope (Agreed with User)
**Email Sorter IS:**
- Bulk classification/triage tool
- Router to downstream specialized tools
- Part of larger email processing ecosystem
**Email Sorter IS NOT:**
- Complete email management solution
- Spam filter (trust Gmail/Outlook)
- Final destination for emails
---
## Recommended Dataset Size Routing
| Size | Method |
|------|--------|
| <500 | Agent-only |
| 500-5000 | Agent pre-scan + ML |
| >5000 | ML pipeline |
---
## Background Processes
There are stale background bash processes (f8678e, 0a3549, 0d150e) from classification runs. These completed successfully and can be ignored.
---
## What Needs Doing Next
1. **Review docs/** - All learnings are in PROJECT_ROADMAP_2025.md
2. **Phase 1 development** - Dataset size routing, sender-first classification
3. **Agent pre-scan module** - 10-15 min discovery phase before ML
---
## User Preferences (from CLAUDE.md)
- NO emojis in commits
- NO "Generated with Claude" attribution
- Use tools (Read/Edit/Grep) not bash commands for file ops
- Virtual environment required for Python
- TTS available via `fss-speak` (single line messages only, no newlines)
---
## Quick Start for Next Agent
```bash
cd /MASTERFOLDER/Tools/email-sorter
source venv/bin/activate
# Read the roadmap
cat docs/PROJECT_ROADMAP_2025.md
# Run classification
python -m src.cli run --source local \
--directory "/path/to/emails" \
--output "/path/to/output" \
--force-ml --llm-provider openai
# Generate HTML report
python tools/generate_html_report.py --input /path/to/results.json
```
---
*Session ended: 2025-11-28 ~03:30 AEDT*

View File

@ -1,324 +0,0 @@
# EMAIL SORTER - START HERE
**Welcome to Email Sorter v1.0 - Your Email Classification System**
---
## What Is This?
A **complete email classification system** that:
- Uses hybrid ML/LLM classification for 90-94% accuracy
- Processes emails with smart rules, machine learning, and AI
- Works with Gmail, IMAP, or any email dataset
- Is ready to use **right now**
---
## What You Need to Know
### ✅ The Good News
- **Framework is 100% complete** - all 16 planned phases are done
- **Ready to use immediately** - with mock model or real model
- **Complete codebase** - 6000+ lines, full type hints, comprehensive logging
- **90% test pass rate** - 27/30 tests passing
- **Comprehensive documentation** - 10 guides covering everything
### ❌ The Not-So-News
- **Mock model included** - for testing the framework (not for production accuracy)
- **Real model optional** - you choose to train on Enron or download pre-trained
- **Gmail setup optional** - framework works without it
- **LLM integration optional** - graceful fallback if unavailable
---
## Three Ways to Get Started
### 🟢 Path A: Validate Framework (5 minutes)
Perfect if you want to quickly verify everything works
```bash
cd "c:/Build Folder/email-sorter"
source venv/Scripts/activate
# Run tests
pytest tests/ -v
# Test with mock pipeline
python -m src.cli run --source mock --output test_results/
```
**What you'll learn**: Framework works perfectly with mock model
---
### 🟡 Path B: Integrate Real Model (30-60 minutes)
Perfect if you want actual classification results
```bash
# Option 1: Train on Enron dataset (recommended)
python -c "
from src.calibration.enron_parser import EnronParser
from src.calibration.trainer import ModelTrainer
from src.classification.feature_extractor import FeatureExtractor
parser = EnronParser('enron_mail_20150507')
emails = parser.parse_emails(limit=5000)
extractor = FeatureExtractor()
trainer = ModelTrainer(extractor, ['junk', 'transactional', 'auth', 'newsletters',
'social', 'automated', 'conversational', 'work',
'personal', 'finance', 'travel', 'unknown'])
results = trainer.train([(e, 'unknown') for e in emails])
trainer.save_model('src/models/pretrained/classifier.pkl')
"
# Option 2: Use pre-trained model
python tools/setup_real_model.py --model-path /path/to/model.pkl
# Verify
python tools/setup_real_model.py --check
```
**What you'll get**: Real LightGBM model, automatic classification with 85-90% accuracy
---
### 🔴 Path C: Full Production Deployment (2-3 hours)
Perfect if you want to process Marion's 80k+ emails
```bash
# 1. Setup Gmail OAuth (download credentials.json, place in project root)
# 2. Test with 100 emails
python -m src.cli run --source gmail --limit 100 --output test_results/
# 3. Process all emails
python -m src.cli run --source gmail --output marion_results/
# 4. Check results
cat marion_results/report.txt
```
**What you'll get**: All 80k+ emails sorted, labeled, and synced to Gmail
---
## Documentation Map
| Document | Purpose | When to Read |
|----------|---------|--------------|
| **START_HERE.md** | This file - quick orientation | First (right now!) |
| **NEXT_STEPS.md** | Decision tree and action plan | Decide your path |
| **PROJECT_COMPLETE.md** | Final summary and status | Understand scope |
| **COMPLETION_ASSESSMENT.md** | Detailed component review | Deep dive needed |
| **MODEL_INFO.md** | Model usage and training | For model setup |
| **README.md** | Getting started guide | General reference |
| **PROJECT_STATUS.md** | Feature inventory | Full feature list |
| **PROJECT_BLUEPRINT.md** | Original architecture plan | Background context |
---
## Quick Reference Commands
```bash
# Navigate and activate
cd "c:/Build Folder/email-sorter"
source venv/Scripts/activate
# Validation
pytest tests/ -v # Run all tests
python -m src.cli test-config # Validate configuration
python -m src.cli test-ollama # Test LLM (if running)
python -m src.cli test-gmail # Test Gmail connection
# Framework testing
python -m src.cli run --source mock # Test with mock provider
# Real processing
python -m src.cli run --source gmail --limit 100 # Test with Gmail
python -m src.cli run --source gmail --output results/ # Full processing
# Model management
python tools/setup_real_model.py --check # Check model status
python tools/setup_real_model.py --model-path FILE # Install model
python tools/download_pretrained_model.py --url URL # Download model
```
---
## Common Questions
### Q: Do I need to do anything right now?
**A:** No! But you can run `pytest tests/ -v` to verify everything works.
### Q: Is the framework ready to use?
**A:** YES! All 16 phases are complete. 90% test pass rate. Ready to use.
### Q: How do I get better accuracy than the mock model?
**A:** Train a real model or download pre-trained. See Path B above.
### Q: Does this work without Gmail?
**A:** YES! Use mock provider or IMAP provider instead.
### Q: Can I use it right now?
**A:** YES! With mock model. For real accuracy, integrate real model (Path B).
### Q: How long to process all 80k emails?
**A:** About 20-30 minutes after setup. Path C shows how.
### Q: Where do I start?
**A:** Choose your path above. Path A (5 min) is the quickest.
---
## What Each Path Gets You
### Path A Results (5 minutes)
- ✅ Confirm framework works
- ✅ See mock classification in action
- ✅ Verify all tests pass
- ❌ Not real-world accuracy yet
### Path B Results (30-60 minutes)
- ✅ Real LightGBM model trained
- ✅ 85-90% classification accuracy
- ✅ Ready for real data
- ❌ Haven't processed real emails yet
### Path C Results (2-3 hours)
- ✅ All emails classified
- ✅ 90-94% overall accuracy
- ✅ Synced to Gmail labels
- ✅ Full deployment complete
- ✅ Marion's 80k+ emails processed
---
## Key Files & Locations
```
c:/Build Folder/email-sorter/
Core Framework:
src/ Main framework code
classification/ Email classifiers
calibration/ Model training
processing/ Batch processing
llm/ LLM providers
email_providers/ Email sources
export/ Results export
Data & Models:
enron_mail_20150507/ Real email dataset (already extracted)
src/models/pretrained/ Where real model goes
models/ Alternative model directory
Tools:
tools/setup_real_model.py Install pre-trained models
tools/download_pretrained_model.py Download models
Configuration:
config/ YAML configuration
credentials.json (optional) Gmail OAuth
Testing:
tests/ 23 test cases
logs/ Execution logs
```
---
## Success Looks Like
### After Path A (5 min)
```
✅ 27/30 tests passing
✅ Framework validation complete
✅ Mock pipeline ran successfully
Status: Ready to explore
```
### After Path B (30-60 min)
```
✅ Real model installed
✅ Model check shows: is_mock: False
✅ Ready for real classification
Status: Ready for real data
```
### After Path C (2-3 hours)
```
✅ All 80k emails processed
✅ Gmail labels synced
✅ Results exported and reviewed
✅ Accuracy metrics acceptable
Status: Complete and deployed
```
---
## One More Thing...
**This framework is complete and ready to use NOW.** You don't need to:
- Fix anything ✅
- Add components ✅
- Change architecture ✅
- Debug systems ✅
- Train models (optional) ✅
What you CAN do:
- Use it immediately with mock model
- Integrate real model when ready
- Scale to production anytime
- Customize categories and rules
- Deploy to other systems
---
## Your Next Step
Pick one:
**🟢 I want to test the framework right now** → Go to Path A (5 min)
**🟡 I want better accuracy tomorrow** → Go to Path B (30-60 min)
**🔴 I want all emails processed this week** → Go to Path C (2-3 hours total)
Or read one of the detailed docs:
- **NEXT_STEPS.md** - Decision tree
- **PROJECT_COMPLETE.md** - Full summary
- **README.md** - Detailed guide
---
## Contact & Support
If something doesn't work:
1. Check logs: `tail -f logs/email_sorter.log`
2. Run tests: `pytest tests/ -v`
3. Validate setup: `python -m src.cli test-config`
4. Review docs: See Documentation Map above
Most issues are covered in the docs!
---
## Quick Stats
- **Framework Status**: 100% complete
- **Test Pass Rate**: 90% (27/30)
- **Lines of Code**: ~6,000+ production
- **Python Modules**: 38 files
- **Documentation**: 10 guides
- **Ready for**: Immediate use
---
**Ready to get started? Choose your path above and begin! 🚀**
The framework is done. The tools are ready. The documentation is complete.
All you need to do is pick a path and start.
Let's go!

View File

@ -1,493 +0,0 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Email Sorter System Flow</title>
<script src="https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.min.js"></script>
<style>
body {
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
margin: 20px;
background: #1e1e1e;
color: #d4d4d4;
}
h1, h2, h3 {
color: #4ec9b0;
}
.diagram {
background: white;
padding: 20px;
margin: 20px 0;
border-radius: 8px;
}
.timing-table {
width: 100%;
border-collapse: collapse;
margin: 20px 0;
background: #252526;
}
.timing-table th {
background: #37373d;
padding: 12px;
text-align: left;
color: #4ec9b0;
}
.timing-table td {
padding: 10px;
border-bottom: 1px solid #3e3e42;
}
.flag-section {
background: #252526;
padding: 15px;
margin: 10px 0;
border-left: 4px solid #4ec9b0;
}
code {
background: #1e1e1e;
padding: 2px 6px;
border-radius: 3px;
color: #ce9178;
}
</style>
</head>
<body>
<h1>Email Sorter System Flow Documentation</h1>
<h2>1. Main Execution Flow</h2>
<div class="diagram">
<pre class="mermaid">
flowchart TD
Start([python -m src.cli run]) --> LoadConfig[Load config/default_config.yaml]
LoadConfig --> InitProviders[Initialize Email Provider<br/>Enron/Gmail/IMAP]
InitProviders --> FetchEmails[Fetch Emails<br/>--limit N]
FetchEmails --> CheckSize{Email Count?}
CheckSize -->|"< 1000"| SetMockMode[Set ml_classifier.is_mock = True<br/>LLM-only mode]
CheckSize -->|">= 1000"| CheckModel{Model Exists?}
CheckModel -->|No model at<br/>src/models/pretrained/classifier.pkl| RunCalibration[CALIBRATION PHASE<br/>LLM category discovery<br/>Train ML model]
CheckModel -->|Model exists| SkipCalibration[Skip Calibration<br/>Load existing model]
SetMockMode --> SkipCalibration
RunCalibration --> ClassifyPhase[CLASSIFICATION PHASE]
SkipCalibration --> ClassifyPhase
ClassifyPhase --> Loop{For each email}
Loop --> RuleCheck{Hard rule match?}
RuleCheck -->|Yes| RuleClassify[Category by rule<br/>confidence=1.0<br/>method='rule']
RuleCheck -->|No| MLClassify[ML Classification<br/>Get category + confidence]
MLClassify --> ConfCheck{Confidence >= threshold?}
ConfCheck -->|Yes| AcceptML[Accept ML result<br/>method='ml'<br/>needs_review=False]
ConfCheck -->|No| LowConf[Low confidence detected<br/>needs_review=True]
LowConf --> FlagCheck{--no-llm-fallback?}
FlagCheck -->|Yes| AcceptMLAnyway[Accept ML anyway<br/>needs_review=False]
FlagCheck -->|No| LLMCheck{LLM available?}
LLMCheck -->|Yes| LLMReview[LLM Classification<br/>~4 seconds<br/>method='llm']
LLMCheck -->|No| AcceptMLAnyway
RuleClassify --> NextEmail{More emails?}
AcceptML --> NextEmail
AcceptMLAnyway --> NextEmail
LLMReview --> NextEmail
NextEmail -->|Yes| Loop
NextEmail -->|No| SaveResults[Save results.json]
SaveResults --> End([Complete])
style RunCalibration fill:#ff6b6b
style LLMReview fill:#ff6b6b
style SetMockMode fill:#ffd93d
style FlagCheck fill:#4ec9b0
style AcceptMLAnyway fill:#4ec9b0
</pre>
</div>
<h2>2. Calibration Phase Detail (When Triggered)</h2>
<div class="diagram">
<pre class="mermaid">
flowchart TD
Start([Calibration Triggered]) --> Sample[Stratified Sampling<br/>3% of emails<br/>min 250, max 1500]
Sample --> LLMBatch[LLM Category Discovery<br/>50 emails per batch]
LLMBatch --> Batch1[Batch 1: 50 emails<br/>~20 seconds]
Batch1 --> Batch2[Batch 2: 50 emails<br/>~20 seconds]
Batch2 --> BatchN[... N batches<br/>For 300 samples: 6 batches]
BatchN --> Consolidate[LLM Consolidation<br/>Merge similar categories<br/>~5 seconds]
Consolidate --> Categories[Final Categories<br/>~10-12 unique categories]
Categories --> Label[Label Training Emails<br/>LLM labels each sample<br/>~3 seconds per email]
Label --> Extract[Feature Extraction<br/>Embeddings + TF-IDF<br/>~0.02 seconds per email]
Extract --> Train[Train LightGBM Model<br/>~5 seconds total]
Train --> Validate[Validate on 100 samples<br/>~2 seconds]
Validate --> Save[Save Model<br/>src/models/calibrated/classifier.pkl]
Save --> End([Calibration Complete<br/>Total time: 15-25 minutes for 10k emails])
style LLMBatch fill:#ff6b6b
style Label fill:#ff6b6b
style Consolidate fill:#ff6b6b
style Train fill:#4ec9b0
</pre>
</div>
<h2>3. Classification Phase Detail</h2>
<div class="diagram">
<pre class="mermaid">
flowchart TD
Start([Classification Phase]) --> Email[Get Email]
Email --> Rules{Check Hard Rules<br/>Pattern matching}
Rules -->|Match| RuleDone[Rule Match<br/>~0.001 seconds<br/>59 of 10000 emails]
Rules -->|No match| Embed[Generate Embedding<br/>all-minilm:l6-v2<br/>~0.02 seconds]
Embed --> TFIDF[TF-IDF Features<br/>~0.001 seconds]
TFIDF --> MLPredict[ML Prediction<br/>LightGBM<br/>~0.003 seconds]
MLPredict --> Threshold{Confidence >= 0.55?}
Threshold -->|Yes| MLDone[ML Classification<br/>7842 of 10000 emails<br/>78.4%]
Threshold -->|No| Flag{--no-llm-fallback?}
Flag -->|Yes| MLForced[Force ML result<br/>No LLM call]
Flag -->|No| LLM[LLM Classification<br/>~4 seconds<br/>2099 of 10000 emails<br/>21%]
RuleDone --> Next([Next Email])
MLDone --> Next
MLForced --> Next
LLM --> Next
style LLM fill:#ff6b6b
style MLDone fill:#4ec9b0
style MLForced fill:#ffd93d
</pre>
</div>
<h2>4. Model Loading Logic</h2>
<div class="diagram">
<pre class="mermaid">
flowchart TD
Start([MLClassifier.__init__]) --> CheckPath{model_path provided?}
CheckPath -->|Yes| UsePath[Use provided path]
CheckPath -->|No| Default[Default:<br/>src/models/pretrained/classifier.pkl]
UsePath --> FileCheck{File exists?}
Default --> FileCheck
FileCheck -->|Yes| Load[Load pickle file]
FileCheck -->|No| CreateMock[Create MOCK model<br/>Random Forest<br/>12 hardcoded categories]
Load --> ValidCheck{Valid model data?}
ValidCheck -->|Yes| CheckMock{is_mock flag?}
ValidCheck -->|No| CreateMock
CheckMock -->|True| WarnMock[Warn: MOCK model active]
CheckMock -->|False| RealModel[Real trained model loaded]
CreateMock --> MockWarnings[Multiple warnings printed<br/>NOT for production]
WarnMock --> Ready[Model Ready]
RealModel --> Ready
MockWarnings --> Ready
Ready --> End([Classification can start])
style CreateMock fill:#ff6b6b
style RealModel fill:#4ec9b0
style WarnMock fill:#ffd93d
</pre>
</div>
<h2>5. Flag Conditions & Effects</h2>
<div class="flag-section">
<h3>--no-llm-fallback</h3>
<p><strong>Location:</strong> src/cli.py:46, src/classification/adaptive_classifier.py:152-161</p>
<p><strong>Effect:</strong> When ML confidence < threshold, accept ML result anyway instead of calling LLM</p>
<p><strong>Use case:</strong> Test pure ML performance, avoid LLM costs</p>
<p><strong>Code path:</strong></p>
<code>
if self.disable_llm_fallback:<br/>
&nbsp;&nbsp;# Just return ML result without LLM fallback<br/>
&nbsp;&nbsp;return ClassificationResult(needs_review=False)
</code>
</div>
<div class="flag-section">
<h3>--limit N</h3>
<p><strong>Location:</strong> src/cli.py:38</p>
<p><strong>Effect:</strong> Limits number of emails fetched from source</p>
<p><strong>Calibration trigger:</strong> If N < 1000, forces LLM-only mode (no ML training)</p>
<p><strong>Code path:</strong></p>
<code>
if total_emails < 1000:<br/>
&nbsp;&nbsp;ml_classifier.is_mock = True # Skip ML, use LLM only
</code>
</div>
<div class="flag-section">
<h3>Model Path Override</h3>
<p><strong>Location:</strong> src/classification/ml_classifier.py:43</p>
<p><strong>Default:</strong> src/models/pretrained/classifier.pkl</p>
<p><strong>Calibration saves to:</strong> src/models/calibrated/classifier.pkl</p>
<p><strong>Problem:</strong> Calibration saves to different location than default load location</p>
<p><strong>Solution:</strong> Copy calibrated model to pretrained location OR pass model_path parameter</p>
</div>
<h2>6. Timing Breakdown (10,000 emails)</h2>
<table class="timing-table">
<tr>
<th>Phase</th>
<th>Operation</th>
<th>Time per Email</th>
<th>Total Time (10k)</th>
<th>LLM Required?</th>
</tr>
<tr>
<td rowspan="6"><strong>Calibration</strong><br/>(if model doesn't exist)</td>
<td>Stratified sampling (300 emails)</td>
<td>-</td>
<td>~1 second</td>
<td>No</td>
</tr>
<tr>
<td>LLM category discovery (6 batches)</td>
<td>~0.4 sec/email</td>
<td>~2 minutes</td>
<td>YES</td>
</tr>
<tr>
<td>LLM consolidation</td>
<td>-</td>
<td>~5 seconds</td>
<td>YES</td>
</tr>
<tr>
<td>LLM labeling (300 samples)</td>
<td>~3 sec/email</td>
<td>~15 minutes</td>
<td>YES</td>
</tr>
<tr>
<td>Feature extraction (300 samples)</td>
<td>~0.02 sec/email</td>
<td>~6 seconds</td>
<td>No (embeddings)</td>
</tr>
<tr>
<td>Model training (LightGBM)</td>
<td>-</td>
<td>~5 seconds</td>
<td>No</td>
</tr>
<tr>
<td colspan="3"><strong>CALIBRATION TOTAL</strong></td>
<td><strong>~17-20 minutes</strong></td>
<td><strong>YES</strong></td>
</tr>
<tr>
<td rowspan="5"><strong>Classification</strong><br/>(with model)</td>
<td>Hard rule matching</td>
<td>~0.001 sec</td>
<td>~10 seconds (all 10k)</td>
<td>No</td>
</tr>
<tr>
<td>Embedding generation</td>
<td>~0.02 sec</td>
<td>~200 seconds (all 10k)</td>
<td>No (Ollama embed)</td>
</tr>
<tr>
<td>ML prediction</td>
<td>~0.003 sec</td>
<td>~30 seconds (all 10k)</td>
<td>No</td>
</tr>
<tr>
<td>LLM fallback (21% of emails)</td>
<td>~4 sec/email</td>
<td>~140 minutes (2100 emails)</td>
<td>YES</td>
</tr>
<tr>
<td>Saving results</td>
<td>-</td>
<td>~1 second</td>
<td>No</td>
</tr>
<tr>
<td colspan="3"><strong>CLASSIFICATION TOTAL (with LLM fallback)</strong></td>
<td><strong>~2.5 hours</strong></td>
<td><strong>YES (21%)</strong></td>
</tr>
<tr>
<td colspan="3"><strong>CLASSIFICATION TOTAL (--no-llm-fallback)</strong></td>
<td><strong>~4 minutes</strong></td>
<td><strong>No</strong></td>
</tr>
</table>
<h2>7. Why LLM Still Loads</h2>
<div class="diagram">
<pre class="mermaid">
flowchart TD
Start([CLI startup]) --> Always1[ALWAYS: Load LLM provider<br/>src/cli.py:98-117]
Always1 --> Reason1[Reason: Needed for calibration<br/>if model doesn't exist]
Reason1 --> Check{Model exists?}
Check -->|No| NeedLLM1[LLM required for calibration<br/>Category discovery<br/>Sample labeling]
Check -->|Yes| SkipCal[Skip calibration]
SkipCal --> ClassStart[Start classification]
NeedLLM1 --> DoCalibration[Run calibration<br/>Uses LLM]
DoCalibration --> ClassStart
ClassStart --> Always2[ALWAYS: LLM provider is available<br/>llm.is_available = True]
Always2 --> EmailLoop[For each email...]
EmailLoop --> LowConf{Low confidence?}
LowConf -->|No| NoLLM[No LLM call]
LowConf -->|Yes| FlagCheck{--no-llm-fallback?}
FlagCheck -->|Yes| NoLLMCall[No LLM call<br/>Accept ML result]
FlagCheck -->|No| LLMAvail{llm.is_available?}
LLMAvail -->|Yes| CallLLM[LLM called<br/>src/cli.py:227-228]
LLMAvail -->|No| NoLLMCall
NoLLM --> End([Next email])
NoLLMCall --> End
CallLLM --> End
style Always1 fill:#ffd93d
style Always2 fill:#ffd93d
style CallLLM fill:#ff6b6b
style NoLLMCall fill:#4ec9b0
</pre>
</div>
<h3>Why LLM Provider is Always Initialized:</h3>
<ul>
<li><strong>Line 98-117 (src/cli.py):</strong> LLM provider is created before checking if model exists</li>
<li><strong>Reason:</strong> Need LLM ready in case calibration is required</li>
<li><strong>Result:</strong> Even with --no-llm-fallback, LLM provider loads (but won't be called for classification)</li>
</ul>
<h2>8. Command Scenarios</h2>
<table class="timing-table">
<tr>
<th>Command</th>
<th>Model Exists?</th>
<th>Calibration Runs?</th>
<th>LLM Used for Classification?</th>
<th>Total Time (10k)</th>
</tr>
<tr>
<td><code>python -m src.cli run --source enron --limit 10000</code></td>
<td>No</td>
<td>YES (~20 min)</td>
<td>YES (~2.5 hours)</td>
<td>~2 hours 50 min</td>
</tr>
<tr>
<td><code>python -m src.cli run --source enron --limit 10000</code></td>
<td>Yes</td>
<td>No</td>
<td>YES (~2.5 hours)</td>
<td>~2.5 hours</td>
</tr>
<tr>
<td><code>python -m src.cli run --source enron --limit 10000 --no-llm-fallback</code></td>
<td>No</td>
<td>YES (~20 min)</td>
<td>NO</td>
<td>~24 minutes</td>
</tr>
<tr>
<td><code>python -m src.cli run --source enron --limit 10000 --no-llm-fallback</code></td>
<td>Yes</td>
<td>No</td>
<td>NO</td>
<td>~4 minutes</td>
</tr>
<tr>
<td><code>python -m src.cli run --source enron --limit 500</code></td>
<td>Any</td>
<td>No (too few emails)</td>
<td>YES (100% LLM-only)</td>
<td>~35 minutes</td>
</tr>
</table>
<h2>9. Current System State</h2>
<div class="flag-section">
<h3>Model Status</h3>
<ul>
<li><strong>src/models/calibrated/classifier.pkl</strong> - 1.8MB, trained at 02:54, 10 categories</li>
<li><strong>src/models/pretrained/classifier.pkl</strong> - Copy of calibrated model (created manually)</li>
</ul>
</div>
<div class="flag-section">
<h3>Threshold Configuration</h3>
<ul>
<li><strong>config/default_config.yaml:</strong> default_threshold = 0.55</li>
<li><strong>config/categories.yaml:</strong> All category thresholds = 0.55</li>
<li><strong>Effect:</strong> ML must be ≥55% confident to skip LLM</li>
</ul>
</div>
<div class="flag-section">
<h3>Last Run Results (10k emails)</h3>
<ul>
<li><strong>Rules:</strong> 59 emails (0.6%)</li>
<li><strong>ML:</strong> 7,842 emails (78.4%)</li>
<li><strong>LLM fallback:</strong> 2,099 emails (21%)</li>
<li><strong>Accuracy estimate:</strong> 92.7%</li>
</ul>
</div>
<h2>10. To Run ML-Only Test (No LLM Calls During Classification)</h2>
<div class="flag-section">
<h3>Requirements:</h3>
<ol>
<li>Model must exist at <code>src/models/pretrained/classifier.pkl</code> ✓ (done)</li>
<li>Use <code>--no-llm-fallback</code> flag</li>
<li>Ensure sufficient emails (≥1000) to avoid LLM-only mode</li>
</ol>
<h3>Command:</h3>
<code>
python -m src.cli run --source enron --limit 10000 --output ml_only_10k/ --no-llm-fallback
</code>
<h3>Expected Results:</h3>
<ul>
<li><strong>Calibration:</strong> Skipped (model exists)</li>
<li><strong>LLM calls during classification:</strong> 0</li>
<li><strong>Total time:</strong> ~4 minutes</li>
<li><strong>ML acceptance rate:</strong> 100% (all emails classified by ML, even low confidence)</li>
</ul>
</div>
<script>
mermaid.initialize({
startOnLoad: true,
theme: 'default',
flowchart: {
useMaxWidth: true,
htmlLabels: true,
curve: 'basis'
}
});
</script>
</body>
</html>

View File

@ -1,357 +0,0 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Category Verification Feature</title>
<script src="https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.min.js"></script>
<style>
body {
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
margin: 20px;
background: #1e1e1e;
color: #d4d4d4;
}
h1, h2, h3 {
color: #4ec9b0;
}
.diagram {
background: white;
padding: 20px;
margin: 20px 0;
border-radius: 8px;
}
.code-section {
background: #252526;
padding: 15px;
margin: 10px 0;
border-left: 4px solid #4ec9b0;
font-family: 'Courier New', monospace;
}
code {
background: #1e1e1e;
padding: 2px 6px;
border-radius: 3px;
color: #ce9178;
}
.success {
background: #002a00;
border-left: 4px solid #4ec9b0;
padding: 15px;
margin: 10px 0;
}
</style>
</head>
<body>
<h1>--verify-categories Feature</h1>
<div class="success">
<h2>✅ IMPLEMENTED AND READY TO USE</h2>
<p><strong>Feature:</strong> Single LLM call to verify model categories fit new mailbox</p>
<p><strong>Cost:</strong> +20 seconds, 1 LLM call</p>
<p><strong>Value:</strong> Confidence check before bulk ML classification</p>
</div>
<h2>Usage</h2>
<div class="code-section">
<strong>Basic usage (with verification):</strong>
python -m src.cli run \
--source enron \
--limit 10000 \
--output verified_test/ \
--no-llm-fallback \
--verify-categories
<strong>Custom verification sample size:</strong>
python -m src.cli run \
--source enron \
--limit 10000 \
--output verified_test/ \
--no-llm-fallback \
--verify-categories \
--verify-sample 30
<strong>Without verification (fastest):</strong>
python -m src.cli run \
--source enron \
--limit 10000 \
--output fast_test/ \
--no-llm-fallback
</div>
<h2>How It Works</h2>
<div class="diagram">
<pre class="mermaid">
flowchart TD
Start([Run with --verify-categories]) --> LoadModel[Load trained model<br/>Categories: Updates, Work,<br/>Meetings, etc.]
LoadModel --> FetchEmails[Fetch all emails<br/>10,000 total]
FetchEmails --> CheckFlag{--verify-categories?}
CheckFlag -->|No| SkipVerify[Skip verification<br/>Proceed to classification]
CheckFlag -->|Yes| Sample[Sample random emails<br/>Default: 20 emails]
Sample --> BuildPrompt[Build verification prompt<br/>Show model categories<br/>Show sample emails]
BuildPrompt --> LLMCall[Single LLM call<br/>~20 seconds<br/>Task: Rate category fit]
LLMCall --> ParseResponse[Parse JSON response<br/>Extract verdict + confidence]
ParseResponse --> Verdict{Verdict?}
Verdict -->|GOOD_MATCH<br/>80%+ fit| LogGood[Log: Categories appropriate<br/>Confidence: 0.8-1.0]
Verdict -->|FAIR_MATCH<br/>60-80% fit| LogFair[Log: Categories acceptable<br/>Confidence: 0.6-0.8]
Verdict -->|POOR_MATCH<br/><60% fit| LogPoor[Log WARNING<br/>Show suggested categories<br/>Recommend calibration<br/>Confidence: 0.0-0.6]
LogGood --> Proceed[Proceed with ML classification]
LogFair --> Proceed
LogPoor --> Proceed
SkipVerify --> Proceed
Proceed --> ClassifyAll[Classify all 10,000 emails<br/>Pure ML, no LLM fallback<br/>~4 minutes]
ClassifyAll --> Done[Results saved]
style LLMCall fill:#ffd93d
style LogGood fill:#4ec9b0
style LogPoor fill:#ff6b6b
style ClassifyAll fill:#4ec9b0
</pre>
</div>
<h2>Example Outputs</h2>
<h3>Scenario 1: GOOD_MATCH (Enron → Enron)</h3>
<div class="code-section">
================================================================================
VERIFYING MODEL CATEGORIES
================================================================================
Verifying model categories against 10000 emails
Model categories (11): Updates, Work, Meetings, External, Financial, Test, Administrative, Operational, Technical, Urgent, Requests
Sampled 20 emails for verification
Calling LLM for category verification...
Verification complete: GOOD_MATCH (0.85)
Reasoning: The sample emails fit well into the trained categories. Most are work-related correspondence, meetings, and operational updates which align with the model.
Verification: GOOD_MATCH
Confidence: 85%
Model categories look appropriate for this mailbox
================================================================================
Starting classification...
</div>
<h3>Scenario 2: POOR_MATCH (Enron → Personal Gmail)</h3>
<div class="code-section">
================================================================================
VERIFYING MODEL CATEGORIES
================================================================================
Verifying model categories against 10000 emails
Model categories (11): Updates, Work, Meetings, External, Financial, Test, Administrative, Operational, Technical, Urgent, Requests
Sampled 20 emails for verification
Calling LLM for category verification...
Verification complete: POOR_MATCH (0.45)
Reasoning: Many sample emails are shopping confirmations, social media notifications, and personal correspondence which don't fit the business-focused categories well.
Verification: POOR_MATCH
Confidence: 45%
================================================================================
WARNING: Model categories may not fit this mailbox well
Suggested categories: ['Shopping', 'Social', 'Travel', 'Newsletters', 'Personal']
Consider running full calibration for better accuracy
Proceeding with existing model anyway...
================================================================================
Starting classification...
</div>
<h2>LLM Prompt Structure</h2>
<div class="code-section">
You are evaluating whether pre-trained email categories fit a new mailbox.
TRAINED MODEL CATEGORIES (11 categories):
- Updates
- Work
- Meetings
- External
- Financial
- Test
- Administrative
- Operational
- Technical
- Urgent
- Requests
SAMPLE EMAILS FROM NEW MAILBOX (20 total, showing first 20):
1. From: phillip.allen@enron.com
Subject: Re: AEC Volumes at OPAL
Preview: Here are the volumes for today...
2. From: notifications@amazon.com
Subject: Your order has shipped
Preview: Your Amazon.com order #123-4567890...
[... 18 more emails ...]
TASK:
Evaluate if the trained categories are appropriate for this mailbox.
Consider:
1. Do the sample emails naturally fit into the trained categories?
2. Are there obvious email types that don't match any category?
3. Are the category names semantically appropriate?
4. Would a user find these categories helpful for THIS mailbox?
Respond with JSON:
{
"verdict": "GOOD_MATCH" | "FAIR_MATCH" | "POOR_MATCH",
"confidence": 0.0-1.0,
"reasoning": "brief explanation",
"fit_percentage": 0-100,
"suggested_categories": ["cat1", "cat2", ...],
"category_mapping": {"old_name": "better_name", ...}
}
</div>
<h2>Configuration</h2>
<table style="width:100%; border-collapse: collapse; background: #252526; margin: 20px 0;">
<tr style="background: #37373d;">
<th style="padding: 12px; text-align: left; color: #4ec9b0;">Flag</th>
<th style="padding: 12px; text-align: left; color: #4ec9b0;">Type</th>
<th style="padding: 12px; text-align: left; color: #4ec9b0;">Default</th>
<th style="padding: 12px; text-align: left; color: #4ec9b0;">Description</th>
</tr>
<tr style="border-bottom: 1px solid #3e3e42;">
<td style="padding: 10px;"><code>--verify-categories</code></td>
<td style="padding: 10px;">Flag</td>
<td style="padding: 10px;">False</td>
<td style="padding: 10px;">Enable category verification</td>
</tr>
<tr style="border-bottom: 1px solid #3e3e42;">
<td style="padding: 10px;"><code>--verify-sample</code></td>
<td style="padding: 10px;">Integer</td>
<td style="padding: 10px;">20</td>
<td style="padding: 10px;">Number of emails to sample</td>
</tr>
<tr style="border-bottom: 1px solid #3e3e42;">
<td style="padding: 10px;"><code>--no-llm-fallback</code></td>
<td style="padding: 10px;">Flag</td>
<td style="padding: 10px;">False</td>
<td style="padding: 10px;">Disable LLM fallback during classification</td>
</tr>
</table>
<h2>When Verification Runs</h2>
<ul>
<li>✅ Only if <code>--verify-categories</code> flag is set</li>
<li>✅ Only if trained model exists (not mock)</li>
<li>✅ After emails are fetched, before calibration/classification</li>
<li>❌ Skipped if using mock model</li>
<li>❌ Skipped if model doesn't exist (calibration will run anyway)</li>
</ul>
<h2>Timing Impact</h2>
<table style="width:100%; border-collapse: collapse; background: #252526; margin: 20px 0;">
<tr style="background: #37373d;">
<th style="padding: 12px; text-align: left; color: #4ec9b0;">Configuration</th>
<th style="padding: 12px; text-align: left; color: #4ec9b0;">Time (10k emails)</th>
<th style="padding: 12px; text-align: left; color: #4ec9b0;">LLM Calls</th>
</tr>
<tr style="border-bottom: 1px solid #3e3e42;">
<td style="padding: 10px;">ML-only (no flags)</td>
<td style="padding: 10px;">~4 minutes</td>
<td style="padding: 10px;">0</td>
</tr>
<tr style="border-bottom: 1px solid #3e3e42;">
<td style="padding: 10px;">ML-only + <code>--verify-categories</code></td>
<td style="padding: 10px;">~4.3 minutes</td>
<td style="padding: 10px;">1 (verification)</td>
</tr>
<tr style="border-bottom: 1px solid #3e3e42;">
<td style="padding: 10px;">Full calibration (no model)</td>
<td style="padding: 10px;">~25 minutes</td>
<td style="padding: 10px;">~500</td>
</tr>
<tr style="border-bottom: 1px solid #3e3e42;">
<td style="padding: 10px;">ML + LLM fallback (21%)</td>
<td style="padding: 10px;">~2.5 hours</td>
<td style="padding: 10px;">~2100</td>
</tr>
</table>
<h2>Decision Tree</h2>
<div class="diagram">
<pre class="mermaid">
flowchart TD
Start([Need to classify emails]) --> HaveModel{Trained model<br/>exists?}
HaveModel -->|No| MustCalibrate[Must run calibration<br/>~20 minutes<br/>~500 LLM calls]
HaveModel -->|Yes| SameDomain{Same domain as<br/>training data?}
SameDomain -->|Yes, confident| FastML[Pure ML<br/>4 minutes<br/>0 LLM calls]
SameDomain -->|Unsure| VerifyML[ML + Verification<br/>4.3 minutes<br/>1 LLM call]
SameDomain -->|No, different| Options{Accuracy needs?}
Options -->|High accuracy required| MustCalibrate
Options -->|Speed more important| VerifyML
Options -->|Experimental| FastML
MustCalibrate --> Done[Classification complete]
FastML --> Done
VerifyML --> Done
style FastML fill:#4ec9b0
style VerifyML fill:#ffd93d
style MustCalibrate fill:#ff6b6b
</pre>
</div>
<h2>Quick Start</h2>
<div class="code-section">
<strong>Test with verification on same domain (Enron → Enron):</strong>
python -m src.cli run \
--source enron \
--limit 1000 \
--output verify_test_same/ \
--no-llm-fallback \
--verify-categories
Expected: GOOD_MATCH (0.80-0.95)
Time: ~30 seconds
<strong>Test without verification for speed comparison:</strong>
python -m src.cli run \
--source enron \
--limit 1000 \
--output no_verify_test/ \
--no-llm-fallback
Expected: Same accuracy, 20 seconds faster
Time: ~10 seconds
</div>
<script>
mermaid.initialize({
startOnLoad: true,
theme: 'default',
flowchart: {
useMaxWidth: true,
htmlLabels: true,
curve: 'basis'
}
});
</script>
</body>
</html>

View File

@ -1,255 +0,0 @@
# Email Sorter - Complete Workflow Diagram
## Full End-to-End Pipeline with LLM Calls
```mermaid
graph TB
Start([📧 Start: Enron Maildir<br/>100,000 emails]) --> Parse[EnronParser<br/>Stratified Sampling]
Parse --> CalibCheck{Need<br/>Calibration?}
CalibCheck -->|Yes: No Model| CalibStart[🎯 CALIBRATION PHASE]
CalibCheck -->|No: Model Exists| ClassifyStart[📊 CLASSIFICATION PHASE]
%% CALIBRATION PHASE
CalibStart --> Sample[Sample 100 Emails<br/>Stratified by user/folder]
Sample --> Split[Split: 50 train / 50 validation]
Split --> LLMBatch[📤 LLM CALL 1-5<br/>Batch Discovery<br/>5 batches × 20 emails]
LLMBatch -->|qwen3:8b-q4_K_M| Discover[Category Discovery<br/>~15 raw categories]
Discover --> Consolidate[📤 LLM CALL 6<br/>Consolidation<br/>Merge similar categories]
Consolidate -->|qwen3:8b-q4_K_M| CacheSnap[Category Cache Snap<br/>Semantic matching<br/>10 final categories]
CacheSnap --> ExtractTrain[Extract Features<br/>50 training emails<br/>Batch embeddings]
ExtractTrain --> Embed1[📤 EMBEDDING CALLS<br/>Ollama all-minilm:l6-v2<br/>384-dim vectors]
Embed1 --> TrainModel[Train LightGBM<br/>200 boosting rounds<br/>22 total categories]
TrainModel --> SaveModel[💾 Save Model<br/>classifier.pkl 1.1MB]
SaveModel --> ClassifyStart
%% CLASSIFICATION PHASE
ClassifyStart --> LoadModel[Load Model<br/>classifier.pkl]
LoadModel --> FetchAll[Fetch All Emails<br/>100,000 emails]
FetchAll --> BatchProcess[Process in Batches<br/>5,000 emails per batch<br/>20 batches total]
BatchProcess --> ExtractFeatures[Extract Features<br/>Batch size: 512<br/>Batched embeddings]
ExtractFeatures --> Embed2[📤 EMBEDDING CALLS<br/>Ollama all-minilm:l6-v2<br/>~200 batched calls]
Embed2 --> MLInference[LightGBM Inference<br/>Predict categories<br/>~2ms per email]
MLInference --> Results[💾 Save Results<br/>results.json 19MB<br/>summary.json 1.5KB<br/>classifications.csv 8.6MB]
Results --> ValidationStart[🔍 VALIDATION PHASE]
%% VALIDATION PHASE
ValidationStart --> SelectSamples[Select Samples<br/>50 low-conf + 25 random]
SelectSamples --> LoadEmails[Load Full Email Content<br/>Subject + Body + Metadata]
LoadEmails --> LLMEval[📤 LLM CALLS 7-81<br/>Individual Evaluation<br/>75 total assessments]
LLMEval -->|qwen3:8b-q4_K_M<br/>&lt;no_think&gt;| EvalResults[Collect Verdicts<br/>YES/PARTIAL/NO<br/>+ Reasoning]
EvalResults --> LLMSummary[📤 LLM CALL 82<br/>Final Summary<br/>Aggregate findings]
LLMSummary -->|qwen3:8b-q4_K_M| FinalReport[📊 Final Report<br/>Accuracy metrics<br/>Category quality<br/>Recommendations]
FinalReport --> End([✅ Complete<br/>100k classified<br/>+ validated])
%% OPTIONAL FINE-TUNING LOOP
FinalReport -.->|If corrections needed| FineTune[🔄 FINE-TUNING<br/>Collect LLM corrections<br/>Continue training]
FineTune -.-> ClassifyStart
style Start fill:#e1f5e1
style End fill:#e1f5e1
style LLMBatch fill:#fff4e6
style Consolidate fill:#fff4e6
style Embed1 fill:#e6f3ff
style Embed2 fill:#e6f3ff
style LLMEval fill:#fff4e6
style LLMSummary fill:#fff4e6
style SaveModel fill:#ffe6f0
style Results fill:#ffe6f0
style FinalReport fill:#ffe6f0
```
---
## Pipeline Stages Breakdown
### STAGE 1: CALIBRATION (1 minute)
**Input:** 100 emails
**LLM Calls:** 6 calls
- 5 batch discovery calls (20 emails each)
- 1 consolidation call
**Embedding Calls:** ~50 calls (one per training email)
**Output:**
- 10 discovered categories
- Trained LightGBM model (1.1MB)
- Category cache
### STAGE 2: CLASSIFICATION (3.4 minutes)
**Input:** 100,000 emails
**LLM Calls:** 0 (pure ML inference)
**Embedding Calls:** ~200 batched calls (512 emails per batch)
**Output:**
- 100,000 classifications
- Confidence scores
- Results in JSON/CSV
### STAGE 3: VALIDATION (variable, ~5-10 minutes)
**Input:** 75 sample emails (50 low-conf + 25 random)
**LLM Calls:** 76 calls
- 75 individual evaluation calls
- 1 final summary call
**Output:**
- Quality assessment (YES/PARTIAL/NO)
- Accuracy metrics
- Recommendations
---
## LLM Call Summary
| Call # | Purpose | Model | Input | Output | Time |
|--------|---------|-------|-------|--------|------|
| 1-5 | Batch Discovery | qwen3:8b | 20 emails each | Categories | ~5-6s each |
| 6 | Consolidation | qwen3:8b | 15 categories | 10 merged | ~3s |
| 7-81 | Evaluation | qwen3:8b | 1 email + category | Verdict | ~2s each |
| 82 | Summary | qwen3:8b | 75 evaluations | Final report | ~5s |
**Total LLM Calls:** 82
**Total LLM Time:** ~3-4 minutes
**Embedding Calls:** ~250 (batched)
**Embedding Time:** ~30 seconds (batched)
---
## Performance Metrics
### Calibration Phase
- **Time:** 60 seconds
- **Samples:** 100 emails (50 for training)
- **Categories Discovered:** 10
- **Model Size:** 1.1MB
- **Accuracy on training:** 95%+
### Classification Phase
- **Time:** 202 seconds (3.4 minutes)
- **Emails:** 100,000
- **Speed:** 495 emails/second
- **Per Email:** 2ms total processing
- **Batch Size:** 512 (optimal)
- **GPU Utilization:** High (batched embeddings)
### Validation Phase
- **Time:** ~10 minutes (75 LLM calls)
- **Samples:** 75 emails
- **Per Sample:** ~8 seconds
- **Accuracy Found:** Model already accurate (0 corrections)
---
## Data Flow Details
### Email Processing Pipeline
```
Email File → Parse → Features → Embedding → Model → Category
(text) (dict) (struct) (384-dim) (22-cat) (label)
```
### Feature Extraction
```
Email Content
├─ Subject (text)
├─ Body (text)
├─ Sender (email address)
├─ Date (timestamp)
├─ Attachments (boolean + count)
└─ Patterns (regex matches)
Structured Text
Ollama Embedding (all-minilm:l6-v2)
384-dimensional vector
```
### LightGBM Training
```
Features (384-dim) + Labels (10 categories)
Training: 200 boosting rounds
Model: 22 categories total (10 discovered + 12 hardcoded)
Output: classifier.pkl (1.1MB)
```
---
## Category Distribution (100k Results)
```mermaid
pie title Category Distribution
"Work Communication" : 89807
"Financial" : 6534
"Forwarded" : 2457
"Technical Analysis" : 1129
"Other" : 73
```
---
## Confidence Distribution (100k Results)
```mermaid
pie title Confidence Levels
"High (≥0.7)" : 74777
"Medium (0.5-0.7)" : 17381
"Low (<0.5)" : 7842
```
---
## System Architecture
```mermaid
graph LR
A[Email Source<br/>Gmail/IMAP/Enron] --> B[Email Provider]
B --> C[Feature Extractor]
C --> D[Ollama<br/>Embeddings]
C --> E[Pattern Detector]
D --> F[LightGBM<br/>Classifier]
E --> F
F --> G[Results<br/>JSON/CSV]
F --> H[Sync Engine<br/>Labels/Keywords]
I[LLM<br/>qwen3:8b] -.->|Calibration| J[Category Discovery]
J -.-> F
I -.->|Validation| K[Quality Check]
K -.-> G
style D fill:#e6f3ff
style I fill:#fff4e6
style F fill:#f0e6ff
style G fill:#ffe6f0
```
---
## Next: Integrated End-to-End Script
Building comprehensive validation script with:
1. 50 low-confidence samples
2. 25 random samples
3. Final LLM summary call
4. Complete pipeline orchestration

View File

@ -1,126 +0,0 @@
Yeah — your instinct is solid here, Brett.
For the kind of **multi-category, fast, local, high-volume** classification system youre building, training your own model on your own labelled data will almost certainly give **better and more controllable results** than trying to stretch an off-the-shelf model.
Lets break this down clearly — especially what **sentence-transformers** brings to the table and why its a good fit for your hybrid architecture.
---
## 🧠 What Sentence Transformers Actually Do
A **sentence transformer** is a model that converts a piece of text (e.g. subject line, short body snippet, sender info) into a **dense vector embedding** — basically a fixed-length numerical representation that captures its semantic meaning.
Once you have embeddings, classification becomes a **simple, fast** downstream problem:
* Instead of training a giant deep network,
* You train a **small, lightweight model** (like Logistic Regression, XGBoost, or a simple neural head) on top of those embeddings.
---
## 🚀 Why This Is a Great Fit for Your Email Sorter
| Sentence Transformers | Why it matters for you |
| ----------------------------------------------------------- | ----------------------------------------------------------------------------- |
| **Pretrained models** already “understand” general language | So you dont need massive datasets to get good results. |
| **Fast inference** | Embedding generation can run on CPU or GPU and is easy to parallelise. |
| **Stable + deterministic** | Embeddings are consistent across runs (unlike LLM zero-shot answers). |
| **Lightweight training** | You can train a classifier on top with a few thousand labelled samples. |
| **Supports multi-class** easily | Perfect for your 12 category types. |
| **Re-usable** | One universal model for all inboxes; just retrain the top layer occasionally. |
---
## 🏗️ How It Would Work in Your Pipeline
### 1. **Use the LLM once** to label your initial batch (e.g. 1.5k5k emails).
This gives you your **bootstrapped labelled dataset**.
### 2. **Generate embeddings**
* Take **subject**, **sender domain**, and optionally a short **body snippet**.
* Pass through sentence transformer → get 768-dimensional vector (typical).
* Save those embeddings alongside labels.
### 3. **Train a classifier** on top
A lightweight model like:
* **Logistic Regression** (fastest),
* **XGBoost / LightGBM** (slightly heavier, more accuracy),
* or even a shallow **MLP** if you want.
This becomes your **universal email classifier**.
### 4. **Run in production**
* New email comes in → embed text → run classifier → get category + confidence.
* If below threshold → send to LLM for re-classification.
This gives you **LLM semantic power** at training time, and **ML speed** at runtime.
---
## 🧪 Practical Setup (Minimal)
```python
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
# 1. Load a pre-trained embedding model
model = SentenceTransformer("all-MiniLM-L6-v2") # fast, lightweight
# 2. Convert texts to embeddings
texts = [f"{subject} {sender}" for subject, sender in your_email_data]
embeddings = model.encode(texts, batch_size=64, show_progress_bar=True)
# 3. Train a lightweight classifier
clf = LogisticRegression(max_iter=1000)
clf.fit(embeddings, labels) # labels come from your LLM bootstrap
# 4. Predict
new_embedding = model.encode(["Subject from new email"])
pred = clf.predict(new_embedding)
conf = clf.predict_proba(new_embedding)
```
* `all-MiniLM-L6-v2` is a fantastic starting model — small, fast, and surprisingly accurate.
* You can fine-tune the sentence transformer later if you want **extra precision**.
---
## 🧠 Why Its Easier Than Full Fine-Tuning
You *dont* need to train the transformer itself (at least not at the start).
Youre just training the **top layer** (classifier). That means:
* Training takes minutes, not hours.
* You dont need huge GPUs.
* You can refresh or retrain easily with new data.
Later on, if you want to **fine-tune the transformer itself** (so it “understands emails” more deeply), thats an optional next step.
---
## ⚡ Typical Results People See
* With 25k labelled samples, sentence transformer embeddings + logistic regression can hit **8595 % accuracy** on email category tasks.
* Inference time is **<5 ms per email** on CPU.
* Works well for both generic and user-specific inboxes.
---
## 🪜 Suggested Path for You
1. Use your **LLM pass** to generate labels on your first big inbox.
2. Generate embeddings with a pretrained MiniLM.
3. Train a logistic regression or XGBoost model.
4. Run it on the next inbox → see how it performs.
5. (Optional) Fine-tune the transformer if you want to push performance higher.
---
👉 In short:
Yes — sentence transformers are **perfect** for this.
They give you **semantic power without LLM overhead**, are **easy to train**, and will make your hybrid classifier **extremely fast and accurate** after that first run.
If you want, I can give you a **tiny starter training script** (3040 lines) that does the embedding + classifier training from your first LLM-labelled dataset. Would you like that?

View File

@ -53,6 +53,8 @@ def cli():
help='Verify model categories fit new mailbox (single LLM call)') help='Verify model categories fit new mailbox (single LLM call)')
@click.option('--verify-sample', type=int, default=20, @click.option('--verify-sample', type=int, default=20,
help='Number of emails to sample for category verification') help='Number of emails to sample for category verification')
@click.option('--force-ml', is_flag=True,
help='Force use of existing ML model regardless of dataset size')
def run( def run(
source: str, source: str,
credentials: Optional[str], credentials: Optional[str],
@ -65,7 +67,8 @@ def run(
verbose: bool, verbose: bool,
no_llm_fallback: bool, no_llm_fallback: bool,
verify_categories: bool, verify_categories: bool,
verify_sample: int verify_sample: int,
force_ml: bool
): ):
"""Run email sorter pipeline.""" """Run email sorter pipeline."""
@ -198,10 +201,14 @@ def run(
total_emails = len(emails) total_emails = len(emails)
# Skip ML for small datasets (<1000 emails) - use LLM only # Skip ML for small datasets (<1000 emails) - use LLM only
if total_emails < 1000: # Unless --force-ml is set and we have an existing model
if total_emails < 1000 and not force_ml:
logger.warning(f"Only {total_emails} emails - too few for ML training") logger.warning(f"Only {total_emails} emails - too few for ML training")
logger.warning("Using LLM-only classification (no ML model)") logger.warning("Using LLM-only classification (no ML model)")
logger.warning("Use --force-ml to use existing model anyway")
ml_classifier.is_mock = True ml_classifier.is_mock = True
elif force_ml and ml_classifier.model:
logger.info(f"--force-ml: Using existing ML model for {total_emails} emails")
# Check if we need calibration (no good ML model) # Check if we need calibration (no good ML model)
if ml_classifier.is_mock or not ml_classifier.model: if ml_classifier.is_mock or not ml_classifier.model:
@ -294,7 +301,20 @@ def run(
logger.info("Exporting results") logger.info("Exporting results")
Path(output).mkdir(parents=True, exist_ok=True) Path(output).mkdir(parents=True, exist_ok=True)
# Build email lookup for metadata enrichment
email_lookup = {email.id: email for email in emails}
import json import json
from datetime import datetime as dt
def serialize_date(date_obj):
"""Serialize date to ISO format string."""
if date_obj is None:
return None
if isinstance(date_obj, dt):
return date_obj.isoformat()
return str(date_obj)
results_data = { results_data = {
'metadata': { 'metadata': {
'total_emails': len(emails), 'total_emails': len(emails),
@ -304,16 +324,24 @@ def run(
'ml_classified': adaptive_classifier.get_stats().ml_classified, 'ml_classified': adaptive_classifier.get_stats().ml_classified,
'llm_classified': adaptive_classifier.get_stats().llm_classified, 'llm_classified': adaptive_classifier.get_stats().llm_classified,
'needs_review': adaptive_classifier.get_stats().needs_review, 'needs_review': adaptive_classifier.get_stats().needs_review,
} },
'generated_at': dt.now().isoformat(),
'source': source,
'source_path': directory if source == 'local' else None,
}, },
'classifications': [ 'classifications': [
{ {
'email_id': r.email_id, 'email_id': r.email_id,
'subject': email_lookup.get(r.email_id, emails[i]).subject if r.email_id in email_lookup or i < len(emails) else '',
'sender': email_lookup.get(r.email_id, emails[i]).sender if r.email_id in email_lookup or i < len(emails) else '',
'sender_name': email_lookup.get(r.email_id, emails[i]).sender_name if r.email_id in email_lookup or i < len(emails) else None,
'date': serialize_date(email_lookup.get(r.email_id, emails[i]).date if r.email_id in email_lookup or i < len(emails) else None),
'has_attachments': email_lookup.get(r.email_id, emails[i]).has_attachments if r.email_id in email_lookup or i < len(emails) else False,
'category': r.category, 'category': r.category,
'confidence': r.confidence, 'confidence': r.confidence,
'method': r.method 'method': r.method
} }
for r in results for i, r in enumerate(results)
] ]
} }

View File

@ -47,14 +47,12 @@ class OpenAIProvider(BaseLLMProvider):
try: try:
from openai import OpenAI from openai import OpenAI
if not self.api_key: # For local vLLM/OpenAI-compatible servers, API key may not be required
self.logger.error("OpenAI API key not configured") # Use a placeholder if not set
self.logger.error("Set OPENAI_API_KEY environment variable or pass api_key parameter") api_key = self.api_key or "not-needed"
self._available = False
return
self.client = OpenAI( self.client = OpenAI(
api_key=self.api_key, api_key=api_key,
base_url=self.base_url if self.base_url != "https://api.openai.com/v1" else None, base_url=self.base_url if self.base_url != "https://api.openai.com/v1" else None,
timeout=self.timeout timeout=self.timeout
) )
@ -121,7 +119,7 @@ class OpenAIProvider(BaseLLMProvider):
def test_connection(self) -> bool: def test_connection(self) -> bool:
"""Test if OpenAI API is accessible.""" """Test if OpenAI API is accessible."""
if not self.client or not self.api_key: if not self.client:
self.logger.warning("OpenAI client not initialized") self.logger.warning("OpenAI client not initialized")
return False return False

364
tools/batch_llm_classifier.py Executable file
View File

@ -0,0 +1,364 @@
#!/usr/bin/env python3
"""
Standalone vLLM Batch Email Classifier
PREREQUISITE: vLLM server must be running at configured endpoint
This is a SEPARATE tool from the main ML classification pipeline.
Use this for:
- One-off batch questions ("find all emails about project X")
- Custom classification criteria not in trained model
- Exploratory analysis with flexible prompts
Use RAG instead for:
- Searching across large email corpus
- Finding specific topics/keywords
- Building knowledge from email content
"""
import time
import asyncio
import logging
import sys
from pathlib import Path
from typing import List, Dict, Any, Optional
import httpx
import click
# Server configuration
VLLM_CONFIG = {
'base_url': 'https://rtx3090.bobai.com.au/v1',
'api_key': 'rtx3090_foxadmin_10_8034ecb47841f45ba1d5f3f5d875c092',
'model': 'qwen3-coder-30b',
'batch_size': 4, # Tested optimal - 100% success, proper batch pooling
'temperature': 0.1,
'max_tokens': 500
}
async def check_vllm_server(base_url: str, api_key: str, model: str) -> bool:
"""Check if vLLM server is running and model is loaded."""
try:
async with httpx.AsyncClient() as client:
response = await client.post(
f"{base_url}/chat/completions",
json={
"model": model,
"messages": [{"role": "user", "content": "test"}],
"max_tokens": 5
},
headers={
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
},
timeout=10.0
)
return response.status_code == 200
except Exception as e:
print(f"ERROR: vLLM server check failed: {e}")
return False
async def classify_email_async(
client: httpx.AsyncClient,
email: Any,
prompt_template: str,
base_url: str,
api_key: str,
model: str,
temperature: float,
max_tokens: int
) -> Dict[str, Any]:
"""Classify single email using async HTTP request."""
# No semaphore - proper batch pooling instead
try:
# Build prompt with email data
prompt = prompt_template.format(
subject=email.get('subject', 'N/A')[:100],
sender=email.get('sender', 'N/A')[:50],
body_snippet=email.get('body_snippet', '')[:500]
)
response = await client.post(
f"{base_url}/chat/completions",
json={
"model": model,
"messages": [{"role": "user", "content": prompt}],
"temperature": temperature,
"max_tokens": max_tokens
},
headers={
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
},
timeout=30.0
)
if response.status_code == 200:
data = response.json()
content = data['choices'][0]['message']['content']
return {
'email_id': email.get('id', 'unknown'),
'subject': email.get('subject', 'N/A')[:60],
'result': content.strip(),
'success': True
}
return {
'email_id': email.get('id', 'unknown'),
'subject': email.get('subject', 'N/A')[:60],
'result': f'HTTP {response.status_code}',
'success': False
}
except Exception as e:
return {
'email_id': email.get('id', 'unknown'),
'subject': email.get('subject', 'N/A')[:60],
'result': f'Error: {str(e)[:100]}',
'success': False
}
async def classify_single_batch(
client: httpx.AsyncClient,
emails: List[Dict[str, Any]],
prompt_template: str,
config: Dict[str, Any]
) -> List[Dict[str, Any]]:
"""Classify one batch of emails - send all at once, wait for completion."""
tasks = [
classify_email_async(
client, email, prompt_template,
config['base_url'], config['api_key'], config['model'],
config['temperature'], config['max_tokens']
)
for email in emails
]
results = await asyncio.gather(*tasks)
return results
async def batch_classify_async(
emails: List[Dict[str, Any]],
prompt_template: str,
config: Dict[str, Any]
) -> List[Dict[str, Any]]:
"""Classify emails using proper batch pooling."""
batch_size = config['batch_size']
all_results = []
async with httpx.AsyncClient() as client:
# Process in batches - send batch, wait for all to complete, repeat
for batch_start in range(0, len(emails), batch_size):
batch_end = min(batch_start + batch_size, len(emails))
batch_emails = emails[batch_start:batch_end]
batch_results = await classify_single_batch(
client, batch_emails, prompt_template, config
)
all_results.extend(batch_results)
return all_results
def load_emails_from_provider(provider_type: str, credentials: Optional[str], limit: int) -> List[Dict[str, Any]]:
"""Load emails from configured provider."""
# Lazy import to avoid dependency issues
if provider_type == 'enron':
from src.email_providers.enron import EnronProvider
provider = EnronProvider(maildir_path=".")
provider.connect({})
emails = provider.fetch_emails(limit=limit)
provider.disconnect()
# Convert to dict format
return [
{
'id': e.id,
'subject': e.subject,
'sender': e.sender,
'body_snippet': e.body_snippet
}
for e in emails
]
elif provider_type == 'gmail':
from src.email_providers.gmail import GmailProvider
if not credentials:
print("ERROR: Gmail requires --credentials path")
sys.exit(1)
provider = GmailProvider()
provider.connect({'credentials_path': credentials})
emails = provider.fetch_emails(limit=limit)
provider.disconnect()
return [
{
'id': e.id,
'subject': e.subject,
'sender': e.sender,
'body_snippet': e.body_snippet
}
for e in emails
]
else:
print(f"ERROR: Unsupported provider: {provider_type}")
sys.exit(1)
@click.group()
def cli():
"""vLLM Batch Email Classifier - Ask custom questions across email batches."""
pass
@cli.command()
@click.option('--source', type=click.Choice(['gmail', 'enron']), default='enron',
help='Email provider')
@click.option('--credentials', type=click.Path(exists=False),
help='Path to credentials file (for Gmail)')
@click.option('--limit', type=int, default=50,
help='Number of emails to process')
@click.option('--question', type=str, required=True,
help='Question to ask about each email')
@click.option('--output', type=click.Path(), default='batch_results.txt',
help='Output file for results')
def ask(source: str, credentials: Optional[str], limit: int, question: str, output: str):
"""Ask a custom question about a batch of emails."""
print("=" * 80)
print("vLLM BATCH EMAIL CLASSIFIER")
print("=" * 80)
print(f"Question: {question}")
print(f"Source: {source}")
print(f"Batch size: {limit}")
print("=" * 80)
print()
# Check vLLM server
print("Checking vLLM server...")
if not asyncio.run(check_vllm_server(
VLLM_CONFIG['base_url'],
VLLM_CONFIG['api_key'],
VLLM_CONFIG['model']
)):
print()
print("ERROR: vLLM server not available or not responding")
print(f"Expected endpoint: {VLLM_CONFIG['base_url']}")
print(f"Expected model: {VLLM_CONFIG['model']}")
print()
print("PREREQUISITE: Start vLLM server before running this tool")
sys.exit(1)
print(f"✓ vLLM server running ({VLLM_CONFIG['model']})")
print()
# Load emails
print(f"Loading {limit} emails from {source}...")
emails = load_emails_from_provider(source, credentials, limit)
print(f"✓ Loaded {len(emails)} emails")
print()
# Build prompt template (optimized for caching)
prompt_template = f"""You are analyzing emails to answer specific questions.
INSTRUCTIONS:
- Read the email carefully
- Answer the question directly and concisely
- Provide reasoning if helpful
- If the email is not relevant, say "Not relevant"
QUESTION:
{question}
EMAIL TO ANALYZE:
Subject: {{subject}}
From: {{sender}}
Body: {{body_snippet}}
ANSWER:
"""
# Process batch
print(f"Processing {len(emails)} emails with {VLLM_CONFIG['max_concurrent']} concurrent requests...")
start_time = time.time()
results = asyncio.run(batch_classify_async(emails, prompt_template, VLLM_CONFIG))
end_time = time.time()
total_time = end_time - start_time
# Stats
successful = sum(1 for r in results if r['success'])
throughput = len(emails) / total_time
print()
print("=" * 80)
print("RESULTS")
print("=" * 80)
print(f"Total emails: {len(emails)}")
print(f"Successful: {successful}")
print(f"Failed: {len(emails) - successful}")
print(f"Time: {total_time:.1f}s")
print(f"Throughput: {throughput:.2f} emails/sec")
print("=" * 80)
print()
# Save results
with open(output, 'w') as f:
f.write(f"Question: {question}\n")
f.write(f"Processed: {len(emails)} emails in {total_time:.1f}s\n")
f.write("=" * 80 + "\n\n")
for i, result in enumerate(results, 1):
f.write(f"{i}. {result['subject']}\n")
f.write(f" Email ID: {result['email_id']}\n")
f.write(f" Answer: {result['result']}\n")
f.write("\n")
print(f"Results saved to: {output}")
print()
# Show sample
print("SAMPLE RESULTS (first 5):")
for i, result in enumerate(results[:5], 1):
print(f"\n{i}. {result['subject']}")
print(f" {result['result'][:100]}...")
@cli.command()
def check():
"""Check if vLLM server is running and ready."""
print("Checking vLLM server...")
print(f"Endpoint: {VLLM_CONFIG['base_url']}")
print(f"Model: {VLLM_CONFIG['model']}")
print()
if asyncio.run(check_vllm_server(
VLLM_CONFIG['base_url'],
VLLM_CONFIG['api_key'],
VLLM_CONFIG['model']
)):
print("✓ vLLM server is running and ready")
print(f"✓ Max concurrent requests: {VLLM_CONFIG['max_concurrent']}")
print(f"✓ Estimated throughput: ~4.4 emails/sec")
else:
print("✗ vLLM server not available")
print()
print("Start vLLM server before using this tool")
sys.exit(1)
if __name__ == '__main__':
cli()

View File

@ -0,0 +1,391 @@
#!/usr/bin/env python3
"""
Brett Gmail Dataset Analyzer
============================
CUSTOM script for analyzing the brett-gmail email dataset.
NOT portable to other datasets without modification.
Usage:
python tools/brett_gmail_analyzer.py
Output:
- Console report with comprehensive statistics
- data/brett_gmail_analysis.json with full analysis data
"""
import json
import re
from collections import Counter, defaultdict
from datetime import datetime
from pathlib import Path
# Add parent to path for imports
import sys
sys.path.insert(0, str(Path(__file__).parent.parent))
from src.calibration.local_file_parser import LocalFileParser
# =============================================================================
# CLASSIFICATION RULES - CUSTOM FOR BRETT'S GMAIL
# =============================================================================
def classify_email(email):
"""
Classify email into categories based on sender domain and subject patterns.
Priority: Sender domain > Subject keywords
"""
sender = email.sender or ""
subject = email.subject or ""
domain = sender.split('@')[-1] if '@' in sender else sender
# === HIGH-LEVEL CATEGORIES ===
# --- Art & Collectibles ---
if 'mutualart.com' in domain:
return ('Art & Collectibles', 'MutualArt Alerts')
# --- Travel & Tourism ---
if 'tripadvisor.com' in domain:
return ('Travel & Tourism', 'Tripadvisor')
if 'booking.com' in domain:
return ('Travel & Tourism', 'Booking.com')
# --- Entertainment & Streaming ---
if 'spotify.com' in domain:
if 'concert' in subject.lower() or 'live' in subject.lower():
return ('Entertainment', 'Spotify Concerts')
return ('Entertainment', 'Spotify Promotions')
if 'youtube.com' in domain:
return ('Entertainment', 'YouTube')
if 'onlyfans.com' in domain:
return ('Entertainment', 'OnlyFans')
if 'ign.com' in domain:
return ('Entertainment', 'IGN Gaming')
# --- Shopping & eCommerce ---
if 'ebay.com' in domain or 'reply.ebay' in domain:
return ('Shopping', 'eBay')
if 'aliexpress.com' in domain:
return ('Shopping', 'AliExpress')
if 'alibabacloud.com' in domain or 'alibaba-inc.com' in domain:
return ('Tech Services', 'Alibaba Cloud')
if '4wdsupacentre' in domain:
return ('Shopping', '4WD Supacentre')
if 'mikeblewitt' in domain or 'mbcoffscoast' in domain:
return ('Shopping', 'Mike Blewitt/MBC')
if 'auspost.com.au' in domain:
return ('Shopping', 'Australia Post')
if 'printfresh' in domain:
return ('Business', 'Timesheets')
# --- AI & Tech Services ---
if 'anthropic.com' in domain or 'claude.com' in domain:
return ('AI Services', 'Anthropic/Claude')
if 'openai.com' in domain:
return ('AI Services', 'OpenAI')
if 'openrouter.ai' in domain:
return ('AI Services', 'OpenRouter')
if 'lambda' in domain:
return ('AI Services', 'Lambda Labs')
if 'x.ai' in domain:
return ('AI Services', 'xAI')
if 'perplexity.ai' in domain:
return ('AI Services', 'Perplexity')
if 'cursor.com' in domain:
return ('Developer Tools', 'Cursor')
# --- Developer Tools ---
if 'ngrok.com' in domain:
return ('Developer Tools', 'ngrok')
if 'docker.com' in domain:
return ('Developer Tools', 'Docker')
# --- Productivity Apps ---
if 'screencastify.com' in domain:
return ('Productivity', 'Screencastify')
if 'tango.us' in domain:
return ('Productivity', 'Tango')
if 'xplor.com' in domain or 'myxplor' in domain:
return ('Services', 'Xplor Childcare')
# --- Google Services ---
if 'google.com' in domain or 'accounts.google.com' in domain:
if 'performance report' in subject.lower() or 'business profile' in subject.lower():
return ('Google', 'Business Profile')
if 'security' in subject.lower() or 'sign-in' in subject.lower():
return ('Security', 'Google Security')
if 'firebase' in subject.lower() or 'firestore' in subject.lower():
return ('Developer Tools', 'Firebase')
if 'ads' in subject.lower():
return ('Google', 'Google Ads')
if 'analytics' in subject.lower():
return ('Google', 'Analytics')
if re.search(r'verification code|verify', subject, re.I):
return ('Security', 'Google Verification')
return ('Google', 'Other Google')
# --- Microsoft ---
if 'microsoft.com' in domain or 'outlook.com' in domain or 'hotmail.com' in domain:
if 'security' in subject.lower() or 'protection' in domain:
return ('Security', 'Microsoft Security')
return ('Personal', 'Microsoft/Outlook')
# --- Social Media ---
if 'reddit' in domain:
return ('Social', 'Reddit')
# --- Business/Work ---
if 'frontiertechstrategies' in domain:
return ('Business', 'Appointments')
if 'crsaustralia.gov.au' in domain:
return ('Business', 'Job Applications')
if 'v6send.net' in domain:
return ('Shopping', 'Automotive Dealers')
# === SUBJECT-BASED FALLBACK ===
if re.search(r'security alert|verification code|sign.?in|password|2fa', subject, re.I):
return ('Security', 'General Security')
if re.search(r'order.*ship|receipt|payment|invoice|purchase', subject, re.I):
return ('Transactions', 'Orders/Receipts')
if re.search(r'trial|subscription|billing|renew', subject, re.I):
return ('Billing', 'Subscriptions')
if re.search(r'terms of service|privacy policy|legal', subject, re.I):
return ('Legal', 'Policy Updates')
if re.search(r'welcome to|getting started', subject, re.I):
return ('Onboarding', 'Welcome Emails')
# --- Personal contacts ---
if 'gmail.com' in domain:
return ('Personal', 'Gmail Contacts')
return ('Uncategorized', 'Unknown')
def extract_order_ids(emails):
"""Extract order/transaction IDs from emails."""
order_patterns = [
(r'Order\s+(\d{10,})', 'AliExpress Order'),
(r'receipt.*(\d{4}-\d{4}-\d{4})', 'Receipt ID'),
(r'#(\d{4,})', 'Generic Order ID'),
]
orders = []
for email in emails:
subject = email.subject or ""
for pattern, order_type in order_patterns:
match = re.search(pattern, subject, re.I)
if match:
orders.append({
'id': match.group(1),
'type': order_type,
'subject': subject,
'date': str(email.date) if email.date else None,
'sender': email.sender
})
break
return orders
def analyze_time_distribution(emails):
"""Analyze email distribution over time."""
by_year = Counter()
by_month = Counter()
by_day_of_week = Counter()
day_names = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
for email in emails:
if email.date:
try:
by_year[email.date.year] += 1
by_month[f"{email.date.year}-{email.date.month:02d}"] += 1
by_day_of_week[day_names[email.date.weekday()]] += 1
except:
pass
return {
'by_year': dict(by_year.most_common()),
'by_month': dict(sorted(by_month.items())),
'by_day_of_week': {d: by_day_of_week.get(d, 0) for d in day_names}
}
def main():
email_dir = "/home/bob/Documents/Email Manager/emails/brett-gmail"
output_dir = Path(__file__).parent.parent / "data"
output_dir.mkdir(exist_ok=True)
print("="*70)
print("BRETT GMAIL DATASET ANALYSIS")
print("="*70)
print(f"\nSource: {email_dir}")
print(f"Output: {output_dir}")
# Parse emails
print("\nParsing emails...")
parser = LocalFileParser(email_dir)
emails = parser.parse_emails()
print(f"Total emails: {len(emails)}")
# Date range
dates = [e.date for e in emails if e.date]
if dates:
dates.sort()
print(f"Date range: {dates[0].strftime('%Y-%m-%d')} to {dates[-1].strftime('%Y-%m-%d')}")
# Classify all emails
print("\nClassifying emails...")
category_counts = Counter()
subcategory_counts = Counter()
by_category = defaultdict(list)
by_subcategory = defaultdict(list)
for email in emails:
category, subcategory = classify_email(email)
category_counts[category] += 1
subcategory_counts[subcategory] += 1
by_category[category].append(email)
by_subcategory[subcategory].append(email)
# Print category summary
print("\n" + "="*70)
print("CATEGORY SUMMARY")
print("="*70)
for category, count in category_counts.most_common():
pct = count / len(emails) * 100
bar = "" * int(pct / 2)
print(f"\n{category} ({count} emails, {pct:.1f}%)")
print(f" {bar}")
# Show subcategories
subcats = Counter()
for email in by_category[category]:
_, subcat = classify_email(email)
subcats[subcat] += 1
for subcat, subcount in subcats.most_common():
print(f" - {subcat}: {subcount}")
# Analyze senders
print("\n" + "="*70)
print("TOP SENDERS BY VOLUME")
print("="*70)
sender_counts = Counter(e.sender for e in emails)
for sender, count in sender_counts.most_common(15):
pct = count / len(emails) * 100
print(f" {count:4d} ({pct:4.1f}%) {sender}")
# Time analysis
print("\n" + "="*70)
print("TIME DISTRIBUTION")
print("="*70)
time_dist = analyze_time_distribution(emails)
print("\nBy Year:")
for year, count in sorted(time_dist['by_year'].items()):
bar = "" * (count // 10)
print(f" {year}: {count:4d} {bar}")
print("\nBy Day of Week:")
for day, count in time_dist['by_day_of_week'].items():
bar = "" * (count // 5)
print(f" {day}: {count:3d} {bar}")
# Extract orders
print("\n" + "="*70)
print("ORDER/TRANSACTION IDs FOUND")
print("="*70)
orders = extract_order_ids(emails)
if orders:
for order in orders[:10]:
print(f" [{order['type']}] {order['id']}")
print(f" Subject: {order['subject'][:60]}...")
else:
print(" No order IDs detected in subjects")
# Actionable insights
print("\n" + "="*70)
print("ACTIONABLE INSIGHTS")
print("="*70)
# High-volume automated senders
automated_domains = ['mutualart.com', 'tripadvisor.com', 'ebay.com', 'spotify.com']
auto_count = sum(1 for e in emails if any(d in (e.sender or '') for d in automated_domains))
print(f"\n1. AUTOMATED EMAILS: {auto_count} ({auto_count/len(emails)*100:.1f}%)")
print(" - MutualArt alerts: Consider aggregating to weekly digest")
print(" - Tripadvisor: Can be filtered to trash or separate folder")
print(" - eBay/Spotify: Promotional, low priority")
# Security alerts
security_count = category_counts.get('Security', 0)
print(f"\n2. SECURITY ALERTS: {security_count} ({security_count/len(emails)*100:.1f}%)")
print(" - Google security: Review for legitimate sign-in attempts")
print(" - Should NOT be auto-filtered")
# Business/Work
business_count = category_counts.get('Business', 0) + category_counts.get('Google', 0)
print(f"\n3. BUSINESS-RELATED: {business_count} ({business_count/len(emails)*100:.1f}%)")
print(" - Google Business Profile reports: Monthly review")
print(" - Job applications: High priority")
print(" - Appointments: Calendar integration")
# AI Services (professional interest)
ai_count = category_counts.get('AI Services', 0) + category_counts.get('Developer Tools', 0)
print(f"\n4. AI/DEVELOPER TOOLS: {ai_count} ({ai_count/len(emails)*100:.1f}%)")
print(" - Anthropic, OpenAI, Lambda: Keep for reference")
print(" - ngrok, Docker, Cursor: Developer updates")
# Personal
personal_count = category_counts.get('Personal', 0)
print(f"\n5. PERSONAL: {personal_count} ({personal_count/len(emails)*100:.1f}%)")
print(" - Gmail contacts: May need human review")
print(" - Microsoft/Outlook: Check for spam")
# Save analysis data
analysis_data = {
'metadata': {
'total_emails': len(emails),
'date_range': {
'start': str(dates[0]) if dates else None,
'end': str(dates[-1]) if dates else None
},
'analyzed_at': datetime.now().isoformat()
},
'categories': dict(category_counts),
'subcategories': dict(subcategory_counts),
'top_senders': dict(sender_counts.most_common(50)),
'time_distribution': time_dist,
'orders_found': orders,
'classification_accuracy': {
'categorized': len(emails) - category_counts.get('Uncategorized', 0),
'uncategorized': category_counts.get('Uncategorized', 0),
'accuracy_pct': (len(emails) - category_counts.get('Uncategorized', 0)) / len(emails) * 100
}
}
output_file = output_dir / "brett_gmail_analysis.json"
with open(output_file, 'w') as f:
json.dump(analysis_data, f, indent=2)
print(f"\n\nAnalysis saved to: {output_file}")
print("\n" + "="*70)
print(f"CLASSIFICATION ACCURACY: {analysis_data['classification_accuracy']['accuracy_pct']:.1f}%")
print(f"({analysis_data['classification_accuracy']['categorized']} categorized, "
f"{analysis_data['classification_accuracy']['uncategorized']} uncategorized)")
print("="*70)
if __name__ == '__main__':
main()

View File

@ -0,0 +1,500 @@
#!/usr/bin/env python3
"""
Brett Microsoft (Outlook) Dataset Analyzer
==========================================
CUSTOM script for analyzing the brett-microsoft email dataset.
NOT portable to other datasets without modification.
Usage:
python tools/brett_microsoft_analyzer.py
Output:
- Console report with comprehensive statistics
- data/brett_microsoft_analysis.json with full analysis data
"""
import json
import re
from collections import Counter, defaultdict
from datetime import datetime
from pathlib import Path
# Add parent to path for imports
import sys
sys.path.insert(0, str(Path(__file__).parent.parent))
from src.calibration.local_file_parser import LocalFileParser
# =============================================================================
# CLASSIFICATION RULES - CUSTOM FOR BRETT'S MICROSOFT/OUTLOOK INBOX
# =============================================================================
def classify_email(email):
"""
Classify email into categories based on sender domain and subject patterns.
This is a BUSINESS inbox - different approach than personal Gmail.
Priority: Sender domain > Subject keywords > Business context
"""
sender = email.sender or ""
subject = email.subject or ""
domain = sender.split('@')[-1] if '@' in sender else sender
# === BUSINESS OPERATIONS ===
# MYOB/Accounting
if 'apps.myob.com' in domain or 'myob' in subject.lower():
return ('Business Operations', 'MYOB Invoices')
# TPG/Telecom/Internet
if 'tpgtelecom.com.au' in domain or 'aapt.com.au' in domain:
if 'suspension' in subject.lower() or 'overdue' in subject.lower():
return ('Business Operations', 'Telecom - Urgent/Overdue')
if 'novation' in subject.lower():
return ('Business Operations', 'Telecom - Contract Changes')
if 'NBN' in subject or 'nbn' in subject.lower():
return ('Business Operations', 'Telecom - NBN')
return ('Business Operations', 'Telecom - General')
# DocuSign (Contracts)
if 'docusign' in domain or 'docusign' in subject.lower():
return ('Business Operations', 'DocuSign Contracts')
# === CLIENT WORK ===
# Green Output / Energy Avengers (App Development Client)
if 'greenoutput.com.au' in domain or 'energyavengers' in domain:
return ('Client Work', 'Energy Avengers Project')
# Brighter Access (Client)
if 'brighteraccess' in domain or 'Brighter Access' in subject:
return ('Client Work', 'Brighter Access')
# Waterfall Way Designs (Business Partner)
if 'waterfallwaydesigns' in domain:
return ('Client Work', 'Waterfall Way Designs')
# Target Impact
if 'targetimpact.com.au' in domain:
return ('Client Work', 'Target Impact')
# MerlinFX
if 'merlinfx.com.au' in domain:
return ('Client Work', 'MerlinFX')
# Solar/Energy related (Energy Avengers ecosystem)
if 'solarairenergy.com.au' in domain or 'solarconnected.com.au' in domain:
return ('Client Work', 'Energy Avengers Ecosystem')
if 'eonadvisory.com.au' in domain or 'australianpowerbrokers.com.au' in domain:
return ('Client Work', 'Energy Avengers Ecosystem')
if 'fyconsulting.com.au' in domain:
return ('Client Work', 'Energy Avengers Ecosystem')
if 'convergedesign.com.au' in domain:
return ('Client Work', 'Energy Avengers Ecosystem')
# MYP Corp (Disability Services Software)
if '1myp.com' in domain or 'mypcorp' in domain or 'MYP' in subject:
return ('Business Operations', 'MYP Software')
# === MICROSOFT SERVICES ===
# Microsoft Support Cases
if re.search(r'\[Case.*#|Case #|TrackingID', subject, re.I) or 'support.microsoft.com' in domain:
return ('Microsoft', 'Support Cases')
# Microsoft Billing/Invoices
if 'Microsoft invoice' in subject or 'credit card was declined' in subject:
return ('Microsoft', 'Billing')
# Microsoft Subscriptions
if 'subscription' in subject.lower() and 'microsoft' in sender.lower():
return ('Microsoft', 'Subscriptions')
# SharePoint/Teams
if 'sharepointonline.com' in domain or 'Teams' in subject:
return ('Microsoft', 'SharePoint/Teams')
# O365 Service Updates
if 'o365su' in sender or ('digest' in subject.lower() and 'microsoft' in sender.lower()):
return ('Microsoft', 'Service Updates')
# General Microsoft
if 'microsoft.com' in domain:
return ('Microsoft', 'General')
# === DEVELOPER TOOLS ===
# GitHub CI/CD
if re.search(r'\[FSSCoding', subject):
return ('Developer', 'GitHub CI/CD Failures')
# GitHub Issues/PRs
if 'github.com' in domain:
if 'linuxmint' in subject or 'cinnamon' in subject:
return ('Developer', 'Open Source Contributions')
if 'Pheromind' in subject or 'ChrisRoyse' in subject:
return ('Developer', 'GitHub Collaborations')
return ('Developer', 'GitHub Notifications')
# Neo4j
if 'neo4j.com' in domain:
if 'webinar' in subject.lower() or 'Webinar' in subject:
return ('Developer', 'Neo4j Webinars')
if 'NODES' in subject or 'GraphTalk' in subject:
return ('Developer', 'Neo4j Conference')
return ('Developer', 'Neo4j')
# Cursor (AI IDE)
if 'cursor.com' in domain or 'cursor.so' in domain or 'Cursor' in subject:
return ('Developer', 'Cursor IDE')
# Tailscale
if 'tailscale.com' in domain:
return ('Developer', 'Tailscale')
# Hugging Face
if 'huggingface' in domain or 'Hugging Face' in subject:
return ('Developer', 'Hugging Face')
# Stripe (Payment Failures)
if 'stripe.com' in domain:
return ('Billing', 'Stripe Payments')
# Contabo (Hosting)
if 'contabo.com' in domain:
return ('Developer', 'Contabo Hosting')
# SendGrid
if 'sendgrid' in subject.lower():
return ('Developer', 'SendGrid')
# Twilio
if 'twilio.com' in domain:
return ('Developer', 'Twilio')
# Brave Search API
if 'brave.com' in domain:
return ('Developer', 'Brave Search API')
# PyPI
if 'pypi' in subject.lower() or 'pypi.org' in domain:
return ('Developer', 'PyPI')
# NVIDIA/CUDA
if 'CUDA' in subject or 'nvidia' in domain:
return ('Developer', 'NVIDIA/CUDA')
# Inception Labs / AI Tools
if 'inceptionlabs.ai' in domain:
return ('Developer', 'AI Tools')
# === LEARNING ===
# Computer Enhance (Casey Muratori) / Substack
if 'computerenhance' in sender or 'substack.com' in domain:
return ('Learning', 'Substack/Newsletters')
# Odoo
if 'odoo.com' in domain:
return ('Learning', 'Odoo ERP')
# Mozilla Firefox
if 'mozilla.org' in domain:
return ('Developer', 'Mozilla Firefox')
# === PERSONAL / COMMUNITY ===
# Grandfather Gatherings (Personal Community)
if 'Grandfather Gather' in subject:
return ('Personal', 'Grandfather Gatherings')
# Mailchimp newsletters (often personal)
if 'mailchimpapp.com' in domain:
return ('Personal', 'Personal Newsletters')
# Community Events
if 'Community Working Bee' in subject:
return ('Personal', 'Community Events')
# Personal emails (Gmail/Hotmail)
if 'gmail.com' in domain or 'hotmail.com' in domain or 'bigpond.com' in domain:
return ('Personal', 'Personal Contacts')
# FSS Internal
if 'foxsoftwaresolutions.com.au' in domain:
return ('Business Operations', 'FSS Internal')
# === FINANCIAL ===
# eToro
if 'etoro.com' in domain:
return ('Financial', 'eToro Trading')
# Dell
if 'dell.com' in domain or 'Dell' in subject:
return ('Business Operations', 'Dell Hardware')
# Insurance
if 'KT Insurance' in subject or 'insurance' in subject.lower():
return ('Business Operations', 'Insurance')
# SBSCH Payments
if 'SBSCH' in subject:
return ('Business Operations', 'SBSCH Payments')
# iCare NSW
if 'icare.nsw.gov.au' in domain:
return ('Business Operations', 'iCare NSW')
# Vodafone
if 'vodafone.com.au' in domain:
return ('Business Operations', 'Telecom - Vodafone')
# === MISC ===
# Undeliverable/Bounces
if 'Undeliverable' in subject:
return ('System', 'Email Bounces')
# Security
if re.search(r'Security Alert|Login detected|security code|Verify', subject, re.I):
return ('Security', 'Security Alerts')
# Password Reset
if 'password' in subject.lower():
return ('Security', 'Password')
# Calendly
if 'calendly.com' in domain:
return ('Business Operations', 'Calendly')
# Trello
if 'trello.com' in domain:
return ('Business Operations', 'Trello')
# Scorptec
if 'scorptec' in domain:
return ('Business Operations', 'Hardware Vendor')
# Webcentral
if 'webcentral.com.au' in domain:
return ('Business Operations', 'Web Hosting')
# Bluetti (Hardware)
if 'bluettipower.com' in domain:
return ('Business Operations', 'Hardware - Power')
# ABS Surveys
if 'abs.gov.au' in domain:
return ('Business Operations', 'Government - ABS')
# Qualtrics/Surveys
if 'qualtrics' in domain:
return ('Business Operations', 'Surveys')
return ('Uncategorized', 'Unknown')
def extract_case_ids(emails):
"""Extract Microsoft support case IDs and tracking IDs from emails."""
case_patterns = [
(r'Case\s*#?\s*:?\s*(\d{8})', 'Microsoft Case'),
(r'\[Case\s*#?\s*:?\s*(\d{8})\]', 'Microsoft Case'),
(r'TrackingID#(\d{16})', 'Tracking ID'),
]
cases = defaultdict(list)
for email in emails:
subject = email.subject or ""
for pattern, case_type in case_patterns:
match = re.search(pattern, subject, re.I)
if match:
case_id = match.group(1)
cases[case_id].append({
'type': case_type,
'subject': subject,
'date': str(email.date) if email.date else None,
'sender': email.sender
})
return dict(cases)
def analyze_time_distribution(emails):
"""Analyze email distribution over time."""
by_year = Counter()
by_month = Counter()
by_day_of_week = Counter()
day_names = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
for email in emails:
if email.date:
try:
by_year[email.date.year] += 1
by_month[f"{email.date.year}-{email.date.month:02d}"] += 1
by_day_of_week[day_names[email.date.weekday()]] += 1
except:
pass
return {
'by_year': dict(by_year.most_common()),
'by_month': dict(sorted(by_month.items())),
'by_day_of_week': {d: by_day_of_week.get(d, 0) for d in day_names}
}
def main():
email_dir = "/home/bob/Documents/Email Manager/emails/brett-microsoft"
output_dir = Path(__file__).parent.parent / "data"
output_dir.mkdir(exist_ok=True)
print("="*70)
print("BRETT MICROSOFT (OUTLOOK) DATASET ANALYSIS")
print("="*70)
print(f"\nSource: {email_dir}")
print(f"Output: {output_dir}")
# Parse emails
print("\nParsing emails...")
parser = LocalFileParser(email_dir)
emails = parser.parse_emails()
print(f"Total emails: {len(emails)}")
# Date range
dates = [e.date for e in emails if e.date]
if dates:
dates.sort()
print(f"Date range: {dates[0].strftime('%Y-%m-%d')} to {dates[-1].strftime('%Y-%m-%d')}")
# Classify all emails
print("\nClassifying emails...")
category_counts = Counter()
subcategory_counts = Counter()
by_category = defaultdict(list)
by_subcategory = defaultdict(list)
for email in emails:
category, subcategory = classify_email(email)
category_counts[category] += 1
subcategory_counts[f"{category}: {subcategory}"] += 1
by_category[category].append(email)
by_subcategory[subcategory].append(email)
# Print category summary
print("\n" + "="*70)
print("TOP-LEVEL CATEGORY SUMMARY")
print("="*70)
for category, count in category_counts.most_common():
pct = count / len(emails) * 100
bar = "" * int(pct / 2)
print(f"\n{category} ({count} emails, {pct:.1f}%)")
print(f" {bar}")
# Show subcategories
subcats = Counter()
for email in by_category[category]:
_, subcat = classify_email(email)
subcats[subcat] += 1
for subcat, subcount in subcats.most_common():
print(f" - {subcat}: {subcount}")
# Analyze senders
print("\n" + "="*70)
print("TOP SENDERS BY VOLUME")
print("="*70)
sender_counts = Counter(e.sender for e in emails)
for sender, count in sender_counts.most_common(15):
pct = count / len(emails) * 100
print(f" {count:4d} ({pct:4.1f}%) {sender}")
# Time analysis
print("\n" + "="*70)
print("TIME DISTRIBUTION")
print("="*70)
time_dist = analyze_time_distribution(emails)
print("\nBy Year:")
for year, count in sorted(time_dist['by_year'].items()):
bar = "" * (count // 10)
print(f" {year}: {count:4d} {bar}")
print("\nBy Day of Week:")
for day, count in time_dist['by_day_of_week'].items():
bar = "" * (count // 5)
print(f" {day}: {count:3d} {bar}")
# Extract case IDs
print("\n" + "="*70)
print("MICROSOFT SUPPORT CASES TRACKED")
print("="*70)
cases = extract_case_ids(emails)
if cases:
for case_id, occurrences in sorted(cases.items()):
print(f"\n Case/Tracking: {case_id} ({len(occurrences)} emails)")
for occ in occurrences[:3]:
print(f" - {occ['date']}: {occ['subject'][:50]}...")
else:
print(" No case IDs detected")
# Actionable insights
print("\n" + "="*70)
print("INBOX CHARACTER ASSESSMENT")
print("="*70)
business_pct = (category_counts.get('Business Operations', 0) +
category_counts.get('Client Work', 0) +
category_counts.get('Developer', 0)) / len(emails) * 100
personal_pct = category_counts.get('Personal', 0) / len(emails) * 100
print(f"\n Business/Professional: {business_pct:.1f}%")
print(f" Personal: {personal_pct:.1f}%")
print(f"\n ASSESSMENT: This is a {'BUSINESS' if business_pct > 50 else 'MIXED'} inbox")
# Save analysis data
analysis_data = {
'metadata': {
'total_emails': len(emails),
'inbox_type': 'microsoft',
'inbox_character': 'business' if business_pct > 50 else 'mixed',
'date_range': {
'start': str(dates[0]) if dates else None,
'end': str(dates[-1]) if dates else None
},
'analyzed_at': datetime.now().isoformat()
},
'categories': dict(category_counts),
'subcategories': dict(subcategory_counts),
'top_senders': dict(sender_counts.most_common(50)),
'time_distribution': time_dist,
'support_cases': cases,
'classification_accuracy': {
'categorized': len(emails) - category_counts.get('Uncategorized', 0),
'uncategorized': category_counts.get('Uncategorized', 0),
'accuracy_pct': (len(emails) - category_counts.get('Uncategorized', 0)) / len(emails) * 100
}
}
output_file = output_dir / "brett_microsoft_analysis.json"
with open(output_file, 'w') as f:
json.dump(analysis_data, f, indent=2)
print(f"\n\nAnalysis saved to: {output_file}")
print("\n" + "="*70)
print(f"CLASSIFICATION ACCURACY: {analysis_data['classification_accuracy']['accuracy_pct']:.1f}%")
print(f"({analysis_data['classification_accuracy']['categorized']} categorized, "
f"{analysis_data['classification_accuracy']['uncategorized']} uncategorized)")
print("="*70)
if __name__ == '__main__':
main()

View File

@ -0,0 +1,642 @@
#!/usr/bin/env python3
"""
Generate interactive HTML report from email classification results.
Usage:
python tools/generate_html_report.py --input results.json --output report.html
"""
import argparse
import json
from pathlib import Path
from datetime import datetime
from collections import Counter, defaultdict
from html import escape
def load_results(input_path: str) -> dict:
"""Load classification results from JSON."""
with open(input_path) as f:
return json.load(f)
def extract_domain(sender: str) -> str:
"""Extract domain from email address."""
if not sender:
return "unknown"
if "@" in sender:
return sender.split("@")[-1].lower()
return sender.lower()
def format_date(date_str: str) -> str:
"""Format ISO date string for display."""
if not date_str:
return "N/A"
try:
dt = datetime.fromisoformat(date_str.replace("Z", "+00:00"))
return dt.strftime("%Y-%m-%d %H:%M")
except:
return date_str[:16] if len(date_str) > 16 else date_str
def truncate(text: str, max_len: int = 60) -> str:
"""Truncate text with ellipsis."""
if not text:
return ""
if len(text) <= max_len:
return text
return text[:max_len-3] + "..."
def generate_html_report(results: dict, output_path: str):
"""Generate interactive HTML report."""
metadata = results.get("metadata", {})
classifications = results.get("classifications", [])
# Calculate statistics
total = len(classifications)
categories = Counter(c["category"] for c in classifications)
methods = Counter(c["method"] for c in classifications)
# Group by category
by_category = defaultdict(list)
for c in classifications:
by_category[c["category"]].append(c)
# Sort categories by count
sorted_categories = sorted(categories.keys(), key=lambda x: categories[x], reverse=True)
# Sender statistics
sender_domains = Counter(extract_domain(c.get("sender", "")) for c in classifications)
top_senders = Counter(c.get("sender", "unknown") for c in classifications).most_common(20)
# Confidence distribution
high_conf = sum(1 for c in classifications if c.get("confidence", 0) >= 0.7)
med_conf = sum(1 for c in classifications if 0.5 <= c.get("confidence", 0) < 0.7)
low_conf = sum(1 for c in classifications if c.get("confidence", 0) < 0.5)
# Generate HTML
html = f'''<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Email Classification Report</title>
<style>
:root {{
--bg-primary: #1a1a2e;
--bg-secondary: #16213e;
--bg-card: #0f3460;
--text-primary: #eee;
--text-secondary: #aaa;
--accent: #e94560;
--accent-hover: #ff6b6b;
--success: #00d9a5;
--warning: #ffc107;
--border: #2a2a4a;
}}
* {{
margin: 0;
padding: 0;
box-sizing: border-box;
}}
body {{
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, sans-serif;
background: var(--bg-primary);
color: var(--text-primary);
line-height: 1.6;
}}
.container {{
max-width: 1400px;
margin: 0 auto;
padding: 20px;
}}
header {{
background: var(--bg-secondary);
padding: 30px;
border-radius: 12px;
margin-bottom: 30px;
border: 1px solid var(--border);
}}
header h1 {{
font-size: 2rem;
margin-bottom: 10px;
color: var(--accent);
}}
.meta-info {{
display: flex;
flex-wrap: wrap;
gap: 20px;
margin-top: 15px;
color: var(--text-secondary);
font-size: 0.9rem;
}}
.meta-info span {{
background: var(--bg-card);
padding: 5px 12px;
border-radius: 20px;
}}
.stats-grid {{
display: grid;
grid-template-columns: repeat(auto-fit, minmax(200px, 1fr));
gap: 20px;
margin-bottom: 30px;
}}
.stat-card {{
background: var(--bg-secondary);
padding: 20px;
border-radius: 12px;
border: 1px solid var(--border);
text-align: center;
}}
.stat-card .value {{
font-size: 2.5rem;
font-weight: bold;
color: var(--accent);
}}
.stat-card .label {{
color: var(--text-secondary);
font-size: 0.9rem;
margin-top: 5px;
}}
.tabs {{
display: flex;
flex-wrap: wrap;
gap: 10px;
margin-bottom: 20px;
border-bottom: 2px solid var(--border);
padding-bottom: 10px;
}}
.tab {{
padding: 10px 20px;
background: var(--bg-secondary);
border: 1px solid var(--border);
border-radius: 8px 8px 0 0;
cursor: pointer;
transition: all 0.2s;
color: var(--text-secondary);
}}
.tab:hover {{
background: var(--bg-card);
color: var(--text-primary);
}}
.tab.active {{
background: var(--accent);
color: white;
border-color: var(--accent);
}}
.tab .count {{
background: rgba(255,255,255,0.2);
padding: 2px 8px;
border-radius: 10px;
font-size: 0.8rem;
margin-left: 8px;
}}
.tab-content {{
display: none;
}}
.tab-content.active {{
display: block;
}}
.email-table {{
width: 100%;
border-collapse: collapse;
background: var(--bg-secondary);
border-radius: 12px;
overflow: hidden;
}}
.email-table th {{
background: var(--bg-card);
padding: 15px;
text-align: left;
font-weight: 600;
color: var(--text-primary);
position: sticky;
top: 0;
}}
.email-table td {{
padding: 12px 15px;
border-bottom: 1px solid var(--border);
color: var(--text-secondary);
}}
.email-table tr:hover td {{
background: var(--bg-card);
color: var(--text-primary);
}}
.email-table .subject {{
max-width: 400px;
color: var(--text-primary);
}}
.email-table .sender {{
max-width: 250px;
}}
.confidence {{
display: inline-block;
padding: 3px 10px;
border-radius: 12px;
font-size: 0.85rem;
font-weight: 500;
}}
.confidence.high {{
background: rgba(0, 217, 165, 0.2);
color: var(--success);
}}
.confidence.medium {{
background: rgba(255, 193, 7, 0.2);
color: var(--warning);
}}
.confidence.low {{
background: rgba(233, 69, 96, 0.2);
color: var(--accent);
}}
.method-badge {{
display: inline-block;
padding: 3px 8px;
border-radius: 4px;
font-size: 0.75rem;
text-transform: uppercase;
}}
.method-ml {{
background: rgba(0, 217, 165, 0.2);
color: var(--success);
}}
.method-rule {{
background: rgba(100, 149, 237, 0.2);
color: cornflowerblue;
}}
.method-llm {{
background: rgba(255, 193, 7, 0.2);
color: var(--warning);
}}
.section {{
background: var(--bg-secondary);
padding: 25px;
border-radius: 12px;
margin-bottom: 30px;
border: 1px solid var(--border);
}}
.section h2 {{
margin-bottom: 20px;
color: var(--accent);
font-size: 1.3rem;
}}
.chart-bar {{
display: flex;
align-items: center;
margin-bottom: 10px;
}}
.chart-bar .label {{
width: 150px;
font-size: 0.9rem;
color: var(--text-secondary);
}}
.chart-bar .bar-container {{
flex: 1;
height: 24px;
background: var(--bg-card);
border-radius: 4px;
overflow: hidden;
margin: 0 15px;
}}
.chart-bar .bar {{
height: 100%;
background: linear-gradient(90deg, var(--accent), var(--accent-hover));
transition: width 0.5s ease;
}}
.chart-bar .value {{
width: 80px;
text-align: right;
font-size: 0.9rem;
}}
.sender-list {{
display: grid;
grid-template-columns: repeat(auto-fill, minmax(300px, 1fr));
gap: 10px;
}}
.sender-item {{
display: flex;
justify-content: space-between;
padding: 10px 15px;
background: var(--bg-card);
border-radius: 8px;
font-size: 0.9rem;
}}
.sender-item .email {{
color: var(--text-secondary);
overflow: hidden;
text-overflow: ellipsis;
white-space: nowrap;
max-width: 220px;
}}
.sender-item .count {{
color: var(--accent);
font-weight: bold;
}}
.search-box {{
width: 100%;
padding: 12px 20px;
background: var(--bg-card);
border: 1px solid var(--border);
border-radius: 8px;
color: var(--text-primary);
font-size: 1rem;
margin-bottom: 20px;
}}
.search-box:focus {{
outline: none;
border-color: var(--accent);
}}
.table-container {{
max-height: 600px;
overflow-y: auto;
border-radius: 12px;
}}
.attachment-icon {{
color: var(--warning);
}}
footer {{
text-align: center;
padding: 20px;
color: var(--text-secondary);
font-size: 0.85rem;
}}
</style>
</head>
<body>
<div class="container">
<header>
<h1>Email Classification Report</h1>
<p>Automated analysis of email inbox</p>
<div class="meta-info">
<span>Generated: {datetime.now().strftime("%Y-%m-%d %H:%M")}</span>
<span>Source: {escape(metadata.get("source", "unknown"))}</span>
<span>Total Emails: {total:,}</span>
</div>
</header>
<div class="stats-grid">
<div class="stat-card">
<div class="value">{total:,}</div>
<div class="label">Total Emails</div>
</div>
<div class="stat-card">
<div class="value">{len(categories)}</div>
<div class="label">Categories</div>
</div>
<div class="stat-card">
<div class="value">{high_conf}</div>
<div class="label">High Confidence (&ge;70%)</div>
</div>
<div class="stat-card">
<div class="value">{len(sender_domains)}</div>
<div class="label">Unique Domains</div>
</div>
</div>
<div class="section">
<h2>Category Distribution</h2>
{"".join(f'''
<div class="chart-bar">
<div class="label">{escape(cat)}</div>
<div class="bar-container">
<div class="bar" style="width: {categories[cat]/total*100:.1f}%"></div>
</div>
<div class="value">{categories[cat]:,} ({categories[cat]/total*100:.1f}%)</div>
</div>
''' for cat in sorted_categories)}
</div>
<div class="section">
<h2>Classification Methods</h2>
{"".join(f'''
<div class="chart-bar">
<div class="label">{escape(method.upper())}</div>
<div class="bar-container">
<div class="bar" style="width: {methods[method]/total*100:.1f}%"></div>
</div>
<div class="value">{methods[method]:,} ({methods[method]/total*100:.1f}%)</div>
</div>
''' for method in sorted(methods.keys()))}
</div>
<div class="section">
<h2>Confidence Distribution</h2>
<div class="chart-bar">
<div class="label">High (&ge;70%)</div>
<div class="bar-container">
<div class="bar" style="width: {high_conf/total*100:.1f}%; background: linear-gradient(90deg, #00d9a5, #00ffcc);"></div>
</div>
<div class="value">{high_conf:,} ({high_conf/total*100:.1f}%)</div>
</div>
<div class="chart-bar">
<div class="label">Medium (50-70%)</div>
<div class="bar-container">
<div class="bar" style="width: {med_conf/total*100:.1f}%; background: linear-gradient(90deg, #ffc107, #ffdb58);"></div>
</div>
<div class="value">{med_conf:,} ({med_conf/total*100:.1f}%)</div>
</div>
<div class="chart-bar">
<div class="label">Low (&lt;50%)</div>
<div class="bar-container">
<div class="bar" style="width: {low_conf/total*100:.1f}%; background: linear-gradient(90deg, #e94560, #ff6b6b);"></div>
</div>
<div class="value">{low_conf:,} ({low_conf/total*100:.1f}%)</div>
</div>
</div>
<div class="section">
<h2>Top Senders</h2>
<div class="sender-list">
{"".join(f'''
<div class="sender-item">
<span class="email" title="{escape(sender)}">{escape(truncate(sender, 35))}</span>
<span class="count">{count}</span>
</div>
''' for sender, count in top_senders)}
</div>
</div>
<div class="section">
<h2>Emails by Category</h2>
<div class="tabs">
<div class="tab active" onclick="showTab('all')">All<span class="count">{total}</span></div>
{"".join(f'''<div class="tab" onclick="showTab('{escape(cat)}')">{escape(cat)}<span class="count">{categories[cat]}</span></div>''' for cat in sorted_categories)}
</div>
<input type="text" class="search-box" placeholder="Search by subject, sender..." onkeyup="filterTable(this.value)">
<div id="tab-all" class="tab-content active">
<div class="table-container">
<table class="email-table" id="email-table-all">
<thead>
<tr>
<th>Date</th>
<th>Subject</th>
<th>Sender</th>
<th>Category</th>
<th>Confidence</th>
<th>Method</th>
</tr>
</thead>
<tbody>
{"".join(generate_email_row(c) for c in sorted(classifications, key=lambda x: x.get("date") or "", reverse=True))}
</tbody>
</table>
</div>
</div>
{"".join(f'''
<div id="tab-{escape(cat)}" class="tab-content">
<div class="table-container">
<table class="email-table">
<thead>
<tr>
<th>Date</th>
<th>Subject</th>
<th>Sender</th>
<th>Confidence</th>
<th>Method</th>
</tr>
</thead>
<tbody>
{"".join(generate_email_row(c, show_category=False) for c in sorted(by_category[cat], key=lambda x: x.get("date") or "", reverse=True))}
</tbody>
</table>
</div>
</div>
''' for cat in sorted_categories)}
</div>
<footer>
Generated by Email Sorter | {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}
</footer>
</div>
<script>
function showTab(tabId) {{
// Hide all tabs
document.querySelectorAll('.tab-content').forEach(el => el.classList.remove('active'));
document.querySelectorAll('.tab').forEach(el => el.classList.remove('active'));
// Show selected tab
document.getElementById('tab-' + tabId).classList.add('active');
event.target.classList.add('active');
}}
function filterTable(query) {{
query = query.toLowerCase();
document.querySelectorAll('.tab-content.active tbody tr').forEach(row => {{
const text = row.textContent.toLowerCase();
row.style.display = text.includes(query) ? '' : 'none';
}});
}}
</script>
</body>
</html>
'''
with open(output_path, "w", encoding="utf-8") as f:
f.write(html)
print(f"Report generated: {output_path}")
print(f" Total emails: {total:,}")
print(f" Categories: {len(categories)}")
print(f" Top category: {sorted_categories[0]} ({categories[sorted_categories[0]]:,})")
def generate_email_row(c: dict, show_category: bool = True) -> str:
"""Generate HTML table row for an email."""
conf = c.get("confidence", 0)
conf_class = "high" if conf >= 0.7 else "medium" if conf >= 0.5 else "low"
method = c.get("method", "unknown")
method_class = f"method-{method}"
attachment_icon = '<span class="attachment-icon" title="Has attachments">📎</span> ' if c.get("has_attachments") else ""
category_col = f'<td>{escape(c.get("category", "unknown"))}</td>' if show_category else ""
return f'''
<tr data-search="{escape(c.get('subject', ''))} {escape(c.get('sender', ''))}">
<td>{format_date(c.get("date"))}</td>
<td class="subject">{attachment_icon}{escape(truncate(c.get("subject", "No subject"), 70))}</td>
<td class="sender" title="{escape(c.get('sender', ''))}">{escape(truncate(c.get("sender_name") or c.get("sender", ""), 35))}</td>
{category_col}
<td><span class="confidence {conf_class}">{conf*100:.0f}%</span></td>
<td><span class="method-badge {method_class}">{method}</span></td>
</tr>
'''
def main():
parser = argparse.ArgumentParser(description="Generate HTML report from classification results")
parser.add_argument("--input", "-i", required=True, help="Path to results.json")
parser.add_argument("--output", "-o", default=None, help="Output HTML file path")
args = parser.parse_args()
input_path = Path(args.input)
if not input_path.exists():
print(f"Error: Input file not found: {input_path}")
return 1
output_path = args.output or str(input_path.parent / "report.html")
results = load_results(args.input)
generate_html_report(results, output_path)
return 0
if __name__ == "__main__":
exit(main())