- Rewrote CLAUDE.md with comprehensive development guide - Archived 20 old docs to docs/archive/ - Added PROJECT_ROADMAP_2025.md with research learnings - Added CLASSIFICATION_METHODS_COMPARISON.md - Added SESSION_HANDOVER_20251128.md - Added tools for analysis (brett_gmail/microsoft analyzers) - Updated .gitignore for archive folders - Config changes for local vLLM endpoint
305 lines
8.4 KiB
Markdown
305 lines
8.4 KiB
Markdown
# Email Sorter - Development Guide
|
|
|
|
## What This Tool Does
|
|
|
|
**Email Sorter is a TRIAGE tool** that sorts emails into buckets for downstream processing. It is NOT a complete email management solution - it's one part of a larger ecosystem.
|
|
|
|
```
|
|
Raw Inbox (10k+) --> Email Sorter --> Categorized Buckets --> Specialized Tools
|
|
(this tool) (output) (other tools)
|
|
```
|
|
|
|
---
|
|
|
|
## Quick Start
|
|
|
|
```bash
|
|
cd /MASTERFOLDER/Tools/email-sorter
|
|
source venv/bin/activate
|
|
|
|
# Classify emails with ML + LLM fallback
|
|
python -m src.cli run --source local \
|
|
--directory "/path/to/emails" \
|
|
--output "/path/to/output" \
|
|
--force-ml --llm-provider openai
|
|
|
|
# Generate HTML report from results
|
|
python tools/generate_html_report.py --input /path/to/results.json
|
|
```
|
|
|
|
---
|
|
|
|
## Key Documentation
|
|
|
|
| Document | Purpose | Location |
|
|
|----------|---------|----------|
|
|
| **PROJECT_ROADMAP_2025.md** | Master learnings, research findings, development roadmap | `docs/` |
|
|
| **CLASSIFICATION_METHODS_COMPARISON.md** | ML vs LLM vs Agent comparison | `docs/` |
|
|
| **REPORT_FORMAT.md** | HTML report documentation | `docs/` |
|
|
| **BATCH_LLM_QUICKSTART.md** | Quick LLM batch processing guide | root |
|
|
|
|
---
|
|
|
|
## Research Findings Summary
|
|
|
|
### Dataset Size Routing
|
|
|
|
| Size | Best Method | Why |
|
|
|------|-------------|-----|
|
|
| <500 | Agent-only | ML overhead exceeds benefit |
|
|
| 500-5000 | Agent pre-scan + ML | Discovery improves accuracy |
|
|
| >5000 | ML pipeline | Speed critical |
|
|
|
|
### Research Results
|
|
|
|
| Dataset | Type | ML-Only | ML+LLM | Agent |
|
|
|---------|------|---------|--------|-------|
|
|
| brett-gmail (801) | Personal | 54.9% | 93.3% | 99.8% |
|
|
| brett-microsoft (596) | Business | - | - | 98.2% |
|
|
|
|
### Key Insight: Inbox Character Matters
|
|
|
|
| Type | Pattern | Approach |
|
|
|------|---------|----------|
|
|
| **Personal** | Subscriptions, marketing (40-50% automated) | Sender domain first |
|
|
| **Business** | Client work, operations (60-70% professional) | Sender + Subject context |
|
|
|
|
---
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
email-sorter/
|
|
├── CLAUDE.md # THIS FILE
|
|
├── README.md # General readme
|
|
├── BATCH_LLM_QUICKSTART.md # LLM batch processing
|
|
│
|
|
├── src/ # Source code
|
|
│ ├── cli.py # Main entry point
|
|
│ ├── classification/ # ML/LLM classification
|
|
│ ├── calibration/ # Model training, email parsing
|
|
│ ├── email_providers/ # Gmail, Outlook, IMAP, Local
|
|
│ └── llm/ # LLM providers
|
|
│
|
|
├── tools/ # Utility scripts
|
|
│ ├── brett_gmail_analyzer.py # Personal inbox template
|
|
│ ├── brett_microsoft_analyzer.py # Business inbox template
|
|
│ ├── generate_html_report.py # HTML report generator
|
|
│ └── batch_llm_classifier.py # Batch LLM classification
|
|
│
|
|
├── config/ # Configuration
|
|
│ ├── default_config.yaml # LLM endpoints, thresholds
|
|
│ └── categories.yaml # Category definitions
|
|
│
|
|
├── docs/ # Current documentation
|
|
│ ├── PROJECT_ROADMAP_2025.md
|
|
│ ├── CLASSIFICATION_METHODS_COMPARISON.md
|
|
│ ├── REPORT_FORMAT.md
|
|
│ └── archive/ # Old docs (historical)
|
|
│
|
|
├── data/ # Analysis outputs (gitignored)
|
|
│ ├── brett_gmail_analysis.json
|
|
│ └── brett_microsoft_analysis.json
|
|
│
|
|
├── credentials/ # OAuth/API creds (gitignored)
|
|
├── results/ # Classification outputs (gitignored)
|
|
├── archive/ # Old scripts (gitignored)
|
|
├── maildir/ # Enron test data
|
|
└── venv/ # Python environment
|
|
```
|
|
|
|
---
|
|
|
|
## Common Operations
|
|
|
|
### 1. Classify Emails (ML Pipeline)
|
|
|
|
```bash
|
|
source venv/bin/activate
|
|
|
|
# With LLM fallback for low confidence
|
|
python -m src.cli run --source local \
|
|
--directory "/path/to/emails" \
|
|
--output "/path/to/output" \
|
|
--force-ml --llm-provider openai
|
|
|
|
# Pure ML (fastest, no LLM)
|
|
python -m src.cli run --source local \
|
|
--directory "/path/to/emails" \
|
|
--output "/path/to/output" \
|
|
--force-ml --no-llm-fallback
|
|
```
|
|
|
|
### 2. Generate HTML Report
|
|
|
|
```bash
|
|
python tools/generate_html_report.py --input /path/to/results.json
|
|
# Creates report.html in same directory
|
|
```
|
|
|
|
### 3. Manual Agent Analysis (Best Accuracy)
|
|
|
|
For <1000 emails, agent analysis gives 98-99% accuracy:
|
|
|
|
```bash
|
|
# Copy and customize analyzer template
|
|
cp tools/brett_gmail_analyzer.py tools/my_inbox_analyzer.py
|
|
|
|
# Edit classify_email() function for your inbox patterns
|
|
# Update email_dir path
|
|
# Run
|
|
python tools/my_inbox_analyzer.py
|
|
```
|
|
|
|
### 4. Different Email Sources
|
|
|
|
```bash
|
|
# Local .eml/.msg files
|
|
--source local --directory "/path/to/emails"
|
|
|
|
# Gmail (OAuth)
|
|
--source gmail --credentials credentials/gmail/account1.json
|
|
|
|
# Outlook (OAuth)
|
|
--source outlook --credentials credentials/outlook/account1.json
|
|
|
|
# Enron test data
|
|
--source enron --limit 10000
|
|
```
|
|
|
|
---
|
|
|
|
## Output Locations
|
|
|
|
**Analysis reports are stored OUTSIDE this project:**
|
|
|
|
```
|
|
/home/bob/Documents/Email Manager/emails/
|
|
├── brett-gmail/ # Source emails (untouched)
|
|
├── brett-gm-md/ # ML-only classification output
|
|
│ ├── results.json
|
|
│ ├── report.html
|
|
│ └── BRETT_GMAIL_ANALYSIS_REPORT.md
|
|
├── brett-gm-llm/ # ML+LLM classification output
|
|
│ ├── results.json
|
|
│ └── report.html
|
|
└── brett-ms-sorter/ # Microsoft inbox analysis
|
|
└── BRETT_MICROSOFT_ANALYSIS_REPORT.md
|
|
```
|
|
|
|
**Project data outputs (gitignored):**
|
|
```
|
|
/MASTERFOLDER/Tools/email-sorter/data/
|
|
├── brett_gmail_analysis.json
|
|
└── brett_microsoft_analysis.json
|
|
```
|
|
|
|
---
|
|
|
|
## Configuration
|
|
|
|
### LLM Endpoint (config/default_config.yaml)
|
|
|
|
```yaml
|
|
llm:
|
|
provider: "openai"
|
|
openai:
|
|
base_url: "http://localhost:11433/v1" # vLLM endpoint
|
|
api_key: "not-needed"
|
|
classification_model: "qwen3-coder-30b"
|
|
```
|
|
|
|
### Thresholds (config/categories.yaml)
|
|
|
|
Default: 0.55 (reduced from 0.75 for 40% less LLM fallback)
|
|
|
|
---
|
|
|
|
## Key Code Locations
|
|
|
|
| Function | File |
|
|
|----------|------|
|
|
| CLI entry | `src/cli.py` |
|
|
| ML classifier | `src/classification/ml_classifier.py` |
|
|
| LLM classifier | `src/classification/llm_classifier.py` |
|
|
| Feature extraction | `src/classification/feature_extractor.py` |
|
|
| Email parsing | `src/calibration/local_file_parser.py` |
|
|
| OpenAI-compat LLM | `src/llm/openai_compat.py` |
|
|
|
|
---
|
|
|
|
## Recent Changes (Nov 2025)
|
|
|
|
1. **cli.py**: Added `--force-ml` flag, enriched results.json with metadata
|
|
2. **openai_compat.py**: Removed API key requirement for local vLLM
|
|
3. **default_config.yaml**: Changed to openai provider on localhost:11433
|
|
4. **tools/**: Added brett_gmail_analyzer.py, brett_microsoft_analyzer.py, generate_html_report.py
|
|
5. **docs/**: Added PROJECT_ROADMAP_2025.md, CLASSIFICATION_METHODS_COMPARISON.md
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
### "LLM endpoint not responding"
|
|
- Check vLLM running on localhost:11433
|
|
- Verify model name in config matches running model
|
|
|
|
### "Low accuracy (50-60%)"
|
|
- For <1000 emails, use agent analysis
|
|
- Dataset may differ from Enron training data
|
|
|
|
### "Too many LLM calls"
|
|
- Use `--no-llm-fallback` for pure ML
|
|
- Increase threshold in categories.yaml
|
|
|
|
---
|
|
|
|
## Development Notes
|
|
|
|
### Virtual Environment Required
|
|
```bash
|
|
source venv/bin/activate
|
|
# ALWAYS activate before Python commands
|
|
```
|
|
|
|
### Batched Feature Extraction (CRITICAL)
|
|
```python
|
|
# CORRECT - Batched (150x faster)
|
|
all_features = feature_extractor.extract_batch(emails, batch_size=512)
|
|
|
|
# WRONG - Sequential (extremely slow)
|
|
for email in emails:
|
|
result = classifier.classify(email) # Don't do this
|
|
```
|
|
|
|
### Model Paths
|
|
- `src/models/calibrated/` - Created during calibration
|
|
- `src/models/pretrained/` - Loaded by default
|
|
|
|
---
|
|
|
|
## What's Gitignored
|
|
|
|
- `credentials/` - OAuth tokens
|
|
- `results/`, `data/` - User data
|
|
- `archive/`, `docs/archive/` - Historical content
|
|
- `maildir/` - Enron test data (large)
|
|
- `enron_mail_20150507.tar.gz` - Source archive
|
|
- `venv/` - Python environment
|
|
- `*.log`, `logs/` - Log files
|
|
|
|
---
|
|
|
|
## Philosophy
|
|
|
|
1. **Triage, not management** - Sort into buckets for other tools
|
|
2. **Risk-based accuracy** - High for personal, acceptable errors for junk
|
|
3. **Speed matters** - 10k emails in <1 min
|
|
4. **Inbox character matters** - Business vs personal = different approaches
|
|
5. **Agent pre-scan adds value** - 10-15 min discovery improves everything
|
|
|
|
---
|
|
|
|
*Last Updated: 2025-11-28*
|
|
*See docs/PROJECT_ROADMAP_2025.md for full research findings*
|