email-sorter/CLAUDE.md
FSSCoding 8f25e30f52 Rewrite CLAUDE.md and clean project structure
- Rewrote CLAUDE.md with comprehensive development guide
- Archived 20 old docs to docs/archive/
- Added PROJECT_ROADMAP_2025.md with research learnings
- Added CLASSIFICATION_METHODS_COMPARISON.md
- Added SESSION_HANDOVER_20251128.md
- Added tools for analysis (brett_gmail/microsoft analyzers)
- Updated .gitignore for archive folders
- Config changes for local vLLM endpoint
2025-11-28 13:07:27 +11:00

305 lines
8.4 KiB
Markdown

# Email Sorter - Development Guide
## What This Tool Does
**Email Sorter is a TRIAGE tool** that sorts emails into buckets for downstream processing. It is NOT a complete email management solution - it's one part of a larger ecosystem.
```
Raw Inbox (10k+) --> Email Sorter --> Categorized Buckets --> Specialized Tools
(this tool) (output) (other tools)
```
---
## Quick Start
```bash
cd /MASTERFOLDER/Tools/email-sorter
source venv/bin/activate
# Classify emails with ML + LLM fallback
python -m src.cli run --source local \
--directory "/path/to/emails" \
--output "/path/to/output" \
--force-ml --llm-provider openai
# Generate HTML report from results
python tools/generate_html_report.py --input /path/to/results.json
```
---
## Key Documentation
| Document | Purpose | Location |
|----------|---------|----------|
| **PROJECT_ROADMAP_2025.md** | Master learnings, research findings, development roadmap | `docs/` |
| **CLASSIFICATION_METHODS_COMPARISON.md** | ML vs LLM vs Agent comparison | `docs/` |
| **REPORT_FORMAT.md** | HTML report documentation | `docs/` |
| **BATCH_LLM_QUICKSTART.md** | Quick LLM batch processing guide | root |
---
## Research Findings Summary
### Dataset Size Routing
| Size | Best Method | Why |
|------|-------------|-----|
| <500 | Agent-only | ML overhead exceeds benefit |
| 500-5000 | Agent pre-scan + ML | Discovery improves accuracy |
| >5000 | ML pipeline | Speed critical |
### Research Results
| Dataset | Type | ML-Only | ML+LLM | Agent |
|---------|------|---------|--------|-------|
| brett-gmail (801) | Personal | 54.9% | 93.3% | 99.8% |
| brett-microsoft (596) | Business | - | - | 98.2% |
### Key Insight: Inbox Character Matters
| Type | Pattern | Approach |
|------|---------|----------|
| **Personal** | Subscriptions, marketing (40-50% automated) | Sender domain first |
| **Business** | Client work, operations (60-70% professional) | Sender + Subject context |
---
## Project Structure
```
email-sorter/
├── CLAUDE.md # THIS FILE
├── README.md # General readme
├── BATCH_LLM_QUICKSTART.md # LLM batch processing
├── src/ # Source code
│ ├── cli.py # Main entry point
│ ├── classification/ # ML/LLM classification
│ ├── calibration/ # Model training, email parsing
│ ├── email_providers/ # Gmail, Outlook, IMAP, Local
│ └── llm/ # LLM providers
├── tools/ # Utility scripts
│ ├── brett_gmail_analyzer.py # Personal inbox template
│ ├── brett_microsoft_analyzer.py # Business inbox template
│ ├── generate_html_report.py # HTML report generator
│ └── batch_llm_classifier.py # Batch LLM classification
├── config/ # Configuration
│ ├── default_config.yaml # LLM endpoints, thresholds
│ └── categories.yaml # Category definitions
├── docs/ # Current documentation
│ ├── PROJECT_ROADMAP_2025.md
│ ├── CLASSIFICATION_METHODS_COMPARISON.md
│ ├── REPORT_FORMAT.md
│ └── archive/ # Old docs (historical)
├── data/ # Analysis outputs (gitignored)
│ ├── brett_gmail_analysis.json
│ └── brett_microsoft_analysis.json
├── credentials/ # OAuth/API creds (gitignored)
├── results/ # Classification outputs (gitignored)
├── archive/ # Old scripts (gitignored)
├── maildir/ # Enron test data
└── venv/ # Python environment
```
---
## Common Operations
### 1. Classify Emails (ML Pipeline)
```bash
source venv/bin/activate
# With LLM fallback for low confidence
python -m src.cli run --source local \
--directory "/path/to/emails" \
--output "/path/to/output" \
--force-ml --llm-provider openai
# Pure ML (fastest, no LLM)
python -m src.cli run --source local \
--directory "/path/to/emails" \
--output "/path/to/output" \
--force-ml --no-llm-fallback
```
### 2. Generate HTML Report
```bash
python tools/generate_html_report.py --input /path/to/results.json
# Creates report.html in same directory
```
### 3. Manual Agent Analysis (Best Accuracy)
For <1000 emails, agent analysis gives 98-99% accuracy:
```bash
# Copy and customize analyzer template
cp tools/brett_gmail_analyzer.py tools/my_inbox_analyzer.py
# Edit classify_email() function for your inbox patterns
# Update email_dir path
# Run
python tools/my_inbox_analyzer.py
```
### 4. Different Email Sources
```bash
# Local .eml/.msg files
--source local --directory "/path/to/emails"
# Gmail (OAuth)
--source gmail --credentials credentials/gmail/account1.json
# Outlook (OAuth)
--source outlook --credentials credentials/outlook/account1.json
# Enron test data
--source enron --limit 10000
```
---
## Output Locations
**Analysis reports are stored OUTSIDE this project:**
```
/home/bob/Documents/Email Manager/emails/
├── brett-gmail/ # Source emails (untouched)
├── brett-gm-md/ # ML-only classification output
│ ├── results.json
│ ├── report.html
│ └── BRETT_GMAIL_ANALYSIS_REPORT.md
├── brett-gm-llm/ # ML+LLM classification output
│ ├── results.json
│ └── report.html
└── brett-ms-sorter/ # Microsoft inbox analysis
└── BRETT_MICROSOFT_ANALYSIS_REPORT.md
```
**Project data outputs (gitignored):**
```
/MASTERFOLDER/Tools/email-sorter/data/
├── brett_gmail_analysis.json
└── brett_microsoft_analysis.json
```
---
## Configuration
### LLM Endpoint (config/default_config.yaml)
```yaml
llm:
provider: "openai"
openai:
base_url: "http://localhost:11433/v1" # vLLM endpoint
api_key: "not-needed"
classification_model: "qwen3-coder-30b"
```
### Thresholds (config/categories.yaml)
Default: 0.55 (reduced from 0.75 for 40% less LLM fallback)
---
## Key Code Locations
| Function | File |
|----------|------|
| CLI entry | `src/cli.py` |
| ML classifier | `src/classification/ml_classifier.py` |
| LLM classifier | `src/classification/llm_classifier.py` |
| Feature extraction | `src/classification/feature_extractor.py` |
| Email parsing | `src/calibration/local_file_parser.py` |
| OpenAI-compat LLM | `src/llm/openai_compat.py` |
---
## Recent Changes (Nov 2025)
1. **cli.py**: Added `--force-ml` flag, enriched results.json with metadata
2. **openai_compat.py**: Removed API key requirement for local vLLM
3. **default_config.yaml**: Changed to openai provider on localhost:11433
4. **tools/**: Added brett_gmail_analyzer.py, brett_microsoft_analyzer.py, generate_html_report.py
5. **docs/**: Added PROJECT_ROADMAP_2025.md, CLASSIFICATION_METHODS_COMPARISON.md
---
## Troubleshooting
### "LLM endpoint not responding"
- Check vLLM running on localhost:11433
- Verify model name in config matches running model
### "Low accuracy (50-60%)"
- For <1000 emails, use agent analysis
- Dataset may differ from Enron training data
### "Too many LLM calls"
- Use `--no-llm-fallback` for pure ML
- Increase threshold in categories.yaml
---
## Development Notes
### Virtual Environment Required
```bash
source venv/bin/activate
# ALWAYS activate before Python commands
```
### Batched Feature Extraction (CRITICAL)
```python
# CORRECT - Batched (150x faster)
all_features = feature_extractor.extract_batch(emails, batch_size=512)
# WRONG - Sequential (extremely slow)
for email in emails:
result = classifier.classify(email) # Don't do this
```
### Model Paths
- `src/models/calibrated/` - Created during calibration
- `src/models/pretrained/` - Loaded by default
---
## What's Gitignored
- `credentials/` - OAuth tokens
- `results/`, `data/` - User data
- `archive/`, `docs/archive/` - Historical content
- `maildir/` - Enron test data (large)
- `enron_mail_20150507.tar.gz` - Source archive
- `venv/` - Python environment
- `*.log`, `logs/` - Log files
---
## Philosophy
1. **Triage, not management** - Sort into buckets for other tools
2. **Risk-based accuracy** - High for personal, acceptable errors for junk
3. **Speed matters** - 10k emails in <1 min
4. **Inbox character matters** - Business vs personal = different approaches
5. **Agent pre-scan adds value** - 10-15 min discovery improves everything
---
*Last Updated: 2025-11-28*
*See docs/PROJECT_ROADMAP_2025.md for full research findings*