email-sorter/CLAUDE.md

# Email Sorter - Development Guide

## What This Tool Does

**Email Sorter is a TRIAGE tool** that sorts emails into buckets for downstream processing. It is NOT a complete email management solution - it's one part of a larger ecosystem.

```
Raw Inbox (10k+) --> Email Sorter --> Categorized Buckets --> Specialized Tools
                    (this tool)     (output)               (other tools)
```

---

## Quick Start

```bash
cd /MASTERFOLDER/Tools/email-sorter
source venv/bin/activate

# Classify emails with ML + LLM fallback
python -m src.cli run --source local \
  --directory "/path/to/emails" \
  --output "/path/to/output" \
  --force-ml --llm-provider openai

# Generate HTML report from results
python tools/generate_html_report.py --input /path/to/results.json
```

---

## Key Documentation

| Document | Purpose | Location |
|----------|---------|----------|
| **PROJECT_ROADMAP_2025.md** | Master learnings, research findings, development roadmap | `docs/` |
| **CLASSIFICATION_METHODS_COMPARISON.md** | ML vs LLM vs Agent comparison | `docs/` |
| **REPORT_FORMAT.md** | HTML report documentation | `docs/` |
| **BATCH_LLM_QUICKSTART.md** | Quick LLM batch processing guide | root |

---

## Research Findings Summary

### Dataset Size Routing

| Size | Best Method | Why |
|------|-------------|-----|
| <500 | Agent-only | ML overhead exceeds benefit |
| 500-5000 | Agent pre-scan + ML | Discovery improves accuracy |
| >5000 | ML pipeline | Speed critical |

### Research Results

| Dataset | Type | ML-Only | ML+LLM | Agent |
|---------|------|---------|--------|-------|
| brett-gmail (801) | Personal | 54.9% | 93.3% | 99.8% |
| brett-microsoft (596) | Business | - | - | 98.2% |

### Key Insight: Inbox Character Matters

| Type | Pattern | Approach |
|------|---------|----------|
| **Personal** | Subscriptions, marketing (40-50% automated) | Sender domain first |
| **Business** | Client work, operations (60-70% professional) | Sender + Subject context |

---

## Project Structure

```
email-sorter/
├── CLAUDE.md                 # THIS FILE
├── README.md                 # General readme
├── BATCH_LLM_QUICKSTART.md   # LLM batch processing
│
├── src/                      # Source code
│   ├── cli.py               # Main entry point
│   ├── classification/      # ML/LLM classification
│   ├── calibration/         # Model training, email parsing
│   ├── email_providers/     # Gmail, Outlook, IMAP, Local
│   └── llm/                 # LLM providers
│
├── tools/                    # Utility scripts
│   ├── brett_gmail_analyzer.py      # Personal inbox template
│   ├── brett_microsoft_analyzer.py  # Business inbox template
│   ├── generate_html_report.py      # HTML report generator
│   └── batch_llm_classifier.py      # Batch LLM classification
│
├── config/                   # Configuration
│   ├── default_config.yaml  # LLM endpoints, thresholds
│   └── categories.yaml      # Category definitions
│
├── docs/                     # Current documentation
│   ├── PROJECT_ROADMAP_2025.md
│   ├── CLASSIFICATION_METHODS_COMPARISON.md
│   ├── REPORT_FORMAT.md
│   └── archive/             # Old docs (historical)
│
├── data/                     # Analysis outputs (gitignored)
│   ├── brett_gmail_analysis.json
│   └── brett_microsoft_analysis.json
│
├── credentials/              # OAuth/API creds (gitignored)
├── results/                  # Classification outputs (gitignored)
├── archive/                  # Old scripts (gitignored)
├── maildir/                  # Enron test data
└── venv/                     # Python environment
```

---

## Common Operations

### 1. Classify Emails (ML Pipeline)

```bash
source venv/bin/activate

# With LLM fallback for low confidence
python -m src.cli run --source local \
  --directory "/path/to/emails" \
  --output "/path/to/output" \
  --force-ml --llm-provider openai

# Pure ML (fastest, no LLM)
python -m src.cli run --source local \
  --directory "/path/to/emails" \
  --output "/path/to/output" \
  --force-ml --no-llm-fallback
```

### 2. Generate HTML Report

```bash
python tools/generate_html_report.py --input /path/to/results.json
# Creates report.html in same directory
```

### 3. Manual Agent Analysis (Best Accuracy)

For <1000 emails, agent analysis gives 98-99% accuracy:

```bash
# Copy and customize analyzer template
cp tools/brett_gmail_analyzer.py tools/my_inbox_analyzer.py

# Edit classify_email() function for your inbox patterns
# Update email_dir path
# Run
python tools/my_inbox_analyzer.py
```

### 4. Different Email Sources

```bash
# Local .eml/.msg files
--source local --directory "/path/to/emails"

# Gmail (OAuth)
--source gmail --credentials credentials/gmail/account1.json

# Outlook (OAuth)
--source outlook --credentials credentials/outlook/account1.json

# Enron test data
--source enron --limit 10000
```

---

## Output Locations

**Analysis reports are stored OUTSIDE this project:**

```
/home/bob/Documents/Email Manager/emails/
├── brett-gmail/           # Source emails (untouched)
├── brett-gm-md/          # ML-only classification output
│   ├── results.json
│   ├── report.html
│   └── BRETT_GMAIL_ANALYSIS_REPORT.md
├── brett-gm-llm/         # ML+LLM classification output
│   ├── results.json
│   └── report.html
└── brett-ms-sorter/      # Microsoft inbox analysis
    └── BRETT_MICROSOFT_ANALYSIS_REPORT.md
```

**Project data outputs (gitignored):**
```
/MASTERFOLDER/Tools/email-sorter/data/
├── brett_gmail_analysis.json
└── brett_microsoft_analysis.json
```

---

## Configuration

### LLM Endpoint (config/default_config.yaml)

```yaml
llm:
  provider: "openai"
  openai:
    base_url: "http://localhost:11433/v1"  # vLLM endpoint
    api_key: "not-needed"
    classification_model: "qwen3-coder-30b"
```

### Thresholds (config/categories.yaml)

Default: 0.55 (reduced from 0.75 for 40% less LLM fallback)

---

## Key Code Locations

| Function | File |
|----------|------|
| CLI entry | `src/cli.py` |
| ML classifier | `src/classification/ml_classifier.py` |
| LLM classifier | `src/classification/llm_classifier.py` |
| Feature extraction | `src/classification/feature_extractor.py` |
| Email parsing | `src/calibration/local_file_parser.py` |
| OpenAI-compat LLM | `src/llm/openai_compat.py` |

---

## Recent Changes (Nov 2025)

1. **cli.py**: Added `--force-ml` flag, enriched results.json with metadata
2. **openai_compat.py**: Removed API key requirement for local vLLM
3. **default_config.yaml**: Changed to openai provider on localhost:11433
4. **tools/**: Added brett_gmail_analyzer.py, brett_microsoft_analyzer.py, generate_html_report.py
5. **docs/**: Added PROJECT_ROADMAP_2025.md, CLASSIFICATION_METHODS_COMPARISON.md

---

## Troubleshooting

### "LLM endpoint not responding"
- Check vLLM running on localhost:11433
- Verify model name in config matches running model

### "Low accuracy (50-60%)"
- For <1000 emails, use agent analysis
- Dataset may differ from Enron training data

### "Too many LLM calls"
- Use `--no-llm-fallback` for pure ML
- Increase threshold in categories.yaml

---

## Development Notes

### Virtual Environment Required
```bash
source venv/bin/activate
# ALWAYS activate before Python commands
```

### Batched Feature Extraction (CRITICAL)
```python
# CORRECT - Batched (150x faster)
all_features = feature_extractor.extract_batch(emails, batch_size=512)

# WRONG - Sequential (extremely slow)
for email in emails:
    result = classifier.classify(email)  # Don't do this
```

### Model Paths
- `src/models/calibrated/` - Created during calibration
- `src/models/pretrained/` - Loaded by default

---

## What's Gitignored

- `credentials/` - OAuth tokens
- `results/`, `data/` - User data
- `archive/`, `docs/archive/` - Historical content
- `maildir/` - Enron test data (large)
- `enron_mail_20150507.tar.gz` - Source archive
- `venv/` - Python environment
- `*.log`, `logs/` - Log files

---

## Philosophy

1. **Triage, not management** - Sort into buckets for other tools
2. **Risk-based accuracy** - High for personal, acceptable errors for junk
3. **Speed matters** - 10k emails in <1 min
4. **Inbox character matters** - Business vs personal = different approaches
5. **Agent pre-scan adds value** - 10-15 min discovery improves everything

---

*Last Updated: 2025-11-28*
*See docs/PROJECT_ROADMAP_2025.md for full research findings*