email-sorter/CLAUDE.md
FSSCoding 8f25e30f52 Rewrite CLAUDE.md and clean project structure
- Rewrote CLAUDE.md with comprehensive development guide
- Archived 20 old docs to docs/archive/
- Added PROJECT_ROADMAP_2025.md with research learnings
- Added CLASSIFICATION_METHODS_COMPARISON.md
- Added SESSION_HANDOVER_20251128.md
- Added tools for analysis (brett_gmail/microsoft analyzers)
- Updated .gitignore for archive folders
- Config changes for local vLLM endpoint
2025-11-28 13:07:27 +11:00

8.4 KiB

Email Sorter - Development Guide

What This Tool Does

Email Sorter is a TRIAGE tool that sorts emails into buckets for downstream processing. It is NOT a complete email management solution - it's one part of a larger ecosystem.

Raw Inbox (10k+) --> Email Sorter --> Categorized Buckets --> Specialized Tools
                    (this tool)     (output)               (other tools)

Quick Start

cd /MASTERFOLDER/Tools/email-sorter
source venv/bin/activate

# Classify emails with ML + LLM fallback
python -m src.cli run --source local \
  --directory "/path/to/emails" \
  --output "/path/to/output" \
  --force-ml --llm-provider openai

# Generate HTML report from results
python tools/generate_html_report.py --input /path/to/results.json

Key Documentation

Document Purpose Location
PROJECT_ROADMAP_2025.md Master learnings, research findings, development roadmap docs/
CLASSIFICATION_METHODS_COMPARISON.md ML vs LLM vs Agent comparison docs/
REPORT_FORMAT.md HTML report documentation docs/
BATCH_LLM_QUICKSTART.md Quick LLM batch processing guide root

Research Findings Summary

Dataset Size Routing

Size Best Method Why
<500 Agent-only ML overhead exceeds benefit
500-5000 Agent pre-scan + ML Discovery improves accuracy
>5000 ML pipeline Speed critical

Research Results

Dataset Type ML-Only ML+LLM Agent
brett-gmail (801) Personal 54.9% 93.3% 99.8%
brett-microsoft (596) Business - - 98.2%

Key Insight: Inbox Character Matters

Type Pattern Approach
Personal Subscriptions, marketing (40-50% automated) Sender domain first
Business Client work, operations (60-70% professional) Sender + Subject context

Project Structure

email-sorter/
├── CLAUDE.md                 # THIS FILE
├── README.md                 # General readme
├── BATCH_LLM_QUICKSTART.md   # LLM batch processing
│
├── src/                      # Source code
│   ├── cli.py               # Main entry point
│   ├── classification/      # ML/LLM classification
│   ├── calibration/         # Model training, email parsing
│   ├── email_providers/     # Gmail, Outlook, IMAP, Local
│   └── llm/                 # LLM providers
│
├── tools/                    # Utility scripts
│   ├── brett_gmail_analyzer.py      # Personal inbox template
│   ├── brett_microsoft_analyzer.py  # Business inbox template
│   ├── generate_html_report.py      # HTML report generator
│   └── batch_llm_classifier.py      # Batch LLM classification
│
├── config/                   # Configuration
│   ├── default_config.yaml  # LLM endpoints, thresholds
│   └── categories.yaml      # Category definitions
│
├── docs/                     # Current documentation
│   ├── PROJECT_ROADMAP_2025.md
│   ├── CLASSIFICATION_METHODS_COMPARISON.md
│   ├── REPORT_FORMAT.md
│   └── archive/             # Old docs (historical)
│
├── data/                     # Analysis outputs (gitignored)
│   ├── brett_gmail_analysis.json
│   └── brett_microsoft_analysis.json
│
├── credentials/              # OAuth/API creds (gitignored)
├── results/                  # Classification outputs (gitignored)
├── archive/                  # Old scripts (gitignored)
├── maildir/                  # Enron test data
└── venv/                     # Python environment

Common Operations

1. Classify Emails (ML Pipeline)

source venv/bin/activate

# With LLM fallback for low confidence
python -m src.cli run --source local \
  --directory "/path/to/emails" \
  --output "/path/to/output" \
  --force-ml --llm-provider openai

# Pure ML (fastest, no LLM)
python -m src.cli run --source local \
  --directory "/path/to/emails" \
  --output "/path/to/output" \
  --force-ml --no-llm-fallback

2. Generate HTML Report

python tools/generate_html_report.py --input /path/to/results.json
# Creates report.html in same directory

3. Manual Agent Analysis (Best Accuracy)

For <1000 emails, agent analysis gives 98-99% accuracy:

# Copy and customize analyzer template
cp tools/brett_gmail_analyzer.py tools/my_inbox_analyzer.py

# Edit classify_email() function for your inbox patterns
# Update email_dir path
# Run
python tools/my_inbox_analyzer.py

4. Different Email Sources

# Local .eml/.msg files
--source local --directory "/path/to/emails"

# Gmail (OAuth)
--source gmail --credentials credentials/gmail/account1.json

# Outlook (OAuth)
--source outlook --credentials credentials/outlook/account1.json

# Enron test data
--source enron --limit 10000

Output Locations

Analysis reports are stored OUTSIDE this project:

/home/bob/Documents/Email Manager/emails/
├── brett-gmail/           # Source emails (untouched)
├── brett-gm-md/          # ML-only classification output
│   ├── results.json
│   ├── report.html
│   └── BRETT_GMAIL_ANALYSIS_REPORT.md
├── brett-gm-llm/         # ML+LLM classification output
│   ├── results.json
│   └── report.html
└── brett-ms-sorter/      # Microsoft inbox analysis
    └── BRETT_MICROSOFT_ANALYSIS_REPORT.md

Project data outputs (gitignored):

/MASTERFOLDER/Tools/email-sorter/data/
├── brett_gmail_analysis.json
└── brett_microsoft_analysis.json

Configuration

LLM Endpoint (config/default_config.yaml)

llm:
  provider: "openai"
  openai:
    base_url: "http://localhost:11433/v1"  # vLLM endpoint
    api_key: "not-needed"
    classification_model: "qwen3-coder-30b"

Thresholds (config/categories.yaml)

Default: 0.55 (reduced from 0.75 for 40% less LLM fallback)


Key Code Locations

Function File
CLI entry src/cli.py
ML classifier src/classification/ml_classifier.py
LLM classifier src/classification/llm_classifier.py
Feature extraction src/classification/feature_extractor.py
Email parsing src/calibration/local_file_parser.py
OpenAI-compat LLM src/llm/openai_compat.py

Recent Changes (Nov 2025)

  1. cli.py: Added --force-ml flag, enriched results.json with metadata
  2. openai_compat.py: Removed API key requirement for local vLLM
  3. default_config.yaml: Changed to openai provider on localhost:11433
  4. tools/: Added brett_gmail_analyzer.py, brett_microsoft_analyzer.py, generate_html_report.py
  5. docs/: Added PROJECT_ROADMAP_2025.md, CLASSIFICATION_METHODS_COMPARISON.md

Troubleshooting

"LLM endpoint not responding"

  • Check vLLM running on localhost:11433
  • Verify model name in config matches running model

"Low accuracy (50-60%)"

  • For <1000 emails, use agent analysis
  • Dataset may differ from Enron training data

"Too many LLM calls"

  • Use --no-llm-fallback for pure ML
  • Increase threshold in categories.yaml

Development Notes

Virtual Environment Required

source venv/bin/activate
# ALWAYS activate before Python commands

Batched Feature Extraction (CRITICAL)

# CORRECT - Batched (150x faster)
all_features = feature_extractor.extract_batch(emails, batch_size=512)

# WRONG - Sequential (extremely slow)
for email in emails:
    result = classifier.classify(email)  # Don't do this

Model Paths

  • src/models/calibrated/ - Created during calibration
  • src/models/pretrained/ - Loaded by default

What's Gitignored

  • credentials/ - OAuth tokens
  • results/, data/ - User data
  • archive/, docs/archive/ - Historical content
  • maildir/ - Enron test data (large)
  • enron_mail_20150507.tar.gz - Source archive
  • venv/ - Python environment
  • *.log, logs/ - Log files

Philosophy

  1. Triage, not management - Sort into buckets for other tools
  2. Risk-based accuracy - High for personal, acceptable errors for junk
  3. Speed matters - 10k emails in <1 min
  4. Inbox character matters - Business vs personal = different approaches
  5. Agent pre-scan adds value - 10-15 min discovery improves everything

Last Updated: 2025-11-28 See docs/PROJECT_ROADMAP_2025.md for full research findings