email-sorter/CLAUDE.md
FSSCoding fe8e882567 Add CLAUDE.md - Comprehensive development guide for AI assistants
Content:
- Project overview and MVP status
- Architecture and performance metrics
- Critical implementation details (batched embeddings, model paths)
- Multi-account credential management
- Common commands and code patterns
- Performance optimization opportunities
- Known issues and troubleshooting
- Dependencies and git workflow
- Recent changes and roadmap

Key Sections:
- Batched feature extraction (CRITICAL - 150x performance)
- LLM-driven calibration (dynamic categories)
- Threshold optimization (0.55 default)
- Email provider credentials (3 accounts each)
- Project structure reference
- Important notes for AI assistants

This document provides essential context for continuing development
and ensures proper understanding of critical performance patterns.
2025-10-25 16:56:59 +11:00

12 KiB
Raw Blame History

Email Sorter - Claude Development Guide

This document provides essential context for Claude (or other AI assistants) working on this project.

Project Overview

Email Sorter is a hybrid ML/LLM email classification system designed to process large email backlogs (10k-100k+ emails) with high speed and accuracy.

Current MVP Status

PROVEN WORKING - 10,000 emails classified in ~24 seconds with 72.7% accuracy

Core Features:

  • LLM-driven category discovery (no hardcoded categories)
  • ML model training on discovered categories (LightGBM)
  • Fast pure-ML classification with --no-llm-fallback
  • Category verification for new mailboxes with --verify-categories
  • Batched embedding extraction (512 emails/batch)
  • Multiple email provider support (Gmail, Outlook, IMAP, Enron)

Architecture

Three-Tier Classification Pipeline

Email → Rules Check → ML Classifier → LLM Fallback (optional)
         ↓              ↓                ↓
      Definite     High Confidence   Low Confidence
      (5-10%)        (70-80%)          (10-20%)

Key Technologies

  • ML Model: LightGBM (1.8MB, 11 categories, 28 threads)
  • Embeddings: all-minilm:l6-v2 via Ollama (384-dim, universal)
  • LLM: qwen3:4b-instruct-2507-q8_0 via Ollama (calibration only)
  • Feature Extraction: Embeddings + TF-IDF + pattern detection
  • Thresholds: 0.55 (optimized from 0.75 to reduce LLM fallback)

Performance Metrics

Emails Time Accuracy LLM Calls Throughput
10,000 24s 72.7% 0 423/sec
10,000 5min 92.7% 2,100 33/sec

Project Structure

email-sorter/
├── src/
│   ├── cli.py                      # Main CLI interface
│   ├── classification/              # Classification pipeline
│   │   ├── adaptive_classifier.py  # Rules → ML → LLM orchestration
│   │   ├── ml_classifier.py        # LightGBM classifier
│   │   ├── llm_classifier.py       # LLM fallback
│   │   └── feature_extractor.py    # Batched embedding extraction
│   ├── calibration/                 # LLM-driven calibration
│   │   ├── workflow.py             # Calibration orchestration
│   │   ├── llm_analyzer.py         # Batch category discovery (20 emails/call)
│   │   ├── trainer.py              # ML model training
│   │   └── category_verifier.py    # Category verification
│   ├── email_providers/             # Email source connectors
│   │   ├── gmail.py                # Gmail API (OAuth 2.0)
│   │   ├── outlook.py              # Microsoft Graph API (OAuth 2.0)
│   │   ├── imap.py                 # IMAP protocol
│   │   └── enron.py                # Enron dataset (testing)
│   ├── llm/                         # LLM provider interfaces
│   │   ├── ollama.py               # Ollama provider
│   │   └── openai_compat.py        # OpenAI-compatible provider
│   └── models/                      # Trained models
│       ├── calibrated/              # User-calibrated models
│       │   └── classifier.pkl      # Current trained model (1.8MB)
│       └── pretrained/              # Default models
├── config/
│   ├── default_config.yaml         # System defaults
│   ├── categories.yaml             # Category definitions (thresholds: 0.55)
│   └── llm_models.yaml             # LLM configuration
├── credentials/                     # Email provider credentials (gitignored)
│   ├── gmail/                      # Gmail OAuth (3 accounts)
│   ├── outlook/                    # Outlook OAuth (3 accounts)
│   └── imap/                       # IMAP credentials (3 accounts)
├── docs/                            # Documentation
├── scripts/                         # Utility scripts
└── logs/                            # Log files (gitignored)

Critical Implementation Details

1. Batched Embedding Extraction (CRITICAL!)

ALWAYS use batched feature extraction:

# ✅ CORRECT - Batched (150x faster)
all_features = feature_extractor.extract_batch(emails, batch_size=512)
for email, features in zip(emails, all_features):
    result = adaptive_classifier.classify_with_features(email, features)

# ❌ WRONG - Sequential (extremely slow)
for email in emails:
    result = adaptive_classifier.classify(email)  # Extracts features one-at-a-time

Why this matters:

  • Sequential: 10,000 emails × 15ms = 150 seconds just for embeddings
  • Batched: 20 batches × 1s = 20 seconds for embeddings
  • 150x performance difference

2. Model Paths

The model exists in TWO locations:

  • src/models/calibrated/classifier.pkl - Created during calibration (authoritative)
  • src/models/pretrained/classifier.pkl - Loaded by default (copy of calibrated)

When calibration runs:

  1. Saves model to calibrated/classifier.pkl
  2. MLClassifier loads from pretrained/classifier.pkl by default
  3. Need to copy or update path

Current status: Both paths have the same 1.8MB model (Oct 25 02:54)

3. LLM-Driven Calibration

NOT hardcoded categories - categories are discovered by LLM:

# Calibration process:
1. Sample 300 emails (3% of 10k)
2. Batch process in groups of 20 emails
3. LLM discovers categories (not predefined)
4. LLM labels each email
5. Train LightGBM on discovered categories

Result: 11 categories discovered from Enron dataset:

  • Updates, Work, Meetings, External, Financial, Test, Administrative, Operational, Technical, Urgent, Requests

4. Threshold Optimization

Default threshold: 0.55 (reduced from 0.75)

Impact:

  • 0.75 threshold: 35% LLM fallback
  • 0.55 threshold: 21% LLM fallback
  • 40% reduction in LLM usage

All category thresholds in config/categories.yaml set to 0.55.

5. Email Provider Credentials

Multi-account support: 3 accounts per provider type

Credential files:

credentials/
├── gmail/
│   ├── account1.json  # Gmail OAuth credentials
│   ├── account2.json
│   └── account3.json
├── outlook/
│   ├── account1.json  # Outlook OAuth credentials
│   ├── account2.json
│   └── account3.json
└── imap/
    ├── account1.json  # IMAP username/password
    ├── account2.json
    └── account3.json

Security: All *.json files in credentials/ are gitignored (only .example files tracked).

Common Commands

Development

# Activate virtual environment
source venv/bin/activate

# Run classification (Enron dataset)
python -m src.cli run --source enron --limit 10000 --output results/

# Pure ML (no LLM fallback) - FAST
python -m src.cli run --source enron --limit 10000 --output results/ --no-llm-fallback

# With category verification
python -m src.cli run --source enron --limit 10000 --output results/ --verify-categories

# Gmail
python -m src.cli run --source gmail --credentials credentials/gmail/account1.json --limit 1000

# Outlook
python -m src.cli run --source outlook --credentials credentials/outlook/account1.json --limit 1000

Training

# Force recalibration (clears cached model)
rm -rf src/models/calibrated/ src/models/pretrained/
python -m src.cli run --source enron --limit 10000 --output results/

Code Patterns

Adding New Features

  1. Update CLI (src/cli.py):

    • Add click options
    • Pass to appropriate modules
  2. Update Classifier (src/classification/adaptive_classifier.py):

    • Add methods following existing pattern
    • Use classify_with_features() for batched processing
  3. Update Feature Extractor (src/classification/feature_extractor.py):

    • Always support batching (extract_batch())
    • Keep extract() for backward compatibility

Testing

# Test imports
python -c "from src.cli import cli; print('OK')"

# Test providers
python -c "from src.email_providers.gmail import GmailProvider; from src.email_providers.outlook import OutlookProvider; print('OK')"

# Test classification
python -m src.cli run --source enron --limit 100 --output test/

Performance Optimization

Current Bottlenecks

  1. Embedding generation - 20s for 10k emails (batched)

    • Optimized with batch_size=512
    • Could use local sentence-transformers for 5-10x speedup
  2. Email parsing - 0.5s for 10k emails (fast)

  3. ML inference - 0.7s for 10k emails (very fast)

Optimization Opportunities

  1. Local embeddings - Replace Ollama API with sentence-transformers

    • Current: 20 API calls, ~20 seconds
    • With local: Direct GPU, ~2-5 seconds
    • Trade-off: More dependencies, larger memory footprint
  2. Embedding cache - Pre-compute and cache to disk

    • One-time cost: 20 seconds
    • Subsequent runs: 2-3 seconds to load from disk
    • Perfect for development/testing
  3. Larger batches - Tested 512, 1024, 2048

    • 512: 23.6s (chosen for balance)
    • 1024: 22.1s (6.6% faster)
    • 2048: 21.9s (7.5% faster, diminishing returns)

Known Issues

1. Background Processes

There are stale background bash processes from previous sessions:

  • These can be safely ignored
  • Do NOT try to kill them (per user's CLAUDE.md instructions)

2. Model Path Confusion

  • Calibration saves to src/models/calibrated/
  • Default loads from src/models/pretrained/
  • Both currently have the same model (synced)

3. Category Cache

  • src/models/category_cache.json stores discovered categories
  • Can become polluted if different datasets used
  • Clear with rm src/models/category_cache.json if issues

Dependencies

Required

pip install click pyyaml lightgbm numpy scikit-learn ollama

Email Providers

# Gmail
pip install google-api-python-client google-auth-oauthlib google-auth-httplib2

# Outlook
pip install msal requests

# IMAP - no additional dependencies (Python stdlib)

Optional

# For faster local embeddings
pip install sentence-transformers

# For development
pip install pytest black mypy

Git Workflow

What's Gitignored

  • credentials/ (except .example files)
  • logs/
  • results/
  • src/models/calibrated/ (trained models)
  • *.log
  • debug_*.txt
  • Test directories

What's Tracked

  • All source code
  • Configuration files
  • Documentation
  • Example credential files
  • Pretrained model (if present)

Important Notes for AI Assistants

  1. NEVER create files unless necessary - Always prefer editing existing files

  2. ALWAYS use batching - Feature extraction MUST be batched (512 emails/batch)

  3. Read before writing - Use Read tool before any Edit operations

  4. Verify paths - Model paths can be confusing (calibrated vs pretrained)

  5. No emoji in commits - Per user's CLAUDE.md preferences

  6. Test before committing - Verify imports and CLI work

  7. Security - Never commit actual credentials, only .example files

  8. Performance matters - 10x performance differences are common, always batch

  9. LLM is optional - System works without LLM (pure ML mode with --no-llm-fallback)

  10. Categories are dynamic - They're discovered by LLM, not hardcoded

Recent Changes (Last Session)

  1. Fixed embedding bottleneck - Changed from sequential to batched feature extraction (10x speedup)
  2. Added Outlook provider - Full Microsoft Graph API integration
  3. Added credentials system - Support for 3 accounts per provider type
  4. Optimized thresholds - Reduced from 0.75 to 0.55 (40% less LLM usage)
  5. Added category verifier - Optional single LLM call to verify model fit
  6. Project reorganization - Clean docs/, scripts/, logs/ structure

Next Steps (Roadmap)

See docs/PROJECT_STATUS_AND_NEXT_STEPS.html for complete roadmap.

Immediate priorities:

  1. Test Gmail provider with real credentials
  2. Test Outlook provider with real credentials
  3. Implement email syncing (apply labels back to mailbox)
  4. Add incremental classification (process only new emails)
  5. Create web dashboard for results visualization

Remember: This is an MVP with proven performance. Don't over-engineer. Keep it fast and simple.