FSSCoding fe8e882567 Add CLAUDE.md - Comprehensive development guide for AI assistants

Content:
- Project overview and MVP status
- Architecture and performance metrics
- Critical implementation details (batched embeddings, model paths)
- Multi-account credential management
- Common commands and code patterns
- Performance optimization opportunities
- Known issues and troubleshooting
- Dependencies and git workflow
- Recent changes and roadmap

Key Sections:
- Batched feature extraction (CRITICAL - 150x performance)
- LLM-driven calibration (dynamic categories)
- Threshold optimization (0.55 default)
- Email provider credentials (3 accounts each)
- Project structure reference
- Important notes for AI assistants

This document provides essential context for continuing development
and ensures proper understanding of critical performance patterns.

2025-10-25 16:56:59 +11:00

12 KiB

Raw Blame History

Email Sorter - Claude Development Guide

This document provides essential context for Claude (or other AI assistants) working on this project.

Project Overview

Email Sorter is a hybrid ML/LLM email classification system designed to process large email backlogs (10k-100k+ emails) with high speed and accuracy.

Current MVP Status

✅ PROVEN WORKING - 10,000 emails classified in ~24 seconds with 72.7% accuracy

Core Features:

LLM-driven category discovery (no hardcoded categories)
ML model training on discovered categories (LightGBM)
Fast pure-ML classification with --no-llm-fallback
Category verification for new mailboxes with --verify-categories
Batched embedding extraction (512 emails/batch)
Multiple email provider support (Gmail, Outlook, IMAP, Enron)

Architecture

Three-Tier Classification Pipeline

Email → Rules Check → ML Classifier → LLM Fallback (optional)
         ↓              ↓                ↓
      Definite     High Confidence   Low Confidence
      (5-10%)        (70-80%)          (10-20%)

Key Technologies

ML Model: LightGBM (1.8MB, 11 categories, 28 threads)
Embeddings: all-minilm:l6-v2 via Ollama (384-dim, universal)
LLM: qwen3:4b-instruct-2507-q8_0 via Ollama (calibration only)
Feature Extraction: Embeddings + TF-IDF + pattern detection
Thresholds: 0.55 (optimized from 0.75 to reduce LLM fallback)

Performance Metrics

Emails	Time	Accuracy	LLM Calls	Throughput
10,000	24s	72.7%	0	423/sec
10,000	5min	92.7%	2,100	33/sec

Project Structure

email-sorter/
├── src/
│   ├── cli.py                      # Main CLI interface
│   ├── classification/              # Classification pipeline
│   │   ├── adaptive_classifier.py  # Rules → ML → LLM orchestration
│   │   ├── ml_classifier.py        # LightGBM classifier
│   │   ├── llm_classifier.py       # LLM fallback
│   │   └── feature_extractor.py    # Batched embedding extraction
│   ├── calibration/                 # LLM-driven calibration
│   │   ├── workflow.py             # Calibration orchestration
│   │   ├── llm_analyzer.py         # Batch category discovery (20 emails/call)
│   │   ├── trainer.py              # ML model training
│   │   └── category_verifier.py    # Category verification
│   ├── email_providers/             # Email source connectors
│   │   ├── gmail.py                # Gmail API (OAuth 2.0)
│   │   ├── outlook.py              # Microsoft Graph API (OAuth 2.0)
│   │   ├── imap.py                 # IMAP protocol
│   │   └── enron.py                # Enron dataset (testing)
│   ├── llm/                         # LLM provider interfaces
│   │   ├── ollama.py               # Ollama provider
│   │   └── openai_compat.py        # OpenAI-compatible provider
│   └── models/                      # Trained models
│       ├── calibrated/              # User-calibrated models
│       │   └── classifier.pkl      # Current trained model (1.8MB)
│       └── pretrained/              # Default models
├── config/
│   ├── default_config.yaml         # System defaults
│   ├── categories.yaml             # Category definitions (thresholds: 0.55)
│   └── llm_models.yaml             # LLM configuration
├── credentials/                     # Email provider credentials (gitignored)
│   ├── gmail/                      # Gmail OAuth (3 accounts)
│   ├── outlook/                    # Outlook OAuth (3 accounts)
│   └── imap/                       # IMAP credentials (3 accounts)
├── docs/                            # Documentation
├── scripts/                         # Utility scripts
└── logs/                            # Log files (gitignored)

Critical Implementation Details

1. Batched Embedding Extraction (CRITICAL!)

ALWAYS use batched feature extraction:

# ✅ CORRECT - Batched (150x faster)
all_features = feature_extractor.extract_batch(emails, batch_size=512)
for email, features in zip(emails, all_features):
    result = adaptive_classifier.classify_with_features(email, features)

# ❌ WRONG - Sequential (extremely slow)
for email in emails:
    result = adaptive_classifier.classify(email)  # Extracts features one-at-a-time

Why this matters:

Sequential: 10,000 emails × 15ms = 150 seconds just for embeddings
Batched: 20 batches × 1s = 20 seconds for embeddings
150x performance difference

2. Model Paths

The model exists in TWO locations:

src/models/calibrated/classifier.pkl - Created during calibration (authoritative)
src/models/pretrained/classifier.pkl - Loaded by default (copy of calibrated)

When calibration runs:

Saves model to calibrated/classifier.pkl
MLClassifier loads from pretrained/classifier.pkl by default
Need to copy or update path

Current status: Both paths have the same 1.8MB model (Oct 25 02:54)

3. LLM-Driven Calibration

NOT hardcoded categories - categories are discovered by LLM:

# Calibration process:
1. Sample 300 emails (3% of 10k)
2. Batch process in groups of 20 emails
3. LLM discovers categories (not predefined)
4. LLM labels each email
5. Train LightGBM on discovered categories

Result: 11 categories discovered from Enron dataset:

Updates, Work, Meetings, External, Financial, Test, Administrative, Operational, Technical, Urgent, Requests

4. Threshold Optimization

Default threshold: 0.55 (reduced from 0.75)

Impact:

0.75 threshold: 35% LLM fallback
0.55 threshold: 21% LLM fallback
40% reduction in LLM usage

All category thresholds in config/categories.yaml set to 0.55.

5. Email Provider Credentials

Multi-account support: 3 accounts per provider type

Credential files:

credentials/
├── gmail/
│   ├── account1.json  # Gmail OAuth credentials
│   ├── account2.json
│   └── account3.json
├── outlook/
│   ├── account1.json  # Outlook OAuth credentials
│   ├── account2.json
│   └── account3.json
└── imap/
    ├── account1.json  # IMAP username/password
    ├── account2.json
    └── account3.json

Security: All *.json files in credentials/ are gitignored (only .example files tracked).

Common Commands

Development

# Activate virtual environment
source venv/bin/activate

# Run classification (Enron dataset)
python -m src.cli run --source enron --limit 10000 --output results/

# Pure ML (no LLM fallback) - FAST
python -m src.cli run --source enron --limit 10000 --output results/ --no-llm-fallback

# With category verification
python -m src.cli run --source enron --limit 10000 --output results/ --verify-categories

# Gmail
python -m src.cli run --source gmail --credentials credentials/gmail/account1.json --limit 1000

# Outlook
python -m src.cli run --source outlook --credentials credentials/outlook/account1.json --limit 1000

Training

# Force recalibration (clears cached model)
rm -rf src/models/calibrated/ src/models/pretrained/
python -m src.cli run --source enron --limit 10000 --output results/

Code Patterns

Adding New Features

Update CLI (src/cli.py):
- Add click options
- Pass to appropriate modules
Update Classifier (src/classification/adaptive_classifier.py):
- Add methods following existing pattern
- Use classify_with_features() for batched processing
Update Feature Extractor (src/classification/feature_extractor.py):
- Always support batching (extract_batch())
- Keep extract() for backward compatibility

Testing

# Test imports
python -c "from src.cli import cli; print('OK')"

# Test providers
python -c "from src.email_providers.gmail import GmailProvider; from src.email_providers.outlook import OutlookProvider; print('OK')"

# Test classification
python -m src.cli run --source enron --limit 100 --output test/

Performance Optimization

Current Bottlenecks

Embedding generation - 20s for 10k emails (batched)
- Optimized with batch_size=512
- Could use local sentence-transformers for 5-10x speedup
Email parsing - 0.5s for 10k emails (fast)
ML inference - 0.7s for 10k emails (very fast)

Optimization Opportunities

Local embeddings - Replace Ollama API with sentence-transformers
- Current: 20 API calls, ~20 seconds
- With local: Direct GPU, ~2-5 seconds
- Trade-off: More dependencies, larger memory footprint
Embedding cache - Pre-compute and cache to disk
- One-time cost: 20 seconds
- Subsequent runs: 2-3 seconds to load from disk
- Perfect for development/testing
Larger batches - Tested 512, 1024, 2048
- 512: 23.6s (chosen for balance)
- 1024: 22.1s (6.6% faster)
- 2048: 21.9s (7.5% faster, diminishing returns)

Known Issues

1. Background Processes

There are stale background bash processes from previous sessions:

These can be safely ignored
Do NOT try to kill them (per user's CLAUDE.md instructions)

2. Model Path Confusion

Calibration saves to src/models/calibrated/
Default loads from src/models/pretrained/
Both currently have the same model (synced)

3. Category Cache

src/models/category_cache.json stores discovered categories
Can become polluted if different datasets used
Clear with rm src/models/category_cache.json if issues

Dependencies

Required

pip install click pyyaml lightgbm numpy scikit-learn ollama

Email Providers

# Gmail
pip install google-api-python-client google-auth-oauthlib google-auth-httplib2

# Outlook
pip install msal requests

# IMAP - no additional dependencies (Python stdlib)

Optional

# For faster local embeddings
pip install sentence-transformers

# For development
pip install pytest black mypy

Git Workflow

What's Gitignored

credentials/ (except .example files)
logs/
results/
src/models/calibrated/ (trained models)
*.log
debug_*.txt
Test directories

What's Tracked

All source code
Configuration files
Documentation
Example credential files
Pretrained model (if present)

Important Notes for AI Assistants

NEVER create files unless necessary - Always prefer editing existing files
ALWAYS use batching - Feature extraction MUST be batched (512 emails/batch)
Read before writing - Use Read tool before any Edit operations
Verify paths - Model paths can be confusing (calibrated vs pretrained)
No emoji in commits - Per user's CLAUDE.md preferences
Test before committing - Verify imports and CLI work
Security - Never commit actual credentials, only .example files
Performance matters - 10x performance differences are common, always batch
LLM is optional - System works without LLM (pure ML mode with --no-llm-fallback)
Categories are dynamic - They're discovered by LLM, not hardcoded

Recent Changes (Last Session)

Fixed embedding bottleneck - Changed from sequential to batched feature extraction (10x speedup)
Added Outlook provider - Full Microsoft Graph API integration
Added credentials system - Support for 3 accounts per provider type
Optimized thresholds - Reduced from 0.75 to 0.55 (40% less LLM usage)
Added category verifier - Optional single LLM call to verify model fit
Project reorganization - Clean docs/, scripts/, logs/ structure

Next Steps (Roadmap)

See docs/PROJECT_STATUS_AND_NEXT_STEPS.html for complete roadmap.

Immediate priorities:

Test Gmail provider with real credentials
Test Outlook provider with real credentials
Implement email syncing (apply labels back to mailbox)
Add incremental classification (process only new emails)
Create web dashboard for results visualization

Remember: This is an MVP with proven performance. Don't over-engineer. Keep it fast and simple.

12 KiB Raw Blame History Unescape Escape