- Rewrote CLAUDE.md with comprehensive development guide - Archived 20 old docs to docs/archive/ - Added PROJECT_ROADMAP_2025.md with research learnings - Added CLASSIFICATION_METHODS_COMPARISON.md - Added SESSION_HANDOVER_20251128.md - Added tools for analysis (brett_gmail/microsoft analyzers) - Updated .gitignore for archive folders - Config changes for local vLLM endpoint
519 lines
16 KiB
Markdown
519 lines
16 KiB
Markdown
# Email Classification Methods: Comparative Analysis
|
|
|
|
## Executive Summary
|
|
|
|
This document compares three email classification approaches tested on an 801-email personal Gmail dataset:
|
|
|
|
| Method | Accuracy | Time | Best For |
|
|
|--------|----------|------|----------|
|
|
| ML-Only | 54.9% | 5 sec | 10k+ emails, speed critical |
|
|
| ML+LLM Fallback | 93.3% | 3.5 min | 1k-10k emails, balanced |
|
|
| Agent Analysis | 99.8% | 15-30 min | <1k emails, deep insights |
|
|
|
|
**Key Finding:** The ML pipeline is overkill for datasets under ~5,000 emails. A 10-15 minute agent pre-analysis phase could dramatically improve ML accuracy for larger datasets.
|
|
|
|
---
|
|
|
|
## Test Dataset Profile
|
|
|
|
| Characteristic | Value |
|
|
|----------------|-------|
|
|
| Total Emails | 801 |
|
|
| Date Range | 20 years (2005-2025) |
|
|
| Unique Senders | ~150 |
|
|
| Automated % | 48.8% |
|
|
| Personal % | 1.6% |
|
|
| Structure Level | MEDIUM-HIGH |
|
|
|
|
### Email Type Breakdown (Sanitized)
|
|
|
|
```
|
|
Automated Notifications 48.8% ████████████████████████
|
|
├─ Art marketplace alerts 16.2% ████████
|
|
├─ Shopping promotions 15.4% ███████
|
|
├─ Travel recommendations 13.4% ██████
|
|
└─ Streaming promotions 8.5% ████
|
|
|
|
Business/Professional 20.1% ██████████
|
|
├─ Cloud service reports 13.0% ██████
|
|
├─ Security alerts 7.1% ███
|
|
|
|
AI/Developer Services 12.8% ██████
|
|
├─ AI platform updates 6.4% ███
|
|
├─ Developer tool updates 6.4% ███
|
|
|
|
Personal/Other 18.3% █████████
|
|
├─ Entertainment 5.1% ██
|
|
├─ Productivity tools 3.7% █
|
|
├─ Direct correspondence 1.6% █
|
|
└─ Miscellaneous 7.9% ███
|
|
```
|
|
|
|
---
|
|
|
|
## Method 1: ML-Only Classification
|
|
|
|
### Configuration
|
|
```yaml
|
|
model: LightGBM (pretrained on Enron dataset)
|
|
embeddings: all-minilm:l6-v2 (384 dimensions)
|
|
threshold: 0.55 confidence
|
|
categories: 11 generic (Work, Updates, Financial, etc.)
|
|
```
|
|
|
|
### Results
|
|
|
|
| Metric | Value |
|
|
|--------|-------|
|
|
| Accuracy Estimate | 54.9% |
|
|
| High Confidence (>55%) | 477 (59.6%) |
|
|
| Low Confidence | 324 (40.4%) |
|
|
| Processing Time | ~5 seconds |
|
|
| LLM Calls | 0 |
|
|
|
|
### Category Distribution (ML-Only)
|
|
|
|
| Category | Count | % |
|
|
|----------|-------|---|
|
|
| Work | 243 | 30.3% |
|
|
| Technical | 198 | 24.7% |
|
|
| Updates | 156 | 19.5% |
|
|
| External | 89 | 11.1% |
|
|
| Operational | 45 | 5.6% |
|
|
| Financial | 38 | 4.7% |
|
|
| Other | 32 | 4.0% |
|
|
|
|
### Limitations Observed
|
|
|
|
1. **Domain Mismatch:** Trained on corporate Enron emails, applied to personal Gmail
|
|
2. **Generic Categories:** "Work" and "Technical" absorbed everything
|
|
3. **No Sender Intelligence:** Didn't leverage sender domain patterns
|
|
4. **High Uncertainty:** 40% needed LLM review but got none
|
|
|
|
### When ML-Only Works
|
|
|
|
- 10,000+ emails where speed matters
|
|
- Corporate/enterprise datasets similar to training data
|
|
- Pre-filtering before human review
|
|
- Cost-constrained environments (no LLM API)
|
|
|
|
---
|
|
|
|
## Method 2: ML + LLM Fallback
|
|
|
|
### Configuration
|
|
```yaml
|
|
ml_model: LightGBM (same as above)
|
|
llm_model: qwen3-coder-30b (vLLM on localhost:11433)
|
|
threshold: 0.55 confidence
|
|
fallback_trigger: confidence < threshold
|
|
```
|
|
|
|
### Results
|
|
|
|
| Metric | Value |
|
|
|--------|-------|
|
|
| Accuracy Estimate | 93.3% |
|
|
| ML Classified | 477 (59.6%) |
|
|
| LLM Classified | 324 (40.4%) |
|
|
| Processing Time | ~3.5 minutes |
|
|
| LLM Calls | 324 |
|
|
|
|
### Category Distribution (ML+LLM)
|
|
|
|
| Category | Count | % | Source |
|
|
|----------|-------|---|--------|
|
|
| Work | 243 | 30.3% | ML |
|
|
| Technical | 156 | 19.5% | ML |
|
|
| newsletters | 98 | 12.2% | LLM |
|
|
| junk | 87 | 10.9% | LLM |
|
|
| transactional | 76 | 9.5% | LLM |
|
|
| Updates | 62 | 7.7% | ML |
|
|
| auth | 45 | 5.6% | LLM |
|
|
| Other | 34 | 4.2% | Mixed |
|
|
|
|
### Improvements Over ML-Only
|
|
|
|
1. **New Categories:** LLM introduced "newsletters", "junk", "transactional", "auth"
|
|
2. **Better Separation:** Marketing vs. transactional distinguished
|
|
3. **Higher Confidence:** 93.3% vs 54.9% accuracy estimate
|
|
|
|
### Limitations Observed
|
|
|
|
1. **Category Inconsistency:** ML uses "Updates", LLM uses "newsletters"
|
|
2. **No Sender Context:** Still classifying email-by-email
|
|
3. **Generic LLM Prompt:** Doesn't know about user's specific interests
|
|
4. **Time Cost:** 324 sequential LLM calls at ~0.6s each
|
|
|
|
### When ML+LLM Works
|
|
|
|
- 1,000-10,000 emails
|
|
- Mixed automated/personal content
|
|
- When accuracy matters more than speed
|
|
- Local LLM available (cost-free fallback)
|
|
|
|
---
|
|
|
|
## Method 3: Agent Analysis (Manual)
|
|
|
|
### Approach
|
|
```
|
|
Phase 1: Initial Discovery (5 min)
|
|
- Sample filenames and subjects
|
|
- Identify sender domains
|
|
- Detect patterns
|
|
|
|
Phase 2: Pattern Extraction (10 min)
|
|
- Design domain-specific rules
|
|
- Test regex patterns
|
|
- Validate on subset
|
|
|
|
Phase 3: Deep Dive (5 min)
|
|
- Track order lifecycles
|
|
- Identify billing patterns
|
|
- Find edge cases
|
|
|
|
Phase 4: Report Generation (5 min)
|
|
- Synthesize findings
|
|
- Create actionable recommendations
|
|
```
|
|
|
|
### Results
|
|
|
|
| Metric | Value |
|
|
|--------|-------|
|
|
| Accuracy | 99.8% (799/801) |
|
|
| Categories | 15 custom |
|
|
| Processing Time | ~25 minutes |
|
|
| LLM Calls | ~20 (analysis only) |
|
|
|
|
### Category Distribution (Agent Analysis)
|
|
|
|
| Category | Count | % | Subcategories |
|
|
|----------|-------|---|---------------|
|
|
| Art & Collectibles | 130 | 16.2% | Marketplace alerts |
|
|
| Shopping | 123 | 15.4% | eBay, AliExpress, Automotive |
|
|
| Entertainment | 109 | 13.6% | Streaming, Gaming, Social |
|
|
| Travel & Tourism | 107 | 13.4% | Review sites, Bookings |
|
|
| Google Services | 104 | 13.0% | Business, Ads, Analytics |
|
|
| Security | 57 | 7.1% | Sign-in alerts, 2FA |
|
|
| AI Services | 51 | 6.4% | Claude, OpenAI, Lambda |
|
|
| Developer Tools | 51 | 6.4% | ngrok, Firebase, Docker |
|
|
| Productivity | 30 | 3.7% | Screen recording, Docs |
|
|
| Personal | 13 | 1.6% | Direct correspondence |
|
|
| Other | 26 | 3.2% | Childcare, Legal, etc. |
|
|
|
|
### Unique Insights (Not Found by ML)
|
|
|
|
1. **Specific Artist Tracking:** 95 alerts for specific artist "Dan Colen"
|
|
2. **Order Lifecycle:** Single order generated 7 notification emails
|
|
3. **Billing Patterns:** Monthly receipts from AI services on 15th
|
|
4. **Business Context:** User runs "Fox Software Solutions"
|
|
5. **Filtering Rules:** Ready-to-implement Gmail filters
|
|
|
|
### When Agent Analysis Works
|
|
|
|
- Under 1,000 emails
|
|
- Initial dataset understanding
|
|
- Creating filtering rules
|
|
- One-time deep analysis
|
|
- Training data preparation
|
|
|
|
---
|
|
|
|
## Comparative Analysis
|
|
|
|
### Accuracy vs Time Tradeoff
|
|
|
|
```
|
|
Accuracy
|
|
100% ─┬─────────────────────────●─── Agent (99.8%)
|
|
│ ●─────── ML+LLM (93.3%)
|
|
75% ─┤
|
|
│
|
|
50% ─┼────●───────────────────────── ML-Only (54.9%)
|
|
│
|
|
25% ─┤
|
|
│
|
|
0% ─┴────┬────────┬────────┬────────┬─── Time
|
|
5s 1m 5m 30m
|
|
```
|
|
|
|
### Cost Analysis (per 1000 emails)
|
|
|
|
| Method | Compute | LLM Calls | Est. Cost |
|
|
|--------|---------|-----------|-----------|
|
|
| ML-Only | 5 sec | 0 | $0.00 |
|
|
| ML+LLM | 4 min | ~400 | $0.02-0.40* |
|
|
| Agent | 30 min | ~30 | $0.01-0.10* |
|
|
|
|
*Depends on LLM provider; local = free, cloud = varies
|
|
|
|
### Category Quality
|
|
|
|
| Aspect | ML-Only | ML+LLM | Agent |
|
|
|--------|---------|--------|-------|
|
|
| Granularity | Low (11) | Medium (16) | High (15+subs) |
|
|
| Domain-Specific | No | Partial | Yes |
|
|
| Actionable | Limited | Moderate | High |
|
|
| Sender-Aware | No | No | Yes |
|
|
| Context-Aware | No | Limited | Yes |
|
|
|
|
---
|
|
|
|
## Enhancement Recommendations
|
|
|
|
### 1. Pre-Analysis Phase (10-15 min investment)
|
|
|
|
**Concept:** Run agent analysis BEFORE ML classification to:
|
|
- Discover sender domains and their purposes
|
|
- Identify category patterns specific to dataset
|
|
- Generate custom classification rules
|
|
- Create sender-to-category mappings
|
|
|
|
**Implementation:**
|
|
```python
|
|
class PreAnalysisAgent:
|
|
def analyze(self, emails: List[Email], sample_size=100):
|
|
# Phase 1: Sender domain clustering
|
|
domains = self.cluster_by_sender_domain(emails)
|
|
|
|
# Phase 2: Subject pattern extraction
|
|
patterns = self.extract_subject_patterns(emails)
|
|
|
|
# Phase 3: Generate custom categories
|
|
categories = self.generate_categories(domains, patterns)
|
|
|
|
# Phase 4: Create sender-category mapping
|
|
sender_map = self.map_senders_to_categories(domains, categories)
|
|
|
|
return {
|
|
'categories': categories,
|
|
'sender_map': sender_map,
|
|
'patterns': patterns
|
|
}
|
|
```
|
|
|
|
**Expected Impact:**
|
|
- Accuracy: 54.9% → 85-90% (ML-only with pre-analysis)
|
|
- Time: +10 min setup, same runtime
|
|
- Best for: 5,000+ email datasets
|
|
|
|
### 2. Sender-First Classification
|
|
|
|
**Concept:** Classify by sender domain BEFORE content analysis:
|
|
|
|
```python
|
|
SENDER_CATEGORIES = {
|
|
# High-volume automated
|
|
'mutualart.com': ('Notifications', 'Art Alerts'),
|
|
'tripadvisor.com': ('Notifications', 'Travel Marketing'),
|
|
'ebay.com': ('Shopping', 'Marketplace'),
|
|
'spotify.com': ('Entertainment', 'Streaming'),
|
|
|
|
# Security - never auto-filter
|
|
'accounts.google.com': ('Security', 'Account Alerts'),
|
|
|
|
# Business
|
|
'businessprofile-noreply@google.com': ('Business', 'Reports'),
|
|
}
|
|
|
|
def classify(email):
|
|
domain = extract_domain(email.sender)
|
|
if domain in SENDER_CATEGORIES:
|
|
return SENDER_CATEGORIES[domain] # 80% of emails
|
|
else:
|
|
return ml_classify(email) # Fallback for 20%
|
|
```
|
|
|
|
**Expected Impact:**
|
|
- Accuracy: 85-95% for known senders
|
|
- Speed: 10x faster (skip ML for known senders)
|
|
- Maintenance: Requires sender map updates
|
|
|
|
### 3. Post-Analysis Enhancement
|
|
|
|
**Concept:** Run agent analysis AFTER ML to:
|
|
- Validate classification quality
|
|
- Extract deeper insights
|
|
- Generate reports and recommendations
|
|
- Identify misclassifications
|
|
|
|
**Implementation:**
|
|
```python
|
|
class PostAnalysisAgent:
|
|
def analyze(self, emails: List[Email], classifications: List[Result]):
|
|
# Validate: Check for obvious errors
|
|
errors = self.detect_misclassifications(emails, classifications)
|
|
|
|
# Enrich: Add metadata not captured by ML
|
|
enriched = self.extract_metadata(emails)
|
|
|
|
# Insights: Generate actionable recommendations
|
|
insights = self.generate_insights(emails, classifications)
|
|
|
|
return {
|
|
'corrections': errors,
|
|
'enrichments': enriched,
|
|
'insights': insights
|
|
}
|
|
```
|
|
|
|
### 4. Dataset Size Routing
|
|
|
|
**Concept:** Automatically choose method based on volume:
|
|
|
|
```python
|
|
def choose_method(email_count: int, time_budget: str = 'normal'):
|
|
if email_count < 500:
|
|
return 'agent_only' # Full agent analysis
|
|
|
|
elif email_count < 2000:
|
|
return 'agent_then_ml' # Pre-analysis + ML
|
|
|
|
elif email_count < 10000:
|
|
return 'ml_with_llm' # ML + LLM fallback
|
|
|
|
else:
|
|
return 'ml_only' # Pure ML for speed
|
|
```
|
|
|
|
**Recommended Thresholds:**
|
|
|
|
| Volume | Recommended Method | Rationale |
|
|
|--------|-------------------|-----------|
|
|
| <500 | Agent Only | ML overhead not worth it |
|
|
| 500-2000 | Agent Pre-Analysis + ML | Investment pays off |
|
|
| 2000-10000 | ML + LLM Fallback | Balanced approach |
|
|
| >10000 | ML-Only | Speed critical |
|
|
|
|
### 5. Hybrid Category System
|
|
|
|
**Concept:** Merge ML categories with agent-discovered categories:
|
|
|
|
```python
|
|
# ML Generic Categories (trained)
|
|
ML_CATEGORIES = ['Work', 'Updates', 'Technical', 'Financial', ...]
|
|
|
|
# Agent-Discovered Categories (per-dataset)
|
|
AGENT_CATEGORIES = {
|
|
'Art Alerts': {'parent': 'Updates', 'sender': 'mutualart.com'},
|
|
'Travel Marketing': {'parent': 'Updates', 'sender': 'tripadvisor.com'},
|
|
'AI Services': {'parent': 'Technical', 'keywords': ['anthropic', 'openai']},
|
|
}
|
|
|
|
def classify_hybrid(email, ml_result):
|
|
# First: Check agent-specific rules
|
|
for cat, rules in AGENT_CATEGORIES.items():
|
|
if matches_rules(email, rules):
|
|
return (cat, ml_result.category) # Specific + generic
|
|
|
|
# Fallback: ML result
|
|
return (ml_result.category, None)
|
|
```
|
|
|
|
---
|
|
|
|
## Implementation Roadmap
|
|
|
|
### Phase 1: Quick Wins (1-2 hours)
|
|
|
|
1. **Add sender-domain classifier**
|
|
- Map top 20 senders to categories
|
|
- Use as fast-path before ML
|
|
- Expected: +20% accuracy
|
|
|
|
2. **Add dataset size routing**
|
|
- Check email count before processing
|
|
- Route small datasets to agent analysis
|
|
- Route large datasets to ML pipeline
|
|
|
|
### Phase 2: Pre-Analysis Agent (4-8 hours)
|
|
|
|
1. **Build sender clustering**
|
|
- Group emails by domain
|
|
- Calculate volume per domain
|
|
- Identify automated vs personal
|
|
|
|
2. **Build pattern extraction**
|
|
- Find subject templates
|
|
- Extract IDs and tracking numbers
|
|
- Identify lifecycle stages
|
|
|
|
3. **Generate sender map**
|
|
- Output: JSON mapping senders to categories
|
|
- Feed into ML pipeline as rules
|
|
|
|
### Phase 3: Post-Analysis Enhancement (4-8 hours)
|
|
|
|
1. **Build validation agent**
|
|
- Check low-confidence results
|
|
- Detect category conflicts
|
|
- Flag for review
|
|
|
|
2. **Build enrichment agent**
|
|
- Extract order IDs
|
|
- Track lifecycles
|
|
- Generate insights
|
|
|
|
3. **Integrate with HTML report**
|
|
- Add insights section
|
|
- Show lifecycle tracking
|
|
- Include recommendations
|
|
|
|
---
|
|
|
|
## Conclusion
|
|
|
|
### Key Takeaways
|
|
|
|
1. **ML pipeline is overkill for <5,000 emails** - Agent analysis provides better accuracy with similar time investment
|
|
|
|
2. **Sender domain is the strongest signal** - 80%+ emails can be classified by sender alone
|
|
|
|
3. **Pre-analysis investment pays off** - 10-15 min agent setup dramatically improves ML accuracy
|
|
|
|
4. **One-size-fits-all doesn't work** - Route by dataset size for optimal results
|
|
|
|
5. **Post-analysis adds unique value** - Lifecycle tracking and insights not possible with ML alone
|
|
|
|
### Recommended Default Pipeline
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ EMAIL CLASSIFICATION │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────┐
|
|
│ Count Emails │
|
|
└────────┬────────┘
|
|
│
|
|
┌──────────────────┼──────────────────┐
|
|
│ │ │
|
|
▼ ▼ ▼
|
|
<500 emails 500-5000 >5000
|
|
│ │ │
|
|
▼ ▼ ▼
|
|
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
|
│ Agent Only │ │ Pre-Analysis │ │ ML Pipeline │
|
|
│ (15-30 min) │ │ + ML + Post │ │ (fast) │
|
|
│ │ │ (15 min + ML)│ │ │
|
|
└──────────────┘ └──────────────┘ └──────────────┘
|
|
│ │ │
|
|
▼ ▼ ▼
|
|
┌──────────────────────────────────────────────────┐
|
|
│ UNIFIED OUTPUT │
|
|
│ - Categorized emails │
|
|
│ - Confidence scores │
|
|
│ - Insights & recommendations │
|
|
│ - Filtering rules │
|
|
└──────────────────────────────────────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
*Document Version: 1.0*
|
|
*Created: 2025-11-28*
|
|
*Based on: brett-gmail dataset analysis (801 emails)*
|