- PROJECT_BLUEPRINT.md: Full architecture with LightGBM, Qwen3, structured embeddings - RESEARCH_FINDINGS.md: 2024 benchmarks, competition analysis, validation - BUILD_INSTRUCTIONS.md: Step-by-step implementation guide - README.md: User-friendly overview and quick start - Research-backed hybrid ML/LLM email classifier - 94-96% accuracy target, 17min for 80k emails - Privacy-first, local processing, distributable wheel - Modular architecture with tiered dependencies - LLM optional (graceful degradation) - OpenAI-compatible API support
1064 lines
34 KiB
Markdown
1064 lines
34 KiB
Markdown
# EMAIL SORTER - PROJECT BLUEPRINT
|
|
**Hybrid ML/LLM Email Classification System**
|
|
|
|
Version: 2.0
|
|
Date: 2024-10-21
|
|
Status: Research Complete - Ready to Build
|
|
|
|
---
|
|
|
|
## EXECUTIVE SUMMARY
|
|
|
|
**What it does:**
|
|
Processes 80,000+ emails in ~17 minutes using a pre-trained ML model for bulk classification (90%+) and LLM (Ollama/OpenAI-compatible) for edge cases and startup calibration (~5-10%).
|
|
|
|
**How it works:**
|
|
1. Fresh repo clone per job (complete isolation)
|
|
2. LLM analyzes sample to discover natural categories (calibration phase)
|
|
3. Train LightGBM on embeddings + patterns + structural features
|
|
4. ML sprints through high-confidence classifications
|
|
5. Hard rules catch obvious patterns (OTP, invoices, etc.)
|
|
6. LLM reviews only uncertain cases (batched efficiently)
|
|
7. System self-tunes thresholds based on LLM feedback
|
|
8. Export results and sync back to email provider
|
|
9. Delete repo (cleanup)
|
|
|
|
**Target use case:**
|
|
Self-employed and business owners with 10k-100k+ neglected emails who need privacy-focused, one-time cleanup without cloud uploads or subscriptions.
|
|
|
|
**Key innovation:**
|
|
Hybrid approach with structured embeddings, hard pattern rules, and dynamic threshold adjustment. LLM is OPTIONAL - system degrades gracefully if unavailable.
|
|
|
|
---
|
|
|
|
## COMPETITIVE ANALYSIS (2024 Research)
|
|
|
|
### Existing Solutions (ALL Cloud-Based)
|
|
|
|
| Tool | Price | Accuracy | Privacy | Notes |
|
|
|------|-------|----------|---------|-------|
|
|
| SaneBox | $7-15/mo | ~85% | ❌ Cloud | AI filtering, requires upload |
|
|
| Clean Email | $10-30/mo | ~80% | ❌ Cloud | Smart folders, subscription |
|
|
| Spark | Free/Paid | ~75% | ❌ Cloud | Smart inbox, cloud sync |
|
|
| EmailTree.ai | Enterprise | ~90% | ❌ Cloud | NLP, for businesses |
|
|
| Mailstrom | $30-50/yr | ~70% | ❌ Cloud | Bulk analysis |
|
|
|
|
### Our Competitive Advantages
|
|
|
|
✅ **100% LOCAL** - No data leaves the machine
|
|
✅ **Privacy-first** - Perfect for business owners with sensitive data
|
|
✅ **One-time use** - No subscription, pay per job or DIY
|
|
✅ **Customizable** - Adapts to each inbox during calibration
|
|
✅ **Open source potential** - Distributable as Python wheel
|
|
✅ **Attachment analysis** - Competitors ignore this entirely
|
|
✅ **Offline capable** - Works without internet (after initial setup)
|
|
|
|
### Benchmark Performance (2024 Research)
|
|
|
|
**Enron Dataset (industry standard):**
|
|
- Traditional ML (SVM, Random Forest): 95-98%
|
|
- Deep Learning (DNN-BiLSTM): 98.69%
|
|
- Transformers (BERT, RoBERTa): ~99%
|
|
- LLMs (GPT-4): 99.7% (phishing detection)
|
|
- Ensemble methods: 98.8%
|
|
|
|
**Our Target:** 94-96% accuracy (competitive, privacy-focused, local)
|
|
|
|
---
|
|
|
|
## ARCHITECTURE
|
|
|
|
### Three-Phase Pipeline
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ PHASE 1: CALIBRATION (3-5 minutes) │
|
|
├─────────────────────────────────────────────────────────────┤
|
|
│ 1. Sample 1500 emails (stratified sampling) │
|
|
│ 2. LLM analyzes patterns and discovers categories │
|
|
│ Model: qwen3:4b (bigger, more accurate) │
|
|
│ Alternative: Compress to 500 emails + smarter batching │
|
|
│ 3. Map discovered → universal categories │
|
|
│ 4. Generate training labels for embedding classifier │
|
|
│ 5. Validate on 300 emails │
|
|
│ 6. Set initial confidence thresholds │
|
|
│ 7. Train LightGBM on embeddings + patterns │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
↓
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ PHASE 2: BULK PROCESSING (10-12 minutes) │
|
|
├─────────────────────────────────────────────────────────────┤
|
|
│ For each email: │
|
|
│ → Pattern detection (regex, <1ms) │
|
|
│ → Hard rule match? → INSTANT (10% of emails) │
|
|
│ → Generate structured embedding (batched, 8 min total) │
|
|
│ → LightGBM classify with confidence score │
|
|
│ → IF confidence >= threshold: ACCEPT (85%) │
|
|
│ → IF confidence < threshold: QUEUE for LLM (5%) │
|
|
│ │
|
|
│ Every 1000 emails or queue full: │
|
|
│ → Process LLM batch (qwen3:1.7b, fast) │
|
|
│ → Analyze agreement rate │
|
|
│ → Adjust thresholds dynamically │
|
|
│ → Learn sender rules │
|
|
│ → Save checkpoint │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
↓
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ PHASE 3: FINALIZATION (2-3 minutes) │
|
|
├─────────────────────────────────────────────────────────────┤
|
|
│ 1. Process remaining LLM queue │
|
|
│ 2. Export results (JSON/CSV) │
|
|
│ 3. Sync to email provider (Gmail labels, IMAP folders) │
|
|
│ 4. Generate classification report │
|
|
│ 5. Cleanup (delete repo, temp files) │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## CORE COMPONENTS
|
|
|
|
### 1. Hybrid Feature Extraction (THE SECRET SAUCE)
|
|
|
|
Combines three feature types for maximum accuracy:
|
|
|
|
#### A. Sentence Embeddings (Semantic Understanding)
|
|
```python
|
|
from sentence_transformers import SentenceTransformer
|
|
|
|
embedder = SentenceTransformer('all-MiniLM-L6-v2') # 384 dimensions
|
|
|
|
# Structured embedding with parameterized headers
|
|
def build_embedding_text(email, patterns):
|
|
return f"""[EMAIL_METADATA]
|
|
sender_type: {email.sender_domain_type}
|
|
time_category: {email.time_of_day}
|
|
has_attachments: {email.has_attachments}
|
|
attachment_types: {email.attachment_types}
|
|
|
|
[DETECTED_PATTERNS]
|
|
has_otp: {patterns['has_otp']}
|
|
has_invoice: {patterns['has_invoice']}
|
|
has_unsubscribe: {patterns['has_unsubscribe']}
|
|
is_automated: {patterns['is_noreply']}
|
|
has_meeting: {patterns['has_meeting']}
|
|
|
|
[CONTENT]
|
|
subject: {email.subject}
|
|
body: {email.body_snippet[:300]}
|
|
"""
|
|
|
|
text = build_embedding_text(email, patterns)
|
|
embedding = embedder.encode(text) # → 384-dim vector
|
|
```
|
|
|
|
**Why this works:**
|
|
- Model sees STRUCTURE, not just raw text
|
|
- Pattern hints guide semantic understanding
|
|
- Research shows 5-10% accuracy boost vs naive embedding
|
|
- Handles semantic variants: "meeting" = "call" = "zoom"
|
|
|
|
#### B. Hard Pattern Rules (Fast Deterministic)
|
|
```python
|
|
# ~20 boolean/numerical features extracted via regex
|
|
patterns = {
|
|
# Authentication patterns
|
|
'has_otp': bool(re.search(r'\b\d{4,6}\b', text)),
|
|
'has_verification': 'verification' in text.lower(),
|
|
'has_reset_password': 'reset password' in text.lower(),
|
|
|
|
# Transactional patterns
|
|
'has_invoice': bool(re.search(r'invoice\s*#?\d+', text, re.I)),
|
|
'has_receipt': 'receipt' in text.lower(),
|
|
'has_price': bool(re.search(r'\$\d+', text)),
|
|
'has_order_number': bool(re.search(r'order\s*#?\d+', text, re.I)),
|
|
|
|
# Newsletter/marketing patterns
|
|
'has_unsubscribe': 'unsubscribe' in text.lower(),
|
|
'has_view_in_browser': 'view in browser' in text.lower(),
|
|
|
|
# Meeting/calendar patterns
|
|
'has_meeting': bool(re.search(r'(meeting|call|zoom|teams)', text, re.I)),
|
|
'has_calendar': 'calendar' in text.lower(),
|
|
|
|
# Other patterns
|
|
'has_tracking': bool(re.search(r'tracking\s*(number|#)', text, re.I)),
|
|
'is_automated': email.sender_domain_type == 'noreply',
|
|
'has_signature': bool(re.search(r'(regards|sincerely|best)', text, re.I)),
|
|
}
|
|
```
|
|
|
|
#### C. Structural Features (Metadata)
|
|
```python
|
|
# ~20 numerical/categorical features
|
|
structural = {
|
|
# Sender analysis
|
|
'sender_domain': extract_domain(email.sender),
|
|
'sender_domain_type': categorize_domain(email.sender), # freemail/corporate/noreply
|
|
'is_noreply': 'noreply' in email.sender.lower(),
|
|
|
|
# Timing
|
|
'time_of_day': categorize_hour(email.date.hour), # night/morning/afternoon/evening
|
|
'day_of_week': email.date.strftime('%A').lower(),
|
|
|
|
# Content structure
|
|
'subject_length': len(email.subject),
|
|
'body_length': len(email.body),
|
|
'link_count': len(re.findall(r'https?://', email.body)),
|
|
'image_count': len(re.findall(r'<img', email.body)),
|
|
|
|
# Attachments (COMPETITIVE ADVANTAGE)
|
|
'has_attachments': email.has_attachments,
|
|
'attachment_count': len(email.attachments),
|
|
'attachment_types': extract_attachment_types(email.attachments),
|
|
'has_pdf': 'pdf' in attachment_types,
|
|
'has_invoice_pdf': has_invoice and has_pdf, # KILLER FEATURE
|
|
|
|
# Reply/forward
|
|
'has_reply_prefix': bool(re.match(r'^(Re:|Fwd:)', email.subject, re.I)),
|
|
}
|
|
```
|
|
|
|
#### D. Attachment Analysis (Differentiator)
|
|
```python
|
|
def analyze_attachments(attachments):
|
|
"""Extract features from attachments - competitors don't do this!"""
|
|
features = {
|
|
'has_attachments': len(attachments) > 0,
|
|
'attachment_count': len(attachments),
|
|
'total_size': sum(a['size'] for a in attachments),
|
|
'attachment_types': []
|
|
}
|
|
|
|
for attachment in attachments:
|
|
mime_type = attachment.get('mime_type', '')
|
|
filename = attachment.get('filename', '')
|
|
|
|
# Type categorization
|
|
if 'pdf' in mime_type or filename.endswith('.pdf'):
|
|
features['attachment_types'].append('pdf')
|
|
|
|
# Extract text from PDF if small enough (<5MB)
|
|
if attachment['size'] < 5_000_000:
|
|
text = extract_pdf_text(attachment)
|
|
features['pdf_has_invoice'] = bool(re.search(r'invoice|bill', text, re.I))
|
|
features['pdf_has_account'] = bool(re.search(r'account\s*#?\d+', text, re.I))
|
|
|
|
elif 'word' in mime_type or filename.endswith(('.doc', '.docx')):
|
|
features['attachment_types'].append('docx')
|
|
|
|
elif 'excel' in mime_type or filename.endswith(('.xls', '.xlsx')):
|
|
features['attachment_types'].append('xlsx')
|
|
|
|
elif 'image' in mime_type or filename.endswith(('.png', '.jpg', '.jpeg')):
|
|
features['attachment_types'].append('image')
|
|
|
|
return features
|
|
```
|
|
|
|
**Why this matters:**
|
|
- Business emails often have invoice PDFs, contract DOCXs
|
|
- Detecting "PDF with INVOICE text" → instant "transactional" classification
|
|
- Competitors ignore attachments entirely = our differentiator
|
|
|
|
#### Combined Feature Vector
|
|
```python
|
|
# Total: ~434 dimensions (vs 10,000 with TF-IDF!)
|
|
final_features = np.concatenate([
|
|
embedding, # 384 dims (semantic understanding)
|
|
pattern_values, # 20 dims (hard rules)
|
|
structural_values, # 20 dims (metadata)
|
|
attachment_values # 10 dims (NEW!)
|
|
])
|
|
```
|
|
|
|
---
|
|
|
|
### 2. LightGBM Classifier (Research-Backed Choice)
|
|
|
|
**Why LightGBM over XGBoost:**
|
|
- ✅ **Native categorical handling** (no encoding needed)
|
|
- ✅ **2-5x faster** on mixed feature types
|
|
- ✅ **4x speedup** with categorical + numerical features
|
|
- ✅ **Better memory efficiency**
|
|
- ✅ **Equivalent accuracy** to XGBoost
|
|
- ✅ **Perfect for embeddings** (dense numerical) + categoricals
|
|
|
|
```python
|
|
import lightgbm as lgb
|
|
import numpy as np
|
|
|
|
class HybridClassifier:
|
|
def __init__(self, categories):
|
|
self.categories = categories
|
|
self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
|
|
self.model = None
|
|
|
|
def extract_features(self, email):
|
|
"""Extract all feature types"""
|
|
patterns = extract_patterns(email)
|
|
structural = extract_structural(email)
|
|
|
|
# Structured embedding with rich context
|
|
text = build_embedding_text(email, patterns)
|
|
embedding = self.embedder.encode(text)
|
|
|
|
# Combine features
|
|
features = {
|
|
'embedding': embedding, # 384 numerical
|
|
'patterns': patterns, # 20 numerical/boolean
|
|
'structural': structural # 20 numerical/categorical
|
|
}
|
|
|
|
return features
|
|
|
|
def train(self, emails, labels):
|
|
"""Train on LLM-labeled data from calibration"""
|
|
# Extract features
|
|
all_features = [self.extract_features(e) for e in emails]
|
|
|
|
# Build feature matrix
|
|
X = np.array([
|
|
np.concatenate([
|
|
f['embedding'],
|
|
list(f['patterns'].values()),
|
|
[f['structural'][k] for k in numerical_keys]
|
|
])
|
|
for f in all_features
|
|
])
|
|
|
|
# Categorical feature indices
|
|
categorical_features = ['sender_domain_type', 'time_of_day', 'day_of_week']
|
|
|
|
# Train LightGBM
|
|
self.model = lgb.LGBMClassifier(
|
|
categorical_feature=categorical_features,
|
|
n_estimators=200,
|
|
learning_rate=0.1,
|
|
max_depth=8,
|
|
num_leaves=31,
|
|
objective='multiclass',
|
|
num_class=len(self.categories)
|
|
)
|
|
|
|
self.model.fit(X, labels)
|
|
|
|
def predict(self, email):
|
|
"""Predict with confidence"""
|
|
features = self.extract_features(email)
|
|
X = build_feature_vector(features)
|
|
|
|
# Get probabilities
|
|
probs = self.model.predict_proba([X])[0]
|
|
pred_class = np.argmax(probs)
|
|
|
|
return {
|
|
'category': self.categories[pred_class],
|
|
'confidence': float(probs[pred_class]),
|
|
'probabilities': {
|
|
self.categories[i]: float(probs[i])
|
|
for i in range(len(self.categories))
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
### 3. LLM Integration (Flexible & Optional)
|
|
|
|
**Model Strategy:**
|
|
|
|
| Phase | Model | Speed | Purpose |
|
|
|-------|-------|-------|---------|
|
|
| Calibration | **qwen3:4b** | Slower | Better category discovery, 1500 emails |
|
|
| Classification | **qwen3:1.7b** | Fast | Quick review, only ~5% of emails |
|
|
| Optional | **qwen3:30b** | Slowest | Maximum accuracy if needed |
|
|
|
|
**Configuration (Single Source of Truth):**
|
|
```yaml
|
|
# config/llm_models.yaml
|
|
llm:
|
|
# Provider type: ollama, openai, anthropic
|
|
provider: "ollama"
|
|
|
|
# Ollama settings
|
|
ollama:
|
|
base_url: "http://localhost:11434"
|
|
calibration_model: "qwen3:4b" # Bigger for better discovery
|
|
classification_model: "qwen3:1.7b" # Smaller for speed
|
|
temperature: 0.1
|
|
max_tokens: 500
|
|
timeout: 30
|
|
retry_attempts: 3
|
|
|
|
# OpenAI-compatible API (future-proof)
|
|
openai:
|
|
base_url: "https://api.openai.com/v1" # Or custom endpoint
|
|
api_key: "${OPENAI_API_KEY}"
|
|
calibration_model: "gpt-4o-mini"
|
|
classification_model: "gpt-4o-mini"
|
|
temperature: 0.1
|
|
max_tokens: 500
|
|
|
|
# Graceful degradation
|
|
fallback:
|
|
enabled: true
|
|
# If LLM unavailable, emails go to "needs_review" folder
|
|
# ML still works, just more conservative thresholds
|
|
```
|
|
|
|
**LLM Provider Abstraction:**
|
|
```python
|
|
from abc import ABC, abstractmethod
|
|
|
|
class BaseLLMProvider(ABC):
|
|
@abstractmethod
|
|
def complete(self, prompt: str, **kwargs) -> str:
|
|
pass
|
|
|
|
@abstractmethod
|
|
def test_connection(self) -> bool:
|
|
pass
|
|
|
|
class OllamaProvider(BaseLLMProvider):
|
|
def __init__(self, base_url: str, model: str):
|
|
import ollama
|
|
self.client = ollama.Client(host=base_url)
|
|
self.model = model
|
|
|
|
def complete(self, prompt: str, **kwargs) -> str:
|
|
response = self.client.generate(
|
|
model=self.model,
|
|
prompt=prompt,
|
|
options={
|
|
'temperature': kwargs.get('temperature', 0.1),
|
|
'num_predict': kwargs.get('max_tokens', 500)
|
|
}
|
|
)
|
|
return response['response']
|
|
|
|
def test_connection(self) -> bool:
|
|
try:
|
|
self.client.list()
|
|
return True
|
|
except:
|
|
return False
|
|
|
|
class OpenAIProvider(BaseLLMProvider):
|
|
def __init__(self, base_url: str, api_key: str, model: str):
|
|
from openai import OpenAI
|
|
self.client = OpenAI(base_url=base_url, api_key=api_key)
|
|
self.model = model
|
|
|
|
def complete(self, prompt: str, **kwargs) -> str:
|
|
response = self.client.chat.completions.create(
|
|
model=self.model,
|
|
messages=[{"role": "user", "content": prompt}],
|
|
temperature=kwargs.get('temperature', 0.1),
|
|
max_tokens=kwargs.get('max_tokens', 500)
|
|
)
|
|
return response.choices[0].message.content
|
|
|
|
def test_connection(self) -> bool:
|
|
try:
|
|
self.client.models.list()
|
|
return True
|
|
except:
|
|
return False
|
|
|
|
def get_llm_provider(config) -> BaseLLMProvider:
|
|
"""Factory to create LLM provider based on config"""
|
|
provider_type = config['llm']['provider']
|
|
|
|
if provider_type == 'ollama':
|
|
return OllamaProvider(
|
|
base_url=config['llm']['ollama']['base_url'],
|
|
model=config['llm']['ollama']['classification_model']
|
|
)
|
|
elif provider_type == 'openai':
|
|
return OpenAIProvider(
|
|
base_url=config['llm']['openai']['base_url'],
|
|
api_key=os.getenv('OPENAI_API_KEY'),
|
|
model=config['llm']['openai']['classification_model']
|
|
)
|
|
else:
|
|
raise ValueError(f"Unknown provider: {provider_type}")
|
|
```
|
|
|
|
**Graceful Degradation (LLM Optional):**
|
|
```python
|
|
class AdaptiveClassifier:
|
|
def __init__(self, ml_model, llm_classifier, config):
|
|
self.ml_model = ml_model
|
|
self.llm_classifier = llm_classifier
|
|
self.llm_available = self._test_llm_connection()
|
|
self.config = config
|
|
|
|
if not self.llm_available:
|
|
logger.warning("LLM unavailable - using conservative thresholds")
|
|
self.default_threshold = 0.85 # Higher threshold without LLM
|
|
else:
|
|
self.default_threshold = 0.75
|
|
|
|
def _test_llm_connection(self):
|
|
"""Check if LLM is available"""
|
|
if not self.llm_classifier:
|
|
return False
|
|
try:
|
|
return self.llm_classifier.test_connection()
|
|
except:
|
|
return False
|
|
|
|
def classify(self, email, features):
|
|
"""Classify with or without LLM"""
|
|
# ML classification
|
|
ml_result = self.ml_model.predict(features)
|
|
|
|
# Check hard rules first
|
|
if self._has_hard_rule_match(email):
|
|
return ClassificationResult(
|
|
category=self._get_rule_category(email),
|
|
confidence=0.99,
|
|
method='rule'
|
|
)
|
|
|
|
# High confidence ML result
|
|
if ml_result['confidence'] >= self.default_threshold:
|
|
return ClassificationResult(
|
|
category=ml_result['category'],
|
|
confidence=ml_result['confidence'],
|
|
method='ml'
|
|
)
|
|
|
|
# Low confidence - try LLM if available
|
|
if self.llm_available:
|
|
return ClassificationResult(
|
|
category=ml_result['category'],
|
|
confidence=ml_result['confidence'],
|
|
method='ml',
|
|
needs_review=True # Queue for LLM
|
|
)
|
|
else:
|
|
# No LLM - mark for manual review
|
|
return ClassificationResult(
|
|
category='needs_review',
|
|
confidence=ml_result['confidence'],
|
|
method='ml',
|
|
needs_review=True,
|
|
metadata={'ml_prediction': ml_result}
|
|
)
|
|
```
|
|
|
|
---
|
|
|
|
### 4. Universal Categories (12 Total)
|
|
|
|
```python
|
|
categories = {
|
|
'junk': {
|
|
'description': 'Spam, unwanted marketing, phishing',
|
|
'patterns': ['unsubscribe', 'click here', 'limited time'],
|
|
'threshold': 0.85 # High confidence needed
|
|
},
|
|
'transactional': {
|
|
'description': 'Receipts, invoices, confirmations, order tracking',
|
|
'patterns': ['receipt', 'invoice', 'order', 'shipped', 'tracking'],
|
|
'threshold': 0.80
|
|
},
|
|
'auth': {
|
|
'description': 'OTPs, password resets, 2FA codes, security alerts',
|
|
'patterns': ['verification code', 'otp', 'reset password', r'\d{4,6}'],
|
|
'threshold': 0.90 # Very high - important emails
|
|
},
|
|
'newsletters': {
|
|
'description': 'Subscribed newsletters, marketing emails',
|
|
'patterns': ['newsletter', 'weekly digest', 'monthly update'],
|
|
'threshold': 0.75
|
|
},
|
|
'social': {
|
|
'description': 'Social media notifications, mentions, friend requests',
|
|
'patterns': ['mentioned you', 'friend request', 'liked your'],
|
|
'threshold': 0.75
|
|
},
|
|
'automated': {
|
|
'description': 'System notifications, alerts, no-reply messages',
|
|
'patterns': ['automated', 'system notification', 'do not reply'],
|
|
'threshold': 0.80
|
|
},
|
|
'conversational': {
|
|
'description': 'Human-to-human correspondence, replies, discussions',
|
|
'patterns': ['hi', 'hello', 'thanks', 'regards'],
|
|
'threshold': 0.65 # Lower - varied language
|
|
},
|
|
'work': {
|
|
'description': 'Business correspondence, meetings, projects',
|
|
'patterns': ['meeting', 'project', 'deadline', 'team'],
|
|
'threshold': 0.70
|
|
},
|
|
'personal': {
|
|
'description': 'Friends and family, personal matters',
|
|
'patterns': ['love', 'family', 'dinner', 'weekend'],
|
|
'threshold': 0.70
|
|
},
|
|
'finance': {
|
|
'description': 'Bank statements, credit cards, investments, bills',
|
|
'patterns': ['statement', 'balance', 'account', 'payment due'],
|
|
'threshold': 0.85 # High - sensitive
|
|
},
|
|
'travel': {
|
|
'description': 'Flight bookings, hotels, reservations, itineraries',
|
|
'patterns': ['flight', 'booking', 'reservation', 'check-in'],
|
|
'threshold': 0.80
|
|
},
|
|
'unknown': {
|
|
'description': "Doesn't fit any category (requires review)",
|
|
'patterns': [],
|
|
'threshold': 0.50 # Catch-all
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## MODULAR ARCHITECTURE
|
|
|
|
### Tiered Dependencies
|
|
|
|
```python
|
|
# setup.py
|
|
setup(
|
|
name="email-sorter",
|
|
version="1.0.0",
|
|
install_requires=[
|
|
# CORE (always required)
|
|
"numpy>=1.24.0",
|
|
"pandas>=2.0.0",
|
|
"scikit-learn>=1.3.0",
|
|
"lightgbm>=4.0.0",
|
|
"sentence-transformers>=2.2.0",
|
|
"pydantic>=2.0.0",
|
|
"pyyaml>=6.0",
|
|
"click>=8.1.0",
|
|
"rich>=13.0.0",
|
|
"tqdm>=4.66.0",
|
|
"tenacity>=8.2.0",
|
|
],
|
|
extras_require={
|
|
# Email providers (optional)
|
|
"gmail": [
|
|
"google-api-python-client>=2.100.0",
|
|
"google-auth-httplib2>=0.1.1",
|
|
"google-auth-oauthlib>=1.1.0",
|
|
],
|
|
"microsoft": [
|
|
"msal>=1.24.0",
|
|
],
|
|
"imap": [
|
|
"imapclient>=2.3.1",
|
|
],
|
|
|
|
# LLM providers (optional)
|
|
"ollama": [
|
|
"ollama>=0.1.0",
|
|
],
|
|
"openai": [
|
|
"openai>=1.0.0",
|
|
],
|
|
|
|
# Attachment processing (optional)
|
|
"attachments": [
|
|
"PyPDF2>=3.0.0",
|
|
"python-docx>=0.8.11",
|
|
"openpyxl>=3.0.10",
|
|
],
|
|
|
|
# Development (optional)
|
|
"dev": [
|
|
"pytest>=7.4.0",
|
|
"pytest-cov>=4.1.0",
|
|
"pytest-mock>=3.11.0",
|
|
"black>=23.0.0",
|
|
"isort>=5.12.0",
|
|
],
|
|
|
|
# All extras
|
|
"all": [
|
|
# Combines all above
|
|
]
|
|
}
|
|
)
|
|
```
|
|
|
|
**Installation options:**
|
|
```bash
|
|
# Minimal (ML only, no LLM, no email providers)
|
|
pip install email-sorter
|
|
|
|
# With Gmail support
|
|
pip install email-sorter[gmail]
|
|
|
|
# With Ollama LLM
|
|
pip install email-sorter[ollama,gmail]
|
|
|
|
# Everything
|
|
pip install email-sorter[all]
|
|
```
|
|
|
|
---
|
|
|
|
## TESTING STRATEGY
|
|
|
|
### Test Harness Structure
|
|
|
|
```
|
|
tests/
|
|
├── unit/
|
|
│ ├── test_feature_extraction.py
|
|
│ ├── test_pattern_matching.py
|
|
│ ├── test_embeddings.py
|
|
│ ├── test_lightgbm.py
|
|
│ └── test_attachment_analysis.py
|
|
├── integration/
|
|
│ ├── test_calibration.py
|
|
│ ├── test_ml_llm_pipeline.py
|
|
│ ├── test_gmail_provider.py
|
|
│ └── test_checkpoint_resume.py
|
|
├── e2e/
|
|
│ ├── test_full_pipeline_100.py
|
|
│ ├── test_full_pipeline_1000.py
|
|
│ └── test_full_pipeline_80k.py
|
|
├── fixtures/
|
|
│ ├── mock_emails.json
|
|
│ ├── mock_llm_responses.json
|
|
│ └── sample_inboxes/
|
|
└── conftest.py
|
|
```
|
|
|
|
### Unit Tests
|
|
```python
|
|
# tests/unit/test_feature_extraction.py
|
|
import pytest
|
|
from src.classification.feature_extractor import FeatureExtractor
|
|
from src.email_providers.base import Email
|
|
|
|
def test_pattern_extraction():
|
|
email = Email(
|
|
id='1',
|
|
subject='Your verification code is 123456',
|
|
sender='noreply@service.com',
|
|
body='Your one-time password is 123456'
|
|
)
|
|
|
|
extractor = FeatureExtractor()
|
|
patterns = extractor._extract_patterns(email)
|
|
|
|
assert patterns['has_otp'] == True
|
|
assert patterns['has_verification'] == True
|
|
assert patterns['is_automated'] == True
|
|
|
|
def test_structured_embedding():
|
|
email = Email(
|
|
id='2',
|
|
subject='Invoice #12345',
|
|
sender='billing@company.com',
|
|
body='Please find attached your invoice'
|
|
)
|
|
|
|
extractor = FeatureExtractor()
|
|
text = extractor.build_embedding_text(email)
|
|
|
|
assert '[EMAIL_METADATA]' in text
|
|
assert '[DETECTED_PATTERNS]' in text
|
|
assert 'has_invoice: True' in text
|
|
```
|
|
|
|
### Integration Tests
|
|
```python
|
|
# tests/integration/test_ml_llm_pipeline.py
|
|
def test_calibration_then_classification():
|
|
# 1. Load sample emails
|
|
emails = load_sample_emails(count=100)
|
|
|
|
# 2. Run calibration (with mock LLM)
|
|
calibrator = CalibrationPhase(mock_llm_provider)
|
|
config = calibrator.run(emails)
|
|
|
|
# 3. Train classifier
|
|
classifier = HybridClassifier()
|
|
classifier.train(emails, config['labels'])
|
|
|
|
# 4. Classify new emails
|
|
new_emails = load_sample_emails(count=20, exclude=emails)
|
|
results = [classifier.predict(e) for e in new_emails]
|
|
|
|
# 5. Assert accuracy
|
|
accuracy = calculate_accuracy(results, ground_truth)
|
|
assert accuracy > 0.85
|
|
```
|
|
|
|
### E2E Tests
|
|
```python
|
|
# tests/e2e/test_full_pipeline_100.py
|
|
def test_full_pipeline_100_emails(tmp_path):
|
|
"""End-to-end test on 100 emails"""
|
|
# Setup
|
|
output_dir = tmp_path / "results"
|
|
emails = load_test_inbox(count=100)
|
|
|
|
# Run full pipeline
|
|
result = run_email_sorter(
|
|
emails=emails,
|
|
output=output_dir,
|
|
config="tests/fixtures/test_config.yaml"
|
|
)
|
|
|
|
# Assertions
|
|
assert result['total_processed'] == 100
|
|
assert result['accuracy_estimate'] > 0.90
|
|
assert (output_dir / "results.json").exists()
|
|
assert (output_dir / "report.txt").exists()
|
|
```
|
|
|
|
---
|
|
|
|
## PERFORMANCE EXPECTATIONS (Updated with Research)
|
|
|
|
### For 80,000 emails:
|
|
|
|
| Phase | Time | Details |
|
|
|-------|------|---------|
|
|
| **Calibration** | 3-5 min | 1500 emails, qwen3:4b, train LightGBM |
|
|
| Pattern detection | ~10 sec | Regex on all 80k emails |
|
|
| Embedding generation | ~8 min | Batched, CPU, all 80k emails |
|
|
| LightGBM classification | ~3 sec | Fast inference |
|
|
| Hard rules auto-classify | instant | 10% = 8,000 emails |
|
|
| LLM review (qwen3:1.7b) | ~4 min | 5% = 4,000 emails, batched |
|
|
| Export & sync | ~2 min | JSON/CSV + Gmail API |
|
|
| **TOTAL** | **~17 min** | |
|
|
|
|
### Accuracy Breakdown:
|
|
|
|
| Component | Coverage | Accuracy |
|
|
|-----------|----------|----------|
|
|
| Hard rules | 10% | 99% |
|
|
| LightGBM (high conf) | 85% | 92% |
|
|
| LLM review | 5% | 95% |
|
|
| **Overall** | **100%** | **94-96%** |
|
|
|
|
### Memory Usage (80k emails):
|
|
- Email data: ~400MB
|
|
- Embeddings (cached): ~500MB
|
|
- LightGBM model: ~5MB
|
|
- MiniLM model: ~90MB
|
|
- Peak: ~1.2GB
|
|
|
|
---
|
|
|
|
## DISTRIBUTABLE WHEEL PACKAGING
|
|
|
|
### Package Structure
|
|
```
|
|
email-sorter/
|
|
├── setup.py
|
|
├── setup.cfg
|
|
├── pyproject.toml
|
|
├── MANIFEST.in
|
|
├── README.md
|
|
├── LICENSE
|
|
├── src/
|
|
│ └── email_sorter/
|
|
│ ├── __init__.py
|
|
│ ├── __main__.py
|
|
│ ├── cli.py
|
|
│ └── ... (all modules)
|
|
├── config/
|
|
│ ├── default_config.yaml
|
|
│ ├── categories.yaml
|
|
│ └── llm_models.yaml
|
|
└── models/
|
|
└── pretrained/
|
|
├── minilm-l6-v2/ (bundled embedder)
|
|
└── lightgbm.pkl (optional pre-trained)
|
|
```
|
|
|
|
### Distribution Commands
|
|
```bash
|
|
# Build wheel
|
|
python setup.py sdist bdist_wheel
|
|
|
|
# Install locally
|
|
pip install dist/email_sorter-1.0.0-py3-none-any.whl
|
|
|
|
# Use as command
|
|
email-sorter --source gmail --credentials creds.json --output results/
|
|
|
|
# Or as module
|
|
python -m email_sorter --source gmail ...
|
|
```
|
|
|
|
### CLI Interface
|
|
```bash
|
|
email-sorter --help
|
|
|
|
# Basic usage
|
|
email-sorter \
|
|
--source gmail \
|
|
--credentials credentials.json \
|
|
--output results/
|
|
|
|
# Advanced options
|
|
email-sorter \
|
|
--source gmail \
|
|
--credentials creds.json \
|
|
--output results/ \
|
|
--config custom_config.yaml \
|
|
--llm-provider ollama \
|
|
--llm-model qwen3:1.7b \
|
|
--limit 1000 \
|
|
--no-calibrate \
|
|
--dry-run
|
|
```
|
|
|
|
---
|
|
|
|
## PROJECT STRUCTURE
|
|
|
|
```
|
|
email-sorter/
|
|
├── README.md
|
|
├── PROJECT_BLUEPRINT.md # This file
|
|
├── BUILD_INSTRUCTIONS.md
|
|
├── RESEARCH_FINDINGS.md
|
|
├── setup.py
|
|
├── setup.cfg
|
|
├── pyproject.toml
|
|
├── requirements.txt
|
|
├── .gitignore
|
|
├── .env.example
|
|
├── config/
|
|
│ ├── default_config.yaml
|
|
│ ├── categories.yaml
|
|
│ ├── llm_models.yaml # LLM config (single source)
|
|
│ └── features.yaml
|
|
├── src/
|
|
│ ├── __init__.py
|
|
│ ├── __main__.py
|
|
│ ├── cli.py # Click CLI
|
|
│ ├── calibration/
|
|
│ │ ├── __init__.py
|
|
│ │ ├── sampler.py # Stratified sampling
|
|
│ │ ├── llm_analyzer.py # LLM calibration
|
|
│ │ └── trainer.py # Train LightGBM
|
|
│ ├── classification/
|
|
│ │ ├── __init__.py
|
|
│ │ ├── feature_extractor.py # Hybrid features
|
|
│ │ ├── pattern_matcher.py # Hard rules
|
|
│ │ ├── embedder.py # Sentence embeddings
|
|
│ │ ├── lightgbm_classifier.py
|
|
│ │ ├── adaptive_classifier.py
|
|
│ │ └── llm_classifier.py
|
|
│ ├── models/
|
|
│ │ ├── __init__.py
|
|
│ │ ├── pretrained/
|
|
│ │ │ └── .gitkeep
|
|
│ │ └── model_loader.py
|
|
│ ├── email_providers/
|
|
│ │ ├── __init__.py
|
|
│ │ ├── base.py
|
|
│ │ ├── gmail.py
|
|
│ │ ├── microsoft.py
|
|
│ │ └── imap.py
|
|
│ ├── llm/
|
|
│ │ ├── __init__.py
|
|
│ │ ├── base.py # Abstract provider
|
|
│ │ ├── ollama.py
|
|
│ │ └── openai.py
|
|
│ ├── processing/
|
|
│ │ ├── __init__.py
|
|
│ │ ├── bulk_processor.py
|
|
│ │ ├── attachment_handler.py
|
|
│ │ └── queue_manager.py
|
|
│ ├── adjustment/
|
|
│ │ ├── __init__.py
|
|
│ │ ├── threshold_adjuster.py
|
|
│ │ └── pattern_learner.py
|
|
│ ├── export/
|
|
│ │ ├── __init__.py
|
|
│ │ ├── results_exporter.py
|
|
│ │ ├── provider_sync.py
|
|
│ │ └── report_generator.py
|
|
│ └── utils/
|
|
│ ├── __init__.py
|
|
│ ├── config.py
|
|
│ ├── logging.py
|
|
│ └── cleanup.py
|
|
├── tests/
|
|
│ ├── unit/
|
|
│ ├── integration/
|
|
│ ├── e2e/
|
|
│ ├── fixtures/
|
|
│ └── conftest.py
|
|
├── prompts/
|
|
│ ├── calibration.txt
|
|
│ └── classification.txt
|
|
├── scripts/
|
|
│ ├── train_model.py
|
|
│ ├── verify_install.py
|
|
│ └── benchmark.py
|
|
├── data/
|
|
│ └── samples/
|
|
└── logs/
|
|
└── .gitkeep
|
|
```
|
|
|
|
---
|
|
|
|
## SECURITY & PRIVACY
|
|
|
|
✅ **All processing is local** - No cloud uploads
|
|
✅ **LLM runs locally** - Via Ollama (or optional OpenAI API)
|
|
✅ **Fresh clone per job** - Complete isolation
|
|
✅ **No persistent storage** - Email bodies never written to disk
|
|
✅ **Attachment content** - Processed in memory, discarded immediately
|
|
✅ **Auto cleanup** - Temp files deleted after processing
|
|
✅ **Credentials** - Used directly, never cached
|
|
✅ **GDPR-friendly** - No data retention or sharing
|
|
|
|
---
|
|
|
|
## SUCCESS CRITERIA
|
|
|
|
✅ Processes 80k emails in <20 minutes
|
|
✅ 94-96% classification accuracy (competitive with cloud tools)
|
|
✅ <5% emails need LLM review
|
|
✅ Successfully syncs back to Gmail/IMAP
|
|
✅ No data leakage between jobs
|
|
✅ Works on Windows, Linux, macOS
|
|
✅ LLM is optional (graceful degradation)
|
|
✅ Distributable as Python wheel
|
|
✅ Attachment analysis working
|
|
✅ OpenAI-compatible API support
|
|
|
|
---
|
|
|
|
## WHAT'S NEXT
|
|
|
|
1. ✅ Research complete (benchmarks, competition, LightGBM vs XGBoost)
|
|
2. ⏭ Update BUILD_INSTRUCTIONS.md with new architecture
|
|
3. ⏭ Create RESEARCH_FINDINGS.md with search results
|
|
4. ⏭ Build core infrastructure (config, logging, data models)
|
|
5. ⏭ Implement feature extraction (embeddings + patterns + attachments)
|
|
6. ⏭ Create LightGBM classifier
|
|
7. ⏭ Implement LLM providers (Ollama + OpenAI-compatible)
|
|
8. ⏭ Build calibration system
|
|
9. ⏭ Create test harness
|
|
10. ⏭ Package as wheel
|
|
11. ⏭ Test on Marion's 80k emails
|
|
|
|
---
|
|
|
|
**END OF BLUEPRINT v2.0**
|
|
|
|
This is the complete, research-backed architecture ready to build.
|