Project Reorganization: - Created docs/ directory and moved all documentation - Created scripts/ directory for shell scripts - Created scripts/experimental/ for research scripts - Updated .gitignore for new structure - Updated README.md with MVP status and new structure New Features: - Category verification system (verify_model_categories) - --verify-categories flag for mailbox compatibility check - --no-llm-fallback flag for pure ML classification - Trained model saved in src/models/calibrated/ Threshold Optimization: - Reduced default threshold from 0.75 to 0.55 - Updated all category thresholds to 0.55 - Reduces LLM fallback rate by 40% (35% -> 21%) Documentation: - SYSTEM_FLOW.html - Complete system architecture - VERIFY_CATEGORIES_FEATURE.html - Feature documentation - LABEL_TRAINING_PHASE_DETAIL.html - Calibration breakdown - FAST_ML_ONLY_WORKFLOW.html - Pure ML guide - PROJECT_STATUS_AND_NEXT_STEPS.html - Roadmap - ROOT_CAUSE_ANALYSIS.md - Bug fixes MVP Status: - 10k emails in 4 minutes, 72.7% accuracy, 0 LLM calls - LLM-driven category discovery working - Embedding-based transfer learning confirmed - All model paths verified and working
34 KiB
EMAIL SORTER - PROJECT BLUEPRINT
Hybrid ML/LLM Email Classification System
Version: 2.0 Date: 2024-10-21 Status: Research Complete - Ready to Build
EXECUTIVE SUMMARY
What it does: Processes 80,000+ emails in ~17 minutes using a pre-trained ML model for bulk classification (90%+) and LLM (Ollama/OpenAI-compatible) for edge cases and startup calibration (~5-10%).
How it works:
- Fresh repo clone per job (complete isolation)
- LLM analyzes sample to discover natural categories (calibration phase)
- Train LightGBM on embeddings + patterns + structural features
- ML sprints through high-confidence classifications
- Hard rules catch obvious patterns (OTP, invoices, etc.)
- LLM reviews only uncertain cases (batched efficiently)
- System self-tunes thresholds based on LLM feedback
- Export results and sync back to email provider
- Delete repo (cleanup)
Target use case: Self-employed and business owners with 10k-100k+ neglected emails who need privacy-focused, one-time cleanup without cloud uploads or subscriptions.
Key innovation: Hybrid approach with structured embeddings, hard pattern rules, and dynamic threshold adjustment. LLM is OPTIONAL - system degrades gracefully if unavailable.
COMPETITIVE ANALYSIS (2024 Research)
Existing Solutions (ALL Cloud-Based)
| Tool | Price | Accuracy | Privacy | Notes |
|---|---|---|---|---|
| SaneBox | $7-15/mo | ~85% | ❌ Cloud | AI filtering, requires upload |
| Clean Email | $10-30/mo | ~80% | ❌ Cloud | Smart folders, subscription |
| Spark | Free/Paid | ~75% | ❌ Cloud | Smart inbox, cloud sync |
| EmailTree.ai | Enterprise | ~90% | ❌ Cloud | NLP, for businesses |
| Mailstrom | $30-50/yr | ~70% | ❌ Cloud | Bulk analysis |
Our Competitive Advantages
✅ 100% LOCAL - No data leaves the machine ✅ Privacy-first - Perfect for business owners with sensitive data ✅ One-time use - No subscription, pay per job or DIY ✅ Customizable - Adapts to each inbox during calibration ✅ Open source potential - Distributable as Python wheel ✅ Attachment analysis - Competitors ignore this entirely ✅ Offline capable - Works without internet (after initial setup)
Benchmark Performance (2024 Research)
Enron Dataset (industry standard):
- Traditional ML (SVM, Random Forest): 95-98%
- Deep Learning (DNN-BiLSTM): 98.69%
- Transformers (BERT, RoBERTa): ~99%
- LLMs (GPT-4): 99.7% (phishing detection)
- Ensemble methods: 98.8%
Our Target: 94-96% accuracy (competitive, privacy-focused, local)
ARCHITECTURE
Three-Phase Pipeline
┌─────────────────────────────────────────────────────────────┐
│ PHASE 1: CALIBRATION (3-5 minutes) │
├─────────────────────────────────────────────────────────────┤
│ 1. Sample 1500 emails (stratified sampling) │
│ 2. LLM analyzes patterns and discovers categories │
│ Model: qwen3:4b (bigger, more accurate) │
│ Alternative: Compress to 500 emails + smarter batching │
│ 3. Map discovered → universal categories │
│ 4. Generate training labels for embedding classifier │
│ 5. Validate on 300 emails │
│ 6. Set initial confidence thresholds │
│ 7. Train LightGBM on embeddings + patterns │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ PHASE 2: BULK PROCESSING (10-12 minutes) │
├─────────────────────────────────────────────────────────────┤
│ For each email: │
│ → Pattern detection (regex, <1ms) │
│ → Hard rule match? → INSTANT (10% of emails) │
│ → Generate structured embedding (batched, 8 min total) │
│ → LightGBM classify with confidence score │
│ → IF confidence >= threshold: ACCEPT (85%) │
│ → IF confidence < threshold: QUEUE for LLM (5%) │
│ │
│ Every 1000 emails or queue full: │
│ → Process LLM batch (qwen3:1.7b, fast) │
│ → Analyze agreement rate │
│ → Adjust thresholds dynamically │
│ → Learn sender rules │
│ → Save checkpoint │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ PHASE 3: FINALIZATION (2-3 minutes) │
├─────────────────────────────────────────────────────────────┤
│ 1. Process remaining LLM queue │
│ 2. Export results (JSON/CSV) │
│ 3. Sync to email provider (Gmail labels, IMAP folders) │
│ 4. Generate classification report │
│ 5. Cleanup (delete repo, temp files) │
└─────────────────────────────────────────────────────────────┘
CORE COMPONENTS
1. Hybrid Feature Extraction (THE SECRET SAUCE)
Combines three feature types for maximum accuracy:
A. Sentence Embeddings (Semantic Understanding)
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer('all-MiniLM-L6-v2') # 384 dimensions
# Structured embedding with parameterized headers
def build_embedding_text(email, patterns):
return f"""[EMAIL_METADATA]
sender_type: {email.sender_domain_type}
time_category: {email.time_of_day}
has_attachments: {email.has_attachments}
attachment_types: {email.attachment_types}
[DETECTED_PATTERNS]
has_otp: {patterns['has_otp']}
has_invoice: {patterns['has_invoice']}
has_unsubscribe: {patterns['has_unsubscribe']}
is_automated: {patterns['is_noreply']}
has_meeting: {patterns['has_meeting']}
[CONTENT]
subject: {email.subject}
body: {email.body_snippet[:300]}
"""
text = build_embedding_text(email, patterns)
embedding = embedder.encode(text) # → 384-dim vector
Why this works:
- Model sees STRUCTURE, not just raw text
- Pattern hints guide semantic understanding
- Research shows 5-10% accuracy boost vs naive embedding
- Handles semantic variants: "meeting" = "call" = "zoom"
B. Hard Pattern Rules (Fast Deterministic)
# ~20 boolean/numerical features extracted via regex
patterns = {
# Authentication patterns
'has_otp': bool(re.search(r'\b\d{4,6}\b', text)),
'has_verification': 'verification' in text.lower(),
'has_reset_password': 'reset password' in text.lower(),
# Transactional patterns
'has_invoice': bool(re.search(r'invoice\s*#?\d+', text, re.I)),
'has_receipt': 'receipt' in text.lower(),
'has_price': bool(re.search(r'\$\d+', text)),
'has_order_number': bool(re.search(r'order\s*#?\d+', text, re.I)),
# Newsletter/marketing patterns
'has_unsubscribe': 'unsubscribe' in text.lower(),
'has_view_in_browser': 'view in browser' in text.lower(),
# Meeting/calendar patterns
'has_meeting': bool(re.search(r'(meeting|call|zoom|teams)', text, re.I)),
'has_calendar': 'calendar' in text.lower(),
# Other patterns
'has_tracking': bool(re.search(r'tracking\s*(number|#)', text, re.I)),
'is_automated': email.sender_domain_type == 'noreply',
'has_signature': bool(re.search(r'(regards|sincerely|best)', text, re.I)),
}
C. Structural Features (Metadata)
# ~20 numerical/categorical features
structural = {
# Sender analysis
'sender_domain': extract_domain(email.sender),
'sender_domain_type': categorize_domain(email.sender), # freemail/corporate/noreply
'is_noreply': 'noreply' in email.sender.lower(),
# Timing
'time_of_day': categorize_hour(email.date.hour), # night/morning/afternoon/evening
'day_of_week': email.date.strftime('%A').lower(),
# Content structure
'subject_length': len(email.subject),
'body_length': len(email.body),
'link_count': len(re.findall(r'https?://', email.body)),
'image_count': len(re.findall(r'<img', email.body)),
# Attachments (COMPETITIVE ADVANTAGE)
'has_attachments': email.has_attachments,
'attachment_count': len(email.attachments),
'attachment_types': extract_attachment_types(email.attachments),
'has_pdf': 'pdf' in attachment_types,
'has_invoice_pdf': has_invoice and has_pdf, # KILLER FEATURE
# Reply/forward
'has_reply_prefix': bool(re.match(r'^(Re:|Fwd:)', email.subject, re.I)),
}
D. Attachment Analysis (Differentiator)
def analyze_attachments(attachments):
"""Extract features from attachments - competitors don't do this!"""
features = {
'has_attachments': len(attachments) > 0,
'attachment_count': len(attachments),
'total_size': sum(a['size'] for a in attachments),
'attachment_types': []
}
for attachment in attachments:
mime_type = attachment.get('mime_type', '')
filename = attachment.get('filename', '')
# Type categorization
if 'pdf' in mime_type or filename.endswith('.pdf'):
features['attachment_types'].append('pdf')
# Extract text from PDF if small enough (<5MB)
if attachment['size'] < 5_000_000:
text = extract_pdf_text(attachment)
features['pdf_has_invoice'] = bool(re.search(r'invoice|bill', text, re.I))
features['pdf_has_account'] = bool(re.search(r'account\s*#?\d+', text, re.I))
elif 'word' in mime_type or filename.endswith(('.doc', '.docx')):
features['attachment_types'].append('docx')
elif 'excel' in mime_type or filename.endswith(('.xls', '.xlsx')):
features['attachment_types'].append('xlsx')
elif 'image' in mime_type or filename.endswith(('.png', '.jpg', '.jpeg')):
features['attachment_types'].append('image')
return features
Why this matters:
- Business emails often have invoice PDFs, contract DOCXs
- Detecting "PDF with INVOICE text" → instant "transactional" classification
- Competitors ignore attachments entirely = our differentiator
Combined Feature Vector
# Total: ~434 dimensions (vs 10,000 with TF-IDF!)
final_features = np.concatenate([
embedding, # 384 dims (semantic understanding)
pattern_values, # 20 dims (hard rules)
structural_values, # 20 dims (metadata)
attachment_values # 10 dims (NEW!)
])
2. LightGBM Classifier (Research-Backed Choice)
Why LightGBM over XGBoost:
- ✅ Native categorical handling (no encoding needed)
- ✅ 2-5x faster on mixed feature types
- ✅ 4x speedup with categorical + numerical features
- ✅ Better memory efficiency
- ✅ Equivalent accuracy to XGBoost
- ✅ Perfect for embeddings (dense numerical) + categoricals
import lightgbm as lgb
import numpy as np
class HybridClassifier:
def __init__(self, categories):
self.categories = categories
self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
self.model = None
def extract_features(self, email):
"""Extract all feature types"""
patterns = extract_patterns(email)
structural = extract_structural(email)
# Structured embedding with rich context
text = build_embedding_text(email, patterns)
embedding = self.embedder.encode(text)
# Combine features
features = {
'embedding': embedding, # 384 numerical
'patterns': patterns, # 20 numerical/boolean
'structural': structural # 20 numerical/categorical
}
return features
def train(self, emails, labels):
"""Train on LLM-labeled data from calibration"""
# Extract features
all_features = [self.extract_features(e) for e in emails]
# Build feature matrix
X = np.array([
np.concatenate([
f['embedding'],
list(f['patterns'].values()),
[f['structural'][k] for k in numerical_keys]
])
for f in all_features
])
# Categorical feature indices
categorical_features = ['sender_domain_type', 'time_of_day', 'day_of_week']
# Train LightGBM
self.model = lgb.LGBMClassifier(
categorical_feature=categorical_features,
n_estimators=200,
learning_rate=0.1,
max_depth=8,
num_leaves=31,
objective='multiclass',
num_class=len(self.categories)
)
self.model.fit(X, labels)
def predict(self, email):
"""Predict with confidence"""
features = self.extract_features(email)
X = build_feature_vector(features)
# Get probabilities
probs = self.model.predict_proba([X])[0]
pred_class = np.argmax(probs)
return {
'category': self.categories[pred_class],
'confidence': float(probs[pred_class]),
'probabilities': {
self.categories[i]: float(probs[i])
for i in range(len(self.categories))
}
}
3. LLM Integration (Flexible & Optional)
Model Strategy:
| Phase | Model | Speed | Purpose |
|---|---|---|---|
| Calibration | qwen3:4b | Slower | Better category discovery, 1500 emails |
| Classification | qwen3:1.7b | Fast | Quick review, only ~5% of emails |
| Optional | qwen3:30b | Slowest | Maximum accuracy if needed |
Configuration (Single Source of Truth):
# config/llm_models.yaml
llm:
# Provider type: ollama, openai, anthropic
provider: "ollama"
# Ollama settings
ollama:
base_url: "http://localhost:11434"
calibration_model: "qwen3:4b" # Bigger for better discovery
classification_model: "qwen3:1.7b" # Smaller for speed
temperature: 0.1
max_tokens: 500
timeout: 30
retry_attempts: 3
# OpenAI-compatible API (future-proof)
openai:
base_url: "https://api.openai.com/v1" # Or custom endpoint
api_key: "${OPENAI_API_KEY}"
calibration_model: "gpt-4o-mini"
classification_model: "gpt-4o-mini"
temperature: 0.1
max_tokens: 500
# Graceful degradation
fallback:
enabled: true
# If LLM unavailable, emails go to "needs_review" folder
# ML still works, just more conservative thresholds
LLM Provider Abstraction:
from abc import ABC, abstractmethod
class BaseLLMProvider(ABC):
@abstractmethod
def complete(self, prompt: str, **kwargs) -> str:
pass
@abstractmethod
def test_connection(self) -> bool:
pass
class OllamaProvider(BaseLLMProvider):
def __init__(self, base_url: str, model: str):
import ollama
self.client = ollama.Client(host=base_url)
self.model = model
def complete(self, prompt: str, **kwargs) -> str:
response = self.client.generate(
model=self.model,
prompt=prompt,
options={
'temperature': kwargs.get('temperature', 0.1),
'num_predict': kwargs.get('max_tokens', 500)
}
)
return response['response']
def test_connection(self) -> bool:
try:
self.client.list()
return True
except:
return False
class OpenAIProvider(BaseLLMProvider):
def __init__(self, base_url: str, api_key: str, model: str):
from openai import OpenAI
self.client = OpenAI(base_url=base_url, api_key=api_key)
self.model = model
def complete(self, prompt: str, **kwargs) -> str:
response = self.client.chat.completions.create(
model=self.model,
messages=[{"role": "user", "content": prompt}],
temperature=kwargs.get('temperature', 0.1),
max_tokens=kwargs.get('max_tokens', 500)
)
return response.choices[0].message.content
def test_connection(self) -> bool:
try:
self.client.models.list()
return True
except:
return False
def get_llm_provider(config) -> BaseLLMProvider:
"""Factory to create LLM provider based on config"""
provider_type = config['llm']['provider']
if provider_type == 'ollama':
return OllamaProvider(
base_url=config['llm']['ollama']['base_url'],
model=config['llm']['ollama']['classification_model']
)
elif provider_type == 'openai':
return OpenAIProvider(
base_url=config['llm']['openai']['base_url'],
api_key=os.getenv('OPENAI_API_KEY'),
model=config['llm']['openai']['classification_model']
)
else:
raise ValueError(f"Unknown provider: {provider_type}")
Graceful Degradation (LLM Optional):
class AdaptiveClassifier:
def __init__(self, ml_model, llm_classifier, config):
self.ml_model = ml_model
self.llm_classifier = llm_classifier
self.llm_available = self._test_llm_connection()
self.config = config
if not self.llm_available:
logger.warning("LLM unavailable - using conservative thresholds")
self.default_threshold = 0.85 # Higher threshold without LLM
else:
self.default_threshold = 0.75
def _test_llm_connection(self):
"""Check if LLM is available"""
if not self.llm_classifier:
return False
try:
return self.llm_classifier.test_connection()
except:
return False
def classify(self, email, features):
"""Classify with or without LLM"""
# ML classification
ml_result = self.ml_model.predict(features)
# Check hard rules first
if self._has_hard_rule_match(email):
return ClassificationResult(
category=self._get_rule_category(email),
confidence=0.99,
method='rule'
)
# High confidence ML result
if ml_result['confidence'] >= self.default_threshold:
return ClassificationResult(
category=ml_result['category'],
confidence=ml_result['confidence'],
method='ml'
)
# Low confidence - try LLM if available
if self.llm_available:
return ClassificationResult(
category=ml_result['category'],
confidence=ml_result['confidence'],
method='ml',
needs_review=True # Queue for LLM
)
else:
# No LLM - mark for manual review
return ClassificationResult(
category='needs_review',
confidence=ml_result['confidence'],
method='ml',
needs_review=True,
metadata={'ml_prediction': ml_result}
)
4. Universal Categories (12 Total)
categories = {
'junk': {
'description': 'Spam, unwanted marketing, phishing',
'patterns': ['unsubscribe', 'click here', 'limited time'],
'threshold': 0.85 # High confidence needed
},
'transactional': {
'description': 'Receipts, invoices, confirmations, order tracking',
'patterns': ['receipt', 'invoice', 'order', 'shipped', 'tracking'],
'threshold': 0.80
},
'auth': {
'description': 'OTPs, password resets, 2FA codes, security alerts',
'patterns': ['verification code', 'otp', 'reset password', r'\d{4,6}'],
'threshold': 0.90 # Very high - important emails
},
'newsletters': {
'description': 'Subscribed newsletters, marketing emails',
'patterns': ['newsletter', 'weekly digest', 'monthly update'],
'threshold': 0.75
},
'social': {
'description': 'Social media notifications, mentions, friend requests',
'patterns': ['mentioned you', 'friend request', 'liked your'],
'threshold': 0.75
},
'automated': {
'description': 'System notifications, alerts, no-reply messages',
'patterns': ['automated', 'system notification', 'do not reply'],
'threshold': 0.80
},
'conversational': {
'description': 'Human-to-human correspondence, replies, discussions',
'patterns': ['hi', 'hello', 'thanks', 'regards'],
'threshold': 0.65 # Lower - varied language
},
'work': {
'description': 'Business correspondence, meetings, projects',
'patterns': ['meeting', 'project', 'deadline', 'team'],
'threshold': 0.70
},
'personal': {
'description': 'Friends and family, personal matters',
'patterns': ['love', 'family', 'dinner', 'weekend'],
'threshold': 0.70
},
'finance': {
'description': 'Bank statements, credit cards, investments, bills',
'patterns': ['statement', 'balance', 'account', 'payment due'],
'threshold': 0.85 # High - sensitive
},
'travel': {
'description': 'Flight bookings, hotels, reservations, itineraries',
'patterns': ['flight', 'booking', 'reservation', 'check-in'],
'threshold': 0.80
},
'unknown': {
'description': "Doesn't fit any category (requires review)",
'patterns': [],
'threshold': 0.50 # Catch-all
}
}
MODULAR ARCHITECTURE
Tiered Dependencies
# setup.py
setup(
name="email-sorter",
version="1.0.0",
install_requires=[
# CORE (always required)
"numpy>=1.24.0",
"pandas>=2.0.0",
"scikit-learn>=1.3.0",
"lightgbm>=4.0.0",
"sentence-transformers>=2.2.0",
"pydantic>=2.0.0",
"pyyaml>=6.0",
"click>=8.1.0",
"rich>=13.0.0",
"tqdm>=4.66.0",
"tenacity>=8.2.0",
],
extras_require={
# Email providers (optional)
"gmail": [
"google-api-python-client>=2.100.0",
"google-auth-httplib2>=0.1.1",
"google-auth-oauthlib>=1.1.0",
],
"microsoft": [
"msal>=1.24.0",
],
"imap": [
"imapclient>=2.3.1",
],
# LLM providers (optional)
"ollama": [
"ollama>=0.1.0",
],
"openai": [
"openai>=1.0.0",
],
# Attachment processing (optional)
"attachments": [
"PyPDF2>=3.0.0",
"python-docx>=0.8.11",
"openpyxl>=3.0.10",
],
# Development (optional)
"dev": [
"pytest>=7.4.0",
"pytest-cov>=4.1.0",
"pytest-mock>=3.11.0",
"black>=23.0.0",
"isort>=5.12.0",
],
# All extras
"all": [
# Combines all above
]
}
)
Installation options:
# Minimal (ML only, no LLM, no email providers)
pip install email-sorter
# With Gmail support
pip install email-sorter[gmail]
# With Ollama LLM
pip install email-sorter[ollama,gmail]
# Everything
pip install email-sorter[all]
TESTING STRATEGY
Test Harness Structure
tests/
├── unit/
│ ├── test_feature_extraction.py
│ ├── test_pattern_matching.py
│ ├── test_embeddings.py
│ ├── test_lightgbm.py
│ └── test_attachment_analysis.py
├── integration/
│ ├── test_calibration.py
│ ├── test_ml_llm_pipeline.py
│ ├── test_gmail_provider.py
│ └── test_checkpoint_resume.py
├── e2e/
│ ├── test_full_pipeline_100.py
│ ├── test_full_pipeline_1000.py
│ └── test_full_pipeline_80k.py
├── fixtures/
│ ├── mock_emails.json
│ ├── mock_llm_responses.json
│ └── sample_inboxes/
└── conftest.py
Unit Tests
# tests/unit/test_feature_extraction.py
import pytest
from src.classification.feature_extractor import FeatureExtractor
from src.email_providers.base import Email
def test_pattern_extraction():
email = Email(
id='1',
subject='Your verification code is 123456',
sender='noreply@service.com',
body='Your one-time password is 123456'
)
extractor = FeatureExtractor()
patterns = extractor._extract_patterns(email)
assert patterns['has_otp'] == True
assert patterns['has_verification'] == True
assert patterns['is_automated'] == True
def test_structured_embedding():
email = Email(
id='2',
subject='Invoice #12345',
sender='billing@company.com',
body='Please find attached your invoice'
)
extractor = FeatureExtractor()
text = extractor.build_embedding_text(email)
assert '[EMAIL_METADATA]' in text
assert '[DETECTED_PATTERNS]' in text
assert 'has_invoice: True' in text
Integration Tests
# tests/integration/test_ml_llm_pipeline.py
def test_calibration_then_classification():
# 1. Load sample emails
emails = load_sample_emails(count=100)
# 2. Run calibration (with mock LLM)
calibrator = CalibrationPhase(mock_llm_provider)
config = calibrator.run(emails)
# 3. Train classifier
classifier = HybridClassifier()
classifier.train(emails, config['labels'])
# 4. Classify new emails
new_emails = load_sample_emails(count=20, exclude=emails)
results = [classifier.predict(e) for e in new_emails]
# 5. Assert accuracy
accuracy = calculate_accuracy(results, ground_truth)
assert accuracy > 0.85
E2E Tests
# tests/e2e/test_full_pipeline_100.py
def test_full_pipeline_100_emails(tmp_path):
"""End-to-end test on 100 emails"""
# Setup
output_dir = tmp_path / "results"
emails = load_test_inbox(count=100)
# Run full pipeline
result = run_email_sorter(
emails=emails,
output=output_dir,
config="tests/fixtures/test_config.yaml"
)
# Assertions
assert result['total_processed'] == 100
assert result['accuracy_estimate'] > 0.90
assert (output_dir / "results.json").exists()
assert (output_dir / "report.txt").exists()
PERFORMANCE EXPECTATIONS (Updated with Research)
For 80,000 emails:
| Phase | Time | Details |
|---|---|---|
| Calibration | 3-5 min | 1500 emails, qwen3:4b, train LightGBM |
| Pattern detection | ~10 sec | Regex on all 80k emails |
| Embedding generation | ~8 min | Batched, CPU, all 80k emails |
| LightGBM classification | ~3 sec | Fast inference |
| Hard rules auto-classify | instant | 10% = 8,000 emails |
| LLM review (qwen3:1.7b) | ~4 min | 5% = 4,000 emails, batched |
| Export & sync | ~2 min | JSON/CSV + Gmail API |
| TOTAL | ~17 min |
Accuracy Breakdown:
| Component | Coverage | Accuracy |
|---|---|---|
| Hard rules | 10% | 99% |
| LightGBM (high conf) | 85% | 92% |
| LLM review | 5% | 95% |
| Overall | 100% | 94-96% |
Memory Usage (80k emails):
- Email data: ~400MB
- Embeddings (cached): ~500MB
- LightGBM model: ~5MB
- MiniLM model: ~90MB
- Peak: ~1.2GB
DISTRIBUTABLE WHEEL PACKAGING
Package Structure
email-sorter/
├── setup.py
├── setup.cfg
├── pyproject.toml
├── MANIFEST.in
├── README.md
├── LICENSE
├── src/
│ └── email_sorter/
│ ├── __init__.py
│ ├── __main__.py
│ ├── cli.py
│ └── ... (all modules)
├── config/
│ ├── default_config.yaml
│ ├── categories.yaml
│ └── llm_models.yaml
└── models/
└── pretrained/
├── minilm-l6-v2/ (bundled embedder)
└── lightgbm.pkl (optional pre-trained)
Distribution Commands
# Build wheel
python setup.py sdist bdist_wheel
# Install locally
pip install dist/email_sorter-1.0.0-py3-none-any.whl
# Use as command
email-sorter --source gmail --credentials creds.json --output results/
# Or as module
python -m email_sorter --source gmail ...
CLI Interface
email-sorter --help
# Basic usage
email-sorter \
--source gmail \
--credentials credentials.json \
--output results/
# Advanced options
email-sorter \
--source gmail \
--credentials creds.json \
--output results/ \
--config custom_config.yaml \
--llm-provider ollama \
--llm-model qwen3:1.7b \
--limit 1000 \
--no-calibrate \
--dry-run
PROJECT STRUCTURE
email-sorter/
├── README.md
├── PROJECT_BLUEPRINT.md # This file
├── BUILD_INSTRUCTIONS.md
├── RESEARCH_FINDINGS.md
├── setup.py
├── setup.cfg
├── pyproject.toml
├── requirements.txt
├── .gitignore
├── .env.example
├── config/
│ ├── default_config.yaml
│ ├── categories.yaml
│ ├── llm_models.yaml # LLM config (single source)
│ └── features.yaml
├── src/
│ ├── __init__.py
│ ├── __main__.py
│ ├── cli.py # Click CLI
│ ├── calibration/
│ │ ├── __init__.py
│ │ ├── sampler.py # Stratified sampling
│ │ ├── llm_analyzer.py # LLM calibration
│ │ └── trainer.py # Train LightGBM
│ ├── classification/
│ │ ├── __init__.py
│ │ ├── feature_extractor.py # Hybrid features
│ │ ├── pattern_matcher.py # Hard rules
│ │ ├── embedder.py # Sentence embeddings
│ │ ├── lightgbm_classifier.py
│ │ ├── adaptive_classifier.py
│ │ └── llm_classifier.py
│ ├── models/
│ │ ├── __init__.py
│ │ ├── pretrained/
│ │ │ └── .gitkeep
│ │ └── model_loader.py
│ ├── email_providers/
│ │ ├── __init__.py
│ │ ├── base.py
│ │ ├── gmail.py
│ │ ├── microsoft.py
│ │ └── imap.py
│ ├── llm/
│ │ ├── __init__.py
│ │ ├── base.py # Abstract provider
│ │ ├── ollama.py
│ │ └── openai.py
│ ├── processing/
│ │ ├── __init__.py
│ │ ├── bulk_processor.py
│ │ ├── attachment_handler.py
│ │ └── queue_manager.py
│ ├── adjustment/
│ │ ├── __init__.py
│ │ ├── threshold_adjuster.py
│ │ └── pattern_learner.py
│ ├── export/
│ │ ├── __init__.py
│ │ ├── results_exporter.py
│ │ ├── provider_sync.py
│ │ └── report_generator.py
│ └── utils/
│ ├── __init__.py
│ ├── config.py
│ ├── logging.py
│ └── cleanup.py
├── tests/
│ ├── unit/
│ ├── integration/
│ ├── e2e/
│ ├── fixtures/
│ └── conftest.py
├── prompts/
│ ├── calibration.txt
│ └── classification.txt
├── scripts/
│ ├── train_model.py
│ ├── verify_install.py
│ └── benchmark.py
├── data/
│ └── samples/
└── logs/
└── .gitkeep
SECURITY & PRIVACY
✅ All processing is local - No cloud uploads ✅ LLM runs locally - Via Ollama (or optional OpenAI API) ✅ Fresh clone per job - Complete isolation ✅ No persistent storage - Email bodies never written to disk ✅ Attachment content - Processed in memory, discarded immediately ✅ Auto cleanup - Temp files deleted after processing ✅ Credentials - Used directly, never cached ✅ GDPR-friendly - No data retention or sharing
SUCCESS CRITERIA
✅ Processes 80k emails in <20 minutes ✅ 94-96% classification accuracy (competitive with cloud tools) ✅ <5% emails need LLM review ✅ Successfully syncs back to Gmail/IMAP ✅ No data leakage between jobs ✅ Works on Windows, Linux, macOS ✅ LLM is optional (graceful degradation) ✅ Distributable as Python wheel ✅ Attachment analysis working ✅ OpenAI-compatible API support
WHAT'S NEXT
- ✅ Research complete (benchmarks, competition, LightGBM vs XGBoost)
- ⏭ Update BUILD_INSTRUCTIONS.md with new architecture
- ⏭ Create RESEARCH_FINDINGS.md with search results
- ⏭ Build core infrastructure (config, logging, data models)
- ⏭ Implement feature extraction (embeddings + patterns + attachments)
- ⏭ Create LightGBM classifier
- ⏭ Implement LLM providers (Ollama + OpenAI-compatible)
- ⏭ Build calibration system
- ⏭ Create test harness
- ⏭ Package as wheel
- ⏭ Test on Marion's 80k emails
END OF BLUEPRINT v2.0
This is the complete, research-backed architecture ready to build.