Brett Fox 8c73f25537 Initial commit: Complete project blueprint and research

- PROJECT_BLUEPRINT.md: Full architecture with LightGBM, Qwen3, structured embeddings
- RESEARCH_FINDINGS.md: 2024 benchmarks, competition analysis, validation
- BUILD_INSTRUCTIONS.md: Step-by-step implementation guide
- README.md: User-friendly overview and quick start
- Research-backed hybrid ML/LLM email classifier
- 94-96% accuracy target, 17min for 80k emails
- Privacy-first, local processing, distributable wheel
- Modular architecture with tiered dependencies
- LLM optional (graceful degradation)
- OpenAI-compatible API support

2025-10-21 03:08:28 +11:00

34 KiB

Raw Blame History

EMAIL SORTER - PROJECT BLUEPRINT

Hybrid ML/LLM Email Classification System

Version: 2.0 Date: 2024-10-21 Status: Research Complete - Ready to Build

EXECUTIVE SUMMARY

What it does: Processes 80,000+ emails in ~17 minutes using a pre-trained ML model for bulk classification (90%+) and LLM (Ollama/OpenAI-compatible) for edge cases and startup calibration (~5-10%).

How it works:

Fresh repo clone per job (complete isolation)
LLM analyzes sample to discover natural categories (calibration phase)
Train LightGBM on embeddings + patterns + structural features
ML sprints through high-confidence classifications
Hard rules catch obvious patterns (OTP, invoices, etc.)
LLM reviews only uncertain cases (batched efficiently)
System self-tunes thresholds based on LLM feedback
Export results and sync back to email provider
Delete repo (cleanup)

Target use case: Self-employed and business owners with 10k-100k+ neglected emails who need privacy-focused, one-time cleanup without cloud uploads or subscriptions.

Key innovation: Hybrid approach with structured embeddings, hard pattern rules, and dynamic threshold adjustment. LLM is OPTIONAL - system degrades gracefully if unavailable.

COMPETITIVE ANALYSIS (2024 Research)

Existing Solutions (ALL Cloud-Based)

Tool	Price	Accuracy	Privacy	Notes
SaneBox	$7-15/mo	~85%	❌ Cloud	AI filtering, requires upload
Clean Email	$10-30/mo	~80%	❌ Cloud	Smart folders, subscription
Spark	Free/Paid	~75%	❌ Cloud	Smart inbox, cloud sync
EmailTree.ai	Enterprise	~90%	❌ Cloud	NLP, for businesses
Mailstrom	$30-50/yr	~70%	❌ Cloud	Bulk analysis

Our Competitive Advantages

✅ 100% LOCAL - No data leaves the machine ✅ Privacy-first - Perfect for business owners with sensitive data ✅ One-time use - No subscription, pay per job or DIY ✅ Customizable - Adapts to each inbox during calibration ✅ Open source potential - Distributable as Python wheel ✅ Attachment analysis - Competitors ignore this entirely ✅ Offline capable - Works without internet (after initial setup)

Benchmark Performance (2024 Research)

Enron Dataset (industry standard):

Traditional ML (SVM, Random Forest): 95-98%
Deep Learning (DNN-BiLSTM): 98.69%
Transformers (BERT, RoBERTa): ~99%
LLMs (GPT-4): 99.7% (phishing detection)
Ensemble methods: 98.8%

Our Target: 94-96% accuracy (competitive, privacy-focused, local)

ARCHITECTURE

Three-Phase Pipeline

┌─────────────────────────────────────────────────────────────┐
│ PHASE 1: CALIBRATION (3-5 minutes)                          │
├─────────────────────────────────────────────────────────────┤
│ 1. Sample 1500 emails (stratified sampling)                │
│ 2. LLM analyzes patterns and discovers categories          │
│    Model: qwen3:4b (bigger, more accurate)                 │
│    Alternative: Compress to 500 emails + smarter batching  │
│ 3. Map discovered → universal categories                   │
│ 4. Generate training labels for embedding classifier       │
│ 5. Validate on 300 emails                                  │
│ 6. Set initial confidence thresholds                       │
│ 7. Train LightGBM on embeddings + patterns                 │
└─────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────┐
│ PHASE 2: BULK PROCESSING (10-12 minutes)                   │
├─────────────────────────────────────────────────────────────┤
│ For each email:                                             │
│   → Pattern detection (regex, <1ms)                        │
│   → Hard rule match? → INSTANT (10% of emails)             │
│   → Generate structured embedding (batched, 8 min total)   │
│   → LightGBM classify with confidence score                │
│   → IF confidence >= threshold: ACCEPT (85%)               │
│   → IF confidence < threshold: QUEUE for LLM (5%)         │
│                                                             │
│ Every 1000 emails or queue full:                            │
│   → Process LLM batch (qwen3:1.7b, fast)                   │
│   → Analyze agreement rate                                 │
│   → Adjust thresholds dynamically                          │
│   → Learn sender rules                                     │
│   → Save checkpoint                                        │
└─────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────┐
│ PHASE 3: FINALIZATION (2-3 minutes)                        │
├─────────────────────────────────────────────────────────────┤
│ 1. Process remaining LLM queue                             │
│ 2. Export results (JSON/CSV)                               │
│ 3. Sync to email provider (Gmail labels, IMAP folders)     │
│ 4. Generate classification report                          │
│ 5. Cleanup (delete repo, temp files)                       │
└─────────────────────────────────────────────────────────────┘

CORE COMPONENTS

1. Hybrid Feature Extraction (THE SECRET SAUCE)

Combines three feature types for maximum accuracy:

A. Sentence Embeddings (Semantic Understanding)

from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer('all-MiniLM-L6-v2')  # 384 dimensions

# Structured embedding with parameterized headers
def build_embedding_text(email, patterns):
    return f"""[EMAIL_METADATA]
sender_type: {email.sender_domain_type}
time_category: {email.time_of_day}
has_attachments: {email.has_attachments}
attachment_types: {email.attachment_types}

[DETECTED_PATTERNS]
has_otp: {patterns['has_otp']}
has_invoice: {patterns['has_invoice']}
has_unsubscribe: {patterns['has_unsubscribe']}
is_automated: {patterns['is_noreply']}
has_meeting: {patterns['has_meeting']}

[CONTENT]
subject: {email.subject}
body: {email.body_snippet[:300]}
"""

text = build_embedding_text(email, patterns)
embedding = embedder.encode(text)  # → 384-dim vector

Why this works:

Model sees STRUCTURE, not just raw text
Pattern hints guide semantic understanding
Research shows 5-10% accuracy boost vs naive embedding
Handles semantic variants: "meeting" = "call" = "zoom"

B. Hard Pattern Rules (Fast Deterministic)

# ~20 boolean/numerical features extracted via regex
patterns = {
    # Authentication patterns
    'has_otp': bool(re.search(r'\b\d{4,6}\b', text)),
    'has_verification': 'verification' in text.lower(),
    'has_reset_password': 'reset password' in text.lower(),

    # Transactional patterns
    'has_invoice': bool(re.search(r'invoice\s*#?\d+', text, re.I)),
    'has_receipt': 'receipt' in text.lower(),
    'has_price': bool(re.search(r'\$\d+', text)),
    'has_order_number': bool(re.search(r'order\s*#?\d+', text, re.I)),

    # Newsletter/marketing patterns
    'has_unsubscribe': 'unsubscribe' in text.lower(),
    'has_view_in_browser': 'view in browser' in text.lower(),

    # Meeting/calendar patterns
    'has_meeting': bool(re.search(r'(meeting|call|zoom|teams)', text, re.I)),
    'has_calendar': 'calendar' in text.lower(),

    # Other patterns
    'has_tracking': bool(re.search(r'tracking\s*(number|#)', text, re.I)),
    'is_automated': email.sender_domain_type == 'noreply',
    'has_signature': bool(re.search(r'(regards|sincerely|best)', text, re.I)),
}

C. Structural Features (Metadata)

# ~20 numerical/categorical features
structural = {
    # Sender analysis
    'sender_domain': extract_domain(email.sender),
    'sender_domain_type': categorize_domain(email.sender),  # freemail/corporate/noreply
    'is_noreply': 'noreply' in email.sender.lower(),

    # Timing
    'time_of_day': categorize_hour(email.date.hour),  # night/morning/afternoon/evening
    'day_of_week': email.date.strftime('%A').lower(),

    # Content structure
    'subject_length': len(email.subject),
    'body_length': len(email.body),
    'link_count': len(re.findall(r'https?://', email.body)),
    'image_count': len(re.findall(r'<img', email.body)),

    # Attachments (COMPETITIVE ADVANTAGE)
    'has_attachments': email.has_attachments,
    'attachment_count': len(email.attachments),
    'attachment_types': extract_attachment_types(email.attachments),
    'has_pdf': 'pdf' in attachment_types,
    'has_invoice_pdf': has_invoice and has_pdf,  # KILLER FEATURE

    # Reply/forward
    'has_reply_prefix': bool(re.match(r'^(Re:|Fwd:)', email.subject, re.I)),
}

D. Attachment Analysis (Differentiator)

def analyze_attachments(attachments):
    """Extract features from attachments - competitors don't do this!"""
    features = {
        'has_attachments': len(attachments) > 0,
        'attachment_count': len(attachments),
        'total_size': sum(a['size'] for a in attachments),
        'attachment_types': []
    }

    for attachment in attachments:
        mime_type = attachment.get('mime_type', '')
        filename = attachment.get('filename', '')

        # Type categorization
        if 'pdf' in mime_type or filename.endswith('.pdf'):
            features['attachment_types'].append('pdf')

            # Extract text from PDF if small enough (<5MB)
            if attachment['size'] < 5_000_000:
                text = extract_pdf_text(attachment)
                features['pdf_has_invoice'] = bool(re.search(r'invoice|bill', text, re.I))
                features['pdf_has_account'] = bool(re.search(r'account\s*#?\d+', text, re.I))

        elif 'word' in mime_type or filename.endswith(('.doc', '.docx')):
            features['attachment_types'].append('docx')

        elif 'excel' in mime_type or filename.endswith(('.xls', '.xlsx')):
            features['attachment_types'].append('xlsx')

        elif 'image' in mime_type or filename.endswith(('.png', '.jpg', '.jpeg')):
            features['attachment_types'].append('image')

    return features

Why this matters:

Business emails often have invoice PDFs, contract DOCXs
Detecting "PDF with INVOICE text" → instant "transactional" classification
Competitors ignore attachments entirely = our differentiator

Combined Feature Vector

# Total: ~434 dimensions (vs 10,000 with TF-IDF!)
final_features = np.concatenate([
    embedding,              # 384 dims (semantic understanding)
    pattern_values,         # 20 dims (hard rules)
    structural_values,      # 20 dims (metadata)
    attachment_values       # 10 dims (NEW!)
])

2. LightGBM Classifier (Research-Backed Choice)

Why LightGBM over XGBoost:

✅ Native categorical handling (no encoding needed)
✅ 2-5x faster on mixed feature types
✅ 4x speedup with categorical + numerical features
✅ Better memory efficiency
✅ Equivalent accuracy to XGBoost
✅ Perfect for embeddings (dense numerical) + categoricals

import lightgbm as lgb
import numpy as np

class HybridClassifier:
    def __init__(self, categories):
        self.categories = categories
        self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
        self.model = None

    def extract_features(self, email):
        """Extract all feature types"""
        patterns = extract_patterns(email)
        structural = extract_structural(email)

        # Structured embedding with rich context
        text = build_embedding_text(email, patterns)
        embedding = self.embedder.encode(text)

        # Combine features
        features = {
            'embedding': embedding,  # 384 numerical
            'patterns': patterns,    # 20 numerical/boolean
            'structural': structural # 20 numerical/categorical
        }

        return features

    def train(self, emails, labels):
        """Train on LLM-labeled data from calibration"""
        # Extract features
        all_features = [self.extract_features(e) for e in emails]

        # Build feature matrix
        X = np.array([
            np.concatenate([
                f['embedding'],
                list(f['patterns'].values()),
                [f['structural'][k] for k in numerical_keys]
            ])
            for f in all_features
        ])

        # Categorical feature indices
        categorical_features = ['sender_domain_type', 'time_of_day', 'day_of_week']

        # Train LightGBM
        self.model = lgb.LGBMClassifier(
            categorical_feature=categorical_features,
            n_estimators=200,
            learning_rate=0.1,
            max_depth=8,
            num_leaves=31,
            objective='multiclass',
            num_class=len(self.categories)
        )

        self.model.fit(X, labels)

    def predict(self, email):
        """Predict with confidence"""
        features = self.extract_features(email)
        X = build_feature_vector(features)

        # Get probabilities
        probs = self.model.predict_proba([X])[0]
        pred_class = np.argmax(probs)

        return {
            'category': self.categories[pred_class],
            'confidence': float(probs[pred_class]),
            'probabilities': {
                self.categories[i]: float(probs[i])
                for i in range(len(self.categories))
            }
        }

3. LLM Integration (Flexible & Optional)

Model Strategy:

Phase	Model	Speed	Purpose
Calibration	qwen3:4b	Slower	Better category discovery, 1500 emails
Classification	qwen3:1.7b	Fast	Quick review, only ~5% of emails
Optional	qwen3:30b	Slowest	Maximum accuracy if needed

Configuration (Single Source of Truth):

# config/llm_models.yaml
llm:
  # Provider type: ollama, openai, anthropic
  provider: "ollama"

  # Ollama settings
  ollama:
    base_url: "http://localhost:11434"
    calibration_model: "qwen3:4b"      # Bigger for better discovery
    classification_model: "qwen3:1.7b"  # Smaller for speed
    temperature: 0.1
    max_tokens: 500
    timeout: 30
    retry_attempts: 3

  # OpenAI-compatible API (future-proof)
  openai:
    base_url: "https://api.openai.com/v1"  # Or custom endpoint
    api_key: "${OPENAI_API_KEY}"
    calibration_model: "gpt-4o-mini"
    classification_model: "gpt-4o-mini"
    temperature: 0.1
    max_tokens: 500

  # Graceful degradation
  fallback:
    enabled: true
    # If LLM unavailable, emails go to "needs_review" folder
    # ML still works, just more conservative thresholds

LLM Provider Abstraction:

from abc import ABC, abstractmethod

class BaseLLMProvider(ABC):
    @abstractmethod
    def complete(self, prompt: str, **kwargs) -> str:
        pass

    @abstractmethod
    def test_connection(self) -> bool:
        pass

class OllamaProvider(BaseLLMProvider):
    def __init__(self, base_url: str, model: str):
        import ollama
        self.client = ollama.Client(host=base_url)
        self.model = model

    def complete(self, prompt: str, **kwargs) -> str:
        response = self.client.generate(
            model=self.model,
            prompt=prompt,
            options={
                'temperature': kwargs.get('temperature', 0.1),
                'num_predict': kwargs.get('max_tokens', 500)
            }
        )
        return response['response']

    def test_connection(self) -> bool:
        try:
            self.client.list()
            return True
        except:
            return False

class OpenAIProvider(BaseLLMProvider):
    def __init__(self, base_url: str, api_key: str, model: str):
        from openai import OpenAI
        self.client = OpenAI(base_url=base_url, api_key=api_key)
        self.model = model

    def complete(self, prompt: str, **kwargs) -> str:
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            temperature=kwargs.get('temperature', 0.1),
            max_tokens=kwargs.get('max_tokens', 500)
        )
        return response.choices[0].message.content

    def test_connection(self) -> bool:
        try:
            self.client.models.list()
            return True
        except:
            return False

def get_llm_provider(config) -> BaseLLMProvider:
    """Factory to create LLM provider based on config"""
    provider_type = config['llm']['provider']

    if provider_type == 'ollama':
        return OllamaProvider(
            base_url=config['llm']['ollama']['base_url'],
            model=config['llm']['ollama']['classification_model']
        )
    elif provider_type == 'openai':
        return OpenAIProvider(
            base_url=config['llm']['openai']['base_url'],
            api_key=os.getenv('OPENAI_API_KEY'),
            model=config['llm']['openai']['classification_model']
        )
    else:
        raise ValueError(f"Unknown provider: {provider_type}")

Graceful Degradation (LLM Optional):

class AdaptiveClassifier:
    def __init__(self, ml_model, llm_classifier, config):
        self.ml_model = ml_model
        self.llm_classifier = llm_classifier
        self.llm_available = self._test_llm_connection()
        self.config = config

        if not self.llm_available:
            logger.warning("LLM unavailable - using conservative thresholds")
            self.default_threshold = 0.85  # Higher threshold without LLM
        else:
            self.default_threshold = 0.75

    def _test_llm_connection(self):
        """Check if LLM is available"""
        if not self.llm_classifier:
            return False
        try:
            return self.llm_classifier.test_connection()
        except:
            return False

    def classify(self, email, features):
        """Classify with or without LLM"""
        # ML classification
        ml_result = self.ml_model.predict(features)

        # Check hard rules first
        if self._has_hard_rule_match(email):
            return ClassificationResult(
                category=self._get_rule_category(email),
                confidence=0.99,
                method='rule'
            )

        # High confidence ML result
        if ml_result['confidence'] >= self.default_threshold:
            return ClassificationResult(
                category=ml_result['category'],
                confidence=ml_result['confidence'],
                method='ml'
            )

        # Low confidence - try LLM if available
        if self.llm_available:
            return ClassificationResult(
                category=ml_result['category'],
                confidence=ml_result['confidence'],
                method='ml',
                needs_review=True  # Queue for LLM
            )
        else:
            # No LLM - mark for manual review
            return ClassificationResult(
                category='needs_review',
                confidence=ml_result['confidence'],
                method='ml',
                needs_review=True,
                metadata={'ml_prediction': ml_result}
            )

4. Universal Categories (12 Total)

categories = {
    'junk': {
        'description': 'Spam, unwanted marketing, phishing',
        'patterns': ['unsubscribe', 'click here', 'limited time'],
        'threshold': 0.85  # High confidence needed
    },
    'transactional': {
        'description': 'Receipts, invoices, confirmations, order tracking',
        'patterns': ['receipt', 'invoice', 'order', 'shipped', 'tracking'],
        'threshold': 0.80
    },
    'auth': {
        'description': 'OTPs, password resets, 2FA codes, security alerts',
        'patterns': ['verification code', 'otp', 'reset password', r'\d{4,6}'],
        'threshold': 0.90  # Very high - important emails
    },
    'newsletters': {
        'description': 'Subscribed newsletters, marketing emails',
        'patterns': ['newsletter', 'weekly digest', 'monthly update'],
        'threshold': 0.75
    },
    'social': {
        'description': 'Social media notifications, mentions, friend requests',
        'patterns': ['mentioned you', 'friend request', 'liked your'],
        'threshold': 0.75
    },
    'automated': {
        'description': 'System notifications, alerts, no-reply messages',
        'patterns': ['automated', 'system notification', 'do not reply'],
        'threshold': 0.80
    },
    'conversational': {
        'description': 'Human-to-human correspondence, replies, discussions',
        'patterns': ['hi', 'hello', 'thanks', 'regards'],
        'threshold': 0.65  # Lower - varied language
    },
    'work': {
        'description': 'Business correspondence, meetings, projects',
        'patterns': ['meeting', 'project', 'deadline', 'team'],
        'threshold': 0.70
    },
    'personal': {
        'description': 'Friends and family, personal matters',
        'patterns': ['love', 'family', 'dinner', 'weekend'],
        'threshold': 0.70
    },
    'finance': {
        'description': 'Bank statements, credit cards, investments, bills',
        'patterns': ['statement', 'balance', 'account', 'payment due'],
        'threshold': 0.85  # High - sensitive
    },
    'travel': {
        'description': 'Flight bookings, hotels, reservations, itineraries',
        'patterns': ['flight', 'booking', 'reservation', 'check-in'],
        'threshold': 0.80
    },
    'unknown': {
        'description': "Doesn't fit any category (requires review)",
        'patterns': [],
        'threshold': 0.50  # Catch-all
    }
}

MODULAR ARCHITECTURE

Tiered Dependencies

# setup.py
setup(
    name="email-sorter",
    version="1.0.0",
    install_requires=[
        # CORE (always required)
        "numpy>=1.24.0",
        "pandas>=2.0.0",
        "scikit-learn>=1.3.0",
        "lightgbm>=4.0.0",
        "sentence-transformers>=2.2.0",
        "pydantic>=2.0.0",
        "pyyaml>=6.0",
        "click>=8.1.0",
        "rich>=13.0.0",
        "tqdm>=4.66.0",
        "tenacity>=8.2.0",
    ],
    extras_require={
        # Email providers (optional)
        "gmail": [
            "google-api-python-client>=2.100.0",
            "google-auth-httplib2>=0.1.1",
            "google-auth-oauthlib>=1.1.0",
        ],
        "microsoft": [
            "msal>=1.24.0",
        ],
        "imap": [
            "imapclient>=2.3.1",
        ],

        # LLM providers (optional)
        "ollama": [
            "ollama>=0.1.0",
        ],
        "openai": [
            "openai>=1.0.0",
        ],

        # Attachment processing (optional)
        "attachments": [
            "PyPDF2>=3.0.0",
            "python-docx>=0.8.11",
            "openpyxl>=3.0.10",
        ],

        # Development (optional)
        "dev": [
            "pytest>=7.4.0",
            "pytest-cov>=4.1.0",
            "pytest-mock>=3.11.0",
            "black>=23.0.0",
            "isort>=5.12.0",
        ],

        # All extras
        "all": [
            # Combines all above
        ]
    }
)

Installation options:

# Minimal (ML only, no LLM, no email providers)
pip install email-sorter

# With Gmail support
pip install email-sorter[gmail]

# With Ollama LLM
pip install email-sorter[ollama,gmail]

# Everything
pip install email-sorter[all]

TESTING STRATEGY

Test Harness Structure

tests/
├── unit/
│   ├── test_feature_extraction.py
│   ├── test_pattern_matching.py
│   ├── test_embeddings.py
│   ├── test_lightgbm.py
│   └── test_attachment_analysis.py
├── integration/
│   ├── test_calibration.py
│   ├── test_ml_llm_pipeline.py
│   ├── test_gmail_provider.py
│   └── test_checkpoint_resume.py
├── e2e/
│   ├── test_full_pipeline_100.py
│   ├── test_full_pipeline_1000.py
│   └── test_full_pipeline_80k.py
├── fixtures/
│   ├── mock_emails.json
│   ├── mock_llm_responses.json
│   └── sample_inboxes/
└── conftest.py

Unit Tests

# tests/unit/test_feature_extraction.py
import pytest
from src.classification.feature_extractor import FeatureExtractor
from src.email_providers.base import Email

def test_pattern_extraction():
    email = Email(
        id='1',
        subject='Your verification code is 123456',
        sender='noreply@service.com',
        body='Your one-time password is 123456'
    )

    extractor = FeatureExtractor()
    patterns = extractor._extract_patterns(email)

    assert patterns['has_otp'] == True
    assert patterns['has_verification'] == True
    assert patterns['is_automated'] == True

def test_structured_embedding():
    email = Email(
        id='2',
        subject='Invoice #12345',
        sender='billing@company.com',
        body='Please find attached your invoice'
    )

    extractor = FeatureExtractor()
    text = extractor.build_embedding_text(email)

    assert '[EMAIL_METADATA]' in text
    assert '[DETECTED_PATTERNS]' in text
    assert 'has_invoice: True' in text

Integration Tests

# tests/integration/test_ml_llm_pipeline.py
def test_calibration_then_classification():
    # 1. Load sample emails
    emails = load_sample_emails(count=100)

    # 2. Run calibration (with mock LLM)
    calibrator = CalibrationPhase(mock_llm_provider)
    config = calibrator.run(emails)

    # 3. Train classifier
    classifier = HybridClassifier()
    classifier.train(emails, config['labels'])

    # 4. Classify new emails
    new_emails = load_sample_emails(count=20, exclude=emails)
    results = [classifier.predict(e) for e in new_emails]

    # 5. Assert accuracy
    accuracy = calculate_accuracy(results, ground_truth)
    assert accuracy > 0.85

E2E Tests

# tests/e2e/test_full_pipeline_100.py
def test_full_pipeline_100_emails(tmp_path):
    """End-to-end test on 100 emails"""
    # Setup
    output_dir = tmp_path / "results"
    emails = load_test_inbox(count=100)

    # Run full pipeline
    result = run_email_sorter(
        emails=emails,
        output=output_dir,
        config="tests/fixtures/test_config.yaml"
    )

    # Assertions
    assert result['total_processed'] == 100
    assert result['accuracy_estimate'] > 0.90
    assert (output_dir / "results.json").exists()
    assert (output_dir / "report.txt").exists()

PERFORMANCE EXPECTATIONS (Updated with Research)

For 80,000 emails:

Phase	Time	Details
Calibration	3-5 min	1500 emails, qwen3:4b, train LightGBM
Pattern detection	~10 sec	Regex on all 80k emails
Embedding generation	~8 min	Batched, CPU, all 80k emails
LightGBM classification	~3 sec	Fast inference
Hard rules auto-classify	instant	10% = 8,000 emails
LLM review (qwen3:1.7b)	~4 min	5% = 4,000 emails, batched
Export & sync	~2 min	JSON/CSV + Gmail API
TOTAL	~17 min

Accuracy Breakdown:

Component	Coverage	Accuracy
Hard rules	10%	99%
LightGBM (high conf)	85%	92%
LLM review	5%	95%
Overall	100%	94-96%

Memory Usage (80k emails):

Email data: ~400MB
Embeddings (cached): ~500MB
LightGBM model: ~5MB
MiniLM model: ~90MB
Peak: ~1.2GB

DISTRIBUTABLE WHEEL PACKAGING

Package Structure

email-sorter/
├── setup.py
├── setup.cfg
├── pyproject.toml
├── MANIFEST.in
├── README.md
├── LICENSE
├── src/
│   └── email_sorter/
│       ├── __init__.py
│       ├── __main__.py
│       ├── cli.py
│       └── ... (all modules)
├── config/
│   ├── default_config.yaml
│   ├── categories.yaml
│   └── llm_models.yaml
└── models/
    └── pretrained/
        ├── minilm-l6-v2/  (bundled embedder)
        └── lightgbm.pkl   (optional pre-trained)

Distribution Commands

# Build wheel
python setup.py sdist bdist_wheel

# Install locally
pip install dist/email_sorter-1.0.0-py3-none-any.whl

# Use as command
email-sorter --source gmail --credentials creds.json --output results/

# Or as module
python -m email_sorter --source gmail ...

CLI Interface

email-sorter --help

# Basic usage
email-sorter \
  --source gmail \
  --credentials credentials.json \
  --output results/

# Advanced options
email-sorter \
  --source gmail \
  --credentials creds.json \
  --output results/ \
  --config custom_config.yaml \
  --llm-provider ollama \
  --llm-model qwen3:1.7b \
  --limit 1000 \
  --no-calibrate \
  --dry-run

PROJECT STRUCTURE

email-sorter/
├── README.md
├── PROJECT_BLUEPRINT.md         # This file
├── BUILD_INSTRUCTIONS.md
├── RESEARCH_FINDINGS.md
├── setup.py
├── setup.cfg
├── pyproject.toml
├── requirements.txt
├── .gitignore
├── .env.example
├── config/
│   ├── default_config.yaml
│   ├── categories.yaml
│   ├── llm_models.yaml          # LLM config (single source)
│   └── features.yaml
├── src/
│   ├── __init__.py
│   ├── __main__.py
│   ├── cli.py                   # Click CLI
│   ├── calibration/
│   │   ├── __init__.py
│   │   ├── sampler.py           # Stratified sampling
│   │   ├── llm_analyzer.py      # LLM calibration
│   │   └── trainer.py           # Train LightGBM
│   ├── classification/
│   │   ├── __init__.py
│   │   ├── feature_extractor.py # Hybrid features
│   │   ├── pattern_matcher.py   # Hard rules
│   │   ├── embedder.py          # Sentence embeddings
│   │   ├── lightgbm_classifier.py
│   │   ├── adaptive_classifier.py
│   │   └── llm_classifier.py
│   ├── models/
│   │   ├── __init__.py
│   │   ├── pretrained/
│   │   │   └── .gitkeep
│   │   └── model_loader.py
│   ├── email_providers/
│   │   ├── __init__.py
│   │   ├── base.py
│   │   ├── gmail.py
│   │   ├── microsoft.py
│   │   └── imap.py
│   ├── llm/
│   │   ├── __init__.py
│   │   ├── base.py              # Abstract provider
│   │   ├── ollama.py
│   │   └── openai.py
│   ├── processing/
│   │   ├── __init__.py
│   │   ├── bulk_processor.py
│   │   ├── attachment_handler.py
│   │   └── queue_manager.py
│   ├── adjustment/
│   │   ├── __init__.py
│   │   ├── threshold_adjuster.py
│   │   └── pattern_learner.py
│   ├── export/
│   │   ├── __init__.py
│   │   ├── results_exporter.py
│   │   ├── provider_sync.py
│   │   └── report_generator.py
│   └── utils/
│       ├── __init__.py
│       ├── config.py
│       ├── logging.py
│       └── cleanup.py
├── tests/
│   ├── unit/
│   ├── integration/
│   ├── e2e/
│   ├── fixtures/
│   └── conftest.py
├── prompts/
│   ├── calibration.txt
│   └── classification.txt
├── scripts/
│   ├── train_model.py
│   ├── verify_install.py
│   └── benchmark.py
├── data/
│   └── samples/
└── logs/
    └── .gitkeep

SECURITY & PRIVACY

✅ All processing is local - No cloud uploads ✅ LLM runs locally - Via Ollama (or optional OpenAI API) ✅ Fresh clone per job - Complete isolation ✅ No persistent storage - Email bodies never written to disk ✅ Attachment content - Processed in memory, discarded immediately ✅ Auto cleanup - Temp files deleted after processing ✅ Credentials - Used directly, never cached ✅ GDPR-friendly - No data retention or sharing

SUCCESS CRITERIA

✅ Processes 80k emails in <20 minutes ✅ 94-96% classification accuracy (competitive with cloud tools) ✅ <5% emails need LLM review ✅ Successfully syncs back to Gmail/IMAP ✅ No data leakage between jobs ✅ Works on Windows, Linux, macOS ✅ LLM is optional (graceful degradation) ✅ Distributable as Python wheel ✅ Attachment analysis working ✅ OpenAI-compatible API support

WHAT'S NEXT

✅ Research complete (benchmarks, competition, LightGBM vs XGBoost)
⏭ Update BUILD_INSTRUCTIONS.md with new architecture
⏭ Create RESEARCH_FINDINGS.md with search results
⏭ Build core infrastructure (config, logging, data models)
⏭ Implement feature extraction (embeddings + patterns + attachments)
⏭ Create LightGBM classifier
⏭ Implement LLM providers (Ollama + OpenAI-compatible)
⏭ Build calibration system
⏭ Create test harness
⏭ Package as wheel
⏭ Test on Marion's 80k emails

END OF BLUEPRINT v2.0

This is the complete, research-backed architecture ready to build.

34 KiB Raw Blame History