email-sorter/docs/PROJECT_ROADMAP_2025.md
FSSCoding 8f25e30f52 Rewrite CLAUDE.md and clean project structure
- Rewrote CLAUDE.md with comprehensive development guide
- Archived 20 old docs to docs/archive/
- Added PROJECT_ROADMAP_2025.md with research learnings
- Added CLASSIFICATION_METHODS_COMPARISON.md
- Added SESSION_HANDOVER_20251128.md
- Added tools for analysis (brett_gmail/microsoft analyzers)
- Updated .gitignore for archive folders
- Config changes for local vLLM endpoint
2025-11-28 13:07:27 +11:00

16 KiB

Email Sorter: Project Roadmap & Learnings

Document Purpose

This document captures learnings from the November 2025 research session and defines the project scope, role within a larger email processing ecosystem, and development roadmap for 2025.


Project Scope Definition

What This Tool IS

Email Sorter is a TRIAGE tool. Its job is:

  1. Bulk classification - Sort emails into buckets quickly
  2. Risk-based routing - Flag high-stakes items for careful handling
  3. Downstream handoff - Prepare emails for specialized processing tools

What This Tool IS NOT

  • Not a spam filter (trust Gmail/Outlook for that)
  • Not a complete email management solution
  • Not trying to do everything
  • Not the final destination for any email

Role in Larger Ecosystem

┌─────────────────────────────────────────────────────────────────┐
│                    EMAIL PROCESSING ECOSYSTEM                    │
└─────────────────────────────────────────────────────────────────┘

     ┌──────────────┐
     │  RAW INBOX   │  (Gmail, Outlook, IMAP)
     │   10k+       │
     └──────┬───────┘
            │
            ▼
     ┌──────────────┐
     │ SPAM FILTER  │  ← Trust existing provider (Gmail/Outlook)
     │  (existing)  │
     └──────┬───────┘
            │
            ▼
┌───────────────────────────────────────┐
│         EMAIL SORTER (THIS TOOL)      │  ← TRIAGE/ROUTING
│  ┌─────────────┐  ┌────────────────┐  │
│  │ Agent Scan  │→ │ ML/LLM Classify│  │
│  │ (discovery) │  │ (bulk sort)    │  │
│  └─────────────┘  └────────────────┘  │
└───────────────────┬───────────────────┘
                    │
      ┌─────────────┼─────────────┬─────────────┐
      ▼             ▼             ▼             ▼
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│  JUNK    │ │ ROUTINE  │ │ BUSINESS │ │ PERSONAL │
│  BUCKET  │ │  BUCKET  │ │  BUCKET  │ │  BUCKET  │
└────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘
     │            │            │            │
     ▼            ▼            ▼            ▼
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│  Batch   │ │  Batch   │ │ Knowledge│ │  Human   │
│ Cleanup  │ │ Summary  │ │  Graph   │ │  Review  │
│  (cheap) │ │  Tool    │ │  Builder │ │(careful) │
└──────────┘ └──────────┘ └──────────┘ └──────────┘

     OTHER TOOLS IN ECOSYSTEM (not this project)

Key Learnings from Research Sessions

Session 1: brett-gmail (801 emails, Personal Inbox)

Method Accuracy Time
ML-Only 54.9% ~5 sec
ML+LLM 93.3% ~3.5 min
Manual Agent 99.8% ~25 min

Session 2: brett-microsoft (596 emails, Business Inbox)

Method Accuracy Time
Manual Agent 98.2% ~30 min

Key Insight: Business inboxes require different classification approaches than personal inboxes.


1. ML Pipeline is Overkill for Small Datasets

Dataset Size Recommended Approach Rationale
<500 Agent-only analysis ML overhead exceeds benefit
500-2000 Agent pre-scan + ML Discovery improves ML accuracy
2000-10000 ML + LLM fallback Balanced speed/accuracy
>10000 ML-only (fast mode) Speed critical at scale

Evidence: 801-email dataset achieved 99.8% accuracy with 25-min agent analysis vs 54.9% with pure ML.

2. Agent Pre-Scan Adds Massive Value

A 10-15 minute agent discovery phase before bulk classification:

  • Identifies dominant sender domains
  • Discovers subject patterns
  • Suggests optimal categories for THIS dataset
  • Can generate sender-to-category mappings

This is NOT the same as the full manual analysis. It's a quick reconnaissance pass.

3. Categories Should Serve Downstream Processing

Don't optimize for human-readable labels. Optimize for routing decisions:

Category Type Downstream Handler Accuracy Need
Junk/Marketing Batch cleanup tool LOW (errors OK)
Newsletters Summary aggregator MEDIUM
Transactional Archive, searchable MEDIUM
Business Knowledge graph HIGH
Personal Human review CRITICAL
Security Never auto-filter CRITICAL

4. Risk-Based Accuracy Requirements

Not all emails need the same classification confidence:

HIGH STAKES (must not miss):
├─ Personal correspondence (sentimental value)
├─ Security alerts (account safety)
├─ Job applications (life-changing)
└─ Financial/legal documents

LOW STAKES (errors tolerable):
├─ Marketing promotions
├─ Newsletter digests
├─ Automated notifications
└─ Social media alerts

5. Spam Filtering is a Solved Problem

Don't reinvent spam filtering. Gmail and Outlook do it well. This tool should:

  • Assume spam is already filtered
  • Focus on categorizing legitimate mail
  • Trust the upstream provider

If spam does get through, a simple secondary filter could catch obvious cases, but this is low priority.

6. Sender Domain is the Strongest Signal

From the 801-email analysis:

  • Top 5 senders = 47.5% of all emails
  • Sender domain alone could classify 80%+ of automated emails
  • Subject patterns matter less than sender patterns

Implication: A sender-first classification approach could dramatically speed up processing.

7. Inbox Character Matters (NEW - Session 2)

Critical Discovery: Before classifying emails, assess the inbox CHARACTER:

Inbox Type Characteristics Classification Approach
Personal/Consumer Subscription-heavy, marketing-dominant, automated 40-50% Sender domain first
Business/Professional Client work, operations, developer tools 60-70% Sender + Subject context
Mixed Both patterns present Hybrid approach needed

Evidence from brett-microsoft analysis:

  • 73.2% Business/Professional content
  • Only 8.2% Personal content
  • Required client relationship tracking
  • Support case ID extraction valuable

Implications for Agent Pre-Scan:

  1. First determine inbox character (business vs personal vs mixed)
  2. Select appropriate category templates
  3. Business inboxes need relationship context, not just sender domains

8. Business Inboxes Need Special Handling (NEW - Session 2)

Business/professional inboxes require additional classification dimensions:

Client Relationship Tracking:

  • Same domain may have different contexts (internal vs external)
  • Client conversations span multiple senders
  • Subject threading matters more than in consumer inboxes

Support Case ID Extraction:

  • Business inboxes often have case/ticket IDs connecting emails
  • Microsoft: Case #, TrackingID#
  • Other vendors: Ticket numbers, reference IDs
  • ID extraction should be first-class feature

Accuracy Expectations:

  • Personal inboxes: 99%+ achievable with sender-first
  • Business inboxes: 95-98% achievable (more nuanced)
  • Accept lower accuracy ceiling, invest in risk-flagging

9. Multi-Inbox Analysis Reveals Patterns (NEW - Session 2)

Analyzing multiple inboxes from same user reveals:

  • Inbox segregation patterns - Gmail for personal, Outlook for business
  • Cross-inbox senders - Security alerts appear in both
  • Category overlap - Some categories universal, some inbox-specific

Implication: Future feature could merge analysis across inboxes to build complete user profile.


Technical Architecture (Refined)

Current State

Email Source → LocalFileParser → FeatureExtractor → ML Classifier → Output
                                      │
                                      └→ LLM Fallback (if low confidence)

Target State (2025)

Email Source
     │
     ▼
┌─────────────────────────────────────────────────────────────┐
│                    ROUTING LAYER                             │
│  Check dataset size → Route to appropriate pipeline          │
└─────────────────────────────────────────────────────────────┘
     │
     ├─── <500 emails ────→ Agent-Only Analysis
     │
     ├─── 500-5000 ───────→ Agent Pre-Scan + ML Pipeline
     │
     └─── >5000 ──────────→ ML Pipeline (optional LLM)

Each pipeline outputs:
  - Categorized emails (with confidence)
  - Risk flags (high-stakes items)
  - Routing recommendations
  - Insights report

Agent Pre-Scan Module (NEW)

class AgentPreScan:
    """
    Quick discovery phase before bulk classification.
    Time budget: 10-15 minutes.
    """

    def scan(self, emails: List[Email]) -> PreScanResult:
        # 1. Sender domain analysis (2 min)
        sender_stats = self.analyze_senders(emails)

        # 2. Subject pattern detection (3 min)
        patterns = self.detect_patterns(emails, sample_size=100)

        # 3. Category suggestions (5 min, uses LLM)
        categories = self.suggest_categories(sender_stats, patterns)

        # 4. Generate sender map (2 min)
        sender_map = self.create_sender_mapping(sender_stats, categories)

        return PreScanResult(
            sender_stats=sender_stats,
            patterns=patterns,
            suggested_categories=categories,
            sender_map=sender_map,
            estimated_distribution=self.estimate_distribution(emails, categories)
        )

Development Roadmap

Phase 0: Documentation Complete (NOW)

  • Research session findings documented
  • Classification methods comparison written
  • Project scope defined
  • This roadmap created

Phase 1: Quick Wins (Q1 2025, 4-8 hours)

  1. Dataset size routing

    • Auto-detect email count
    • Route small datasets to agent analysis
    • Route large datasets to ML pipeline
  2. Sender-first classification

    • Extract sender domain
    • Check against known sender map
    • Skip ML for known high-volume senders
  3. Risk flagging

    • Flag low-confidence results
    • Flag potential personal emails
    • Flag security-related emails

Phase 2: Agent Pre-Scan (Q1 2025, 8-16 hours)

  1. Sender analysis module

    • Cluster by domain
    • Calculate volume statistics
    • Identify automated vs personal
  2. Pattern detection module

    • Sample subject lines
    • Find templates and IDs
    • Detect lifecycle stages
  3. Category suggestion module

    • Use LLM to suggest categories
    • Based on sender/pattern analysis
    • Output category definitions
  4. Sender mapping module

    • Map senders to suggested categories
    • Output as JSON for pipeline use
    • Support manual overrides

Phase 3: Integration & Polish (Q2 2025)

  1. Unified CLI

    • Single command handles all dataset sizes
    • Progress reporting
    • Configurable verbosity
  2. Output standardization

    • Common format for all pipelines
    • Include routing recommendations
    • Include confidence and risk flags
  3. Ecosystem integration

    • Define handoff format for downstream tools
    • Document API for other tools to consume
    • Create example integrations

Phase 4: Scale Testing (Q2-Q3 2025)

  1. Test on real 10k+ mailboxes

    • Multiple users, different patterns
    • Measure accuracy vs speed
    • Refine thresholds
  2. Pattern library

    • Accumulate patterns from multiple mailboxes
    • Build reusable sender maps
    • Create category templates
  3. Feedback loop

    • Track classification accuracy
    • Learn from corrections
    • Improve over time

Configuration Philosophy

User-Facing Config (Keep Simple)

# config/user_config.yaml
mode: auto          # auto | agent | ml | hybrid
risk_threshold: high  # low | medium | high
output_format: json   # json | csv | html

Internal Config (Full Control)

# config/advanced_config.yaml
routing:
  small_threshold: 500
  medium_threshold: 5000

agent_prescan:
  enabled: true
  time_budget_minutes: 15
  sample_size: 100

ml_pipeline:
  confidence_threshold: 0.55
  llm_fallback: true
  batch_size: 512

risk_detection:
  personal_indicators: [gmail.com, hotmail.com, outlook.com]
  security_senders: [accounts.google.com, security@]
  high_stakes_keywords: [urgent, important, legal, contract]

Success Metrics

For This Tool

Metric Target Current
Classification accuracy (large datasets) >85% 54.9% (ML), 93.3% (ML+LLM)
Processing speed (10k emails) <5 min ~24 sec (ML-only)
High-stakes miss rate <1% Not measured
Setup time for new mailbox <20 min Variable

For Ecosystem

Metric Target
End-to-end mailbox processing <2 hours for 10k
User intervention needed <10% of emails
Downstream tool compatibility 100%

Open Questions (To Resolve in 2025)

  1. Category standardization: Should categories be fixed across all users, or discovered per-mailbox?

  2. Sender map sharing: Can sender maps be shared across users? Privacy implications?

  3. Incremental processing: How to handle new emails added to already-processed mailboxes?

  4. Multi-account support: Same user, multiple email accounts?

  5. Feedback integration: How do corrections feed back into the system?


Files Created During Research

Session 1 (brett-gmail, Personal Inbox)

File Purpose
tools/brett_gmail_analyzer.py Custom analyzer for personal inbox
tools/generate_html_report.py HTML report generator
data/brett_gmail_analysis.json Analysis data output
docs/CLASSIFICATION_METHODS_COMPARISON.md Method comparison
docs/REPORT_FORMAT.md HTML report documentation
docs/SESSION_HANDOVER_20251128.md Session 1 handover

Session 2 (brett-microsoft, Business Inbox)

File Purpose
tools/brett_microsoft_analyzer.py Custom analyzer for business inbox
data/brett_microsoft_analysis.json Analysis data output
/home/bob/.../brett-ms-sorter/BRETT_MICROSOFT_ANALYSIS_REPORT.md Full analysis report

Summary

Email Sorter is a triage tool, not a complete solution.

Its job is to quickly sort emails into buckets so that specialized downstream tools can handle each bucket appropriately. The key insight from this research session is that an agent pre-scan phase, even just 10-15 minutes, dramatically improves classification accuracy for any dataset size.

The ML pipeline is valuable for scale (10k+ emails) but overkill for smaller datasets. Risk-based accuracy means we can tolerate errors on junk but must be careful with personal correspondence.

2025 development should focus on:

  1. Smart routing based on dataset size
  2. Agent pre-scan for discovery
  3. Standardized output for ecosystem integration
  4. Scale testing on real large mailboxes

Document Version: 1.1 Created: 2025-11-28 Updated: 2025-11-28 (Session 2 learnings) Sessions: brett-gmail (801 emails, personal), brett-microsoft (596 emails, business)