email-sorter/docs/PROJECT_ROADMAP_2025.md

# Email Sorter: Project Roadmap & Learnings

## Document Purpose

This document captures learnings from the November 2025 research session and defines the project scope, role within a larger email processing ecosystem, and development roadmap for 2025.

---

## Project Scope Definition

### What This Tool IS

**Email Sorter is a TRIAGE tool.** Its job is:

1. **Bulk classification** - Sort emails into buckets quickly
2. **Risk-based routing** - Flag high-stakes items for careful handling
3. **Downstream handoff** - Prepare emails for specialized processing tools

### What This Tool IS NOT

- Not a spam filter (trust Gmail/Outlook for that)
- Not a complete email management solution
- Not trying to do everything
- Not the final destination for any email

### Role in Larger Ecosystem

```
┌─────────────────────────────────────────────────────────────────┐
│                    EMAIL PROCESSING ECOSYSTEM                    │
└─────────────────────────────────────────────────────────────────┘

     ┌──────────────┐
     │  RAW INBOX   │  (Gmail, Outlook, IMAP)
     │   10k+       │
     └──────┬───────┘
            │
            ▼
     ┌──────────────┐
     │ SPAM FILTER  │  ← Trust existing provider (Gmail/Outlook)
     │  (existing)  │
     └──────┬───────┘
            │
            ▼
┌───────────────────────────────────────┐
│         EMAIL SORTER (THIS TOOL)      │  ← TRIAGE/ROUTING
│  ┌─────────────┐  ┌────────────────┐  │
│  │ Agent Scan  │→ │ ML/LLM Classify│  │
│  │ (discovery) │  │ (bulk sort)    │  │
│  └─────────────┘  └────────────────┘  │
└───────────────────┬───────────────────┘
                    │
      ┌─────────────┼─────────────┬─────────────┐
      ▼             ▼             ▼             ▼
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│  JUNK    │ │ ROUTINE  │ │ BUSINESS │ │ PERSONAL │
│  BUCKET  │ │  BUCKET  │ │  BUCKET  │ │  BUCKET  │
└────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘
     │            │            │            │
     ▼            ▼            ▼            ▼
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│  Batch   │ │  Batch   │ │ Knowledge│ │  Human   │
│ Cleanup  │ │ Summary  │ │  Graph   │ │  Review  │
│  (cheap) │ │  Tool    │ │  Builder │ │(careful) │
└──────────┘ └──────────┘ └──────────┘ └──────────┘

     OTHER TOOLS IN ECOSYSTEM (not this project)
```

---

## Key Learnings from Research Sessions

### Session 1: brett-gmail (801 emails, Personal Inbox)

| Method | Accuracy | Time |
|--------|----------|------|
| ML-Only | 54.9% | ~5 sec |
| ML+LLM | 93.3% | ~3.5 min |
| Manual Agent | 99.8% | ~25 min |

### Session 2: brett-microsoft (596 emails, Business Inbox)

| Method | Accuracy | Time |
|--------|----------|------|
| Manual Agent | 98.2% | ~30 min |

**Key Insight:** Business inboxes require different classification approaches than personal inboxes.

---

### 1. ML Pipeline is Overkill for Small Datasets

| Dataset Size | Recommended Approach | Rationale |
|--------------|---------------------|-----------|
| <500 | Agent-only analysis | ML overhead exceeds benefit |
| 500-2000 | Agent pre-scan + ML | Discovery improves ML accuracy |
| 2000-10000 | ML + LLM fallback | Balanced speed/accuracy |
| >10000 | ML-only (fast mode) | Speed critical at scale |

**Evidence:** 801-email dataset achieved 99.8% accuracy with 25-min agent analysis vs 54.9% with pure ML.

### 2. Agent Pre-Scan Adds Massive Value

A 10-15 minute agent discovery phase before bulk classification:
- Identifies dominant sender domains
- Discovers subject patterns
- Suggests optimal categories for THIS dataset
- Can generate sender-to-category mappings

**This is NOT the same as the full manual analysis.** It's a quick reconnaissance pass.

### 3. Categories Should Serve Downstream Processing

Don't optimize for human-readable labels. Optimize for routing decisions:

| Category Type | Downstream Handler | Accuracy Need |
|---------------|-------------------|---------------|
| Junk/Marketing | Batch cleanup tool | LOW (errors OK) |
| Newsletters | Summary aggregator | MEDIUM |
| Transactional | Archive, searchable | MEDIUM |
| Business | Knowledge graph | HIGH |
| Personal | Human review | CRITICAL |
| Security | Never auto-filter | CRITICAL |

### 4. Risk-Based Accuracy Requirements

Not all emails need the same classification confidence:

```
HIGH STAKES (must not miss):
├─ Personal correspondence (sentimental value)
├─ Security alerts (account safety)
├─ Job applications (life-changing)
└─ Financial/legal documents

LOW STAKES (errors tolerable):
├─ Marketing promotions
├─ Newsletter digests
├─ Automated notifications
└─ Social media alerts
```

### 5. Spam Filtering is a Solved Problem

Don't reinvent spam filtering. Gmail and Outlook do it well. This tool should:
- Assume spam is already filtered
- Focus on categorizing legitimate mail
- Trust the upstream provider

If spam does get through, a simple secondary filter could catch obvious cases, but this is low priority.

### 6. Sender Domain is the Strongest Signal

From the 801-email analysis:
- Top 5 senders = 47.5% of all emails
- Sender domain alone could classify 80%+ of automated emails
- Subject patterns matter less than sender patterns

**Implication:** A sender-first classification approach could dramatically speed up processing.

### 7. Inbox Character Matters (NEW - Session 2)

**Critical Discovery:** Before classifying emails, assess the inbox CHARACTER:

| Inbox Type | Characteristics | Classification Approach |
|------------|-----------------|------------------------|
| **Personal/Consumer** | Subscription-heavy, marketing-dominant, automated 40-50% | Sender domain first |
| **Business/Professional** | Client work, operations, developer tools 60-70% | Sender + Subject context |
| **Mixed** | Both patterns present | Hybrid approach needed |

**Evidence from brett-microsoft analysis:**
- 73.2% Business/Professional content
- Only 8.2% Personal content
- Required client relationship tracking
- Support case ID extraction valuable

**Implications for Agent Pre-Scan:**
1. First determine inbox character (business vs personal vs mixed)
2. Select appropriate category templates
3. Business inboxes need relationship context, not just sender domains

### 8. Business Inboxes Need Special Handling (NEW - Session 2)

Business/professional inboxes require additional classification dimensions:

**Client Relationship Tracking:**
- Same domain may have different contexts (internal vs external)
- Client conversations span multiple senders
- Subject threading matters more than in consumer inboxes

**Support Case ID Extraction:**
- Business inboxes often have case/ticket IDs connecting emails
- Microsoft: Case #, TrackingID#
- Other vendors: Ticket numbers, reference IDs
- ID extraction should be first-class feature

**Accuracy Expectations:**
- Personal inboxes: 99%+ achievable with sender-first
- Business inboxes: 95-98% achievable (more nuanced)
- Accept lower accuracy ceiling, invest in risk-flagging

### 9. Multi-Inbox Analysis Reveals Patterns (NEW - Session 2)

Analyzing multiple inboxes from same user reveals:
- **Inbox segregation patterns** - Gmail for personal, Outlook for business
- **Cross-inbox senders** - Security alerts appear in both
- **Category overlap** - Some categories universal, some inbox-specific

**Implication:** Future feature could merge analysis across inboxes to build complete user profile.

---

## Technical Architecture (Refined)

### Current State

```
Email Source → LocalFileParser → FeatureExtractor → ML Classifier → Output
                                      │
                                      └→ LLM Fallback (if low confidence)
```

### Target State (2025)

```
Email Source
     │
     ▼
┌─────────────────────────────────────────────────────────────┐
│                    ROUTING LAYER                             │
│  Check dataset size → Route to appropriate pipeline          │
└─────────────────────────────────────────────────────────────┘
     │
     ├─── <500 emails ────→ Agent-Only Analysis
     │
     ├─── 500-5000 ───────→ Agent Pre-Scan + ML Pipeline
     │
     └─── >5000 ──────────→ ML Pipeline (optional LLM)

Each pipeline outputs:
  - Categorized emails (with confidence)
  - Risk flags (high-stakes items)
  - Routing recommendations
  - Insights report
```

### Agent Pre-Scan Module (NEW)

```python
class AgentPreScan:
    """
    Quick discovery phase before bulk classification.
    Time budget: 10-15 minutes.
    """

    def scan(self, emails: List[Email]) -> PreScanResult:
        # 1. Sender domain analysis (2 min)
        sender_stats = self.analyze_senders(emails)

        # 2. Subject pattern detection (3 min)
        patterns = self.detect_patterns(emails, sample_size=100)

        # 3. Category suggestions (5 min, uses LLM)
        categories = self.suggest_categories(sender_stats, patterns)

        # 4. Generate sender map (2 min)
        sender_map = self.create_sender_mapping(sender_stats, categories)

        return PreScanResult(
            sender_stats=sender_stats,
            patterns=patterns,
            suggested_categories=categories,
            sender_map=sender_map,
            estimated_distribution=self.estimate_distribution(emails, categories)
        )
```

---

## Development Roadmap

### Phase 0: Documentation Complete (NOW)

- [x] Research session findings documented
- [x] Classification methods comparison written
- [x] Project scope defined
- [x] This roadmap created

### Phase 1: Quick Wins (Q1 2025, 4-8 hours)

1. **Dataset size routing**
   - Auto-detect email count
   - Route small datasets to agent analysis
   - Route large datasets to ML pipeline

2. **Sender-first classification**
   - Extract sender domain
   - Check against known sender map
   - Skip ML for known high-volume senders

3. **Risk flagging**
   - Flag low-confidence results
   - Flag potential personal emails
   - Flag security-related emails

### Phase 2: Agent Pre-Scan (Q1 2025, 8-16 hours)

1. **Sender analysis module**
   - Cluster by domain
   - Calculate volume statistics
   - Identify automated vs personal

2. **Pattern detection module**
   - Sample subject lines
   - Find templates and IDs
   - Detect lifecycle stages

3. **Category suggestion module**
   - Use LLM to suggest categories
   - Based on sender/pattern analysis
   - Output category definitions

4. **Sender mapping module**
   - Map senders to suggested categories
   - Output as JSON for pipeline use
   - Support manual overrides

### Phase 3: Integration & Polish (Q2 2025)

1. **Unified CLI**
   - Single command handles all dataset sizes
   - Progress reporting
   - Configurable verbosity

2. **Output standardization**
   - Common format for all pipelines
   - Include routing recommendations
   - Include confidence and risk flags

3. **Ecosystem integration**
   - Define handoff format for downstream tools
   - Document API for other tools to consume
   - Create example integrations

### Phase 4: Scale Testing (Q2-Q3 2025)

1. **Test on real 10k+ mailboxes**
   - Multiple users, different patterns
   - Measure accuracy vs speed
   - Refine thresholds

2. **Pattern library**
   - Accumulate patterns from multiple mailboxes
   - Build reusable sender maps
   - Create category templates

3. **Feedback loop**
   - Track classification accuracy
   - Learn from corrections
   - Improve over time

---

## Configuration Philosophy

### User-Facing Config (Keep Simple)

```yaml
# config/user_config.yaml
mode: auto          # auto | agent | ml | hybrid
risk_threshold: high  # low | medium | high
output_format: json   # json | csv | html
```

### Internal Config (Full Control)

```yaml
# config/advanced_config.yaml
routing:
  small_threshold: 500
  medium_threshold: 5000

agent_prescan:
  enabled: true
  time_budget_minutes: 15
  sample_size: 100

ml_pipeline:
  confidence_threshold: 0.55
  llm_fallback: true
  batch_size: 512

risk_detection:
  personal_indicators: [gmail.com, hotmail.com, outlook.com]
  security_senders: [accounts.google.com, security@]
  high_stakes_keywords: [urgent, important, legal, contract]
```

---

## Success Metrics

### For This Tool

| Metric | Target | Current |
|--------|--------|---------|
| Classification accuracy (large datasets) | >85% | 54.9% (ML), 93.3% (ML+LLM) |
| Processing speed (10k emails) | <5 min | ~24 sec (ML-only) |
| High-stakes miss rate | <1% | Not measured |
| Setup time for new mailbox | <20 min | Variable |

### For Ecosystem

| Metric | Target |
|--------|--------|
| End-to-end mailbox processing | <2 hours for 10k |
| User intervention needed | <10% of emails |
| Downstream tool compatibility | 100% |

---

## Open Questions (To Resolve in 2025)

1. **Category standardization**: Should categories be fixed across all users, or discovered per-mailbox?

2. **Sender map sharing**: Can sender maps be shared across users? Privacy implications?

3. **Incremental processing**: How to handle new emails added to already-processed mailboxes?

4. **Multi-account support**: Same user, multiple email accounts?

5. **Feedback integration**: How do corrections feed back into the system?

---

## Files Created During Research

### Session 1 (brett-gmail, Personal Inbox)

| File | Purpose |
|------|---------|
| `tools/brett_gmail_analyzer.py` | Custom analyzer for personal inbox |
| `tools/generate_html_report.py` | HTML report generator |
| `data/brett_gmail_analysis.json` | Analysis data output |
| `docs/CLASSIFICATION_METHODS_COMPARISON.md` | Method comparison |
| `docs/REPORT_FORMAT.md` | HTML report documentation |
| `docs/SESSION_HANDOVER_20251128.md` | Session 1 handover |

### Session 2 (brett-microsoft, Business Inbox)

| File | Purpose |
|------|---------|
| `tools/brett_microsoft_analyzer.py` | Custom analyzer for business inbox |
| `data/brett_microsoft_analysis.json` | Analysis data output |
| `/home/bob/.../brett-ms-sorter/BRETT_MICROSOFT_ANALYSIS_REPORT.md` | Full analysis report |

---

## Summary

**Email Sorter is a triage tool, not a complete solution.**

Its job is to quickly sort emails into buckets so that specialized downstream tools can handle each bucket appropriately. The key insight from this research session is that an agent pre-scan phase, even just 10-15 minutes, dramatically improves classification accuracy for any dataset size.

The ML pipeline is valuable for scale (10k+ emails) but overkill for smaller datasets. Risk-based accuracy means we can tolerate errors on junk but must be careful with personal correspondence.

2025 development should focus on:
1. Smart routing based on dataset size
2. Agent pre-scan for discovery
3. Standardized output for ecosystem integration
4. Scale testing on real large mailboxes

---

*Document Version: 1.1*
*Created: 2025-11-28*
*Updated: 2025-11-28 (Session 2 learnings)*
*Sessions: brett-gmail (801 emails, personal), brett-microsoft (596 emails, business)*