FSSCoding 8f25e30f52 Rewrite CLAUDE.md and clean project structure

- Rewrote CLAUDE.md with comprehensive development guide
- Archived 20 old docs to docs/archive/
- Added PROJECT_ROADMAP_2025.md with research learnings
- Added CLASSIFICATION_METHODS_COMPARISON.md
- Added SESSION_HANDOVER_20251128.md
- Added tools for analysis (brett_gmail/microsoft analyzers)
- Updated .gitignore for archive folders
- Config changes for local vLLM endpoint

2025-11-28 13:07:27 +11:00

16 KiB

Raw Permalink Blame History

Email Sorter: Project Roadmap & Learnings

Document Purpose

This document captures learnings from the November 2025 research session and defines the project scope, role within a larger email processing ecosystem, and development roadmap for 2025.

Project Scope Definition

What This Tool IS

Email Sorter is a TRIAGE tool. Its job is:

Bulk classification - Sort emails into buckets quickly
Risk-based routing - Flag high-stakes items for careful handling
Downstream handoff - Prepare emails for specialized processing tools

What This Tool IS NOT

Not a spam filter (trust Gmail/Outlook for that)
Not a complete email management solution
Not trying to do everything
Not the final destination for any email

Role in Larger Ecosystem

┌─────────────────────────────────────────────────────────────────┐
│                    EMAIL PROCESSING ECOSYSTEM                    │
└─────────────────────────────────────────────────────────────────┘

     ┌──────────────┐
     │  RAW INBOX   │  (Gmail, Outlook, IMAP)
     │   10k+       │
     └──────┬───────┘
            │
            ▼
     ┌──────────────┐
     │ SPAM FILTER  │  ← Trust existing provider (Gmail/Outlook)
     │  (existing)  │
     └──────┬───────┘
            │
            ▼
┌───────────────────────────────────────┐
│         EMAIL SORTER (THIS TOOL)      │  ← TRIAGE/ROUTING
│  ┌─────────────┐  ┌────────────────┐  │
│  │ Agent Scan  │→ │ ML/LLM Classify│  │
│  │ (discovery) │  │ (bulk sort)    │  │
│  └─────────────┘  └────────────────┘  │
└───────────────────┬───────────────────┘
                    │
      ┌─────────────┼─────────────┬─────────────┐
      ▼             ▼             ▼             ▼
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│  JUNK    │ │ ROUTINE  │ │ BUSINESS │ │ PERSONAL │
│  BUCKET  │ │  BUCKET  │ │  BUCKET  │ │  BUCKET  │
└────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘
     │            │            │            │
     ▼            ▼            ▼            ▼
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│  Batch   │ │  Batch   │ │ Knowledge│ │  Human   │
│ Cleanup  │ │ Summary  │ │  Graph   │ │  Review  │
│  (cheap) │ │  Tool    │ │  Builder │ │(careful) │
└──────────┘ └──────────┘ └──────────┘ └──────────┘

     OTHER TOOLS IN ECOSYSTEM (not this project)

Key Learnings from Research Sessions

Session 1: brett-gmail (801 emails, Personal Inbox)

Method	Accuracy	Time
ML-Only	54.9%	~5 sec
ML+LLM	93.3%	~3.5 min
Manual Agent	99.8%	~25 min

Session 2: brett-microsoft (596 emails, Business Inbox)

Method	Accuracy	Time
Manual Agent	98.2%	~30 min

Key Insight: Business inboxes require different classification approaches than personal inboxes.

1. ML Pipeline is Overkill for Small Datasets

Dataset Size	Recommended Approach	Rationale
<500	Agent-only analysis	ML overhead exceeds benefit
500-2000	Agent pre-scan + ML	Discovery improves ML accuracy
2000-10000	ML + LLM fallback	Balanced speed/accuracy
>10000	ML-only (fast mode)	Speed critical at scale

Evidence: 801-email dataset achieved 99.8% accuracy with 25-min agent analysis vs 54.9% with pure ML.

2. Agent Pre-Scan Adds Massive Value

A 10-15 minute agent discovery phase before bulk classification:

Identifies dominant sender domains
Discovers subject patterns
Suggests optimal categories for THIS dataset
Can generate sender-to-category mappings

This is NOT the same as the full manual analysis. It's a quick reconnaissance pass.

3. Categories Should Serve Downstream Processing

Don't optimize for human-readable labels. Optimize for routing decisions:

Category Type	Downstream Handler	Accuracy Need
Junk/Marketing	Batch cleanup tool	LOW (errors OK)
Newsletters	Summary aggregator	MEDIUM
Transactional	Archive, searchable	MEDIUM
Business	Knowledge graph	HIGH
Personal	Human review	CRITICAL
Security	Never auto-filter	CRITICAL

4. Risk-Based Accuracy Requirements

Not all emails need the same classification confidence:

HIGH STAKES (must not miss):
├─ Personal correspondence (sentimental value)
├─ Security alerts (account safety)
├─ Job applications (life-changing)
└─ Financial/legal documents

LOW STAKES (errors tolerable):
├─ Marketing promotions
├─ Newsletter digests
├─ Automated notifications
└─ Social media alerts

5. Spam Filtering is a Solved Problem

Don't reinvent spam filtering. Gmail and Outlook do it well. This tool should:

Assume spam is already filtered
Focus on categorizing legitimate mail
Trust the upstream provider

If spam does get through, a simple secondary filter could catch obvious cases, but this is low priority.

6. Sender Domain is the Strongest Signal

From the 801-email analysis:

Top 5 senders = 47.5% of all emails
Sender domain alone could classify 80%+ of automated emails
Subject patterns matter less than sender patterns

Implication: A sender-first classification approach could dramatically speed up processing.

7. Inbox Character Matters (NEW - Session 2)

Critical Discovery: Before classifying emails, assess the inbox CHARACTER:

Inbox Type	Characteristics	Classification Approach
Personal/Consumer	Subscription-heavy, marketing-dominant, automated 40-50%	Sender domain first
Business/Professional	Client work, operations, developer tools 60-70%	Sender + Subject context
Mixed	Both patterns present	Hybrid approach needed

Evidence from brett-microsoft analysis:

73.2% Business/Professional content
Only 8.2% Personal content
Required client relationship tracking
Support case ID extraction valuable

Implications for Agent Pre-Scan:

First determine inbox character (business vs personal vs mixed)
Select appropriate category templates
Business inboxes need relationship context, not just sender domains

8. Business Inboxes Need Special Handling (NEW - Session 2)

Business/professional inboxes require additional classification dimensions:

Client Relationship Tracking:

Same domain may have different contexts (internal vs external)
Client conversations span multiple senders
Subject threading matters more than in consumer inboxes

Support Case ID Extraction:

Business inboxes often have case/ticket IDs connecting emails
Microsoft: Case #, TrackingID#
Other vendors: Ticket numbers, reference IDs
ID extraction should be first-class feature

Accuracy Expectations:

Personal inboxes: 99%+ achievable with sender-first
Business inboxes: 95-98% achievable (more nuanced)
Accept lower accuracy ceiling, invest in risk-flagging

9. Multi-Inbox Analysis Reveals Patterns (NEW - Session 2)

Analyzing multiple inboxes from same user reveals:

Inbox segregation patterns - Gmail for personal, Outlook for business
Cross-inbox senders - Security alerts appear in both
Category overlap - Some categories universal, some inbox-specific

Implication: Future feature could merge analysis across inboxes to build complete user profile.

Technical Architecture (Refined)

Current State

Email Source → LocalFileParser → FeatureExtractor → ML Classifier → Output
                                      │
                                      └→ LLM Fallback (if low confidence)

Target State (2025)

Email Source
     │
     ▼
┌─────────────────────────────────────────────────────────────┐
│                    ROUTING LAYER                             │
│  Check dataset size → Route to appropriate pipeline          │
└─────────────────────────────────────────────────────────────┘
     │
     ├─── <500 emails ────→ Agent-Only Analysis
     │
     ├─── 500-5000 ───────→ Agent Pre-Scan + ML Pipeline
     │
     └─── >5000 ──────────→ ML Pipeline (optional LLM)

Each pipeline outputs:
  - Categorized emails (with confidence)
  - Risk flags (high-stakes items)
  - Routing recommendations
  - Insights report

Agent Pre-Scan Module (NEW)

class AgentPreScan:
    """
    Quick discovery phase before bulk classification.
    Time budget: 10-15 minutes.
    """

    def scan(self, emails: List[Email]) -> PreScanResult:
        # 1. Sender domain analysis (2 min)
        sender_stats = self.analyze_senders(emails)

        # 2. Subject pattern detection (3 min)
        patterns = self.detect_patterns(emails, sample_size=100)

        # 3. Category suggestions (5 min, uses LLM)
        categories = self.suggest_categories(sender_stats, patterns)

        # 4. Generate sender map (2 min)
        sender_map = self.create_sender_mapping(sender_stats, categories)

        return PreScanResult(
            sender_stats=sender_stats,
            patterns=patterns,
            suggested_categories=categories,
            sender_map=sender_map,
            estimated_distribution=self.estimate_distribution(emails, categories)
        )

Development Roadmap

Phase 0: Documentation Complete (NOW)

Research session findings documented
Classification methods comparison written
Project scope defined
This roadmap created

Phase 1: Quick Wins (Q1 2025, 4-8 hours)

Dataset size routing
- Auto-detect email count
- Route small datasets to agent analysis
- Route large datasets to ML pipeline
Sender-first classification
- Extract sender domain
- Check against known sender map
- Skip ML for known high-volume senders
Risk flagging
- Flag low-confidence results
- Flag potential personal emails
- Flag security-related emails

Phase 2: Agent Pre-Scan (Q1 2025, 8-16 hours)

Sender analysis module
- Cluster by domain
- Calculate volume statistics
- Identify automated vs personal
Pattern detection module
- Sample subject lines
- Find templates and IDs
- Detect lifecycle stages
Category suggestion module
- Use LLM to suggest categories
- Based on sender/pattern analysis
- Output category definitions
Sender mapping module
- Map senders to suggested categories
- Output as JSON for pipeline use
- Support manual overrides

Phase 3: Integration & Polish (Q2 2025)

Unified CLI
- Single command handles all dataset sizes
- Progress reporting
- Configurable verbosity
Output standardization
- Common format for all pipelines
- Include routing recommendations
- Include confidence and risk flags
Ecosystem integration
- Define handoff format for downstream tools
- Document API for other tools to consume
- Create example integrations

Phase 4: Scale Testing (Q2-Q3 2025)

Test on real 10k+ mailboxes
- Multiple users, different patterns
- Measure accuracy vs speed
- Refine thresholds
Pattern library
- Accumulate patterns from multiple mailboxes
- Build reusable sender maps
- Create category templates
Feedback loop
- Track classification accuracy
- Learn from corrections
- Improve over time

Configuration Philosophy

User-Facing Config (Keep Simple)

# config/user_config.yaml
mode: auto          # auto | agent | ml | hybrid
risk_threshold: high  # low | medium | high
output_format: json   # json | csv | html

Internal Config (Full Control)

# config/advanced_config.yaml
routing:
  small_threshold: 500
  medium_threshold: 5000

agent_prescan:
  enabled: true
  time_budget_minutes: 15
  sample_size: 100

ml_pipeline:
  confidence_threshold: 0.55
  llm_fallback: true
  batch_size: 512

risk_detection:
  personal_indicators: [gmail.com, hotmail.com, outlook.com]
  security_senders: [accounts.google.com, security@]
  high_stakes_keywords: [urgent, important, legal, contract]

Success Metrics

For This Tool

Metric	Target	Current
Classification accuracy (large datasets)	>85%	54.9% (ML), 93.3% (ML+LLM)
Processing speed (10k emails)	<5 min	~24 sec (ML-only)
High-stakes miss rate	<1%	Not measured
Setup time for new mailbox	<20 min	Variable

For Ecosystem

Metric	Target
End-to-end mailbox processing	<2 hours for 10k
User intervention needed	<10% of emails
Downstream tool compatibility	100%

Open Questions (To Resolve in 2025)

Category standardization: Should categories be fixed across all users, or discovered per-mailbox?
Sender map sharing: Can sender maps be shared across users? Privacy implications?
Incremental processing: How to handle new emails added to already-processed mailboxes?
Multi-account support: Same user, multiple email accounts?
Feedback integration: How do corrections feed back into the system?

Files Created During Research

Session 1 (brett-gmail, Personal Inbox)

File	Purpose
`tools/brett_gmail_analyzer.py`	Custom analyzer for personal inbox
`tools/generate_html_report.py`	HTML report generator
`data/brett_gmail_analysis.json`	Analysis data output
`docs/CLASSIFICATION_METHODS_COMPARISON.md`	Method comparison
`docs/REPORT_FORMAT.md`	HTML report documentation
`docs/SESSION_HANDOVER_20251128.md`	Session 1 handover

Session 2 (brett-microsoft, Business Inbox)

File	Purpose
`tools/brett_microsoft_analyzer.py`	Custom analyzer for business inbox
`data/brett_microsoft_analysis.json`	Analysis data output
`/home/bob/.../brett-ms-sorter/BRETT_MICROSOFT_ANALYSIS_REPORT.md`	Full analysis report

Summary

Email Sorter is a triage tool, not a complete solution.

Its job is to quickly sort emails into buckets so that specialized downstream tools can handle each bucket appropriately. The key insight from this research session is that an agent pre-scan phase, even just 10-15 minutes, dramatically improves classification accuracy for any dataset size.

The ML pipeline is valuable for scale (10k+ emails) but overkill for smaller datasets. Risk-based accuracy means we can tolerate errors on junk but must be careful with personal correspondence.

2025 development should focus on:

Smart routing based on dataset size
Agent pre-scan for discovery
Standardized output for ecosystem integration
Scale testing on real large mailboxes

Document Version: 1.1 Created: 2025-11-28 Updated: 2025-11-28 (Session 2 learnings) Sessions: brett-gmail (801 emails, personal), brett-microsoft (596 emails, business)

16 KiB Raw Permalink Blame History