email-sorter/WORKFLOW_DIAGRAM.md

256 lines
7.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Email Sorter - Complete Workflow Diagram
## Full End-to-End Pipeline with LLM Calls
```mermaid
graph TB
Start([📧 Start: Enron Maildir<br/>100,000 emails]) --> Parse[EnronParser<br/>Stratified Sampling]
Parse --> CalibCheck{Need<br/>Calibration?}
CalibCheck -->|Yes: No Model| CalibStart[🎯 CALIBRATION PHASE]
CalibCheck -->|No: Model Exists| ClassifyStart[📊 CLASSIFICATION PHASE]
%% CALIBRATION PHASE
CalibStart --> Sample[Sample 100 Emails<br/>Stratified by user/folder]
Sample --> Split[Split: 50 train / 50 validation]
Split --> LLMBatch[📤 LLM CALL 1-5<br/>Batch Discovery<br/>5 batches × 20 emails]
LLMBatch -->|qwen3:8b-q4_K_M| Discover[Category Discovery<br/>~15 raw categories]
Discover --> Consolidate[📤 LLM CALL 6<br/>Consolidation<br/>Merge similar categories]
Consolidate -->|qwen3:8b-q4_K_M| CacheSnap[Category Cache Snap<br/>Semantic matching<br/>10 final categories]
CacheSnap --> ExtractTrain[Extract Features<br/>50 training emails<br/>Batch embeddings]
ExtractTrain --> Embed1[📤 EMBEDDING CALLS<br/>Ollama all-minilm:l6-v2<br/>384-dim vectors]
Embed1 --> TrainModel[Train LightGBM<br/>200 boosting rounds<br/>22 total categories]
TrainModel --> SaveModel[💾 Save Model<br/>classifier.pkl 1.1MB]
SaveModel --> ClassifyStart
%% CLASSIFICATION PHASE
ClassifyStart --> LoadModel[Load Model<br/>classifier.pkl]
LoadModel --> FetchAll[Fetch All Emails<br/>100,000 emails]
FetchAll --> BatchProcess[Process in Batches<br/>5,000 emails per batch<br/>20 batches total]
BatchProcess --> ExtractFeatures[Extract Features<br/>Batch size: 512<br/>Batched embeddings]
ExtractFeatures --> Embed2[📤 EMBEDDING CALLS<br/>Ollama all-minilm:l6-v2<br/>~200 batched calls]
Embed2 --> MLInference[LightGBM Inference<br/>Predict categories<br/>~2ms per email]
MLInference --> Results[💾 Save Results<br/>results.json 19MB<br/>summary.json 1.5KB<br/>classifications.csv 8.6MB]
Results --> ValidationStart[🔍 VALIDATION PHASE]
%% VALIDATION PHASE
ValidationStart --> SelectSamples[Select Samples<br/>50 low-conf + 25 random]
SelectSamples --> LoadEmails[Load Full Email Content<br/>Subject + Body + Metadata]
LoadEmails --> LLMEval[📤 LLM CALLS 7-81<br/>Individual Evaluation<br/>75 total assessments]
LLMEval -->|qwen3:8b-q4_K_M<br/>&lt;no_think&gt;| EvalResults[Collect Verdicts<br/>YES/PARTIAL/NO<br/>+ Reasoning]
EvalResults --> LLMSummary[📤 LLM CALL 82<br/>Final Summary<br/>Aggregate findings]
LLMSummary -->|qwen3:8b-q4_K_M| FinalReport[📊 Final Report<br/>Accuracy metrics<br/>Category quality<br/>Recommendations]
FinalReport --> End([✅ Complete<br/>100k classified<br/>+ validated])
%% OPTIONAL FINE-TUNING LOOP
FinalReport -.->|If corrections needed| FineTune[🔄 FINE-TUNING<br/>Collect LLM corrections<br/>Continue training]
FineTune -.-> ClassifyStart
style Start fill:#e1f5e1
style End fill:#e1f5e1
style LLMBatch fill:#fff4e6
style Consolidate fill:#fff4e6
style Embed1 fill:#e6f3ff
style Embed2 fill:#e6f3ff
style LLMEval fill:#fff4e6
style LLMSummary fill:#fff4e6
style SaveModel fill:#ffe6f0
style Results fill:#ffe6f0
style FinalReport fill:#ffe6f0
```
---
## Pipeline Stages Breakdown
### STAGE 1: CALIBRATION (1 minute)
**Input:** 100 emails
**LLM Calls:** 6 calls
- 5 batch discovery calls (20 emails each)
- 1 consolidation call
**Embedding Calls:** ~50 calls (one per training email)
**Output:**
- 10 discovered categories
- Trained LightGBM model (1.1MB)
- Category cache
### STAGE 2: CLASSIFICATION (3.4 minutes)
**Input:** 100,000 emails
**LLM Calls:** 0 (pure ML inference)
**Embedding Calls:** ~200 batched calls (512 emails per batch)
**Output:**
- 100,000 classifications
- Confidence scores
- Results in JSON/CSV
### STAGE 3: VALIDATION (variable, ~5-10 minutes)
**Input:** 75 sample emails (50 low-conf + 25 random)
**LLM Calls:** 76 calls
- 75 individual evaluation calls
- 1 final summary call
**Output:**
- Quality assessment (YES/PARTIAL/NO)
- Accuracy metrics
- Recommendations
---
## LLM Call Summary
| Call # | Purpose | Model | Input | Output | Time |
|--------|---------|-------|-------|--------|------|
| 1-5 | Batch Discovery | qwen3:8b | 20 emails each | Categories | ~5-6s each |
| 6 | Consolidation | qwen3:8b | 15 categories | 10 merged | ~3s |
| 7-81 | Evaluation | qwen3:8b | 1 email + category | Verdict | ~2s each |
| 82 | Summary | qwen3:8b | 75 evaluations | Final report | ~5s |
**Total LLM Calls:** 82
**Total LLM Time:** ~3-4 minutes
**Embedding Calls:** ~250 (batched)
**Embedding Time:** ~30 seconds (batched)
---
## Performance Metrics
### Calibration Phase
- **Time:** 60 seconds
- **Samples:** 100 emails (50 for training)
- **Categories Discovered:** 10
- **Model Size:** 1.1MB
- **Accuracy on training:** 95%+
### Classification Phase
- **Time:** 202 seconds (3.4 minutes)
- **Emails:** 100,000
- **Speed:** 495 emails/second
- **Per Email:** 2ms total processing
- **Batch Size:** 512 (optimal)
- **GPU Utilization:** High (batched embeddings)
### Validation Phase
- **Time:** ~10 minutes (75 LLM calls)
- **Samples:** 75 emails
- **Per Sample:** ~8 seconds
- **Accuracy Found:** Model already accurate (0 corrections)
---
## Data Flow Details
### Email Processing Pipeline
```
Email File → Parse → Features → Embedding → Model → Category
(text) (dict) (struct) (384-dim) (22-cat) (label)
```
### Feature Extraction
```
Email Content
├─ Subject (text)
├─ Body (text)
├─ Sender (email address)
├─ Date (timestamp)
├─ Attachments (boolean + count)
└─ Patterns (regex matches)
Structured Text
Ollama Embedding (all-minilm:l6-v2)
384-dimensional vector
```
### LightGBM Training
```
Features (384-dim) + Labels (10 categories)
Training: 200 boosting rounds
Model: 22 categories total (10 discovered + 12 hardcoded)
Output: classifier.pkl (1.1MB)
```
---
## Category Distribution (100k Results)
```mermaid
pie title Category Distribution
"Work Communication" : 89807
"Financial" : 6534
"Forwarded" : 2457
"Technical Analysis" : 1129
"Other" : 73
```
---
## Confidence Distribution (100k Results)
```mermaid
pie title Confidence Levels
"High (≥0.7)" : 74777
"Medium (0.5-0.7)" : 17381
"Low (<0.5)" : 7842
```
---
## System Architecture
```mermaid
graph LR
A[Email Source<br/>Gmail/IMAP/Enron] --> B[Email Provider]
B --> C[Feature Extractor]
C --> D[Ollama<br/>Embeddings]
C --> E[Pattern Detector]
D --> F[LightGBM<br/>Classifier]
E --> F
F --> G[Results<br/>JSON/CSV]
F --> H[Sync Engine<br/>Labels/Keywords]
I[LLM<br/>qwen3:8b] -.->|Calibration| J[Category Discovery]
J -.-> F
I -.->|Validation| K[Quality Check]
K -.-> G
style D fill:#e6f3ff
style I fill:#fff4e6
style F fill:#f0e6ff
style G fill:#ffe6f0
```
---
## Next: Integrated End-to-End Script
Building comprehensive validation script with:
1. 50 low-confidence samples
2. 25 random samples
3. Final LLM summary call
4. Complete pipeline orchestration