256 lines
7.1 KiB
Markdown
256 lines
7.1 KiB
Markdown
# Email Sorter - Complete Workflow Diagram
|
||
|
||
## Full End-to-End Pipeline with LLM Calls
|
||
|
||
```mermaid
|
||
graph TB
|
||
Start([📧 Start: Enron Maildir<br/>100,000 emails]) --> Parse[EnronParser<br/>Stratified Sampling]
|
||
|
||
Parse --> CalibCheck{Need<br/>Calibration?}
|
||
|
||
CalibCheck -->|Yes: No Model| CalibStart[🎯 CALIBRATION PHASE]
|
||
CalibCheck -->|No: Model Exists| ClassifyStart[📊 CLASSIFICATION PHASE]
|
||
|
||
%% CALIBRATION PHASE
|
||
CalibStart --> Sample[Sample 100 Emails<br/>Stratified by user/folder]
|
||
Sample --> Split[Split: 50 train / 50 validation]
|
||
|
||
Split --> LLMBatch[📤 LLM CALL 1-5<br/>Batch Discovery<br/>5 batches × 20 emails]
|
||
|
||
LLMBatch -->|qwen3:8b-q4_K_M| Discover[Category Discovery<br/>~15 raw categories]
|
||
|
||
Discover --> Consolidate[📤 LLM CALL 6<br/>Consolidation<br/>Merge similar categories]
|
||
|
||
Consolidate -->|qwen3:8b-q4_K_M| CacheSnap[Category Cache Snap<br/>Semantic matching<br/>10 final categories]
|
||
|
||
CacheSnap --> ExtractTrain[Extract Features<br/>50 training emails<br/>Batch embeddings]
|
||
|
||
ExtractTrain --> Embed1[📤 EMBEDDING CALLS<br/>Ollama all-minilm:l6-v2<br/>384-dim vectors]
|
||
|
||
Embed1 --> TrainModel[Train LightGBM<br/>200 boosting rounds<br/>22 total categories]
|
||
|
||
TrainModel --> SaveModel[💾 Save Model<br/>classifier.pkl 1.1MB]
|
||
|
||
SaveModel --> ClassifyStart
|
||
|
||
%% CLASSIFICATION PHASE
|
||
ClassifyStart --> LoadModel[Load Model<br/>classifier.pkl]
|
||
LoadModel --> FetchAll[Fetch All Emails<br/>100,000 emails]
|
||
|
||
FetchAll --> BatchProcess[Process in Batches<br/>5,000 emails per batch<br/>20 batches total]
|
||
|
||
BatchProcess --> ExtractFeatures[Extract Features<br/>Batch size: 512<br/>Batched embeddings]
|
||
|
||
ExtractFeatures --> Embed2[📤 EMBEDDING CALLS<br/>Ollama all-minilm:l6-v2<br/>~200 batched calls]
|
||
|
||
Embed2 --> MLInference[LightGBM Inference<br/>Predict categories<br/>~2ms per email]
|
||
|
||
MLInference --> Results[💾 Save Results<br/>results.json 19MB<br/>summary.json 1.5KB<br/>classifications.csv 8.6MB]
|
||
|
||
Results --> ValidationStart[🔍 VALIDATION PHASE]
|
||
|
||
%% VALIDATION PHASE
|
||
ValidationStart --> SelectSamples[Select Samples<br/>50 low-conf + 25 random]
|
||
|
||
SelectSamples --> LoadEmails[Load Full Email Content<br/>Subject + Body + Metadata]
|
||
|
||
LoadEmails --> LLMEval[📤 LLM CALLS 7-81<br/>Individual Evaluation<br/>75 total assessments]
|
||
|
||
LLMEval -->|qwen3:8b-q4_K_M<br/><no_think>| EvalResults[Collect Verdicts<br/>YES/PARTIAL/NO<br/>+ Reasoning]
|
||
|
||
EvalResults --> LLMSummary[📤 LLM CALL 82<br/>Final Summary<br/>Aggregate findings]
|
||
|
||
LLMSummary -->|qwen3:8b-q4_K_M| FinalReport[📊 Final Report<br/>Accuracy metrics<br/>Category quality<br/>Recommendations]
|
||
|
||
FinalReport --> End([✅ Complete<br/>100k classified<br/>+ validated])
|
||
|
||
%% OPTIONAL FINE-TUNING LOOP
|
||
FinalReport -.->|If corrections needed| FineTune[🔄 FINE-TUNING<br/>Collect LLM corrections<br/>Continue training]
|
||
FineTune -.-> ClassifyStart
|
||
|
||
style Start fill:#e1f5e1
|
||
style End fill:#e1f5e1
|
||
style LLMBatch fill:#fff4e6
|
||
style Consolidate fill:#fff4e6
|
||
style Embed1 fill:#e6f3ff
|
||
style Embed2 fill:#e6f3ff
|
||
style LLMEval fill:#fff4e6
|
||
style LLMSummary fill:#fff4e6
|
||
style SaveModel fill:#ffe6f0
|
||
style Results fill:#ffe6f0
|
||
style FinalReport fill:#ffe6f0
|
||
```
|
||
|
||
---
|
||
|
||
## Pipeline Stages Breakdown
|
||
|
||
### STAGE 1: CALIBRATION (1 minute)
|
||
**Input:** 100 emails
|
||
**LLM Calls:** 6 calls
|
||
- 5 batch discovery calls (20 emails each)
|
||
- 1 consolidation call
|
||
**Embedding Calls:** ~50 calls (one per training email)
|
||
**Output:**
|
||
- 10 discovered categories
|
||
- Trained LightGBM model (1.1MB)
|
||
- Category cache
|
||
|
||
### STAGE 2: CLASSIFICATION (3.4 minutes)
|
||
**Input:** 100,000 emails
|
||
**LLM Calls:** 0 (pure ML inference)
|
||
**Embedding Calls:** ~200 batched calls (512 emails per batch)
|
||
**Output:**
|
||
- 100,000 classifications
|
||
- Confidence scores
|
||
- Results in JSON/CSV
|
||
|
||
### STAGE 3: VALIDATION (variable, ~5-10 minutes)
|
||
**Input:** 75 sample emails (50 low-conf + 25 random)
|
||
**LLM Calls:** 76 calls
|
||
- 75 individual evaluation calls
|
||
- 1 final summary call
|
||
**Output:**
|
||
- Quality assessment (YES/PARTIAL/NO)
|
||
- Accuracy metrics
|
||
- Recommendations
|
||
|
||
---
|
||
|
||
## LLM Call Summary
|
||
|
||
| Call # | Purpose | Model | Input | Output | Time |
|
||
|--------|---------|-------|-------|--------|------|
|
||
| 1-5 | Batch Discovery | qwen3:8b | 20 emails each | Categories | ~5-6s each |
|
||
| 6 | Consolidation | qwen3:8b | 15 categories | 10 merged | ~3s |
|
||
| 7-81 | Evaluation | qwen3:8b | 1 email + category | Verdict | ~2s each |
|
||
| 82 | Summary | qwen3:8b | 75 evaluations | Final report | ~5s |
|
||
|
||
**Total LLM Calls:** 82
|
||
**Total LLM Time:** ~3-4 minutes
|
||
**Embedding Calls:** ~250 (batched)
|
||
**Embedding Time:** ~30 seconds (batched)
|
||
|
||
---
|
||
|
||
## Performance Metrics
|
||
|
||
### Calibration Phase
|
||
- **Time:** 60 seconds
|
||
- **Samples:** 100 emails (50 for training)
|
||
- **Categories Discovered:** 10
|
||
- **Model Size:** 1.1MB
|
||
- **Accuracy on training:** 95%+
|
||
|
||
### Classification Phase
|
||
- **Time:** 202 seconds (3.4 minutes)
|
||
- **Emails:** 100,000
|
||
- **Speed:** 495 emails/second
|
||
- **Per Email:** 2ms total processing
|
||
- **Batch Size:** 512 (optimal)
|
||
- **GPU Utilization:** High (batched embeddings)
|
||
|
||
### Validation Phase
|
||
- **Time:** ~10 minutes (75 LLM calls)
|
||
- **Samples:** 75 emails
|
||
- **Per Sample:** ~8 seconds
|
||
- **Accuracy Found:** Model already accurate (0 corrections)
|
||
|
||
---
|
||
|
||
## Data Flow Details
|
||
|
||
### Email Processing Pipeline
|
||
```
|
||
Email File → Parse → Features → Embedding → Model → Category
|
||
(text) (dict) (struct) (384-dim) (22-cat) (label)
|
||
```
|
||
|
||
### Feature Extraction
|
||
```
|
||
Email Content
|
||
├─ Subject (text)
|
||
├─ Body (text)
|
||
├─ Sender (email address)
|
||
├─ Date (timestamp)
|
||
├─ Attachments (boolean + count)
|
||
└─ Patterns (regex matches)
|
||
↓
|
||
Structured Text
|
||
↓
|
||
Ollama Embedding (all-minilm:l6-v2)
|
||
↓
|
||
384-dimensional vector
|
||
```
|
||
|
||
### LightGBM Training
|
||
```
|
||
Features (384-dim) + Labels (10 categories)
|
||
↓
|
||
Training: 200 boosting rounds
|
||
↓
|
||
Model: 22 categories total (10 discovered + 12 hardcoded)
|
||
↓
|
||
Output: classifier.pkl (1.1MB)
|
||
```
|
||
|
||
---
|
||
|
||
## Category Distribution (100k Results)
|
||
|
||
```mermaid
|
||
pie title Category Distribution
|
||
"Work Communication" : 89807
|
||
"Financial" : 6534
|
||
"Forwarded" : 2457
|
||
"Technical Analysis" : 1129
|
||
"Other" : 73
|
||
```
|
||
|
||
---
|
||
|
||
## Confidence Distribution (100k Results)
|
||
|
||
```mermaid
|
||
pie title Confidence Levels
|
||
"High (≥0.7)" : 74777
|
||
"Medium (0.5-0.7)" : 17381
|
||
"Low (<0.5)" : 7842
|
||
```
|
||
|
||
---
|
||
|
||
## System Architecture
|
||
|
||
```mermaid
|
||
graph LR
|
||
A[Email Source<br/>Gmail/IMAP/Enron] --> B[Email Provider]
|
||
B --> C[Feature Extractor]
|
||
C --> D[Ollama<br/>Embeddings]
|
||
C --> E[Pattern Detector]
|
||
D --> F[LightGBM<br/>Classifier]
|
||
E --> F
|
||
F --> G[Results<br/>JSON/CSV]
|
||
F --> H[Sync Engine<br/>Labels/Keywords]
|
||
|
||
I[LLM<br/>qwen3:8b] -.->|Calibration| J[Category Discovery]
|
||
J -.-> F
|
||
I -.->|Validation| K[Quality Check]
|
||
K -.-> G
|
||
|
||
style D fill:#e6f3ff
|
||
style I fill:#fff4e6
|
||
style F fill:#f0e6ff
|
||
style G fill:#ffe6f0
|
||
```
|
||
|
||
---
|
||
|
||
## Next: Integrated End-to-End Script
|
||
|
||
Building comprehensive validation script with:
|
||
1. 50 low-confidence samples
|
||
2. 25 random samples
|
||
3. Final LLM summary call
|
||
4. Complete pipeline orchestration
|