email-sorter/docs/VERIFY_CATEGORIES_FEATURE.html
FSSCoding 53174a34eb Organize project structure and add MVP features
Project Reorganization:
- Created docs/ directory and moved all documentation
- Created scripts/ directory for shell scripts
- Created scripts/experimental/ for research scripts
- Updated .gitignore for new structure
- Updated README.md with MVP status and new structure

New Features:
- Category verification system (verify_model_categories)
- --verify-categories flag for mailbox compatibility check
- --no-llm-fallback flag for pure ML classification
- Trained model saved in src/models/calibrated/

Threshold Optimization:
- Reduced default threshold from 0.75 to 0.55
- Updated all category thresholds to 0.55
- Reduces LLM fallback rate by 40% (35% -> 21%)

Documentation:
- SYSTEM_FLOW.html - Complete system architecture
- VERIFY_CATEGORIES_FEATURE.html - Feature documentation
- LABEL_TRAINING_PHASE_DETAIL.html - Calibration breakdown
- FAST_ML_ONLY_WORKFLOW.html - Pure ML guide
- PROJECT_STATUS_AND_NEXT_STEPS.html - Roadmap
- ROOT_CAUSE_ANALYSIS.md - Bug fixes

MVP Status:
- 10k emails in 4 minutes, 72.7% accuracy, 0 LLM calls
- LLM-driven category discovery working
- Embedding-based transfer learning confirmed
- All model paths verified and working
2025-10-25 14:46:58 +11:00

358 lines
12 KiB
HTML

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Category Verification Feature</title>
<script src="https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.min.js"></script>
<style>
body {
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
margin: 20px;
background: #1e1e1e;
color: #d4d4d4;
}
h1, h2, h3 {
color: #4ec9b0;
}
.diagram {
background: white;
padding: 20px;
margin: 20px 0;
border-radius: 8px;
}
.code-section {
background: #252526;
padding: 15px;
margin: 10px 0;
border-left: 4px solid #4ec9b0;
font-family: 'Courier New', monospace;
}
code {
background: #1e1e1e;
padding: 2px 6px;
border-radius: 3px;
color: #ce9178;
}
.success {
background: #002a00;
border-left: 4px solid #4ec9b0;
padding: 15px;
margin: 10px 0;
}
</style>
</head>
<body>
<h1>--verify-categories Feature</h1>
<div class="success">
<h2>✅ IMPLEMENTED AND READY TO USE</h2>
<p><strong>Feature:</strong> Single LLM call to verify model categories fit new mailbox</p>
<p><strong>Cost:</strong> +20 seconds, 1 LLM call</p>
<p><strong>Value:</strong> Confidence check before bulk ML classification</p>
</div>
<h2>Usage</h2>
<div class="code-section">
<strong>Basic usage (with verification):</strong>
python -m src.cli run \
--source enron \
--limit 10000 \
--output verified_test/ \
--no-llm-fallback \
--verify-categories
<strong>Custom verification sample size:</strong>
python -m src.cli run \
--source enron \
--limit 10000 \
--output verified_test/ \
--no-llm-fallback \
--verify-categories \
--verify-sample 30
<strong>Without verification (fastest):</strong>
python -m src.cli run \
--source enron \
--limit 10000 \
--output fast_test/ \
--no-llm-fallback
</div>
<h2>How It Works</h2>
<div class="diagram">
<pre class="mermaid">
flowchart TD
Start([Run with --verify-categories]) --> LoadModel[Load trained model<br/>Categories: Updates, Work,<br/>Meetings, etc.]
LoadModel --> FetchEmails[Fetch all emails<br/>10,000 total]
FetchEmails --> CheckFlag{--verify-categories?}
CheckFlag -->|No| SkipVerify[Skip verification<br/>Proceed to classification]
CheckFlag -->|Yes| Sample[Sample random emails<br/>Default: 20 emails]
Sample --> BuildPrompt[Build verification prompt<br/>Show model categories<br/>Show sample emails]
BuildPrompt --> LLMCall[Single LLM call<br/>~20 seconds<br/>Task: Rate category fit]
LLMCall --> ParseResponse[Parse JSON response<br/>Extract verdict + confidence]
ParseResponse --> Verdict{Verdict?}
Verdict -->|GOOD_MATCH<br/>80%+ fit| LogGood[Log: Categories appropriate<br/>Confidence: 0.8-1.0]
Verdict -->|FAIR_MATCH<br/>60-80% fit| LogFair[Log: Categories acceptable<br/>Confidence: 0.6-0.8]
Verdict -->|POOR_MATCH<br/><60% fit| LogPoor[Log WARNING<br/>Show suggested categories<br/>Recommend calibration<br/>Confidence: 0.0-0.6]
LogGood --> Proceed[Proceed with ML classification]
LogFair --> Proceed
LogPoor --> Proceed
SkipVerify --> Proceed
Proceed --> ClassifyAll[Classify all 10,000 emails<br/>Pure ML, no LLM fallback<br/>~4 minutes]
ClassifyAll --> Done[Results saved]
style LLMCall fill:#ffd93d
style LogGood fill:#4ec9b0
style LogPoor fill:#ff6b6b
style ClassifyAll fill:#4ec9b0
</pre>
</div>
<h2>Example Outputs</h2>
<h3>Scenario 1: GOOD_MATCH (Enron → Enron)</h3>
<div class="code-section">
================================================================================
VERIFYING MODEL CATEGORIES
================================================================================
Verifying model categories against 10000 emails
Model categories (11): Updates, Work, Meetings, External, Financial, Test, Administrative, Operational, Technical, Urgent, Requests
Sampled 20 emails for verification
Calling LLM for category verification...
Verification complete: GOOD_MATCH (0.85)
Reasoning: The sample emails fit well into the trained categories. Most are work-related correspondence, meetings, and operational updates which align with the model.
Verification: GOOD_MATCH
Confidence: 85%
Model categories look appropriate for this mailbox
================================================================================
Starting classification...
</div>
<h3>Scenario 2: POOR_MATCH (Enron → Personal Gmail)</h3>
<div class="code-section">
================================================================================
VERIFYING MODEL CATEGORIES
================================================================================
Verifying model categories against 10000 emails
Model categories (11): Updates, Work, Meetings, External, Financial, Test, Administrative, Operational, Technical, Urgent, Requests
Sampled 20 emails for verification
Calling LLM for category verification...
Verification complete: POOR_MATCH (0.45)
Reasoning: Many sample emails are shopping confirmations, social media notifications, and personal correspondence which don't fit the business-focused categories well.
Verification: POOR_MATCH
Confidence: 45%
================================================================================
WARNING: Model categories may not fit this mailbox well
Suggested categories: ['Shopping', 'Social', 'Travel', 'Newsletters', 'Personal']
Consider running full calibration for better accuracy
Proceeding with existing model anyway...
================================================================================
Starting classification...
</div>
<h2>LLM Prompt Structure</h2>
<div class="code-section">
You are evaluating whether pre-trained email categories fit a new mailbox.
TRAINED MODEL CATEGORIES (11 categories):
- Updates
- Work
- Meetings
- External
- Financial
- Test
- Administrative
- Operational
- Technical
- Urgent
- Requests
SAMPLE EMAILS FROM NEW MAILBOX (20 total, showing first 20):
1. From: phillip.allen@enron.com
Subject: Re: AEC Volumes at OPAL
Preview: Here are the volumes for today...
2. From: notifications@amazon.com
Subject: Your order has shipped
Preview: Your Amazon.com order #123-4567890...
[... 18 more emails ...]
TASK:
Evaluate if the trained categories are appropriate for this mailbox.
Consider:
1. Do the sample emails naturally fit into the trained categories?
2. Are there obvious email types that don't match any category?
3. Are the category names semantically appropriate?
4. Would a user find these categories helpful for THIS mailbox?
Respond with JSON:
{
"verdict": "GOOD_MATCH" | "FAIR_MATCH" | "POOR_MATCH",
"confidence": 0.0-1.0,
"reasoning": "brief explanation",
"fit_percentage": 0-100,
"suggested_categories": ["cat1", "cat2", ...],
"category_mapping": {"old_name": "better_name", ...}
}
</div>
<h2>Configuration</h2>
<table style="width:100%; border-collapse: collapse; background: #252526; margin: 20px 0;">
<tr style="background: #37373d;">
<th style="padding: 12px; text-align: left; color: #4ec9b0;">Flag</th>
<th style="padding: 12px; text-align: left; color: #4ec9b0;">Type</th>
<th style="padding: 12px; text-align: left; color: #4ec9b0;">Default</th>
<th style="padding: 12px; text-align: left; color: #4ec9b0;">Description</th>
</tr>
<tr style="border-bottom: 1px solid #3e3e42;">
<td style="padding: 10px;"><code>--verify-categories</code></td>
<td style="padding: 10px;">Flag</td>
<td style="padding: 10px;">False</td>
<td style="padding: 10px;">Enable category verification</td>
</tr>
<tr style="border-bottom: 1px solid #3e3e42;">
<td style="padding: 10px;"><code>--verify-sample</code></td>
<td style="padding: 10px;">Integer</td>
<td style="padding: 10px;">20</td>
<td style="padding: 10px;">Number of emails to sample</td>
</tr>
<tr style="border-bottom: 1px solid #3e3e42;">
<td style="padding: 10px;"><code>--no-llm-fallback</code></td>
<td style="padding: 10px;">Flag</td>
<td style="padding: 10px;">False</td>
<td style="padding: 10px;">Disable LLM fallback during classification</td>
</tr>
</table>
<h2>When Verification Runs</h2>
<ul>
<li>✅ Only if <code>--verify-categories</code> flag is set</li>
<li>✅ Only if trained model exists (not mock)</li>
<li>✅ After emails are fetched, before calibration/classification</li>
<li>❌ Skipped if using mock model</li>
<li>❌ Skipped if model doesn't exist (calibration will run anyway)</li>
</ul>
<h2>Timing Impact</h2>
<table style="width:100%; border-collapse: collapse; background: #252526; margin: 20px 0;">
<tr style="background: #37373d;">
<th style="padding: 12px; text-align: left; color: #4ec9b0;">Configuration</th>
<th style="padding: 12px; text-align: left; color: #4ec9b0;">Time (10k emails)</th>
<th style="padding: 12px; text-align: left; color: #4ec9b0;">LLM Calls</th>
</tr>
<tr style="border-bottom: 1px solid #3e3e42;">
<td style="padding: 10px;">ML-only (no flags)</td>
<td style="padding: 10px;">~4 minutes</td>
<td style="padding: 10px;">0</td>
</tr>
<tr style="border-bottom: 1px solid #3e3e42;">
<td style="padding: 10px;">ML-only + <code>--verify-categories</code></td>
<td style="padding: 10px;">~4.3 minutes</td>
<td style="padding: 10px;">1 (verification)</td>
</tr>
<tr style="border-bottom: 1px solid #3e3e42;">
<td style="padding: 10px;">Full calibration (no model)</td>
<td style="padding: 10px;">~25 minutes</td>
<td style="padding: 10px;">~500</td>
</tr>
<tr style="border-bottom: 1px solid #3e3e42;">
<td style="padding: 10px;">ML + LLM fallback (21%)</td>
<td style="padding: 10px;">~2.5 hours</td>
<td style="padding: 10px;">~2100</td>
</tr>
</table>
<h2>Decision Tree</h2>
<div class="diagram">
<pre class="mermaid">
flowchart TD
Start([Need to classify emails]) --> HaveModel{Trained model<br/>exists?}
HaveModel -->|No| MustCalibrate[Must run calibration<br/>~20 minutes<br/>~500 LLM calls]
HaveModel -->|Yes| SameDomain{Same domain as<br/>training data?}
SameDomain -->|Yes, confident| FastML[Pure ML<br/>4 minutes<br/>0 LLM calls]
SameDomain -->|Unsure| VerifyML[ML + Verification<br/>4.3 minutes<br/>1 LLM call]
SameDomain -->|No, different| Options{Accuracy needs?}
Options -->|High accuracy required| MustCalibrate
Options -->|Speed more important| VerifyML
Options -->|Experimental| FastML
MustCalibrate --> Done[Classification complete]
FastML --> Done
VerifyML --> Done
style FastML fill:#4ec9b0
style VerifyML fill:#ffd93d
style MustCalibrate fill:#ff6b6b
</pre>
</div>
<h2>Quick Start</h2>
<div class="code-section">
<strong>Test with verification on same domain (Enron → Enron):</strong>
python -m src.cli run \
--source enron \
--limit 1000 \
--output verify_test_same/ \
--no-llm-fallback \
--verify-categories
Expected: GOOD_MATCH (0.80-0.95)
Time: ~30 seconds
<strong>Test without verification for speed comparison:</strong>
python -m src.cli run \
--source enron \
--limit 1000 \
--output no_verify_test/ \
--no-llm-fallback
Expected: Same accuracy, 20 seconds faster
Time: ~10 seconds
</div>
<script>
mermaid.initialize({
startOnLoad: true,
theme: 'default',
flowchart: {
useMaxWidth: true,
htmlLabels: true,
curve: 'basis'
}
});
</script>
</body>
</html>