Project Reorganization: - Created docs/ directory and moved all documentation - Created scripts/ directory for shell scripts - Created scripts/experimental/ for research scripts - Updated .gitignore for new structure - Updated README.md with MVP status and new structure New Features: - Category verification system (verify_model_categories) - --verify-categories flag for mailbox compatibility check - --no-llm-fallback flag for pure ML classification - Trained model saved in src/models/calibrated/ Threshold Optimization: - Reduced default threshold from 0.75 to 0.55 - Updated all category thresholds to 0.55 - Reduces LLM fallback rate by 40% (35% -> 21%) Documentation: - SYSTEM_FLOW.html - Complete system architecture - VERIFY_CATEGORIES_FEATURE.html - Feature documentation - LABEL_TRAINING_PHASE_DETAIL.html - Calibration breakdown - FAST_ML_ONLY_WORKFLOW.html - Pure ML guide - PROJECT_STATUS_AND_NEXT_STEPS.html - Roadmap - ROOT_CAUSE_ANALYSIS.md - Bug fixes MVP Status: - 10k emails in 4 minutes, 72.7% accuracy, 0 LLM calls - LLM-driven category discovery working - Embedding-based transfer learning confirmed - All model paths verified and working
358 lines
12 KiB
HTML
358 lines
12 KiB
HTML
<!DOCTYPE html>
|
|
<html lang="en">
|
|
<head>
|
|
<meta charset="UTF-8">
|
|
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
|
<title>Category Verification Feature</title>
|
|
<script src="https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.min.js"></script>
|
|
<style>
|
|
body {
|
|
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
|
|
margin: 20px;
|
|
background: #1e1e1e;
|
|
color: #d4d4d4;
|
|
}
|
|
h1, h2, h3 {
|
|
color: #4ec9b0;
|
|
}
|
|
.diagram {
|
|
background: white;
|
|
padding: 20px;
|
|
margin: 20px 0;
|
|
border-radius: 8px;
|
|
}
|
|
.code-section {
|
|
background: #252526;
|
|
padding: 15px;
|
|
margin: 10px 0;
|
|
border-left: 4px solid #4ec9b0;
|
|
font-family: 'Courier New', monospace;
|
|
}
|
|
code {
|
|
background: #1e1e1e;
|
|
padding: 2px 6px;
|
|
border-radius: 3px;
|
|
color: #ce9178;
|
|
}
|
|
.success {
|
|
background: #002a00;
|
|
border-left: 4px solid #4ec9b0;
|
|
padding: 15px;
|
|
margin: 10px 0;
|
|
}
|
|
</style>
|
|
</head>
|
|
<body>
|
|
<h1>--verify-categories Feature</h1>
|
|
|
|
<div class="success">
|
|
<h2>✅ IMPLEMENTED AND READY TO USE</h2>
|
|
<p><strong>Feature:</strong> Single LLM call to verify model categories fit new mailbox</p>
|
|
<p><strong>Cost:</strong> +20 seconds, 1 LLM call</p>
|
|
<p><strong>Value:</strong> Confidence check before bulk ML classification</p>
|
|
</div>
|
|
|
|
<h2>Usage</h2>
|
|
|
|
<div class="code-section">
|
|
<strong>Basic usage (with verification):</strong>
|
|
python -m src.cli run \
|
|
--source enron \
|
|
--limit 10000 \
|
|
--output verified_test/ \
|
|
--no-llm-fallback \
|
|
--verify-categories
|
|
|
|
<strong>Custom verification sample size:</strong>
|
|
python -m src.cli run \
|
|
--source enron \
|
|
--limit 10000 \
|
|
--output verified_test/ \
|
|
--no-llm-fallback \
|
|
--verify-categories \
|
|
--verify-sample 30
|
|
|
|
<strong>Without verification (fastest):</strong>
|
|
python -m src.cli run \
|
|
--source enron \
|
|
--limit 10000 \
|
|
--output fast_test/ \
|
|
--no-llm-fallback
|
|
</div>
|
|
|
|
<h2>How It Works</h2>
|
|
|
|
<div class="diagram">
|
|
<pre class="mermaid">
|
|
flowchart TD
|
|
Start([Run with --verify-categories]) --> LoadModel[Load trained model<br/>Categories: Updates, Work,<br/>Meetings, etc.]
|
|
|
|
LoadModel --> FetchEmails[Fetch all emails<br/>10,000 total]
|
|
|
|
FetchEmails --> CheckFlag{--verify-categories?}
|
|
CheckFlag -->|No| SkipVerify[Skip verification<br/>Proceed to classification]
|
|
CheckFlag -->|Yes| Sample[Sample random emails<br/>Default: 20 emails]
|
|
|
|
Sample --> BuildPrompt[Build verification prompt<br/>Show model categories<br/>Show sample emails]
|
|
|
|
BuildPrompt --> LLMCall[Single LLM call<br/>~20 seconds<br/>Task: Rate category fit]
|
|
|
|
LLMCall --> ParseResponse[Parse JSON response<br/>Extract verdict + confidence]
|
|
|
|
ParseResponse --> Verdict{Verdict?}
|
|
|
|
Verdict -->|GOOD_MATCH<br/>80%+ fit| LogGood[Log: Categories appropriate<br/>Confidence: 0.8-1.0]
|
|
Verdict -->|FAIR_MATCH<br/>60-80% fit| LogFair[Log: Categories acceptable<br/>Confidence: 0.6-0.8]
|
|
Verdict -->|POOR_MATCH<br/><60% fit| LogPoor[Log WARNING<br/>Show suggested categories<br/>Recommend calibration<br/>Confidence: 0.0-0.6]
|
|
|
|
LogGood --> Proceed[Proceed with ML classification]
|
|
LogFair --> Proceed
|
|
LogPoor --> Proceed
|
|
|
|
SkipVerify --> Proceed
|
|
|
|
Proceed --> ClassifyAll[Classify all 10,000 emails<br/>Pure ML, no LLM fallback<br/>~4 minutes]
|
|
|
|
ClassifyAll --> Done[Results saved]
|
|
|
|
style LLMCall fill:#ffd93d
|
|
style LogGood fill:#4ec9b0
|
|
style LogPoor fill:#ff6b6b
|
|
style ClassifyAll fill:#4ec9b0
|
|
</pre>
|
|
</div>
|
|
|
|
<h2>Example Outputs</h2>
|
|
|
|
<h3>Scenario 1: GOOD_MATCH (Enron → Enron)</h3>
|
|
<div class="code-section">
|
|
================================================================================
|
|
VERIFYING MODEL CATEGORIES
|
|
================================================================================
|
|
Verifying model categories against 10000 emails
|
|
Model categories (11): Updates, Work, Meetings, External, Financial, Test, Administrative, Operational, Technical, Urgent, Requests
|
|
Sampled 20 emails for verification
|
|
Calling LLM for category verification...
|
|
Verification complete: GOOD_MATCH (0.85)
|
|
Reasoning: The sample emails fit well into the trained categories. Most are work-related correspondence, meetings, and operational updates which align with the model.
|
|
|
|
Verification: GOOD_MATCH
|
|
Confidence: 85%
|
|
Model categories look appropriate for this mailbox
|
|
================================================================================
|
|
|
|
Starting classification...
|
|
</div>
|
|
|
|
<h3>Scenario 2: POOR_MATCH (Enron → Personal Gmail)</h3>
|
|
<div class="code-section">
|
|
================================================================================
|
|
VERIFYING MODEL CATEGORIES
|
|
================================================================================
|
|
Verifying model categories against 10000 emails
|
|
Model categories (11): Updates, Work, Meetings, External, Financial, Test, Administrative, Operational, Technical, Urgent, Requests
|
|
Sampled 20 emails for verification
|
|
Calling LLM for category verification...
|
|
Verification complete: POOR_MATCH (0.45)
|
|
Reasoning: Many sample emails are shopping confirmations, social media notifications, and personal correspondence which don't fit the business-focused categories well.
|
|
|
|
Verification: POOR_MATCH
|
|
Confidence: 45%
|
|
================================================================================
|
|
WARNING: Model categories may not fit this mailbox well
|
|
Suggested categories: ['Shopping', 'Social', 'Travel', 'Newsletters', 'Personal']
|
|
Consider running full calibration for better accuracy
|
|
Proceeding with existing model anyway...
|
|
================================================================================
|
|
|
|
Starting classification...
|
|
</div>
|
|
|
|
<h2>LLM Prompt Structure</h2>
|
|
|
|
<div class="code-section">
|
|
You are evaluating whether pre-trained email categories fit a new mailbox.
|
|
|
|
TRAINED MODEL CATEGORIES (11 categories):
|
|
- Updates
|
|
- Work
|
|
- Meetings
|
|
- External
|
|
- Financial
|
|
- Test
|
|
- Administrative
|
|
- Operational
|
|
- Technical
|
|
- Urgent
|
|
- Requests
|
|
|
|
SAMPLE EMAILS FROM NEW MAILBOX (20 total, showing first 20):
|
|
1. From: phillip.allen@enron.com
|
|
Subject: Re: AEC Volumes at OPAL
|
|
Preview: Here are the volumes for today...
|
|
|
|
2. From: notifications@amazon.com
|
|
Subject: Your order has shipped
|
|
Preview: Your Amazon.com order #123-4567890...
|
|
|
|
[... 18 more emails ...]
|
|
|
|
TASK:
|
|
Evaluate if the trained categories are appropriate for this mailbox.
|
|
|
|
Consider:
|
|
1. Do the sample emails naturally fit into the trained categories?
|
|
2. Are there obvious email types that don't match any category?
|
|
3. Are the category names semantically appropriate?
|
|
4. Would a user find these categories helpful for THIS mailbox?
|
|
|
|
Respond with JSON:
|
|
{
|
|
"verdict": "GOOD_MATCH" | "FAIR_MATCH" | "POOR_MATCH",
|
|
"confidence": 0.0-1.0,
|
|
"reasoning": "brief explanation",
|
|
"fit_percentage": 0-100,
|
|
"suggested_categories": ["cat1", "cat2", ...],
|
|
"category_mapping": {"old_name": "better_name", ...}
|
|
}
|
|
</div>
|
|
|
|
<h2>Configuration</h2>
|
|
|
|
<table style="width:100%; border-collapse: collapse; background: #252526; margin: 20px 0;">
|
|
<tr style="background: #37373d;">
|
|
<th style="padding: 12px; text-align: left; color: #4ec9b0;">Flag</th>
|
|
<th style="padding: 12px; text-align: left; color: #4ec9b0;">Type</th>
|
|
<th style="padding: 12px; text-align: left; color: #4ec9b0;">Default</th>
|
|
<th style="padding: 12px; text-align: left; color: #4ec9b0;">Description</th>
|
|
</tr>
|
|
<tr style="border-bottom: 1px solid #3e3e42;">
|
|
<td style="padding: 10px;"><code>--verify-categories</code></td>
|
|
<td style="padding: 10px;">Flag</td>
|
|
<td style="padding: 10px;">False</td>
|
|
<td style="padding: 10px;">Enable category verification</td>
|
|
</tr>
|
|
<tr style="border-bottom: 1px solid #3e3e42;">
|
|
<td style="padding: 10px;"><code>--verify-sample</code></td>
|
|
<td style="padding: 10px;">Integer</td>
|
|
<td style="padding: 10px;">20</td>
|
|
<td style="padding: 10px;">Number of emails to sample</td>
|
|
</tr>
|
|
<tr style="border-bottom: 1px solid #3e3e42;">
|
|
<td style="padding: 10px;"><code>--no-llm-fallback</code></td>
|
|
<td style="padding: 10px;">Flag</td>
|
|
<td style="padding: 10px;">False</td>
|
|
<td style="padding: 10px;">Disable LLM fallback during classification</td>
|
|
</tr>
|
|
</table>
|
|
|
|
<h2>When Verification Runs</h2>
|
|
|
|
<ul>
|
|
<li>✅ Only if <code>--verify-categories</code> flag is set</li>
|
|
<li>✅ Only if trained model exists (not mock)</li>
|
|
<li>✅ After emails are fetched, before calibration/classification</li>
|
|
<li>❌ Skipped if using mock model</li>
|
|
<li>❌ Skipped if model doesn't exist (calibration will run anyway)</li>
|
|
</ul>
|
|
|
|
<h2>Timing Impact</h2>
|
|
|
|
<table style="width:100%; border-collapse: collapse; background: #252526; margin: 20px 0;">
|
|
<tr style="background: #37373d;">
|
|
<th style="padding: 12px; text-align: left; color: #4ec9b0;">Configuration</th>
|
|
<th style="padding: 12px; text-align: left; color: #4ec9b0;">Time (10k emails)</th>
|
|
<th style="padding: 12px; text-align: left; color: #4ec9b0;">LLM Calls</th>
|
|
</tr>
|
|
<tr style="border-bottom: 1px solid #3e3e42;">
|
|
<td style="padding: 10px;">ML-only (no flags)</td>
|
|
<td style="padding: 10px;">~4 minutes</td>
|
|
<td style="padding: 10px;">0</td>
|
|
</tr>
|
|
<tr style="border-bottom: 1px solid #3e3e42;">
|
|
<td style="padding: 10px;">ML-only + <code>--verify-categories</code></td>
|
|
<td style="padding: 10px;">~4.3 minutes</td>
|
|
<td style="padding: 10px;">1 (verification)</td>
|
|
</tr>
|
|
<tr style="border-bottom: 1px solid #3e3e42;">
|
|
<td style="padding: 10px;">Full calibration (no model)</td>
|
|
<td style="padding: 10px;">~25 minutes</td>
|
|
<td style="padding: 10px;">~500</td>
|
|
</tr>
|
|
<tr style="border-bottom: 1px solid #3e3e42;">
|
|
<td style="padding: 10px;">ML + LLM fallback (21%)</td>
|
|
<td style="padding: 10px;">~2.5 hours</td>
|
|
<td style="padding: 10px;">~2100</td>
|
|
</tr>
|
|
</table>
|
|
|
|
<h2>Decision Tree</h2>
|
|
|
|
<div class="diagram">
|
|
<pre class="mermaid">
|
|
flowchart TD
|
|
Start([Need to classify emails]) --> HaveModel{Trained model<br/>exists?}
|
|
|
|
HaveModel -->|No| MustCalibrate[Must run calibration<br/>~20 minutes<br/>~500 LLM calls]
|
|
|
|
HaveModel -->|Yes| SameDomain{Same domain as<br/>training data?}
|
|
|
|
SameDomain -->|Yes, confident| FastML[Pure ML<br/>4 minutes<br/>0 LLM calls]
|
|
|
|
SameDomain -->|Unsure| VerifyML[ML + Verification<br/>4.3 minutes<br/>1 LLM call]
|
|
|
|
SameDomain -->|No, different| Options{Accuracy needs?}
|
|
|
|
Options -->|High accuracy required| MustCalibrate
|
|
Options -->|Speed more important| VerifyML
|
|
Options -->|Experimental| FastML
|
|
|
|
MustCalibrate --> Done[Classification complete]
|
|
FastML --> Done
|
|
VerifyML --> Done
|
|
|
|
style FastML fill:#4ec9b0
|
|
style VerifyML fill:#ffd93d
|
|
style MustCalibrate fill:#ff6b6b
|
|
</pre>
|
|
</div>
|
|
|
|
<h2>Quick Start</h2>
|
|
|
|
<div class="code-section">
|
|
<strong>Test with verification on same domain (Enron → Enron):</strong>
|
|
python -m src.cli run \
|
|
--source enron \
|
|
--limit 1000 \
|
|
--output verify_test_same/ \
|
|
--no-llm-fallback \
|
|
--verify-categories
|
|
|
|
Expected: GOOD_MATCH (0.80-0.95)
|
|
Time: ~30 seconds
|
|
|
|
<strong>Test without verification for speed comparison:</strong>
|
|
python -m src.cli run \
|
|
--source enron \
|
|
--limit 1000 \
|
|
--output no_verify_test/ \
|
|
--no-llm-fallback
|
|
|
|
Expected: Same accuracy, 20 seconds faster
|
|
Time: ~10 seconds
|
|
</div>
|
|
|
|
<script>
|
|
mermaid.initialize({
|
|
startOnLoad: true,
|
|
theme: 'default',
|
|
flowchart: {
|
|
useMaxWidth: true,
|
|
htmlLabels: true,
|
|
curve: 'basis'
|
|
}
|
|
});
|
|
</script>
|
|
</body>
|
|
</html>
|