email-sorter/docs/REPORT_FORMAT.md
FSSCoding 8f25e30f52 Rewrite CLAUDE.md and clean project structure
- Rewrote CLAUDE.md with comprehensive development guide
- Archived 20 old docs to docs/archive/
- Added PROJECT_ROADMAP_2025.md with research learnings
- Added CLASSIFICATION_METHODS_COMPARISON.md
- Added SESSION_HANDOVER_20251128.md
- Added tools for analysis (brett_gmail/microsoft analyzers)
- Updated .gitignore for archive folders
- Config changes for local vLLM endpoint
2025-11-28 13:07:27 +11:00

233 lines
5.7 KiB
Markdown

# Email Classification Report Format
This document explains the HTML report generation system, its data sources, and how to customize it.
## Overview
The report generator creates a static HTML file from classification results. It requires enriched `results.json` with email metadata (subject, sender, date, etc.) - not just classification data.
## Files Involved
| File | Purpose |
|------|---------|
| `tools/generate_html_report.py` | Main report generator script |
| `src/cli.py` | Classification CLI - outputs enriched `results.json` |
| `src/export/exporter.py` | Legacy exporter (JSON/CSV) - not used for HTML |
## Data Flow
```
Email Source (.eml/.msg files)
src/cli.py (classification)
results.json (enriched with metadata)
tools/generate_html_report.py
report.html (static, self-contained)
```
## Usage
### Generate Report
```bash
python tools/generate_html_report.py \
--input /path/to/results.json \
--output /path/to/report.html
```
If `--output` is omitted, creates `report.html` in same directory as input.
### Full Workflow
```bash
# 1. Classify emails
python -m src.cli run \
--source local \
--directory "/path/to/emails" \
--output "/path/to/output" \
--no-llm-fallback
# 2. Generate report
python tools/generate_html_report.py \
--input "/path/to/output/results.json"
```
## results.json Format
The report generator expects this structure:
```json
{
"metadata": {
"total_emails": 801,
"accuracy_estimate": 0.55,
"classification_stats": {
"rule_matched": 9,
"ml_classified": 468,
"llm_classified": 0,
"needs_review": 324
},
"generated_at": "2025-11-28T02:34:00.680196",
"source": "local",
"source_path": "/path/to/emails"
},
"classifications": [
{
"email_id": "unique_id.eml",
"subject": "Email subject line",
"sender": "sender@example.com",
"sender_name": "Sender Name",
"date": "2023-04-13T09:43:29+10:00",
"has_attachments": false,
"category": "Work",
"confidence": 0.81,
"method": "ml"
}
]
}
```
### Required Fields
| Field | Type | Description |
|-------|------|-------------|
| `email_id` | string | Unique identifier (usually filename) |
| `subject` | string | Email subject line |
| `sender` | string | Sender email address |
| `category` | string | Assigned category |
| `confidence` | float | Classification confidence (0-1) |
| `method` | string | Classification method: `ml`, `rule`, or `llm` |
### Optional Fields
| Field | Type | Description |
|-------|------|-------------|
| `sender_name` | string | Display name of sender |
| `date` | string | ISO 8601 date string |
| `has_attachments` | boolean | Whether email has attachments |
## Report Sections
### 1. Header
- Report title
- Generation timestamp
- Source info
- Total email count
### 2. Stats Grid
- Total emails
- Number of categories
- High confidence count (>=70%)
- Unique sender domains
### 3. Category Distribution
- Horizontal bar chart
- Count and percentage per category
- Sorted by count (descending)
### 4. Classification Methods
- Breakdown of ML vs Rule vs LLM
- Shows which method handled what percentage
### 5. Confidence Distribution
- High (>=70%): Green
- Medium (50-70%): Yellow
- Low (<50%): Red
### 6. Top Senders
- Top 20 senders by email count
- Grid layout
### 7. Email Tables (Tabbed)
- "All" tab shows all emails
- Category tabs filter by category
- Search box filters by subject/sender
- Columns: Date, Subject, Sender, Category, Confidence, Method
- Sorted by date (newest first)
- Attachment indicator (📎)
## Customization
### Changing Colors
Edit the CSS variables in `generate_html_report.py`:
```css
:root {
--bg-primary: #1a1a2e; /* Main background */
--bg-secondary: #16213e; /* Card backgrounds */
--bg-card: #0f3460; /* Nested elements */
--text-primary: #eee; /* Main text */
--text-secondary: #aaa; /* Muted text */
--accent: #e94560; /* Accent color (red) */
--accent-hover: #ff6b6b; /* Accent hover */
--success: #00d9a5; /* Green (high confidence) */
--warning: #ffc107; /* Yellow (medium confidence) */
--border: #2a2a4a; /* Border color */
}
```
### Light Theme Example
```css
:root {
--bg-primary: #f5f5f5;
--bg-secondary: #ffffff;
--bg-card: #e8e8e8;
--text-primary: #333;
--text-secondary: #666;
--accent: #2563eb;
--accent-hover: #3b82f6;
--success: #10b981;
--warning: #f59e0b;
--border: #d1d5db;
}
```
### Adding New Sections
1. Add data extraction in `generate_html_report()` function
2. Add HTML section in the main template string
3. Style with existing CSS classes or add new ones
### Adding New Table Columns
1. Modify `generate_email_row()` function
2. Add `<th>` in table header
3. Add `<td>` in row template
## Performance Notes
- Report is fully static (no server required)
- JavaScript is minimal (tab switching, search filtering)
- Handles 1000+ emails without performance issues
- For 10k+ emails, consider pagination (not yet implemented)
## Future Enhancements (TODO)
- [ ] Pagination for large datasets
- [ ] Export to PDF option
- [ ] Configurable color themes via CLI
- [ ] Column sorting (click headers)
- [ ] Date range filter
- [ ] Sender domain grouping
- [ ] Category confidence heatmap
- [ ] Email body preview on hover
## Troubleshooting
### "KeyError: 'subject'"
Results.json lacks email metadata. Re-run classification with latest cli.py.
### Empty tables
Check that results.json has `classifications` array with data.
### Dates showing "N/A"
Date parsing failed. Check date format in results.json is ISO 8601.
### Search not working
JavaScript error. Check browser console. Ensure no HTML entities in data.