email-sorter/docs/REPORT_FORMAT.md

# Email Classification Report Format

This document explains the HTML report generation system, its data sources, and how to customize it.

## Overview

The report generator creates a static HTML file from classification results. It requires enriched `results.json` with email metadata (subject, sender, date, etc.) - not just classification data.

## Files Involved

| File | Purpose |
|------|---------|
| `tools/generate_html_report.py` | Main report generator script |
| `src/cli.py` | Classification CLI - outputs enriched `results.json` |
| `src/export/exporter.py` | Legacy exporter (JSON/CSV) - not used for HTML |

## Data Flow

```
Email Source (.eml/.msg files)
        ↓
   src/cli.py (classification)
        ↓
   results.json (enriched with metadata)
        ↓
   tools/generate_html_report.py
        ↓
   report.html (static, self-contained)
```

## Usage

### Generate Report

```bash
python tools/generate_html_report.py \
  --input /path/to/results.json \
  --output /path/to/report.html
```

If `--output` is omitted, creates `report.html` in same directory as input.

### Full Workflow

```bash
# 1. Classify emails
python -m src.cli run \
  --source local \
  --directory "/path/to/emails" \
  --output "/path/to/output" \
  --no-llm-fallback

# 2. Generate report
python tools/generate_html_report.py \
  --input "/path/to/output/results.json"
```

## results.json Format

The report generator expects this structure:

```json
{
  "metadata": {
    "total_emails": 801,
    "accuracy_estimate": 0.55,
    "classification_stats": {
      "rule_matched": 9,
      "ml_classified": 468,
      "llm_classified": 0,
      "needs_review": 324
    },
    "generated_at": "2025-11-28T02:34:00.680196",
    "source": "local",
    "source_path": "/path/to/emails"
  },
  "classifications": [
    {
      "email_id": "unique_id.eml",
      "subject": "Email subject line",
      "sender": "sender@example.com",
      "sender_name": "Sender Name",
      "date": "2023-04-13T09:43:29+10:00",
      "has_attachments": false,
      "category": "Work",
      "confidence": 0.81,
      "method": "ml"
    }
  ]
}
```

### Required Fields

| Field | Type | Description |
|-------|------|-------------|
| `email_id` | string | Unique identifier (usually filename) |
| `subject` | string | Email subject line |
| `sender` | string | Sender email address |
| `category` | string | Assigned category |
| `confidence` | float | Classification confidence (0-1) |
| `method` | string | Classification method: `ml`, `rule`, or `llm` |

### Optional Fields

| Field | Type | Description |
|-------|------|-------------|
| `sender_name` | string | Display name of sender |
| `date` | string | ISO 8601 date string |
| `has_attachments` | boolean | Whether email has attachments |

## Report Sections

### 1. Header
- Report title
- Generation timestamp
- Source info
- Total email count

### 2. Stats Grid
- Total emails
- Number of categories
- High confidence count (>=70%)
- Unique sender domains

### 3. Category Distribution
- Horizontal bar chart
- Count and percentage per category
- Sorted by count (descending)

### 4. Classification Methods
- Breakdown of ML vs Rule vs LLM
- Shows which method handled what percentage

### 5. Confidence Distribution
- High (>=70%): Green
- Medium (50-70%): Yellow
- Low (<50%): Red

### 6. Top Senders
- Top 20 senders by email count
- Grid layout

### 7. Email Tables (Tabbed)
- "All" tab shows all emails
- Category tabs filter by category
- Search box filters by subject/sender
- Columns: Date, Subject, Sender, Category, Confidence, Method
- Sorted by date (newest first)
- Attachment indicator (📎)

## Customization

### Changing Colors

Edit the CSS variables in `generate_html_report.py`:

```css
:root {
    --bg-primary: #1a1a2e;      /* Main background */
    --bg-secondary: #16213e;    /* Card backgrounds */
    --bg-card: #0f3460;         /* Nested elements */
    --text-primary: #eee;       /* Main text */
    --text-secondary: #aaa;     /* Muted text */
    --accent: #e94560;          /* Accent color (red) */
    --accent-hover: #ff6b6b;    /* Accent hover */
    --success: #00d9a5;         /* Green (high confidence) */
    --warning: #ffc107;         /* Yellow (medium confidence) */
    --border: #2a2a4a;          /* Border color */
}
```

### Light Theme Example

```css
:root {
    --bg-primary: #f5f5f5;
    --bg-secondary: #ffffff;
    --bg-card: #e8e8e8;
    --text-primary: #333;
    --text-secondary: #666;
    --accent: #2563eb;
    --accent-hover: #3b82f6;
    --success: #10b981;
    --warning: #f59e0b;
    --border: #d1d5db;
}
```

### Adding New Sections

1. Add data extraction in `generate_html_report()` function
2. Add HTML section in the main template string
3. Style with existing CSS classes or add new ones

### Adding New Table Columns

1. Modify `generate_email_row()` function
2. Add `<th>` in table header
3. Add `<td>` in row template

## Performance Notes

- Report is fully static (no server required)
- JavaScript is minimal (tab switching, search filtering)
- Handles 1000+ emails without performance issues
- For 10k+ emails, consider pagination (not yet implemented)

## Future Enhancements (TODO)

- [ ] Pagination for large datasets
- [ ] Export to PDF option
- [ ] Configurable color themes via CLI
- [ ] Column sorting (click headers)
- [ ] Date range filter
- [ ] Sender domain grouping
- [ ] Category confidence heatmap
- [ ] Email body preview on hover

## Troubleshooting

### "KeyError: 'subject'"
Results.json lacks email metadata. Re-run classification with latest cli.py.

### Empty tables
Check that results.json has `classifications` array with data.

### Dates showing "N/A"
Date parsing failed. Check date format in results.json is ISO 8601.

### Search not working
JavaScript error. Check browser console. Ensure no HTML entities in data.