email-sorter/docs/REPORT_FORMAT.md
FSSCoding 8f25e30f52 Rewrite CLAUDE.md and clean project structure
- Rewrote CLAUDE.md with comprehensive development guide
- Archived 20 old docs to docs/archive/
- Added PROJECT_ROADMAP_2025.md with research learnings
- Added CLASSIFICATION_METHODS_COMPARISON.md
- Added SESSION_HANDOVER_20251128.md
- Added tools for analysis (brett_gmail/microsoft analyzers)
- Updated .gitignore for archive folders
- Config changes for local vLLM endpoint
2025-11-28 13:07:27 +11:00

5.7 KiB

Email Classification Report Format

This document explains the HTML report generation system, its data sources, and how to customize it.

Overview

The report generator creates a static HTML file from classification results. It requires enriched results.json with email metadata (subject, sender, date, etc.) - not just classification data.

Files Involved

File Purpose
tools/generate_html_report.py Main report generator script
src/cli.py Classification CLI - outputs enriched results.json
src/export/exporter.py Legacy exporter (JSON/CSV) - not used for HTML

Data Flow

Email Source (.eml/.msg files)
        ↓
   src/cli.py (classification)
        ↓
   results.json (enriched with metadata)
        ↓
   tools/generate_html_report.py
        ↓
   report.html (static, self-contained)

Usage

Generate Report

python tools/generate_html_report.py \
  --input /path/to/results.json \
  --output /path/to/report.html

If --output is omitted, creates report.html in same directory as input.

Full Workflow

# 1. Classify emails
python -m src.cli run \
  --source local \
  --directory "/path/to/emails" \
  --output "/path/to/output" \
  --no-llm-fallback

# 2. Generate report
python tools/generate_html_report.py \
  --input "/path/to/output/results.json"

results.json Format

The report generator expects this structure:

{
  "metadata": {
    "total_emails": 801,
    "accuracy_estimate": 0.55,
    "classification_stats": {
      "rule_matched": 9,
      "ml_classified": 468,
      "llm_classified": 0,
      "needs_review": 324
    },
    "generated_at": "2025-11-28T02:34:00.680196",
    "source": "local",
    "source_path": "/path/to/emails"
  },
  "classifications": [
    {
      "email_id": "unique_id.eml",
      "subject": "Email subject line",
      "sender": "sender@example.com",
      "sender_name": "Sender Name",
      "date": "2023-04-13T09:43:29+10:00",
      "has_attachments": false,
      "category": "Work",
      "confidence": 0.81,
      "method": "ml"
    }
  ]
}

Required Fields

Field Type Description
email_id string Unique identifier (usually filename)
subject string Email subject line
sender string Sender email address
category string Assigned category
confidence float Classification confidence (0-1)
method string Classification method: ml, rule, or llm

Optional Fields

Field Type Description
sender_name string Display name of sender
date string ISO 8601 date string
has_attachments boolean Whether email has attachments

Report Sections

1. Header

  • Report title
  • Generation timestamp
  • Source info
  • Total email count

2. Stats Grid

  • Total emails
  • Number of categories
  • High confidence count (>=70%)
  • Unique sender domains

3. Category Distribution

  • Horizontal bar chart
  • Count and percentage per category
  • Sorted by count (descending)

4. Classification Methods

  • Breakdown of ML vs Rule vs LLM
  • Shows which method handled what percentage

5. Confidence Distribution

  • High (>=70%): Green
  • Medium (50-70%): Yellow
  • Low (<50%): Red

6. Top Senders

  • Top 20 senders by email count
  • Grid layout

7. Email Tables (Tabbed)

  • "All" tab shows all emails
  • Category tabs filter by category
  • Search box filters by subject/sender
  • Columns: Date, Subject, Sender, Category, Confidence, Method
  • Sorted by date (newest first)
  • Attachment indicator (📎)

Customization

Changing Colors

Edit the CSS variables in generate_html_report.py:

:root {
    --bg-primary: #1a1a2e;      /* Main background */
    --bg-secondary: #16213e;    /* Card backgrounds */
    --bg-card: #0f3460;         /* Nested elements */
    --text-primary: #eee;       /* Main text */
    --text-secondary: #aaa;     /* Muted text */
    --accent: #e94560;          /* Accent color (red) */
    --accent-hover: #ff6b6b;    /* Accent hover */
    --success: #00d9a5;         /* Green (high confidence) */
    --warning: #ffc107;         /* Yellow (medium confidence) */
    --border: #2a2a4a;          /* Border color */
}

Light Theme Example

:root {
    --bg-primary: #f5f5f5;
    --bg-secondary: #ffffff;
    --bg-card: #e8e8e8;
    --text-primary: #333;
    --text-secondary: #666;
    --accent: #2563eb;
    --accent-hover: #3b82f6;
    --success: #10b981;
    --warning: #f59e0b;
    --border: #d1d5db;
}

Adding New Sections

  1. Add data extraction in generate_html_report() function
  2. Add HTML section in the main template string
  3. Style with existing CSS classes or add new ones

Adding New Table Columns

  1. Modify generate_email_row() function
  2. Add <th> in table header
  3. Add <td> in row template

Performance Notes

  • Report is fully static (no server required)
  • JavaScript is minimal (tab switching, search filtering)
  • Handles 1000+ emails without performance issues
  • For 10k+ emails, consider pagination (not yet implemented)

Future Enhancements (TODO)

  • Pagination for large datasets
  • Export to PDF option
  • Configurable color themes via CLI
  • Column sorting (click headers)
  • Date range filter
  • Sender domain grouping
  • Category confidence heatmap
  • Email body preview on hover

Troubleshooting

"KeyError: 'subject'"

Results.json lacks email metadata. Re-run classification with latest cli.py.

Empty tables

Check that results.json has classifications array with data.

Dates showing "N/A"

Date parsing failed. Check date format in results.json is ISO 8601.

Search not working

JavaScript error. Check browser console. Ensure no HTML entities in data.