FSSCoding 8f25e30f52 Rewrite CLAUDE.md and clean project structure

- Rewrote CLAUDE.md with comprehensive development guide
- Archived 20 old docs to docs/archive/
- Added PROJECT_ROADMAP_2025.md with research learnings
- Added CLASSIFICATION_METHODS_COMPARISON.md
- Added SESSION_HANDOVER_20251128.md
- Added tools for analysis (brett_gmail/microsoft analyzers)
- Updated .gitignore for archive folders
- Config changes for local vLLM endpoint

2025-11-28 13:07:27 +11:00

5.7 KiB

Raw Permalink Blame History

Email Classification Report Format

This document explains the HTML report generation system, its data sources, and how to customize it.

Overview

The report generator creates a static HTML file from classification results. It requires enriched results.json with email metadata (subject, sender, date, etc.) - not just classification data.

Files Involved

File	Purpose
`tools/generate_html_report.py`	Main report generator script
`src/cli.py`	Classification CLI - outputs enriched `results.json`
`src/export/exporter.py`	Legacy exporter (JSON/CSV) - not used for HTML

Data Flow

Email Source (.eml/.msg files)
        ↓
   src/cli.py (classification)
        ↓
   results.json (enriched with metadata)
        ↓
   tools/generate_html_report.py
        ↓
   report.html (static, self-contained)

Usage

Generate Report

python tools/generate_html_report.py \
  --input /path/to/results.json \
  --output /path/to/report.html

If --output is omitted, creates report.html in same directory as input.

Full Workflow

# 1. Classify emails
python -m src.cli run \
  --source local \
  --directory "/path/to/emails" \
  --output "/path/to/output" \
  --no-llm-fallback

# 2. Generate report
python tools/generate_html_report.py \
  --input "/path/to/output/results.json"

results.json Format

The report generator expects this structure:

{
  "metadata": {
    "total_emails": 801,
    "accuracy_estimate": 0.55,
    "classification_stats": {
      "rule_matched": 9,
      "ml_classified": 468,
      "llm_classified": 0,
      "needs_review": 324
    },
    "generated_at": "2025-11-28T02:34:00.680196",
    "source": "local",
    "source_path": "/path/to/emails"
  },
  "classifications": [
    {
      "email_id": "unique_id.eml",
      "subject": "Email subject line",
      "sender": "sender@example.com",
      "sender_name": "Sender Name",
      "date": "2023-04-13T09:43:29+10:00",
      "has_attachments": false,
      "category": "Work",
      "confidence": 0.81,
      "method": "ml"
    }
  ]
}

Required Fields

Field	Type	Description
`email_id`	string	Unique identifier (usually filename)
`subject`	string	Email subject line
`sender`	string	Sender email address
`category`	string	Assigned category
`confidence`	float	Classification confidence (0-1)
`method`	string	Classification method: `ml`, `rule`, or `llm`

Optional Fields

Field	Type	Description
`sender_name`	string	Display name of sender
`date`	string	ISO 8601 date string
`has_attachments`	boolean	Whether email has attachments

Report Sections

1. Header

Report title
Generation timestamp
Source info
Total email count

2. Stats Grid

Total emails
Number of categories
High confidence count (>=70%)
Unique sender domains

3. Category Distribution

Horizontal bar chart
Count and percentage per category
Sorted by count (descending)

4. Classification Methods

Breakdown of ML vs Rule vs LLM
Shows which method handled what percentage

5. Confidence Distribution

High (>=70%): Green
Medium (50-70%): Yellow
Low (<50%): Red

6. Top Senders

Top 20 senders by email count
Grid layout

7. Email Tables (Tabbed)

"All" tab shows all emails
Category tabs filter by category
Search box filters by subject/sender
Columns: Date, Subject, Sender, Category, Confidence, Method
Sorted by date (newest first)
Attachment indicator (📎)

Customization

Changing Colors

Edit the CSS variables in generate_html_report.py:

:root {
    --bg-primary: #1a1a2e;      /* Main background */
    --bg-secondary: #16213e;    /* Card backgrounds */
    --bg-card: #0f3460;         /* Nested elements */
    --text-primary: #eee;       /* Main text */
    --text-secondary: #aaa;     /* Muted text */
    --accent: #e94560;          /* Accent color (red) */
    --accent-hover: #ff6b6b;    /* Accent hover */
    --success: #00d9a5;         /* Green (high confidence) */
    --warning: #ffc107;         /* Yellow (medium confidence) */
    --border: #2a2a4a;          /* Border color */
}

Light Theme Example

:root {
    --bg-primary: #f5f5f5;
    --bg-secondary: #ffffff;
    --bg-card: #e8e8e8;
    --text-primary: #333;
    --text-secondary: #666;
    --accent: #2563eb;
    --accent-hover: #3b82f6;
    --success: #10b981;
    --warning: #f59e0b;
    --border: #d1d5db;
}

Adding New Sections

Add data extraction in generate_html_report() function
Add HTML section in the main template string
Style with existing CSS classes or add new ones

Adding New Table Columns

Modify generate_email_row() function
Add <th> in table header
Add <td> in row template

Performance Notes

Report is fully static (no server required)
JavaScript is minimal (tab switching, search filtering)
Handles 1000+ emails without performance issues
For 10k+ emails, consider pagination (not yet implemented)

Future Enhancements (TODO)

Pagination for large datasets
Export to PDF option
Configurable color themes via CLI
Column sorting (click headers)
Date range filter
Sender domain grouping
Category confidence heatmap
Email body preview on hover

Troubleshooting

"KeyError: 'subject'"

Results.json lacks email metadata. Re-run classification with latest cli.py.

Empty tables

Check that results.json has classifications array with data.

Dates showing "N/A"

Date parsing failed. Check date format in results.json is ISO 8601.

Search not working

JavaScript error. Check browser console. Ensure no HTML entities in data.

5.7 KiB Raw Permalink Blame History