FSSCoding 53174a34eb Organize project structure and add MVP features

Project Reorganization:
- Created docs/ directory and moved all documentation
- Created scripts/ directory for shell scripts
- Created scripts/experimental/ for research scripts
- Updated .gitignore for new structure
- Updated README.md with MVP status and new structure

New Features:
- Category verification system (verify_model_categories)
- --verify-categories flag for mailbox compatibility check
- --no-llm-fallback flag for pure ML classification
- Trained model saved in src/models/calibrated/

Threshold Optimization:
- Reduced default threshold from 0.75 to 0.55
- Updated all category thresholds to 0.55
- Reduces LLM fallback rate by 40% (35% -> 21%)

Documentation:
- SYSTEM_FLOW.html - Complete system architecture
- VERIFY_CATEGORIES_FEATURE.html - Feature documentation
- LABEL_TRAINING_PHASE_DETAIL.html - Calibration breakdown
- FAST_ML_ONLY_WORKFLOW.html - Pure ML guide
- PROJECT_STATUS_AND_NEXT_STEPS.html - Roadmap
- ROOT_CAUSE_ANALYSIS.md - Bug fixes

MVP Status:
- 10k emails in 4 minutes, 72.7% accuracy, 0 LLM calls
- LLM-driven category discovery working
- Embedding-based transfer learning confirmed
- All model paths verified and working

2025-10-25 14:46:58 +11:00

3.2 KiB

Raw Blame History

Model Information

Current Status

Model Type: LightGBM Classifier (Production)
Location: src/models/pretrained/classifier.pkl
Categories: 12 (junk, transactional, auth, newsletters, social, automated, conversational, work, personal, finance, travel, unknown)
Feature Extraction: Hybrid (embeddings + patterns + structural features)

Usage

The ML classifier will automatically use the real model if it exists at:

src/models/pretrained/classifier.pkl

Programmatic Usage

from src.classification.ml_classifier import MLClassifier

# Will automatically load real model if available
classifier = MLClassifier()

# Check if using mock or real model
info = classifier.get_info()
print(f"Is mock: {info['is_mock']}")
print(f"Model type: {info['model_type']}")

# Make predictions
result = classifier.predict(feature_vector)
print(f"Category: {result['category']}")
print(f"Confidence: {result['confidence']}")

Command Line Usage

# Test with mock pipeline
python -m src.cli run --source mock --output test_results/

# Test with real model (when available)
python -m src.cli run --source gmail --limit 100 --output results/

How to Get a Real Model

Option 1: Train Your Own (Recommended)

from src.calibration.trainer import ModelTrainer
from src.calibration.enron_parser import EnronParser
from src.classification.feature_extractor import FeatureExtractor

# Parse Enron dataset
parser = EnronParser("enron_mail_20150507")
emails = parser.parse_emails(limit=5000)

# Extract features
extractor = FeatureExtractor()
labeled_data = [(email, category) for email, category in zip(emails, categories)]

# Train model
trainer = ModelTrainer(extractor, categories)
results = trainer.train(labeled_data)

# Save model
trainer.save_model("src/models/pretrained/classifier.pkl")

Option 2: Download Pre-trained Model

Use the provided script:

cd tools
python download_pretrained_model.py \
  --url https://example.com/model.pkl \
  --hash abc123def456

Option 3: Use Community Model

Check available pre-trained models at:

Email Sorter releases on GitHub
Hugging Face model hub (when available)
Community-trained models

Model Performance

Expected accuracy on real data:

Hard Rules: 94-96% (instant, ~10% of emails)
ML Model: 85-90% (fast, ~85% of emails)
LLM Review: 92-95% (slower, ~5% uncertain cases)
Overall: 90-94% (weighted average)

Retraining

To retrain the model:

python -m src.cli train \
  --source enron \
  --output models/new_model.pkl \
  --limit 10000

Troubleshooting

Model Not Loading

Check file exists: src/models/pretrained/classifier.pkl

Try to load directly:

import pickle
with open('src/models/pretrained/classifier.pkl', 'rb') as f:
    data = pickle.load(f)
print(data.keys())

Ensure pickle format is correct

Low Accuracy

Model may be underfitted - train on more data
Feature extraction may need tuning
Categories may need adjustment
Consider LLM review for uncertain cases

Slow Predictions

Use embedding cache for batch processing
Implement parallel processing
Consider quantization for LightGBM model
Profile feature extraction step

3.2 KiB Raw Blame History