email-sorter/docs/MODEL_INFO.md
FSSCoding 53174a34eb Organize project structure and add MVP features
Project Reorganization:
- Created docs/ directory and moved all documentation
- Created scripts/ directory for shell scripts
- Created scripts/experimental/ for research scripts
- Updated .gitignore for new structure
- Updated README.md with MVP status and new structure

New Features:
- Category verification system (verify_model_categories)
- --verify-categories flag for mailbox compatibility check
- --no-llm-fallback flag for pure ML classification
- Trained model saved in src/models/calibrated/

Threshold Optimization:
- Reduced default threshold from 0.75 to 0.55
- Updated all category thresholds to 0.55
- Reduces LLM fallback rate by 40% (35% -> 21%)

Documentation:
- SYSTEM_FLOW.html - Complete system architecture
- VERIFY_CATEGORIES_FEATURE.html - Feature documentation
- LABEL_TRAINING_PHASE_DETAIL.html - Calibration breakdown
- FAST_ML_ONLY_WORKFLOW.html - Pure ML guide
- PROJECT_STATUS_AND_NEXT_STEPS.html - Roadmap
- ROOT_CAUSE_ANALYSIS.md - Bug fixes

MVP Status:
- 10k emails in 4 minutes, 72.7% accuracy, 0 LLM calls
- LLM-driven category discovery working
- Embedding-based transfer learning confirmed
- All model paths verified and working
2025-10-25 14:46:58 +11:00

3.2 KiB

Model Information

Current Status

  • Model Type: LightGBM Classifier (Production)
  • Location: src/models/pretrained/classifier.pkl
  • Categories: 12 (junk, transactional, auth, newsletters, social, automated, conversational, work, personal, finance, travel, unknown)
  • Feature Extraction: Hybrid (embeddings + patterns + structural features)

Usage

The ML classifier will automatically use the real model if it exists at:

src/models/pretrained/classifier.pkl

Programmatic Usage

from src.classification.ml_classifier import MLClassifier

# Will automatically load real model if available
classifier = MLClassifier()

# Check if using mock or real model
info = classifier.get_info()
print(f"Is mock: {info['is_mock']}")
print(f"Model type: {info['model_type']}")

# Make predictions
result = classifier.predict(feature_vector)
print(f"Category: {result['category']}")
print(f"Confidence: {result['confidence']}")

Command Line Usage

# Test with mock pipeline
python -m src.cli run --source mock --output test_results/

# Test with real model (when available)
python -m src.cli run --source gmail --limit 100 --output results/

How to Get a Real Model

from src.calibration.trainer import ModelTrainer
from src.calibration.enron_parser import EnronParser
from src.classification.feature_extractor import FeatureExtractor

# Parse Enron dataset
parser = EnronParser("enron_mail_20150507")
emails = parser.parse_emails(limit=5000)

# Extract features
extractor = FeatureExtractor()
labeled_data = [(email, category) for email, category in zip(emails, categories)]

# Train model
trainer = ModelTrainer(extractor, categories)
results = trainer.train(labeled_data)

# Save model
trainer.save_model("src/models/pretrained/classifier.pkl")

Option 2: Download Pre-trained Model

Use the provided script:

cd tools
python download_pretrained_model.py \
  --url https://example.com/model.pkl \
  --hash abc123def456

Option 3: Use Community Model

Check available pre-trained models at:

  • Email Sorter releases on GitHub
  • Hugging Face model hub (when available)
  • Community-trained models

Model Performance

Expected accuracy on real data:

  • Hard Rules: 94-96% (instant, ~10% of emails)
  • ML Model: 85-90% (fast, ~85% of emails)
  • LLM Review: 92-95% (slower, ~5% uncertain cases)
  • Overall: 90-94% (weighted average)

Retraining

To retrain the model:

python -m src.cli train \
  --source enron \
  --output models/new_model.pkl \
  --limit 10000

Troubleshooting

Model Not Loading

  1. Check file exists: src/models/pretrained/classifier.pkl
  2. Try to load directly:
    import pickle
    with open('src/models/pretrained/classifier.pkl', 'rb') as f:
        data = pickle.load(f)
    print(data.keys())
    
  3. Ensure pickle format is correct

Low Accuracy

  1. Model may be underfitted - train on more data
  2. Feature extraction may need tuning
  3. Categories may need adjustment
  4. Consider LLM review for uncertain cases

Slow Predictions

  1. Use embedding cache for batch processing
  2. Implement parallel processing
  3. Consider quantization for LightGBM model
  4. Profile feature extraction step