email-sorter/docs/MODEL_INFO.md

# Model Information

## Current Status

- **Model Type**: LightGBM Classifier (Production)
- **Location**: `src/models/pretrained/classifier.pkl`
- **Categories**: 12 (junk, transactional, auth, newsletters, social, automated, conversational, work, personal, finance, travel, unknown)
- **Feature Extraction**: Hybrid (embeddings + patterns + structural features)

## Usage

The ML classifier will automatically use the real model if it exists at:
```
src/models/pretrained/classifier.pkl
```

### Programmatic Usage

```python
from src.classification.ml_classifier import MLClassifier

# Will automatically load real model if available
classifier = MLClassifier()

# Check if using mock or real model
info = classifier.get_info()
print(f"Is mock: {info['is_mock']}")
print(f"Model type: {info['model_type']}")

# Make predictions
result = classifier.predict(feature_vector)
print(f"Category: {result['category']}")
print(f"Confidence: {result['confidence']}")
```

### Command Line Usage

```bash
# Test with mock pipeline
python -m src.cli run --source mock --output test_results/

# Test with real model (when available)
python -m src.cli run --source gmail --limit 100 --output results/
```

## How to Get a Real Model

### Option 1: Train Your Own (Recommended)
```python
from src.calibration.trainer import ModelTrainer
from src.calibration.enron_parser import EnronParser
from src.classification.feature_extractor import FeatureExtractor

# Parse Enron dataset
parser = EnronParser("enron_mail_20150507")
emails = parser.parse_emails(limit=5000)

# Extract features
extractor = FeatureExtractor()
labeled_data = [(email, category) for email, category in zip(emails, categories)]

# Train model
trainer = ModelTrainer(extractor, categories)
results = trainer.train(labeled_data)

# Save model
trainer.save_model("src/models/pretrained/classifier.pkl")
```

### Option 2: Download Pre-trained Model

Use the provided script:
```bash
cd tools
python download_pretrained_model.py \
  --url https://example.com/model.pkl \
  --hash abc123def456
```

### Option 3: Use Community Model

Check available pre-trained models at:
- Email Sorter releases on GitHub
- Hugging Face model hub (when available)
- Community-trained models

## Model Performance

Expected accuracy on real data:
- **Hard Rules**: 94-96% (instant, ~10% of emails)
- **ML Model**: 85-90% (fast, ~85% of emails)
- **LLM Review**: 92-95% (slower, ~5% uncertain cases)
- **Overall**: 90-94% (weighted average)

## Retraining

To retrain the model:

```bash
python -m src.cli train \
  --source enron \
  --output models/new_model.pkl \
  --limit 10000
```

## Troubleshooting

### Model Not Loading
1. Check file exists: `src/models/pretrained/classifier.pkl`
2. Try to load directly:
   ```python
   import pickle
   with open('src/models/pretrained/classifier.pkl', 'rb') as f:
       data = pickle.load(f)
   print(data.keys())
   ```
3. Ensure pickle format is correct

### Low Accuracy
1. Model may be underfitted - train on more data
2. Feature extraction may need tuning
3. Categories may need adjustment
4. Consider LLM review for uncertain cases

### Slow Predictions
1. Use embedding cache for batch processing
2. Implement parallel processing
3. Consider quantization for LightGBM model
4. Profile feature extraction step