# Model Information ## Current Status - **Model Type**: LightGBM Classifier (Production) - **Location**: `src/models/pretrained/classifier.pkl` - **Categories**: 12 (junk, transactional, auth, newsletters, social, automated, conversational, work, personal, finance, travel, unknown) - **Feature Extraction**: Hybrid (embeddings + patterns + structural features) ## Usage The ML classifier will automatically use the real model if it exists at: ``` src/models/pretrained/classifier.pkl ``` ### Programmatic Usage ```python from src.classification.ml_classifier import MLClassifier # Will automatically load real model if available classifier = MLClassifier() # Check if using mock or real model info = classifier.get_info() print(f"Is mock: {info['is_mock']}") print(f"Model type: {info['model_type']}") # Make predictions result = classifier.predict(feature_vector) print(f"Category: {result['category']}") print(f"Confidence: {result['confidence']}") ``` ### Command Line Usage ```bash # Test with mock pipeline python -m src.cli run --source mock --output test_results/ # Test with real model (when available) python -m src.cli run --source gmail --limit 100 --output results/ ``` ## How to Get a Real Model ### Option 1: Train Your Own (Recommended) ```python from src.calibration.trainer import ModelTrainer from src.calibration.enron_parser import EnronParser from src.classification.feature_extractor import FeatureExtractor # Parse Enron dataset parser = EnronParser("enron_mail_20150507") emails = parser.parse_emails(limit=5000) # Extract features extractor = FeatureExtractor() labeled_data = [(email, category) for email, category in zip(emails, categories)] # Train model trainer = ModelTrainer(extractor, categories) results = trainer.train(labeled_data) # Save model trainer.save_model("src/models/pretrained/classifier.pkl") ``` ### Option 2: Download Pre-trained Model Use the provided script: ```bash cd tools python download_pretrained_model.py \ --url https://example.com/model.pkl \ --hash abc123def456 ``` ### Option 3: Use Community Model Check available pre-trained models at: - Email Sorter releases on GitHub - Hugging Face model hub (when available) - Community-trained models ## Model Performance Expected accuracy on real data: - **Hard Rules**: 94-96% (instant, ~10% of emails) - **ML Model**: 85-90% (fast, ~85% of emails) - **LLM Review**: 92-95% (slower, ~5% uncertain cases) - **Overall**: 90-94% (weighted average) ## Retraining To retrain the model: ```bash python -m src.cli train \ --source enron \ --output models/new_model.pkl \ --limit 10000 ``` ## Troubleshooting ### Model Not Loading 1. Check file exists: `src/models/pretrained/classifier.pkl` 2. Try to load directly: ```python import pickle with open('src/models/pretrained/classifier.pkl', 'rb') as f: data = pickle.load(f) print(data.keys()) ``` 3. Ensure pickle format is correct ### Low Accuracy 1. Model may be underfitted - train on more data 2. Feature extraction may need tuning 3. Categories may need adjustment 4. Consider LLM review for uncertain cases ### Slow Predictions 1. Use embedding cache for batch processing 2. Implement parallel processing 3. Consider quantization for LightGBM model 4. Profile feature extraction step