Project Reorganization: - Created docs/ directory and moved all documentation - Created scripts/ directory for shell scripts - Created scripts/experimental/ for research scripts - Updated .gitignore for new structure - Updated README.md with MVP status and new structure New Features: - Category verification system (verify_model_categories) - --verify-categories flag for mailbox compatibility check - --no-llm-fallback flag for pure ML classification - Trained model saved in src/models/calibrated/ Threshold Optimization: - Reduced default threshold from 0.75 to 0.55 - Updated all category thresholds to 0.55 - Reduces LLM fallback rate by 40% (35% -> 21%) Documentation: - SYSTEM_FLOW.html - Complete system architecture - VERIFY_CATEGORIES_FEATURE.html - Feature documentation - LABEL_TRAINING_PHASE_DETAIL.html - Calibration breakdown - FAST_ML_ONLY_WORKFLOW.html - Pure ML guide - PROJECT_STATUS_AND_NEXT_STEPS.html - Roadmap - ROOT_CAUSE_ANALYSIS.md - Bug fixes MVP Status: - 10k emails in 4 minutes, 72.7% accuracy, 0 LLM calls - LLM-driven category discovery working - Embedding-based transfer learning confirmed - All model paths verified and working
3.2 KiB
3.2 KiB
Model Information
Current Status
- Model Type: LightGBM Classifier (Production)
- Location:
src/models/pretrained/classifier.pkl - Categories: 12 (junk, transactional, auth, newsletters, social, automated, conversational, work, personal, finance, travel, unknown)
- Feature Extraction: Hybrid (embeddings + patterns + structural features)
Usage
The ML classifier will automatically use the real model if it exists at:
src/models/pretrained/classifier.pkl
Programmatic Usage
from src.classification.ml_classifier import MLClassifier
# Will automatically load real model if available
classifier = MLClassifier()
# Check if using mock or real model
info = classifier.get_info()
print(f"Is mock: {info['is_mock']}")
print(f"Model type: {info['model_type']}")
# Make predictions
result = classifier.predict(feature_vector)
print(f"Category: {result['category']}")
print(f"Confidence: {result['confidence']}")
Command Line Usage
# Test with mock pipeline
python -m src.cli run --source mock --output test_results/
# Test with real model (when available)
python -m src.cli run --source gmail --limit 100 --output results/
How to Get a Real Model
Option 1: Train Your Own (Recommended)
from src.calibration.trainer import ModelTrainer
from src.calibration.enron_parser import EnronParser
from src.classification.feature_extractor import FeatureExtractor
# Parse Enron dataset
parser = EnronParser("enron_mail_20150507")
emails = parser.parse_emails(limit=5000)
# Extract features
extractor = FeatureExtractor()
labeled_data = [(email, category) for email, category in zip(emails, categories)]
# Train model
trainer = ModelTrainer(extractor, categories)
results = trainer.train(labeled_data)
# Save model
trainer.save_model("src/models/pretrained/classifier.pkl")
Option 2: Download Pre-trained Model
Use the provided script:
cd tools
python download_pretrained_model.py \
--url https://example.com/model.pkl \
--hash abc123def456
Option 3: Use Community Model
Check available pre-trained models at:
- Email Sorter releases on GitHub
- Hugging Face model hub (when available)
- Community-trained models
Model Performance
Expected accuracy on real data:
- Hard Rules: 94-96% (instant, ~10% of emails)
- ML Model: 85-90% (fast, ~85% of emails)
- LLM Review: 92-95% (slower, ~5% uncertain cases)
- Overall: 90-94% (weighted average)
Retraining
To retrain the model:
python -m src.cli train \
--source enron \
--output models/new_model.pkl \
--limit 10000
Troubleshooting
Model Not Loading
- Check file exists:
src/models/pretrained/classifier.pkl - Try to load directly:
import pickle with open('src/models/pretrained/classifier.pkl', 'rb') as f: data = pickle.load(f) print(data.keys()) - Ensure pickle format is correct
Low Accuracy
- Model may be underfitted - train on more data
- Feature extraction may need tuning
- Categories may need adjustment
- Consider LLM review for uncertain cases
Slow Predictions
- Use embedding cache for batch processing
- Implement parallel processing
- Consider quantization for LightGBM model
- Profile feature extraction step