email-sorter/docs/MODEL_INFO.md
FSSCoding 53174a34eb Organize project structure and add MVP features
Project Reorganization:
- Created docs/ directory and moved all documentation
- Created scripts/ directory for shell scripts
- Created scripts/experimental/ for research scripts
- Updated .gitignore for new structure
- Updated README.md with MVP status and new structure

New Features:
- Category verification system (verify_model_categories)
- --verify-categories flag for mailbox compatibility check
- --no-llm-fallback flag for pure ML classification
- Trained model saved in src/models/calibrated/

Threshold Optimization:
- Reduced default threshold from 0.75 to 0.55
- Updated all category thresholds to 0.55
- Reduces LLM fallback rate by 40% (35% -> 21%)

Documentation:
- SYSTEM_FLOW.html - Complete system architecture
- VERIFY_CATEGORIES_FEATURE.html - Feature documentation
- LABEL_TRAINING_PHASE_DETAIL.html - Calibration breakdown
- FAST_ML_ONLY_WORKFLOW.html - Pure ML guide
- PROJECT_STATUS_AND_NEXT_STEPS.html - Roadmap
- ROOT_CAUSE_ANALYSIS.md - Bug fixes

MVP Status:
- 10k emails in 4 minutes, 72.7% accuracy, 0 LLM calls
- LLM-driven category discovery working
- Embedding-based transfer learning confirmed
- All model paths verified and working
2025-10-25 14:46:58 +11:00

130 lines
3.2 KiB
Markdown

# Model Information
## Current Status
- **Model Type**: LightGBM Classifier (Production)
- **Location**: `src/models/pretrained/classifier.pkl`
- **Categories**: 12 (junk, transactional, auth, newsletters, social, automated, conversational, work, personal, finance, travel, unknown)
- **Feature Extraction**: Hybrid (embeddings + patterns + structural features)
## Usage
The ML classifier will automatically use the real model if it exists at:
```
src/models/pretrained/classifier.pkl
```
### Programmatic Usage
```python
from src.classification.ml_classifier import MLClassifier
# Will automatically load real model if available
classifier = MLClassifier()
# Check if using mock or real model
info = classifier.get_info()
print(f"Is mock: {info['is_mock']}")
print(f"Model type: {info['model_type']}")
# Make predictions
result = classifier.predict(feature_vector)
print(f"Category: {result['category']}")
print(f"Confidence: {result['confidence']}")
```
### Command Line Usage
```bash
# Test with mock pipeline
python -m src.cli run --source mock --output test_results/
# Test with real model (when available)
python -m src.cli run --source gmail --limit 100 --output results/
```
## How to Get a Real Model
### Option 1: Train Your Own (Recommended)
```python
from src.calibration.trainer import ModelTrainer
from src.calibration.enron_parser import EnronParser
from src.classification.feature_extractor import FeatureExtractor
# Parse Enron dataset
parser = EnronParser("enron_mail_20150507")
emails = parser.parse_emails(limit=5000)
# Extract features
extractor = FeatureExtractor()
labeled_data = [(email, category) for email, category in zip(emails, categories)]
# Train model
trainer = ModelTrainer(extractor, categories)
results = trainer.train(labeled_data)
# Save model
trainer.save_model("src/models/pretrained/classifier.pkl")
```
### Option 2: Download Pre-trained Model
Use the provided script:
```bash
cd tools
python download_pretrained_model.py \
--url https://example.com/model.pkl \
--hash abc123def456
```
### Option 3: Use Community Model
Check available pre-trained models at:
- Email Sorter releases on GitHub
- Hugging Face model hub (when available)
- Community-trained models
## Model Performance
Expected accuracy on real data:
- **Hard Rules**: 94-96% (instant, ~10% of emails)
- **ML Model**: 85-90% (fast, ~85% of emails)
- **LLM Review**: 92-95% (slower, ~5% uncertain cases)
- **Overall**: 90-94% (weighted average)
## Retraining
To retrain the model:
```bash
python -m src.cli train \
--source enron \
--output models/new_model.pkl \
--limit 10000
```
## Troubleshooting
### Model Not Loading
1. Check file exists: `src/models/pretrained/classifier.pkl`
2. Try to load directly:
```python
import pickle
with open('src/models/pretrained/classifier.pkl', 'rb') as f:
data = pickle.load(f)
print(data.keys())
```
3. Ensure pickle format is correct
### Low Accuracy
1. Model may be underfitted - train on more data
2. Feature extraction may need tuning
3. Categories may need adjustment
4. Consider LLM review for uncertain cases
### Slow Predictions
1. Use embedding cache for batch processing
2. Implement parallel processing
3. Consider quantization for LightGBM model
4. Profile feature extraction step