email-sorter/scripts/train_final_model.sh
FSSCoding 53174a34eb Organize project structure and add MVP features
Project Reorganization:
- Created docs/ directory and moved all documentation
- Created scripts/ directory for shell scripts
- Created scripts/experimental/ for research scripts
- Updated .gitignore for new structure
- Updated README.md with MVP status and new structure

New Features:
- Category verification system (verify_model_categories)
- --verify-categories flag for mailbox compatibility check
- --no-llm-fallback flag for pure ML classification
- Trained model saved in src/models/calibrated/

Threshold Optimization:
- Reduced default threshold from 0.75 to 0.55
- Updated all category thresholds to 0.55
- Reduces LLM fallback rate by 40% (35% -> 21%)

Documentation:
- SYSTEM_FLOW.html - Complete system architecture
- VERIFY_CATEGORIES_FEATURE.html - Feature documentation
- LABEL_TRAINING_PHASE_DETAIL.html - Calibration breakdown
- FAST_ML_ONLY_WORKFLOW.html - Pure ML guide
- PROJECT_STATUS_AND_NEXT_STEPS.html - Roadmap
- ROOT_CAUSE_ANALYSIS.md - Bug fixes

MVP Status:
- 10k emails in 4 minutes, 72.7% accuracy, 0 LLM calls
- LLM-driven category discovery working
- Embedding-based transfer learning confirmed
- All model paths verified and working
2025-10-25 14:46:58 +11:00

52 lines
1.5 KiB
Bash
Executable File

#!/bin/bash
# Train final production model with 10k emails and 0.55 thresholds
set -e
echo "=========================================="
echo "TRAINING FINAL MODEL"
echo "=========================================="
echo ""
echo "Config: 0.55 thresholds across all categories"
echo "Training set: 10,000 Enron emails"
echo "Calibration: 300 samples (3%)"
echo "Validation: 100 samples (1%)"
echo ""
# Backup existing model if it exists
if [ -f src/models/calibrated/classifier.pkl ]; then
BACKUP_FILE="src/models/calibrated/classifier.pkl.backup-$(date +%Y%m%d-%H%M%S)"
cp src/models/calibrated/classifier.pkl "$BACKUP_FILE"
echo "Backed up existing model to: $BACKUP_FILE"
fi
# Clean old results
rm -rf results_final/ final_training.log
# Activate venv
if [ -z "$VIRTUAL_ENV" ]; then
source venv/bin/activate
fi
# Train model
python -m src.cli run \
--source enron \
--limit 10000 \
--output results_final/ \
2>&1 | tee final_training.log
# Create timestamped backup of trained model
if [ -f src/models/calibrated/classifier.pkl ]; then
TRAINED_BACKUP="src/models/calibrated/classifier.pkl.backup-trained-$(date +%Y%m%d-%H%M%S)"
cp src/models/calibrated/classifier.pkl "$TRAINED_BACKUP"
echo "Created backup of trained model: $TRAINED_BACKUP"
fi
echo ""
echo "=========================================="
echo "Training complete!"
echo "Model saved to: src/models/calibrated/classifier.pkl"
echo "Backup created with timestamp"
echo "Log: final_training.log"
echo "=========================================="