History

FSSCoding 10862583ad Add batch LLM classifier tool with prompt caching optimization

- Created standalone batch_llm_classifier.py for custom email queries
- Optimized all LLM prompts for caching (static instructions first, variables last)
- Configured rtx3090 vLLM endpoint (qwen3-coder-30b)
- Tested batch_size=4 optimal (100% success, 4.65 req/sec)
- Added comprehensive documentation (tools/README.md, BATCH_LLM_QUICKSTART.md)

Tool is completely separate from main ML pipeline - no interference.
Prerequisite: vLLM server must be running at rtx3090.bobai.com.au

2025-11-14 16:01:57 +11:00

download_pretrained_model.py

Add model integration tools and comprehensive completion assessment

2025-10-21 12:12:52 +11:00

README.md

Add batch LLM classifier tool with prompt caching optimization

2025-11-14 16:01:57 +11:00

setup_real_model.py

Add model integration tools and comprehensive completion assessment

2025-10-21 12:12:52 +11:00

README.md

Email Sorter - Supplementary Tools

This directory contains optional standalone tools that complement the main ML classification pipeline without interfering with it.

Tools

batch_llm_classifier.py

Purpose: Ask custom questions across batches of emails using vLLM server

Prerequisite: vLLM server must be running at configured endpoint

When to use this:

One-off batch analysis with custom questions
Exploratory queries ("find all emails mentioning budget cuts")
Custom classification criteria not in trained ML model
Quick ad-hoc analysis without retraining

When to use RAG instead:

Searching across large email corpus (10k+ emails)
Finding specific topics/keywords with semantic search
Building knowledge base from email content
Multi-step reasoning across many documents

When to use main ML pipeline:

Regular ongoing classification of incoming emails
High-volume processing (100k+ emails)
Consistent categories that don't change
Maximum speed (pure ML with no LLM calls)

batch_llm_classifier.py Usage

Check vLLM Server Status

python tools/batch_llm_classifier.py check

Expected output:

✓ vLLM server is running and ready
✓ Max concurrent requests: 4
✓ Estimated throughput: ~4.4 emails/sec

Ask Custom Question

python tools/batch_llm_classifier.py ask \
  --source enron \
  --limit 100 \
  --question "Does this email contain any financial numbers or budget information?" \
  --output financial_emails.txt

Parameters:

--source: Email provider (gmail, enron)
--credentials: Path to credentials (for Gmail)
--limit: Number of emails to process
--question: Custom question to ask about each email
--output: Output file for results

Example Questions

Finding specific content:

--question "Is this email about a meeting or calendar event? Answer yes/no and provide date if found."

Sentiment analysis:

--question "What is the tone of this email? Professional/Casual/Urgent/Friendly?"

Categorization with custom criteria:

--question "Should this email be archived or kept for reference? Explain why."

Data extraction:

--question "Extract all names, dates, and dollar amounts mentioned in this email."

Configuration

vLLM server settings are in batch_llm_classifier.py:

VLLM_CONFIG = {
    'base_url': 'https://rtx3090.bobai.com.au/v1',
    'api_key': 'rtx3090_foxadmin_10_8034ecb47841f45ba1d5f3f5d875c092',
    'model': 'qwen3-coder-30b',
    'batch_size': 4,  # Tested optimal - 100% success rate
    'temperature': 0.1,
    'max_tokens': 500
}

Note: batch_size: 4 is the tested optimal setting. Uses proper batch pooling (send 4, wait for completion, send next 4). Higher values cause 503 errors.

Performance Benchmarks

Tested on rtx3090.bobai.com.au with qwen3-coder-30b:

Emails	Batch Size	Time	Throughput	Success Rate
500	4 (pooled)	108s	4.65/sec	100%
500	8 (pooled)	62s	8.10/sec	60%
500	20 (pooled)	23s	21.8/sec	23%

Conclusion: batch_size=4 with proper batch pooling is optimal (100% reliability, ~4.7 req/sec)

Architecture Notes

Prompt Caching Optimization

Prompts are structured with static content first, variable content last:

STATIC (cached):
  - System instructions
  - Question
  - Output format guidelines

VARIABLE (not cached):
  - Email subject
  - Email sender
  - Email body

This allows vLLM to cache the static portion across all emails in the batch.

Separation from Main Pipeline

This tool is completely independent from the main classification pipeline:

Main pipeline (src/cli.py run):
- Uses calibrated LightGBM model
- Fast pure ML classification
- Optional LLM fallback for low-confidence cases
- Processes 10k emails in ~24s (pure ML) or ~5min (with LLM fallback)
Batch LLM tool (tools/batch_llm_classifier.py):
- Uses vLLM server exclusively
- Custom questions per run
- ~4.4 emails/sec throughput
- For ad-hoc analysis, not production classification

No Interference Guarantee

The batch LLM tool:

✓ Does NOT modify any files in src/
✓ Does NOT touch trained models in src/models/
✓ Does NOT affect config files
✓ Does NOT interfere with existing workflows
✓ Uses separate vLLM endpoint (not Ollama)

Comparison: Batch LLM vs RAG

Feature	Batch LLM (this tool)	RAG (rag-search)
Speed	4.4 emails/sec	Instant (pre-indexed)
Flexibility	Custom questions	Semantic search queries
Best for	50-500 email batches	10k+ email corpus
Prerequisite	vLLM server running	RAG collection indexed
Use case	"Does this mention X?"	"Find all emails about X"
Reasoning	Per-email LLM analysis	Similarity + ranking

Rule of thumb:

< 500 emails + custom question = Use Batch LLM
1000 emails + topic search = Use RAG
Regular classification = Use main ML pipeline

Prerequisites

vLLM server must be running
- Endpoint: https://rtx3090.bobai.com.au/v1
- Model loaded: qwen3-coder-30b
- Check with: python tools/batch_llm_classifier.py check
Python dependencies
```
pip install httpx click
```
Email provider setup
- Enron: No setup needed (uses local maildir)
- Gmail: Requires credentials file

Troubleshooting

"vLLM server not available"

Check server status:

curl https://rtx3090.bobai.com.au/v1/models \
  -H "Authorization: Bearer rtx3090_foxadmin_10_8034ecb47841f45ba1d5f3f5d875c092"

Verify model is loaded:

python tools/batch_llm_classifier.py check

High error rate (503 errors)

Reduce concurrent requests in VLLM_CONFIG:

'max_concurrent': 2,  # Lower if getting 503s

Slow processing

Check vLLM server isn't overloaded
Verify network latency to rtx3090.bobai.com.au
Consider using main ML pipeline for large batches

Future Enhancements

Potential additions (not implemented):

Support for custom prompt templates
JSON output mode for structured extraction
Progress bar for large batches
Retry logic for transient failures
Multi-server load balancing
Streaming responses for real-time feedback

Remember: This tool is supplementary. For production email classification, use the main ML pipeline (src/cli.py run).