Improve LLM prompts with proper context and purpose
Both discovery and consolidation prompts now explain: - What the system does (train ML classifier for auto-sorting) - What makes good categories (broad, timeless, learnable) - Why this matters (user needs, ML training requirements) - How to think about the task (user-focused, functional) Discovery prompt changes: - Explains goal of identifying natural categories for ML training - Lists guidelines for good categories (broad, user-focused, learnable) - Provides concrete examples of functional categories - Emphasizes PURPOSE over topic Consolidation prompt changes: - Explains full system context (LightGBM, auto-labeling, user search) - Defines what makes categories effective for ML and users - Provides user-centric thinking framework - Emphasizes reusability and timelessness Prompts now give the brilliant 8b model proper context to deliver excellent category decisions instead of lazy generic categorization.
This commit is contained in:
parent
88ef570fed
commit
183b12c9b4
@ -105,16 +105,36 @@ class CalibrationAnalyzer:
|
||||
# Use first email ID as example
|
||||
example_id = batch[0].id if batch else "maildir_example__sent_1"
|
||||
|
||||
prompt = f"""<no_think>Categorize these emails. You MUST copy the exact ID string for each email.
|
||||
prompt = f"""<no_think>You are analyzing emails to discover natural categories for an automatic classification system.
|
||||
|
||||
EMAILS:
|
||||
GOAL: Identify broad, reusable categories that will help train a machine learning model to sort thousands of emails automatically.
|
||||
|
||||
GUIDELINES FOR GOOD CATEGORIES:
|
||||
- BROAD & TIMELESS: "Financial" not "Q3 Budget Review"
|
||||
- USER-FOCUSED: Think "what would help someone find this email later?"
|
||||
- LEARNABLE: ML model needs consistent patterns (sender domains, keywords, structure)
|
||||
- FUNCTIONAL: Each category serves a distinct purpose
|
||||
- 3-10 categories ideal: Too many = noise, too few = useless
|
||||
|
||||
EMAILS TO ANALYZE:
|
||||
{email_summary}
|
||||
|
||||
CRITICAL: Copy the EXACT ID from each email above. For example, if email #1 has ID "{example_id}", you must write exactly "{example_id}" in the labels array, not "email1" or anything else.
|
||||
TASK:
|
||||
1. Identify natural groupings based on PURPOSE, not just topic
|
||||
2. Create SHORT (1-3 word) category names
|
||||
3. Assign each email to exactly one category
|
||||
4. CRITICAL: Copy EXACT email IDs - if email #1 shows ID "{example_id}", use exactly "{example_id}" in labels
|
||||
|
||||
EXAMPLES OF GOOD CATEGORIES:
|
||||
- "Work Communication" (daily business emails)
|
||||
- "Financial" (invoices, budgets, reports)
|
||||
- "Urgent" (time-sensitive requests)
|
||||
- "Technical" (system alerts, dev discussions)
|
||||
- "Administrative" (HR, policies, announcements)
|
||||
|
||||
Return JSON:
|
||||
{{
|
||||
"categories": {{"category_name": "description", ...}},
|
||||
"categories": {{"category_name": "what user need this serves", ...}},
|
||||
"labels": [["{example_id}", "category"], ...]
|
||||
}}
|
||||
|
||||
@ -257,28 +277,51 @@ JSON:
|
||||
rules_text = "\n".join(rules)
|
||||
|
||||
# Build prompt
|
||||
prompt = f"""<no_think>Consolidate email categories by merging duplicates and overlaps.
|
||||
prompt = f"""<no_think>You are helping build an email classification system that will automatically sort thousands of emails.
|
||||
|
||||
TASK: Consolidate the discovered categories below into a lean, effective set for training a machine learning classifier.
|
||||
|
||||
WHY THIS MATTERS:
|
||||
These categories will be used to:
|
||||
1. Train a LightGBM classifier on email features (embeddings, patterns, structure)
|
||||
2. Automatically label thousands of emails without human intervention
|
||||
3. Help users quickly find emails by category (like Gmail labels)
|
||||
|
||||
WHAT MAKES GOOD CATEGORIES:
|
||||
- BROAD & REUSABLE: "Meetings" not "Q3 Planning Meeting" - applies to many emails
|
||||
- FUNCTIONALLY DISTINCT: Each category serves a different user need
|
||||
- BALANCED: Avoid 1 huge category + many tiny ones
|
||||
- LEARNABLE: ML model needs clear patterns to distinguish categories
|
||||
- TIMELESS: "Financial Reports" not "2023 Budget Review"
|
||||
- ACTION-ORIENTED: Users ask "show me all X" - what is X?
|
||||
|
||||
DISCOVERED CATEGORIES (sorted by email count):
|
||||
{category_list}
|
||||
|
||||
{context_section}CONSOLIDATION RULES:
|
||||
{context_section}CONSOLIDATION STRATEGY:
|
||||
{rules_text}
|
||||
|
||||
THINK LIKE A USER: If you had to sort 10,000 emails, what categories would help you find things fast?
|
||||
- "Work Communication" catches daily business emails
|
||||
- "Urgent" flags time-sensitive items
|
||||
- "Financial" groups all money-related emails
|
||||
- "Technical" vs "Administrative" serves different workflows
|
||||
|
||||
OUTPUT FORMAT - Return JSON with consolidated categories and mapping:
|
||||
{{
|
||||
"consolidated": {{
|
||||
"FinalCategoryName": "Clear, generic description of what emails fit here"
|
||||
"FinalCategoryName": "Clear description of what user need this serves"
|
||||
}},
|
||||
"mappings": {{
|
||||
"OldCategoryName": "FinalCategoryName"
|
||||
}}
|
||||
}}
|
||||
|
||||
IMPORTANT:
|
||||
- consolidated dict should have {target_categories} or fewer entries
|
||||
- mappings dict must map EVERY old category name to a final category
|
||||
- Final category names should be present in both consolidated and mappings
|
||||
CRITICAL REQUIREMENTS:
|
||||
- Maximum {target_categories} final categories (strict limit)
|
||||
- Map EVERY old category to exactly one final category
|
||||
- Final category names must be SHORT (1-3 words), GENERIC, and REUSABLE
|
||||
- Think: "Would this category still make sense in 5 years?"
|
||||
|
||||
JSON:
|
||||
"""
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user