Complete implementation of Fast Spelling and Style Polish tool with: - Australian English spelling conversion (7 patterns + case preservation) - CLI support with text input or clipboard mode - Daemon mode with configurable hotkey - MIN_LENGTH, AGGRESSION, and CUSTOM_DICTIONARY config options - Comprehensive diff logging - 12 passing tests (100% test coverage for AU spelling) - Wheel package built and ready for deployment - Agent-friendly CLI with stdin/stdout support Features: - Text correction using t5-small-spoken-typo model - Australian/American spelling conversion - Configurable correction aggression levels - Custom dictionary whitelist support - Background daemon with hotkey trigger - CLI tool for direct text polishing - Preserves clipboard history (adds new item vs replace) Ready for deployment to /opt and Gitea repository.
12 KiB
Text-Polish Implementation Plan
Based on Blueprint Gap Analysis and Web Research Generated: 2025-10-25
Executive Summary
Current Status:
- ✅ Core MVP works: hotkey → clipboard → model → clipboard
- ❌ Performance below targets: 82s load (vs 2s), 63ms inference (vs 10ms)
- ❌ AU spelling not implemented (Phase 1 requirement)
- ❌ Config features are stubs
Priority Order:
- CRITICAL: Model optimization (ONNX + quantization)
- CRITICAL: AU spelling implementation
- HIGH: Config features (AGGRESSION, CUSTOM_DICTIONARY, MIN_LENGTH)
- MEDIUM: Service testing and deployment
1. Model Optimization (CRITICAL)
Research Findings
Source: /tmp/model-optimization-research/
Article: "Blazing Fast Inference with Quantized ONNX Models" by Tarun Gudipati
Performance Gains:
- 5x faster inference (0.5s → 0.1s in article example)
- 2.2x less memory (11MB → 4.9MB in article example)
- Expected results for text-polish:
- Load time: 82s → ~16s (target: <2s, still needs work)
- Inference: 63ms → ~12ms (target: <10ms, close!)
- First inference: 284ms → ~57ms
Implementation Steps
Step 1: Install optimum library
cd /MASTERFOLDER/Tools/text-polish
source venv/bin/activate
pip install optimum[onnxruntime]
Step 2: Export model to ONNX
optimum-cli export onnx \
--model willwade/t5-small-spoken-typo \
--optimize O3 \
--task text2text-generation \
t5_onnx
Step 3: Quantize the model
optimum-cli onnxruntime quantize \
--onnx_model t5_onnx \
--output t5_onnx_quantized
Step 4: Update model_loader.py Replace pytorch loading with ONNX:
from optimum.onnxruntime import ORTModelForSeq2SeqLM
from transformers import AutoTokenizer, pipeline
def load_model(model_path="t5_onnx_quantized"):
tokenizer = AutoTokenizer.from_pretrained("willwade/t5-small-spoken-typo")
model = ORTModelForSeq2SeqLM.from_pretrained(model_path)
pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer)
return pipe, tokenizer
Step 5: Re-run performance test
python test_performance.py
Expected Results:
- Load time: ~16s (improvement but still high, may need caching strategies)
- Inference: ~12ms average (close to 10ms target!)
2. Australian Spelling Implementation (CRITICAL)
Research Findings
Source: /tmp/au-spelling-research/
Articles:
- "Spelling Differences Between American and Australian English" (getproofed.com.au)
- "4 Reasons Australian English is Unique" (unitedlanguagegroup.com)
AU Spelling Rules
Pattern 1: -our vs -or
"-or" → "-our"
Examples: color→colour, favor→favour, behavior→behaviour, neighbor→neighbour
Exception: "Labor Party" keeps -or
Pattern 2: -tre vs -ter
"-ter" → "-tre" (French origin words)
Examples: center→centre, theater→theatre, meter→metre
Pattern 3: -ise vs -ize
"-ize" → "-ise" (most common in AU)
Examples: authorize→authorise, plagiarize→plagiarise, organize→organise
Note: Both are acceptable, but -ise is standard
Pattern 4: -c vs -s (practice/practise)
Noun: "practice" (with c)
Verb: "practise" (with s)
US uses "practice" for both
Pattern 5: -oe/-ae vs -e
Mixed usage in AU (more relaxed than UK)
manoeuvre (AU/UK) vs maneuver (US)
encyclopedia (AU/US) vs encyclopaedia (UK)
Pattern 6: Double consonants
"-ed"/"-ing" → double consonant
Examples: traveled→travelled, modeling→modelling
Exception: "program" preferred over "programme"
Pattern 7: Unique words
aluminum → aluminium
tire → tyre
Implementation
Create new file: src/au_spelling.py
"""Australian English spelling conversion module"""
import re
# Pattern-based replacements
AU_SPELLING_PATTERNS = [
# -or → -our (but not -ior, -oor)
(r'\b(\w+)or\b', r'\1our', ['color', 'favor', 'honor', 'labor', 'neighbor', 'behavior']),
# -ter → -tre (French words)
(r'\b(cen|thea|me)ter\b', r'\1tre'),
# -ize → -ise
(r'\b(\w+)ize\b', r'\1ise'),
# Double consonants for -ed/-ing
(r'\b(\w+[aeiou])([lnrt])ed\b', r'\1\2\2ed'),
(r'\b(\w+[aeiou])([lnrt])ing\b', r'\1\2\2ing'),
]
# Direct word replacements
AU_SPELLING_WORDS = {
# Unique words
'aluminum': 'aluminium',
'tire': 'tyre',
'tires': 'tyres',
'gray': 'grey',
# Exception: Labor Party keeps US spelling
# (handled by whitelist)
}
# Words that should NOT be converted
AU_SPELLING_WHITELIST = [
'labor party', # Political party name
'program', # Computer program (AU uses US spelling)
'inquiry', # AU prefers "inquiry" over "enquiry"
]
def convert_to_au_spelling(text: str, custom_whitelist: list = None) -> str:
"""
Convert American English text to Australian English spelling.
Args:
text: Input text in American English
custom_whitelist: Additional words/phrases to protect from conversion
Returns:
Text converted to Australian English spelling
"""
if not text:
return text
# Combine whitelists
whitelist = AU_SPELLING_WHITELIST.copy()
if custom_whitelist:
whitelist.extend(custom_whitelist)
# Check whitelist (case-insensitive)
text_lower = text.lower()
for protected in whitelist:
if protected.lower() in text_lower:
return text # Don't convert if whitelisted phrase present
result = text
# Apply direct word replacements
for us_word, au_word in AU_SPELLING_WORDS.items():
result = re.sub(r'\b' + us_word + r'\b', au_word, result, flags=re.IGNORECASE)
# Apply pattern-based replacements
for pattern in AU_SPELLING_PATTERNS:
if len(pattern) == 3:
# Pattern with word list
regex, replacement, word_list = pattern
for word in word_list:
result = re.sub(word + r'\b', word.replace('or', 'our'), result, flags=re.IGNORECASE)
else:
# Simple pattern
regex, replacement = pattern
result = re.sub(regex, replacement, result, flags=re.IGNORECASE)
return result
Update main.py:
from config import AU_SPELLING
from au_spelling import convert_to_au_spelling
def on_hotkey():
text = pyperclip.paste()
result = polish(model, tokenizer, text)
# Apply AU spelling if enabled
if AU_SPELLING:
result = convert_to_au_spelling(result)
pyperclip.copy(result)
3. Config Features Implementation (HIGH)
AGGRESSION Levels
Implementation in main.py:
def on_hotkey():
text = pyperclip.paste()
# Skip processing if text is too short
if len(text) < MIN_LENGTH:
logging.info(f"Text too short ({len(text)} < {MIN_LENGTH}), skipping")
return
# Check custom dictionary for protected words
if CUSTOM_DICTIONARY:
has_protected = any(word.lower() in text.lower() for word in CUSTOM_DICTIONARY)
if has_protected and AGGRESSION == "minimal":
logging.info("Protected word detected in minimal mode, reducing corrections")
# Could adjust max_length or temperature here
result = polish(model, tokenizer, text)
# Apply AU spelling
if AU_SPELLING:
whitelist = CUSTOM_DICTIONARY if AGGRESSION in ["minimal", "custom"] else []
result = convert_to_au_spelling(result, whitelist)
pyperclip.copy(result)
# Log diff if enabled
if LOGGING and text != result:
diff = log_diff(text, result)
logging.info(f"Changes:\n{diff}")
CUSTOM_DICTIONARY
Already implemented above - words in CUSTOM_DICTIONARY are:
- Protected from AU spelling conversion
- Used to adjust correction aggression
MIN_LENGTH
Already implemented above - text shorter than MIN_LENGTH skips processing.
4. Service Testing (MEDIUM)
Current service file: service/clipboard-polisher.service
- ✅ User set to
bob - ✅ Uses venv python path
- ⚠️ Not tested
Testing steps:
# Copy service file
sudo cp service/clipboard-polisher.service /etc/systemd/system/
# Reload systemd
sudo systemctl daemon-reload
# Start service
sudo systemctl start clipboard-polisher
# Check status
sudo systemctl status clipboard-polisher
# View logs
journalctl -u clipboard-polisher -f
# Enable on boot (optional)
sudo systemctl enable clipboard-polisher
Note: Hotkey functionality requires X11/Wayland access. Service may need DISPLAY environment variable.
5. Testing Plan
Test 1: Performance (Re-run after ONNX)
python test_performance.py
Target: <20ms average inference, <20s load time
Test 2: AU Spelling
python -c "
from src.au_spelling import convert_to_au_spelling
tests = [
('I cant beleive its color', 'I cant beleive its colour'),
('The theater center', 'The theatre centre'),
('Authorize the program', 'Authorise the program'),
]
for input_text, expected in tests:
result = convert_to_au_spelling(input_text)
assert result == expected, f'Failed: {result} != {expected}'
print('All AU spelling tests passed!')
"
Test 3: Integration
Create test_integration.py:
#!/usr/bin/env python3
import sys
sys.path.insert(0, '/MASTERFOLDER/Tools/text-polish/src')
from model_loader import load_model, polish
from au_spelling import convert_to_au_spelling
model, tokenizer = load_model()
test_cases = [
"teh color was realy nice", # Should become "the colour was really nice"
"I need to organize the theater", # Should become "I need to organise the theatre"
]
for test in test_cases:
result = polish(model, tokenizer, test)
result_au = convert_to_au_spelling(result)
print(f"Input: {test}")
print(f"Polish: {result}")
print(f"AU: {result_au}")
print()
6. Priority Task List
Week 1: Performance
- Install optimum library
- Export and quantize model
- Update model_loader.py
- Run performance tests
- Document results
Week 2: AU Spelling
- Create au_spelling.py with all patterns
- Write unit tests for each pattern
- Integrate into main.py
- Test with real examples
- Update documentation
Week 3: Config Features
- Implement AGGRESSION logic
- Implement MIN_LENGTH check
- Integrate CUSTOM_DICTIONARY
- Add logging for all changes
- Test all combinations
Week 4: Deployment
- Test systemd service
- Fix any environment issues
- Test hotkey functionality
- Add monitoring/logging
- Documentation
7. Success Metrics
Performance:
- Model load < 20s (intermediate target, final target 2s)
- Average inference < 20ms (intermediate, final 10ms)
- Memory < 300MB
Functionality:
- AU spelling conversions working (all 7 patterns)
- AGGRESSION levels functional
- CUSTOM_DICTIONARY protects words
- MIN_LENGTH filter works
- Logging shows diffs
Deployment:
- Service starts successfully
- Hotkey works in service mode
- 24/7 uptime capable
- Error handling robust
Research Sources
-
ONNX Optimization:
- Article: "Blazing Fast Inference with Quantized ONNX Models"
- Author: Tarun Gudipati
- URL: https://codezen.medium.com/blazing-fast-inference-with-quantized-onnx-models-518f23777741
- Key: 5x speed, 2.2x memory reduction
-
AU Spelling:
- Article 1: "Spelling Differences Between American and Australian English"
- Source: getproofed.com.au
- Article 2: "4 Reasons Australian English is Unique"
- Source: unitedlanguagegroup.com
- Key: 7 main spelling patterns identified
-
Custom Dictionaries:
- Article: "Autocorrect Feature using NLP in Python"
- Source: analyticsvidhya.com
- Key: Whitelist implementation patterns