# Text-Polish Implementation Plan **Based on Blueprint Gap Analysis and Web Research** **Generated:** 2025-10-25 --- ## Executive Summary **Current Status:** - ✅ Core MVP works: hotkey → clipboard → model → clipboard - ❌ Performance below targets: 82s load (vs 2s), 63ms inference (vs 10ms) - ❌ AU spelling not implemented (Phase 1 requirement) - ❌ Config features are stubs **Priority Order:** 1. **CRITICAL**: Model optimization (ONNX + quantization) 2. **CRITICAL**: AU spelling implementation 3. **HIGH**: Config features (AGGRESSION, CUSTOM_DICTIONARY, MIN_LENGTH) 4. **MEDIUM**: Service testing and deployment --- ## 1. Model Optimization (CRITICAL) ### Research Findings **Source:** `/tmp/model-optimization-research/` **Article:** "Blazing Fast Inference with Quantized ONNX Models" by Tarun Gudipati **Performance Gains:** - **5x faster inference** (0.5s → 0.1s in article example) - **2.2x less memory** (11MB → 4.9MB in article example) - Expected results for text-polish: - Load time: 82s → ~16s (target: <2s, still needs work) - Inference: 63ms → ~12ms (target: <10ms, close!) - First inference: 284ms → ~57ms ### Implementation Steps **Step 1: Install optimum library** ```bash cd /MASTERFOLDER/Tools/text-polish source venv/bin/activate pip install optimum[onnxruntime] ``` **Step 2: Export model to ONNX** ```bash optimum-cli export onnx \ --model willwade/t5-small-spoken-typo \ --optimize O3 \ --task text2text-generation \ t5_onnx ``` **Step 3: Quantize the model** ```bash optimum-cli onnxruntime quantize \ --onnx_model t5_onnx \ --output t5_onnx_quantized ``` **Step 4: Update model_loader.py** Replace pytorch loading with ONNX: ```python from optimum.onnxruntime import ORTModelForSeq2SeqLM from transformers import AutoTokenizer, pipeline def load_model(model_path="t5_onnx_quantized"): tokenizer = AutoTokenizer.from_pretrained("willwade/t5-small-spoken-typo") model = ORTModelForSeq2SeqLM.from_pretrained(model_path) pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer) return pipe, tokenizer ``` **Step 5: Re-run performance test** ```bash python test_performance.py ``` **Expected Results:** - Load time: ~16s (improvement but still high, may need caching strategies) - Inference: ~12ms average (close to 10ms target!) --- ## 2. Australian Spelling Implementation (CRITICAL) ### Research Findings **Source:** `/tmp/au-spelling-research/` **Articles:** - "Spelling Differences Between American and Australian English" (getproofed.com.au) - "4 Reasons Australian English is Unique" (unitedlanguagegroup.com) ### AU Spelling Rules **Pattern 1: -our vs -or** ```python "-or" → "-our" Examples: color→colour, favor→favour, behavior→behaviour, neighbor→neighbour Exception: "Labor Party" keeps -or ``` **Pattern 2: -tre vs -ter** ```python "-ter" → "-tre" (French origin words) Examples: center→centre, theater→theatre, meter→metre ``` **Pattern 3: -ise vs -ize** ```python "-ize" → "-ise" (most common in AU) Examples: authorize→authorise, plagiarize→plagiarise, organize→organise Note: Both are acceptable, but -ise is standard ``` **Pattern 4: -c vs -s (practice/practise)** ```python Noun: "practice" (with c) Verb: "practise" (with s) US uses "practice" for both ``` **Pattern 5: -oe/-ae vs -e** ```python Mixed usage in AU (more relaxed than UK) manoeuvre (AU/UK) vs maneuver (US) encyclopedia (AU/US) vs encyclopaedia (UK) ``` **Pattern 6: Double consonants** ```python "-ed"/"-ing" → double consonant Examples: traveled→travelled, modeling→modelling Exception: "program" preferred over "programme" ``` **Pattern 7: Unique words** ```python aluminum → aluminium tire → tyre ``` ### Implementation **Create new file:** `src/au_spelling.py` ```python """Australian English spelling conversion module""" import re # Pattern-based replacements AU_SPELLING_PATTERNS = [ # -or → -our (but not -ior, -oor) (r'\b(\w+)or\b', r'\1our', ['color', 'favor', 'honor', 'labor', 'neighbor', 'behavior']), # -ter → -tre (French words) (r'\b(cen|thea|me)ter\b', r'\1tre'), # -ize → -ise (r'\b(\w+)ize\b', r'\1ise'), # Double consonants for -ed/-ing (r'\b(\w+[aeiou])([lnrt])ed\b', r'\1\2\2ed'), (r'\b(\w+[aeiou])([lnrt])ing\b', r'\1\2\2ing'), ] # Direct word replacements AU_SPELLING_WORDS = { # Unique words 'aluminum': 'aluminium', 'tire': 'tyre', 'tires': 'tyres', 'gray': 'grey', # Exception: Labor Party keeps US spelling # (handled by whitelist) } # Words that should NOT be converted AU_SPELLING_WHITELIST = [ 'labor party', # Political party name 'program', # Computer program (AU uses US spelling) 'inquiry', # AU prefers "inquiry" over "enquiry" ] def convert_to_au_spelling(text: str, custom_whitelist: list = None) -> str: """ Convert American English text to Australian English spelling. Args: text: Input text in American English custom_whitelist: Additional words/phrases to protect from conversion Returns: Text converted to Australian English spelling """ if not text: return text # Combine whitelists whitelist = AU_SPELLING_WHITELIST.copy() if custom_whitelist: whitelist.extend(custom_whitelist) # Check whitelist (case-insensitive) text_lower = text.lower() for protected in whitelist: if protected.lower() in text_lower: return text # Don't convert if whitelisted phrase present result = text # Apply direct word replacements for us_word, au_word in AU_SPELLING_WORDS.items(): result = re.sub(r'\b' + us_word + r'\b', au_word, result, flags=re.IGNORECASE) # Apply pattern-based replacements for pattern in AU_SPELLING_PATTERNS: if len(pattern) == 3: # Pattern with word list regex, replacement, word_list = pattern for word in word_list: result = re.sub(word + r'\b', word.replace('or', 'our'), result, flags=re.IGNORECASE) else: # Simple pattern regex, replacement = pattern result = re.sub(regex, replacement, result, flags=re.IGNORECASE) return result ``` **Update main.py:** ```python from config import AU_SPELLING from au_spelling import convert_to_au_spelling def on_hotkey(): text = pyperclip.paste() result = polish(model, tokenizer, text) # Apply AU spelling if enabled if AU_SPELLING: result = convert_to_au_spelling(result) pyperclip.copy(result) ``` --- ## 3. Config Features Implementation (HIGH) ### AGGRESSION Levels **Implementation in main.py:** ```python def on_hotkey(): text = pyperclip.paste() # Skip processing if text is too short if len(text) < MIN_LENGTH: logging.info(f"Text too short ({len(text)} < {MIN_LENGTH}), skipping") return # Check custom dictionary for protected words if CUSTOM_DICTIONARY: has_protected = any(word.lower() in text.lower() for word in CUSTOM_DICTIONARY) if has_protected and AGGRESSION == "minimal": logging.info("Protected word detected in minimal mode, reducing corrections") # Could adjust max_length or temperature here result = polish(model, tokenizer, text) # Apply AU spelling if AU_SPELLING: whitelist = CUSTOM_DICTIONARY if AGGRESSION in ["minimal", "custom"] else [] result = convert_to_au_spelling(result, whitelist) pyperclip.copy(result) # Log diff if enabled if LOGGING and text != result: diff = log_diff(text, result) logging.info(f"Changes:\n{diff}") ``` ### CUSTOM_DICTIONARY Already implemented above - words in CUSTOM_DICTIONARY are: 1. Protected from AU spelling conversion 2. Used to adjust correction aggression ### MIN_LENGTH Already implemented above - text shorter than MIN_LENGTH skips processing. --- ## 4. Service Testing (MEDIUM) **Current service file:** `service/clipboard-polisher.service` - ✅ User set to `bob` - ✅ Uses venv python path - ⚠️ Not tested **Testing steps:** ```bash # Copy service file sudo cp service/clipboard-polisher.service /etc/systemd/system/ # Reload systemd sudo systemctl daemon-reload # Start service sudo systemctl start clipboard-polisher # Check status sudo systemctl status clipboard-polisher # View logs journalctl -u clipboard-polisher -f # Enable on boot (optional) sudo systemctl enable clipboard-polisher ``` **Note:** Hotkey functionality requires X11/Wayland access. Service may need `DISPLAY` environment variable. --- ## 5. Testing Plan ### Test 1: Performance (Re-run after ONNX) ```bash python test_performance.py ``` **Target:** <20ms average inference, <20s load time ### Test 2: AU Spelling ```bash python -c " from src.au_spelling import convert_to_au_spelling tests = [ ('I cant beleive its color', 'I cant beleive its colour'), ('The theater center', 'The theatre centre'), ('Authorize the program', 'Authorise the program'), ] for input_text, expected in tests: result = convert_to_au_spelling(input_text) assert result == expected, f'Failed: {result} != {expected}' print('All AU spelling tests passed!') " ``` ### Test 3: Integration Create `test_integration.py`: ```python #!/usr/bin/env python3 import sys sys.path.insert(0, '/MASTERFOLDER/Tools/text-polish/src') from model_loader import load_model, polish from au_spelling import convert_to_au_spelling model, tokenizer = load_model() test_cases = [ "teh color was realy nice", # Should become "the colour was really nice" "I need to organize the theater", # Should become "I need to organise the theatre" ] for test in test_cases: result = polish(model, tokenizer, test) result_au = convert_to_au_spelling(result) print(f"Input: {test}") print(f"Polish: {result}") print(f"AU: {result_au}") print() ``` --- ## 6. Priority Task List ### Week 1: Performance 1. Install optimum library 2. Export and quantize model 3. Update model_loader.py 4. Run performance tests 5. Document results ### Week 2: AU Spelling 1. Create au_spelling.py with all patterns 2. Write unit tests for each pattern 3. Integrate into main.py 4. Test with real examples 5. Update documentation ### Week 3: Config Features 1. Implement AGGRESSION logic 2. Implement MIN_LENGTH check 3. Integrate CUSTOM_DICTIONARY 4. Add logging for all changes 5. Test all combinations ### Week 4: Deployment 1. Test systemd service 2. Fix any environment issues 3. Test hotkey functionality 4. Add monitoring/logging 5. Documentation --- ## 7. Success Metrics **Performance:** - [ ] Model load < 20s (intermediate target, final target 2s) - [ ] Average inference < 20ms (intermediate, final 10ms) - [ ] Memory < 300MB **Functionality:** - [ ] AU spelling conversions working (all 7 patterns) - [ ] AGGRESSION levels functional - [ ] CUSTOM_DICTIONARY protects words - [ ] MIN_LENGTH filter works - [ ] Logging shows diffs **Deployment:** - [ ] Service starts successfully - [ ] Hotkey works in service mode - [ ] 24/7 uptime capable - [ ] Error handling robust --- ## Research Sources 1. **ONNX Optimization:** - Article: "Blazing Fast Inference with Quantized ONNX Models" - Author: Tarun Gudipati - URL: https://codezen.medium.com/blazing-fast-inference-with-quantized-onnx-models-518f23777741 - Key: 5x speed, 2.2x memory reduction 2. **AU Spelling:** - Article 1: "Spelling Differences Between American and Australian English" - Source: getproofed.com.au - Article 2: "4 Reasons Australian English is Unique" - Source: unitedlanguagegroup.com - Key: 7 main spelling patterns identified 3. **Custom Dictionaries:** - Article: "Autocorrect Feature using NLP in Python" - Source: analyticsvidhya.com - Key: Whitelist implementation patterns