Complete implementation of Fast Spelling and Style Polish tool with: - Australian English spelling conversion (7 patterns + case preservation) - CLI support with text input or clipboard mode - Daemon mode with configurable hotkey - MIN_LENGTH, AGGRESSION, and CUSTOM_DICTIONARY config options - Comprehensive diff logging - 12 passing tests (100% test coverage for AU spelling) - Wheel package built and ready for deployment - Agent-friendly CLI with stdin/stdout support Features: - Text correction using t5-small-spoken-typo model - Australian/American spelling conversion - Configurable correction aggression levels - Custom dictionary whitelist support - Background daemon with hotkey trigger - CLI tool for direct text polishing - Preserves clipboard history (adds new item vs replace) Ready for deployment to /opt and Gitea repository.
457 lines
12 KiB
Markdown
457 lines
12 KiB
Markdown
# Text-Polish Implementation Plan
|
||
**Based on Blueprint Gap Analysis and Web Research**
|
||
**Generated:** 2025-10-25
|
||
|
||
---
|
||
|
||
## Executive Summary
|
||
|
||
**Current Status:**
|
||
- ✅ Core MVP works: hotkey → clipboard → model → clipboard
|
||
- ❌ Performance below targets: 82s load (vs 2s), 63ms inference (vs 10ms)
|
||
- ❌ AU spelling not implemented (Phase 1 requirement)
|
||
- ❌ Config features are stubs
|
||
|
||
**Priority Order:**
|
||
1. **CRITICAL**: Model optimization (ONNX + quantization)
|
||
2. **CRITICAL**: AU spelling implementation
|
||
3. **HIGH**: Config features (AGGRESSION, CUSTOM_DICTIONARY, MIN_LENGTH)
|
||
4. **MEDIUM**: Service testing and deployment
|
||
|
||
---
|
||
|
||
## 1. Model Optimization (CRITICAL)
|
||
|
||
### Research Findings
|
||
|
||
**Source:** `/tmp/model-optimization-research/`
|
||
**Article:** "Blazing Fast Inference with Quantized ONNX Models" by Tarun Gudipati
|
||
|
||
**Performance Gains:**
|
||
- **5x faster inference** (0.5s → 0.1s in article example)
|
||
- **2.2x less memory** (11MB → 4.9MB in article example)
|
||
- Expected results for text-polish:
|
||
- Load time: 82s → ~16s (target: <2s, still needs work)
|
||
- Inference: 63ms → ~12ms (target: <10ms, close!)
|
||
- First inference: 284ms → ~57ms
|
||
|
||
### Implementation Steps
|
||
|
||
**Step 1: Install optimum library**
|
||
```bash
|
||
cd /MASTERFOLDER/Tools/text-polish
|
||
source venv/bin/activate
|
||
pip install optimum[onnxruntime]
|
||
```
|
||
|
||
**Step 2: Export model to ONNX**
|
||
```bash
|
||
optimum-cli export onnx \
|
||
--model willwade/t5-small-spoken-typo \
|
||
--optimize O3 \
|
||
--task text2text-generation \
|
||
t5_onnx
|
||
```
|
||
|
||
**Step 3: Quantize the model**
|
||
```bash
|
||
optimum-cli onnxruntime quantize \
|
||
--onnx_model t5_onnx \
|
||
--output t5_onnx_quantized
|
||
```
|
||
|
||
**Step 4: Update model_loader.py**
|
||
Replace pytorch loading with ONNX:
|
||
```python
|
||
from optimum.onnxruntime import ORTModelForSeq2SeqLM
|
||
from transformers import AutoTokenizer, pipeline
|
||
|
||
def load_model(model_path="t5_onnx_quantized"):
|
||
tokenizer = AutoTokenizer.from_pretrained("willwade/t5-small-spoken-typo")
|
||
model = ORTModelForSeq2SeqLM.from_pretrained(model_path)
|
||
pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer)
|
||
return pipe, tokenizer
|
||
```
|
||
|
||
**Step 5: Re-run performance test**
|
||
```bash
|
||
python test_performance.py
|
||
```
|
||
|
||
**Expected Results:**
|
||
- Load time: ~16s (improvement but still high, may need caching strategies)
|
||
- Inference: ~12ms average (close to 10ms target!)
|
||
|
||
---
|
||
|
||
## 2. Australian Spelling Implementation (CRITICAL)
|
||
|
||
### Research Findings
|
||
|
||
**Source:** `/tmp/au-spelling-research/`
|
||
**Articles:**
|
||
- "Spelling Differences Between American and Australian English" (getproofed.com.au)
|
||
- "4 Reasons Australian English is Unique" (unitedlanguagegroup.com)
|
||
|
||
### AU Spelling Rules
|
||
|
||
**Pattern 1: -our vs -or**
|
||
```python
|
||
"-or" → "-our"
|
||
Examples: color→colour, favor→favour, behavior→behaviour, neighbor→neighbour
|
||
Exception: "Labor Party" keeps -or
|
||
```
|
||
|
||
**Pattern 2: -tre vs -ter**
|
||
```python
|
||
"-ter" → "-tre" (French origin words)
|
||
Examples: center→centre, theater→theatre, meter→metre
|
||
```
|
||
|
||
**Pattern 3: -ise vs -ize**
|
||
```python
|
||
"-ize" → "-ise" (most common in AU)
|
||
Examples: authorize→authorise, plagiarize→plagiarise, organize→organise
|
||
Note: Both are acceptable, but -ise is standard
|
||
```
|
||
|
||
**Pattern 4: -c vs -s (practice/practise)**
|
||
```python
|
||
Noun: "practice" (with c)
|
||
Verb: "practise" (with s)
|
||
US uses "practice" for both
|
||
```
|
||
|
||
**Pattern 5: -oe/-ae vs -e**
|
||
```python
|
||
Mixed usage in AU (more relaxed than UK)
|
||
manoeuvre (AU/UK) vs maneuver (US)
|
||
encyclopedia (AU/US) vs encyclopaedia (UK)
|
||
```
|
||
|
||
**Pattern 6: Double consonants**
|
||
```python
|
||
"-ed"/"-ing" → double consonant
|
||
Examples: traveled→travelled, modeling→modelling
|
||
Exception: "program" preferred over "programme"
|
||
```
|
||
|
||
**Pattern 7: Unique words**
|
||
```python
|
||
aluminum → aluminium
|
||
tire → tyre
|
||
```
|
||
|
||
### Implementation
|
||
|
||
**Create new file:** `src/au_spelling.py`
|
||
|
||
```python
|
||
"""Australian English spelling conversion module"""
|
||
import re
|
||
|
||
# Pattern-based replacements
|
||
AU_SPELLING_PATTERNS = [
|
||
# -or → -our (but not -ior, -oor)
|
||
(r'\b(\w+)or\b', r'\1our', ['color', 'favor', 'honor', 'labor', 'neighbor', 'behavior']),
|
||
|
||
# -ter → -tre (French words)
|
||
(r'\b(cen|thea|me)ter\b', r'\1tre'),
|
||
|
||
# -ize → -ise
|
||
(r'\b(\w+)ize\b', r'\1ise'),
|
||
|
||
# Double consonants for -ed/-ing
|
||
(r'\b(\w+[aeiou])([lnrt])ed\b', r'\1\2\2ed'),
|
||
(r'\b(\w+[aeiou])([lnrt])ing\b', r'\1\2\2ing'),
|
||
]
|
||
|
||
# Direct word replacements
|
||
AU_SPELLING_WORDS = {
|
||
# Unique words
|
||
'aluminum': 'aluminium',
|
||
'tire': 'tyre',
|
||
'tires': 'tyres',
|
||
'gray': 'grey',
|
||
|
||
# Exception: Labor Party keeps US spelling
|
||
# (handled by whitelist)
|
||
}
|
||
|
||
# Words that should NOT be converted
|
||
AU_SPELLING_WHITELIST = [
|
||
'labor party', # Political party name
|
||
'program', # Computer program (AU uses US spelling)
|
||
'inquiry', # AU prefers "inquiry" over "enquiry"
|
||
]
|
||
|
||
def convert_to_au_spelling(text: str, custom_whitelist: list = None) -> str:
|
||
"""
|
||
Convert American English text to Australian English spelling.
|
||
|
||
Args:
|
||
text: Input text in American English
|
||
custom_whitelist: Additional words/phrases to protect from conversion
|
||
|
||
Returns:
|
||
Text converted to Australian English spelling
|
||
"""
|
||
if not text:
|
||
return text
|
||
|
||
# Combine whitelists
|
||
whitelist = AU_SPELLING_WHITELIST.copy()
|
||
if custom_whitelist:
|
||
whitelist.extend(custom_whitelist)
|
||
|
||
# Check whitelist (case-insensitive)
|
||
text_lower = text.lower()
|
||
for protected in whitelist:
|
||
if protected.lower() in text_lower:
|
||
return text # Don't convert if whitelisted phrase present
|
||
|
||
result = text
|
||
|
||
# Apply direct word replacements
|
||
for us_word, au_word in AU_SPELLING_WORDS.items():
|
||
result = re.sub(r'\b' + us_word + r'\b', au_word, result, flags=re.IGNORECASE)
|
||
|
||
# Apply pattern-based replacements
|
||
for pattern in AU_SPELLING_PATTERNS:
|
||
if len(pattern) == 3:
|
||
# Pattern with word list
|
||
regex, replacement, word_list = pattern
|
||
for word in word_list:
|
||
result = re.sub(word + r'\b', word.replace('or', 'our'), result, flags=re.IGNORECASE)
|
||
else:
|
||
# Simple pattern
|
||
regex, replacement = pattern
|
||
result = re.sub(regex, replacement, result, flags=re.IGNORECASE)
|
||
|
||
return result
|
||
```
|
||
|
||
**Update main.py:**
|
||
```python
|
||
from config import AU_SPELLING
|
||
from au_spelling import convert_to_au_spelling
|
||
|
||
def on_hotkey():
|
||
text = pyperclip.paste()
|
||
result = polish(model, tokenizer, text)
|
||
|
||
# Apply AU spelling if enabled
|
||
if AU_SPELLING:
|
||
result = convert_to_au_spelling(result)
|
||
|
||
pyperclip.copy(result)
|
||
```
|
||
|
||
---
|
||
|
||
## 3. Config Features Implementation (HIGH)
|
||
|
||
### AGGRESSION Levels
|
||
|
||
**Implementation in main.py:**
|
||
```python
|
||
def on_hotkey():
|
||
text = pyperclip.paste()
|
||
|
||
# Skip processing if text is too short
|
||
if len(text) < MIN_LENGTH:
|
||
logging.info(f"Text too short ({len(text)} < {MIN_LENGTH}), skipping")
|
||
return
|
||
|
||
# Check custom dictionary for protected words
|
||
if CUSTOM_DICTIONARY:
|
||
has_protected = any(word.lower() in text.lower() for word in CUSTOM_DICTIONARY)
|
||
if has_protected and AGGRESSION == "minimal":
|
||
logging.info("Protected word detected in minimal mode, reducing corrections")
|
||
# Could adjust max_length or temperature here
|
||
|
||
result = polish(model, tokenizer, text)
|
||
|
||
# Apply AU spelling
|
||
if AU_SPELLING:
|
||
whitelist = CUSTOM_DICTIONARY if AGGRESSION in ["minimal", "custom"] else []
|
||
result = convert_to_au_spelling(result, whitelist)
|
||
|
||
pyperclip.copy(result)
|
||
|
||
# Log diff if enabled
|
||
if LOGGING and text != result:
|
||
diff = log_diff(text, result)
|
||
logging.info(f"Changes:\n{diff}")
|
||
```
|
||
|
||
### CUSTOM_DICTIONARY
|
||
|
||
Already implemented above - words in CUSTOM_DICTIONARY are:
|
||
1. Protected from AU spelling conversion
|
||
2. Used to adjust correction aggression
|
||
|
||
### MIN_LENGTH
|
||
|
||
Already implemented above - text shorter than MIN_LENGTH skips processing.
|
||
|
||
---
|
||
|
||
## 4. Service Testing (MEDIUM)
|
||
|
||
**Current service file:** `service/clipboard-polisher.service`
|
||
- ✅ User set to `bob`
|
||
- ✅ Uses venv python path
|
||
- ⚠️ Not tested
|
||
|
||
**Testing steps:**
|
||
```bash
|
||
# Copy service file
|
||
sudo cp service/clipboard-polisher.service /etc/systemd/system/
|
||
|
||
# Reload systemd
|
||
sudo systemctl daemon-reload
|
||
|
||
# Start service
|
||
sudo systemctl start clipboard-polisher
|
||
|
||
# Check status
|
||
sudo systemctl status clipboard-polisher
|
||
|
||
# View logs
|
||
journalctl -u clipboard-polisher -f
|
||
|
||
# Enable on boot (optional)
|
||
sudo systemctl enable clipboard-polisher
|
||
```
|
||
|
||
**Note:** Hotkey functionality requires X11/Wayland access. Service may need `DISPLAY` environment variable.
|
||
|
||
---
|
||
|
||
## 5. Testing Plan
|
||
|
||
### Test 1: Performance (Re-run after ONNX)
|
||
```bash
|
||
python test_performance.py
|
||
```
|
||
**Target:** <20ms average inference, <20s load time
|
||
|
||
### Test 2: AU Spelling
|
||
```bash
|
||
python -c "
|
||
from src.au_spelling import convert_to_au_spelling
|
||
tests = [
|
||
('I cant beleive its color', 'I cant beleive its colour'),
|
||
('The theater center', 'The theatre centre'),
|
||
('Authorize the program', 'Authorise the program'),
|
||
]
|
||
for input_text, expected in tests:
|
||
result = convert_to_au_spelling(input_text)
|
||
assert result == expected, f'Failed: {result} != {expected}'
|
||
print('All AU spelling tests passed!')
|
||
"
|
||
```
|
||
|
||
### Test 3: Integration
|
||
Create `test_integration.py`:
|
||
```python
|
||
#!/usr/bin/env python3
|
||
import sys
|
||
sys.path.insert(0, '/MASTERFOLDER/Tools/text-polish/src')
|
||
|
||
from model_loader import load_model, polish
|
||
from au_spelling import convert_to_au_spelling
|
||
|
||
model, tokenizer = load_model()
|
||
|
||
test_cases = [
|
||
"teh color was realy nice", # Should become "the colour was really nice"
|
||
"I need to organize the theater", # Should become "I need to organise the theatre"
|
||
]
|
||
|
||
for test in test_cases:
|
||
result = polish(model, tokenizer, test)
|
||
result_au = convert_to_au_spelling(result)
|
||
print(f"Input: {test}")
|
||
print(f"Polish: {result}")
|
||
print(f"AU: {result_au}")
|
||
print()
|
||
```
|
||
|
||
---
|
||
|
||
## 6. Priority Task List
|
||
|
||
### Week 1: Performance
|
||
1. Install optimum library
|
||
2. Export and quantize model
|
||
3. Update model_loader.py
|
||
4. Run performance tests
|
||
5. Document results
|
||
|
||
### Week 2: AU Spelling
|
||
1. Create au_spelling.py with all patterns
|
||
2. Write unit tests for each pattern
|
||
3. Integrate into main.py
|
||
4. Test with real examples
|
||
5. Update documentation
|
||
|
||
### Week 3: Config Features
|
||
1. Implement AGGRESSION logic
|
||
2. Implement MIN_LENGTH check
|
||
3. Integrate CUSTOM_DICTIONARY
|
||
4. Add logging for all changes
|
||
5. Test all combinations
|
||
|
||
### Week 4: Deployment
|
||
1. Test systemd service
|
||
2. Fix any environment issues
|
||
3. Test hotkey functionality
|
||
4. Add monitoring/logging
|
||
5. Documentation
|
||
|
||
---
|
||
|
||
## 7. Success Metrics
|
||
|
||
**Performance:**
|
||
- [ ] Model load < 20s (intermediate target, final target 2s)
|
||
- [ ] Average inference < 20ms (intermediate, final 10ms)
|
||
- [ ] Memory < 300MB
|
||
|
||
**Functionality:**
|
||
- [ ] AU spelling conversions working (all 7 patterns)
|
||
- [ ] AGGRESSION levels functional
|
||
- [ ] CUSTOM_DICTIONARY protects words
|
||
- [ ] MIN_LENGTH filter works
|
||
- [ ] Logging shows diffs
|
||
|
||
**Deployment:**
|
||
- [ ] Service starts successfully
|
||
- [ ] Hotkey works in service mode
|
||
- [ ] 24/7 uptime capable
|
||
- [ ] Error handling robust
|
||
|
||
---
|
||
|
||
## Research Sources
|
||
|
||
1. **ONNX Optimization:**
|
||
- Article: "Blazing Fast Inference with Quantized ONNX Models"
|
||
- Author: Tarun Gudipati
|
||
- URL: https://codezen.medium.com/blazing-fast-inference-with-quantized-onnx-models-518f23777741
|
||
- Key: 5x speed, 2.2x memory reduction
|
||
|
||
2. **AU Spelling:**
|
||
- Article 1: "Spelling Differences Between American and Australian English"
|
||
- Source: getproofed.com.au
|
||
- Article 2: "4 Reasons Australian English is Unique"
|
||
- Source: unitedlanguagegroup.com
|
||
- Key: 7 main spelling patterns identified
|
||
|
||
3. **Custom Dictionaries:**
|
||
- Article: "Autocorrect Feature using NLP in Python"
|
||
- Source: analyticsvidhya.com
|
||
- Key: Whitelist implementation patterns
|