fss-polish/IMPLEMENTATION_PLAN.md
FSSCoding 9316bc50f1 Initial commit: FSS-Polish v1.0.0
Complete implementation of Fast Spelling and Style Polish tool with:
- Australian English spelling conversion (7 patterns + case preservation)
- CLI support with text input or clipboard mode
- Daemon mode with configurable hotkey
- MIN_LENGTH, AGGRESSION, and CUSTOM_DICTIONARY config options
- Comprehensive diff logging
- 12 passing tests (100% test coverage for AU spelling)
- Wheel package built and ready for deployment
- Agent-friendly CLI with stdin/stdout support

Features:
- Text correction using t5-small-spoken-typo model
- Australian/American spelling conversion
- Configurable correction aggression levels
- Custom dictionary whitelist support
- Background daemon with hotkey trigger
- CLI tool for direct text polishing
- Preserves clipboard history (adds new item vs replace)

Ready for deployment to /opt and Gitea repository.
2025-10-25 23:59:34 +11:00

457 lines
12 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Text-Polish Implementation Plan
**Based on Blueprint Gap Analysis and Web Research**
**Generated:** 2025-10-25
---
## Executive Summary
**Current Status:**
- ✅ Core MVP works: hotkey → clipboard → model → clipboard
- ❌ Performance below targets: 82s load (vs 2s), 63ms inference (vs 10ms)
- ❌ AU spelling not implemented (Phase 1 requirement)
- ❌ Config features are stubs
**Priority Order:**
1. **CRITICAL**: Model optimization (ONNX + quantization)
2. **CRITICAL**: AU spelling implementation
3. **HIGH**: Config features (AGGRESSION, CUSTOM_DICTIONARY, MIN_LENGTH)
4. **MEDIUM**: Service testing and deployment
---
## 1. Model Optimization (CRITICAL)
### Research Findings
**Source:** `/tmp/model-optimization-research/`
**Article:** "Blazing Fast Inference with Quantized ONNX Models" by Tarun Gudipati
**Performance Gains:**
- **5x faster inference** (0.5s → 0.1s in article example)
- **2.2x less memory** (11MB → 4.9MB in article example)
- Expected results for text-polish:
- Load time: 82s → ~16s (target: <2s, still needs work)
- Inference: 63ms ~12ms (target: <10ms, close!)
- First inference: 284ms ~57ms
### Implementation Steps
**Step 1: Install optimum library**
```bash
cd /MASTERFOLDER/Tools/text-polish
source venv/bin/activate
pip install optimum[onnxruntime]
```
**Step 2: Export model to ONNX**
```bash
optimum-cli export onnx \
--model willwade/t5-small-spoken-typo \
--optimize O3 \
--task text2text-generation \
t5_onnx
```
**Step 3: Quantize the model**
```bash
optimum-cli onnxruntime quantize \
--onnx_model t5_onnx \
--output t5_onnx_quantized
```
**Step 4: Update model_loader.py**
Replace pytorch loading with ONNX:
```python
from optimum.onnxruntime import ORTModelForSeq2SeqLM
from transformers import AutoTokenizer, pipeline
def load_model(model_path="t5_onnx_quantized"):
tokenizer = AutoTokenizer.from_pretrained("willwade/t5-small-spoken-typo")
model = ORTModelForSeq2SeqLM.from_pretrained(model_path)
pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer)
return pipe, tokenizer
```
**Step 5: Re-run performance test**
```bash
python test_performance.py
```
**Expected Results:**
- Load time: ~16s (improvement but still high, may need caching strategies)
- Inference: ~12ms average (close to 10ms target!)
---
## 2. Australian Spelling Implementation (CRITICAL)
### Research Findings
**Source:** `/tmp/au-spelling-research/`
**Articles:**
- "Spelling Differences Between American and Australian English" (getproofed.com.au)
- "4 Reasons Australian English is Unique" (unitedlanguagegroup.com)
### AU Spelling Rules
**Pattern 1: -our vs -or**
```python
"-or" "-our"
Examples: colorcolour, favorfavour, behaviorbehaviour, neighborneighbour
Exception: "Labor Party" keeps -or
```
**Pattern 2: -tre vs -ter**
```python
"-ter" "-tre" (French origin words)
Examples: centercentre, theatertheatre, metermetre
```
**Pattern 3: -ise vs -ize**
```python
"-ize" "-ise" (most common in AU)
Examples: authorizeauthorise, plagiarizeplagiarise, organizeorganise
Note: Both are acceptable, but -ise is standard
```
**Pattern 4: -c vs -s (practice/practise)**
```python
Noun: "practice" (with c)
Verb: "practise" (with s)
US uses "practice" for both
```
**Pattern 5: -oe/-ae vs -e**
```python
Mixed usage in AU (more relaxed than UK)
manoeuvre (AU/UK) vs maneuver (US)
encyclopedia (AU/US) vs encyclopaedia (UK)
```
**Pattern 6: Double consonants**
```python
"-ed"/"-ing" double consonant
Examples: traveledtravelled, modelingmodelling
Exception: "program" preferred over "programme"
```
**Pattern 7: Unique words**
```python
aluminum aluminium
tire tyre
```
### Implementation
**Create new file:** `src/au_spelling.py`
```python
"""Australian English spelling conversion module"""
import re
# Pattern-based replacements
AU_SPELLING_PATTERNS = [
# -or → -our (but not -ior, -oor)
(r'\b(\w+)or\b', r'\1our', ['color', 'favor', 'honor', 'labor', 'neighbor', 'behavior']),
# -ter → -tre (French words)
(r'\b(cen|thea|me)ter\b', r'\1tre'),
# -ize → -ise
(r'\b(\w+)ize\b', r'\1ise'),
# Double consonants for -ed/-ing
(r'\b(\w+[aeiou])([lnrt])ed\b', r'\1\2\2ed'),
(r'\b(\w+[aeiou])([lnrt])ing\b', r'\1\2\2ing'),
]
# Direct word replacements
AU_SPELLING_WORDS = {
# Unique words
'aluminum': 'aluminium',
'tire': 'tyre',
'tires': 'tyres',
'gray': 'grey',
# Exception: Labor Party keeps US spelling
# (handled by whitelist)
}
# Words that should NOT be converted
AU_SPELLING_WHITELIST = [
'labor party', # Political party name
'program', # Computer program (AU uses US spelling)
'inquiry', # AU prefers "inquiry" over "enquiry"
]
def convert_to_au_spelling(text: str, custom_whitelist: list = None) -> str:
"""
Convert American English text to Australian English spelling.
Args:
text: Input text in American English
custom_whitelist: Additional words/phrases to protect from conversion
Returns:
Text converted to Australian English spelling
"""
if not text:
return text
# Combine whitelists
whitelist = AU_SPELLING_WHITELIST.copy()
if custom_whitelist:
whitelist.extend(custom_whitelist)
# Check whitelist (case-insensitive)
text_lower = text.lower()
for protected in whitelist:
if protected.lower() in text_lower:
return text # Don't convert if whitelisted phrase present
result = text
# Apply direct word replacements
for us_word, au_word in AU_SPELLING_WORDS.items():
result = re.sub(r'\b' + us_word + r'\b', au_word, result, flags=re.IGNORECASE)
# Apply pattern-based replacements
for pattern in AU_SPELLING_PATTERNS:
if len(pattern) == 3:
# Pattern with word list
regex, replacement, word_list = pattern
for word in word_list:
result = re.sub(word + r'\b', word.replace('or', 'our'), result, flags=re.IGNORECASE)
else:
# Simple pattern
regex, replacement = pattern
result = re.sub(regex, replacement, result, flags=re.IGNORECASE)
return result
```
**Update main.py:**
```python
from config import AU_SPELLING
from au_spelling import convert_to_au_spelling
def on_hotkey():
text = pyperclip.paste()
result = polish(model, tokenizer, text)
# Apply AU spelling if enabled
if AU_SPELLING:
result = convert_to_au_spelling(result)
pyperclip.copy(result)
```
---
## 3. Config Features Implementation (HIGH)
### AGGRESSION Levels
**Implementation in main.py:**
```python
def on_hotkey():
text = pyperclip.paste()
# Skip processing if text is too short
if len(text) < MIN_LENGTH:
logging.info(f"Text too short ({len(text)} < {MIN_LENGTH}), skipping")
return
# Check custom dictionary for protected words
if CUSTOM_DICTIONARY:
has_protected = any(word.lower() in text.lower() for word in CUSTOM_DICTIONARY)
if has_protected and AGGRESSION == "minimal":
logging.info("Protected word detected in minimal mode, reducing corrections")
# Could adjust max_length or temperature here
result = polish(model, tokenizer, text)
# Apply AU spelling
if AU_SPELLING:
whitelist = CUSTOM_DICTIONARY if AGGRESSION in ["minimal", "custom"] else []
result = convert_to_au_spelling(result, whitelist)
pyperclip.copy(result)
# Log diff if enabled
if LOGGING and text != result:
diff = log_diff(text, result)
logging.info(f"Changes:\n{diff}")
```
### CUSTOM_DICTIONARY
Already implemented above - words in CUSTOM_DICTIONARY are:
1. Protected from AU spelling conversion
2. Used to adjust correction aggression
### MIN_LENGTH
Already implemented above - text shorter than MIN_LENGTH skips processing.
---
## 4. Service Testing (MEDIUM)
**Current service file:** `service/clipboard-polisher.service`
- User set to `bob`
- Uses venv python path
- Not tested
**Testing steps:**
```bash
# Copy service file
sudo cp service/clipboard-polisher.service /etc/systemd/system/
# Reload systemd
sudo systemctl daemon-reload
# Start service
sudo systemctl start clipboard-polisher
# Check status
sudo systemctl status clipboard-polisher
# View logs
journalctl -u clipboard-polisher -f
# Enable on boot (optional)
sudo systemctl enable clipboard-polisher
```
**Note:** Hotkey functionality requires X11/Wayland access. Service may need `DISPLAY` environment variable.
---
## 5. Testing Plan
### Test 1: Performance (Re-run after ONNX)
```bash
python test_performance.py
```
**Target:** <20ms average inference, <20s load time
### Test 2: AU Spelling
```bash
python -c "
from src.au_spelling import convert_to_au_spelling
tests = [
('I cant beleive its color', 'I cant beleive its colour'),
('The theater center', 'The theatre centre'),
('Authorize the program', 'Authorise the program'),
]
for input_text, expected in tests:
result = convert_to_au_spelling(input_text)
assert result == expected, f'Failed: {result} != {expected}'
print('All AU spelling tests passed!')
"
```
### Test 3: Integration
Create `test_integration.py`:
```python
#!/usr/bin/env python3
import sys
sys.path.insert(0, '/MASTERFOLDER/Tools/text-polish/src')
from model_loader import load_model, polish
from au_spelling import convert_to_au_spelling
model, tokenizer = load_model()
test_cases = [
"teh color was realy nice", # Should become "the colour was really nice"
"I need to organize the theater", # Should become "I need to organise the theatre"
]
for test in test_cases:
result = polish(model, tokenizer, test)
result_au = convert_to_au_spelling(result)
print(f"Input: {test}")
print(f"Polish: {result}")
print(f"AU: {result_au}")
print()
```
---
## 6. Priority Task List
### Week 1: Performance
1. Install optimum library
2. Export and quantize model
3. Update model_loader.py
4. Run performance tests
5. Document results
### Week 2: AU Spelling
1. Create au_spelling.py with all patterns
2. Write unit tests for each pattern
3. Integrate into main.py
4. Test with real examples
5. Update documentation
### Week 3: Config Features
1. Implement AGGRESSION logic
2. Implement MIN_LENGTH check
3. Integrate CUSTOM_DICTIONARY
4. Add logging for all changes
5. Test all combinations
### Week 4: Deployment
1. Test systemd service
2. Fix any environment issues
3. Test hotkey functionality
4. Add monitoring/logging
5. Documentation
---
## 7. Success Metrics
**Performance:**
- [ ] Model load < 20s (intermediate target, final target 2s)
- [ ] Average inference < 20ms (intermediate, final 10ms)
- [ ] Memory < 300MB
**Functionality:**
- [ ] AU spelling conversions working (all 7 patterns)
- [ ] AGGRESSION levels functional
- [ ] CUSTOM_DICTIONARY protects words
- [ ] MIN_LENGTH filter works
- [ ] Logging shows diffs
**Deployment:**
- [ ] Service starts successfully
- [ ] Hotkey works in service mode
- [ ] 24/7 uptime capable
- [ ] Error handling robust
---
## Research Sources
1. **ONNX Optimization:**
- Article: "Blazing Fast Inference with Quantized ONNX Models"
- Author: Tarun Gudipati
- URL: https://codezen.medium.com/blazing-fast-inference-with-quantized-onnx-models-518f23777741
- Key: 5x speed, 2.2x memory reduction
2. **AU Spelling:**
- Article 1: "Spelling Differences Between American and Australian English"
- Source: getproofed.com.au
- Article 2: "4 Reasons Australian English is Unique"
- Source: unitedlanguagegroup.com
- Key: 7 main spelling patterns identified
3. **Custom Dictionaries:**
- Article: "Autocorrect Feature using NLP in Python"
- Source: analyticsvidhya.com
- Key: Whitelist implementation patterns