fss-polish/IMPLEMENTATION_PLAN.md

# Text-Polish Implementation Plan
**Based on Blueprint Gap Analysis and Web Research**
**Generated:** 2025-10-25

---

## Executive Summary

**Current Status:**
- ✅ Core MVP works: hotkey → clipboard → model → clipboard
- ❌ Performance below targets: 82s load (vs 2s), 63ms inference (vs 10ms)
- ❌ AU spelling not implemented (Phase 1 requirement)
- ❌ Config features are stubs

**Priority Order:**
1. **CRITICAL**: Model optimization (ONNX + quantization)
2. **CRITICAL**: AU spelling implementation
3. **HIGH**: Config features (AGGRESSION, CUSTOM_DICTIONARY, MIN_LENGTH)
4. **MEDIUM**: Service testing and deployment

---

## 1. Model Optimization (CRITICAL)

### Research Findings

**Source:** `/tmp/model-optimization-research/`
**Article:** "Blazing Fast Inference with Quantized ONNX Models" by Tarun Gudipati

**Performance Gains:**
- **5x faster inference** (0.5s → 0.1s in article example)
- **2.2x less memory** (11MB → 4.9MB in article example)
- Expected results for text-polish:
  - Load time: 82s → ~16s (target: <2s, still needs work)
  - Inference: 63ms → ~12ms (target: <10ms, close!)
  - First inference: 284ms → ~57ms

### Implementation Steps

**Step 1: Install optimum library**
```bash
cd /MASTERFOLDER/Tools/text-polish
source venv/bin/activate
pip install optimum[onnxruntime]
```

**Step 2: Export model to ONNX**
```bash
optimum-cli export onnx \
  --model willwade/t5-small-spoken-typo \
  --optimize O3 \
  --task text2text-generation \
  t5_onnx
```

**Step 3: Quantize the model**
```bash
optimum-cli onnxruntime quantize \
  --onnx_model t5_onnx \
  --output t5_onnx_quantized
```

**Step 4: Update model_loader.py**
Replace pytorch loading with ONNX:
```python
from optimum.onnxruntime import ORTModelForSeq2SeqLM
from transformers import AutoTokenizer, pipeline

def load_model(model_path="t5_onnx_quantized"):
    tokenizer = AutoTokenizer.from_pretrained("willwade/t5-small-spoken-typo")
    model = ORTModelForSeq2SeqLM.from_pretrained(model_path)
    pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer)
    return pipe, tokenizer
```

**Step 5: Re-run performance test**
```bash
python test_performance.py
```

**Expected Results:**
- Load time: ~16s (improvement but still high, may need caching strategies)
- Inference: ~12ms average (close to 10ms target!)

---

## 2. Australian Spelling Implementation (CRITICAL)

### Research Findings

**Source:** `/tmp/au-spelling-research/`
**Articles:**
- "Spelling Differences Between American and Australian English" (getproofed.com.au)
- "4 Reasons Australian English is Unique" (unitedlanguagegroup.com)

### AU Spelling Rules

**Pattern 1: -our vs -or**
```python
"-or" → "-our"
Examples: color→colour, favor→favour, behavior→behaviour, neighbor→neighbour
Exception: "Labor Party" keeps -or
```

**Pattern 2: -tre vs -ter**
```python
"-ter" → "-tre" (French origin words)
Examples: center→centre, theater→theatre, meter→metre
```

**Pattern 3: -ise vs -ize**
```python
"-ize" → "-ise" (most common in AU)
Examples: authorize→authorise, plagiarize→plagiarise, organize→organise
Note: Both are acceptable, but -ise is standard
```

**Pattern 4: -c vs -s (practice/practise)**
```python
Noun: "practice" (with c)
Verb: "practise" (with s)
US uses "practice" for both
```

**Pattern 5: -oe/-ae vs -e**
```python
Mixed usage in AU (more relaxed than UK)
manoeuvre (AU/UK) vs maneuver (US)
encyclopedia (AU/US) vs encyclopaedia (UK)
```

**Pattern 6: Double consonants**
```python
"-ed"/"-ing" → double consonant
Examples: traveled→travelled, modeling→modelling
Exception: "program" preferred over "programme"
```

**Pattern 7: Unique words**
```python
aluminum → aluminium
tire → tyre
```

### Implementation

**Create new file:** `src/au_spelling.py`

```python
"""Australian English spelling conversion module"""
import re

# Pattern-based replacements
AU_SPELLING_PATTERNS = [
    # -or → -our (but not -ior, -oor)
    (r'\b(\w+)or\b', r'\1our', ['color', 'favor', 'honor', 'labor', 'neighbor', 'behavior']),

    # -ter → -tre (French words)
    (r'\b(cen|thea|me)ter\b', r'\1tre'),

    # -ize → -ise
    (r'\b(\w+)ize\b', r'\1ise'),

    # Double consonants for -ed/-ing
    (r'\b(\w+[aeiou])([lnrt])ed\b', r'\1\2\2ed'),
    (r'\b(\w+[aeiou])([lnrt])ing\b', r'\1\2\2ing'),
]

# Direct word replacements
AU_SPELLING_WORDS = {
    # Unique words
    'aluminum': 'aluminium',
    'tire': 'tyre',
    'tires': 'tyres',
    'gray': 'grey',

    # Exception: Labor Party keeps US spelling
    # (handled by whitelist)
}

# Words that should NOT be converted
AU_SPELLING_WHITELIST = [
    'labor party',  # Political party name
    'program',      # Computer program (AU uses US spelling)
    'inquiry',      # AU prefers "inquiry" over "enquiry"
]

def convert_to_au_spelling(text: str, custom_whitelist: list = None) -> str:
    """
    Convert American English text to Australian English spelling.

    Args:
        text: Input text in American English
        custom_whitelist: Additional words/phrases to protect from conversion

    Returns:
        Text converted to Australian English spelling
    """
    if not text:
        return text

    # Combine whitelists
    whitelist = AU_SPELLING_WHITELIST.copy()
    if custom_whitelist:
        whitelist.extend(custom_whitelist)

    # Check whitelist (case-insensitive)
    text_lower = text.lower()
    for protected in whitelist:
        if protected.lower() in text_lower:
            return text  # Don't convert if whitelisted phrase present

    result = text

    # Apply direct word replacements
    for us_word, au_word in AU_SPELLING_WORDS.items():
        result = re.sub(r'\b' + us_word + r'\b', au_word, result, flags=re.IGNORECASE)

    # Apply pattern-based replacements
    for pattern in AU_SPELLING_PATTERNS:
        if len(pattern) == 3:
            # Pattern with word list
            regex, replacement, word_list = pattern
            for word in word_list:
                result = re.sub(word + r'\b', word.replace('or', 'our'), result, flags=re.IGNORECASE)
        else:
            # Simple pattern
            regex, replacement = pattern
            result = re.sub(regex, replacement, result, flags=re.IGNORECASE)

    return result
```

**Update main.py:**
```python
from config import AU_SPELLING
from au_spelling import convert_to_au_spelling

def on_hotkey():
    text = pyperclip.paste()
    result = polish(model, tokenizer, text)

    # Apply AU spelling if enabled
    if AU_SPELLING:
        result = convert_to_au_spelling(result)

    pyperclip.copy(result)
```

---

## 3. Config Features Implementation (HIGH)

### AGGRESSION Levels

**Implementation in main.py:**
```python
def on_hotkey():
    text = pyperclip.paste()

    # Skip processing if text is too short
    if len(text) < MIN_LENGTH:
        logging.info(f"Text too short ({len(text)} < {MIN_LENGTH}), skipping")
        return

    # Check custom dictionary for protected words
    if CUSTOM_DICTIONARY:
        has_protected = any(word.lower() in text.lower() for word in CUSTOM_DICTIONARY)
        if has_protected and AGGRESSION == "minimal":
            logging.info("Protected word detected in minimal mode, reducing corrections")
            # Could adjust max_length or temperature here

    result = polish(model, tokenizer, text)

    # Apply AU spelling
    if AU_SPELLING:
        whitelist = CUSTOM_DICTIONARY if AGGRESSION in ["minimal", "custom"] else []
        result = convert_to_au_spelling(result, whitelist)

    pyperclip.copy(result)

    # Log diff if enabled
    if LOGGING and text != result:
        diff = log_diff(text, result)
        logging.info(f"Changes:\n{diff}")
```

### CUSTOM_DICTIONARY

Already implemented above - words in CUSTOM_DICTIONARY are:
1. Protected from AU spelling conversion
2. Used to adjust correction aggression

### MIN_LENGTH

Already implemented above - text shorter than MIN_LENGTH skips processing.

---

## 4. Service Testing (MEDIUM)

**Current service file:** `service/clipboard-polisher.service`
- ✅ User set to `bob`
- ✅ Uses venv python path
- ⚠️ Not tested

**Testing steps:**
```bash
# Copy service file
sudo cp service/clipboard-polisher.service /etc/systemd/system/

# Reload systemd
sudo systemctl daemon-reload

# Start service
sudo systemctl start clipboard-polisher

# Check status
sudo systemctl status clipboard-polisher

# View logs
journalctl -u clipboard-polisher -f

# Enable on boot (optional)
sudo systemctl enable clipboard-polisher
```

**Note:** Hotkey functionality requires X11/Wayland access. Service may need `DISPLAY` environment variable.

---

## 5. Testing Plan

### Test 1: Performance (Re-run after ONNX)
```bash
python test_performance.py
```
**Target:** <20ms average inference, <20s load time

### Test 2: AU Spelling
```bash
python -c "
from src.au_spelling import convert_to_au_spelling
tests = [
    ('I cant beleive its color', 'I cant beleive its colour'),
    ('The theater center', 'The theatre centre'),
    ('Authorize the program', 'Authorise the program'),
]
for input_text, expected in tests:
    result = convert_to_au_spelling(input_text)
    assert result == expected, f'Failed: {result} != {expected}'
print('All AU spelling tests passed!')
"
```

### Test 3: Integration
Create `test_integration.py`:
```python
#!/usr/bin/env python3
import sys
sys.path.insert(0, '/MASTERFOLDER/Tools/text-polish/src')

from model_loader import load_model, polish
from au_spelling import convert_to_au_spelling

model, tokenizer = load_model()

test_cases = [
    "teh color was realy nice",  # Should become "the colour was really nice"
    "I need to organize the theater",  # Should become "I need to organise the theatre"
]

for test in test_cases:
    result = polish(model, tokenizer, test)
    result_au = convert_to_au_spelling(result)
    print(f"Input:  {test}")
    print(f"Polish: {result}")
    print(f"AU:     {result_au}")
    print()
```

---

## 6. Priority Task List

### Week 1: Performance
1. Install optimum library
2. Export and quantize model
3. Update model_loader.py
4. Run performance tests
5. Document results

### Week 2: AU Spelling
1. Create au_spelling.py with all patterns
2. Write unit tests for each pattern
3. Integrate into main.py
4. Test with real examples
5. Update documentation

### Week 3: Config Features
1. Implement AGGRESSION logic
2. Implement MIN_LENGTH check
3. Integrate CUSTOM_DICTIONARY
4. Add logging for all changes
5. Test all combinations

### Week 4: Deployment
1. Test systemd service
2. Fix any environment issues
3. Test hotkey functionality
4. Add monitoring/logging
5. Documentation

---

## 7. Success Metrics

**Performance:**
- [ ] Model load < 20s (intermediate target, final target 2s)
- [ ] Average inference < 20ms (intermediate, final 10ms)
- [ ] Memory < 300MB

**Functionality:**
- [ ] AU spelling conversions working (all 7 patterns)
- [ ] AGGRESSION levels functional
- [ ] CUSTOM_DICTIONARY protects words
- [ ] MIN_LENGTH filter works
- [ ] Logging shows diffs

**Deployment:**
- [ ] Service starts successfully
- [ ] Hotkey works in service mode
- [ ] 24/7 uptime capable
- [ ] Error handling robust

---

## Research Sources

1. **ONNX Optimization:**
   - Article: "Blazing Fast Inference with Quantized ONNX Models"
   - Author: Tarun Gudipati
   - URL: https://codezen.medium.com/blazing-fast-inference-with-quantized-onnx-models-518f23777741
   - Key: 5x speed, 2.2x memory reduction

2. **AU Spelling:**
   - Article 1: "Spelling Differences Between American and Australian English"
   - Source: getproofed.com.au
   - Article 2: "4 Reasons Australian English is Unique"
   - Source: unitedlanguagegroup.com
   - Key: 7 main spelling patterns identified

3. **Custom Dictionaries:**
   - Article: "Autocorrect Feature using NLP in Python"
   - Source: analyticsvidhya.com
   - Key: Whitelist implementation patterns