fss-polish/IMPLEMENTATION_PLAN.md
FSSCoding 9316bc50f1 Initial commit: FSS-Polish v1.0.0
Complete implementation of Fast Spelling and Style Polish tool with:
- Australian English spelling conversion (7 patterns + case preservation)
- CLI support with text input or clipboard mode
- Daemon mode with configurable hotkey
- MIN_LENGTH, AGGRESSION, and CUSTOM_DICTIONARY config options
- Comprehensive diff logging
- 12 passing tests (100% test coverage for AU spelling)
- Wheel package built and ready for deployment
- Agent-friendly CLI with stdin/stdout support

Features:
- Text correction using t5-small-spoken-typo model
- Australian/American spelling conversion
- Configurable correction aggression levels
- Custom dictionary whitelist support
- Background daemon with hotkey trigger
- CLI tool for direct text polishing
- Preserves clipboard history (adds new item vs replace)

Ready for deployment to /opt and Gitea repository.
2025-10-25 23:59:34 +11:00

12 KiB

Text-Polish Implementation Plan

Based on Blueprint Gap Analysis and Web Research Generated: 2025-10-25


Executive Summary

Current Status:

  • Core MVP works: hotkey → clipboard → model → clipboard
  • Performance below targets: 82s load (vs 2s), 63ms inference (vs 10ms)
  • AU spelling not implemented (Phase 1 requirement)
  • Config features are stubs

Priority Order:

  1. CRITICAL: Model optimization (ONNX + quantization)
  2. CRITICAL: AU spelling implementation
  3. HIGH: Config features (AGGRESSION, CUSTOM_DICTIONARY, MIN_LENGTH)
  4. MEDIUM: Service testing and deployment

1. Model Optimization (CRITICAL)

Research Findings

Source: /tmp/model-optimization-research/ Article: "Blazing Fast Inference with Quantized ONNX Models" by Tarun Gudipati

Performance Gains:

  • 5x faster inference (0.5s → 0.1s in article example)
  • 2.2x less memory (11MB → 4.9MB in article example)
  • Expected results for text-polish:
    • Load time: 82s → ~16s (target: <2s, still needs work)
    • Inference: 63ms → ~12ms (target: <10ms, close!)
    • First inference: 284ms → ~57ms

Implementation Steps

Step 1: Install optimum library

cd /MASTERFOLDER/Tools/text-polish
source venv/bin/activate
pip install optimum[onnxruntime]

Step 2: Export model to ONNX

optimum-cli export onnx \
  --model willwade/t5-small-spoken-typo \
  --optimize O3 \
  --task text2text-generation \
  t5_onnx

Step 3: Quantize the model

optimum-cli onnxruntime quantize \
  --onnx_model t5_onnx \
  --output t5_onnx_quantized

Step 4: Update model_loader.py Replace pytorch loading with ONNX:

from optimum.onnxruntime import ORTModelForSeq2SeqLM
from transformers import AutoTokenizer, pipeline

def load_model(model_path="t5_onnx_quantized"):
    tokenizer = AutoTokenizer.from_pretrained("willwade/t5-small-spoken-typo")
    model = ORTModelForSeq2SeqLM.from_pretrained(model_path)
    pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer)
    return pipe, tokenizer

Step 5: Re-run performance test

python test_performance.py

Expected Results:

  • Load time: ~16s (improvement but still high, may need caching strategies)
  • Inference: ~12ms average (close to 10ms target!)

2. Australian Spelling Implementation (CRITICAL)

Research Findings

Source: /tmp/au-spelling-research/ Articles:

  • "Spelling Differences Between American and Australian English" (getproofed.com.au)
  • "4 Reasons Australian English is Unique" (unitedlanguagegroup.com)

AU Spelling Rules

Pattern 1: -our vs -or

"-or"  "-our"
Examples: colorcolour, favorfavour, behaviorbehaviour, neighborneighbour
Exception: "Labor Party" keeps -or

Pattern 2: -tre vs -ter

"-ter"  "-tre" (French origin words)
Examples: centercentre, theatertheatre, metermetre

Pattern 3: -ise vs -ize

"-ize"  "-ise" (most common in AU)
Examples: authorizeauthorise, plagiarizeplagiarise, organizeorganise
Note: Both are acceptable, but -ise is standard

Pattern 4: -c vs -s (practice/practise)

Noun: "practice" (with c)
Verb: "practise" (with s)
US uses "practice" for both

Pattern 5: -oe/-ae vs -e

Mixed usage in AU (more relaxed than UK)
manoeuvre (AU/UK) vs maneuver (US)
encyclopedia (AU/US) vs encyclopaedia (UK)

Pattern 6: Double consonants

"-ed"/"-ing"  double consonant
Examples: traveledtravelled, modelingmodelling
Exception: "program" preferred over "programme"

Pattern 7: Unique words

aluminum  aluminium
tire  tyre

Implementation

Create new file: src/au_spelling.py

"""Australian English spelling conversion module"""
import re

# Pattern-based replacements
AU_SPELLING_PATTERNS = [
    # -or → -our (but not -ior, -oor)
    (r'\b(\w+)or\b', r'\1our', ['color', 'favor', 'honor', 'labor', 'neighbor', 'behavior']),

    # -ter → -tre (French words)
    (r'\b(cen|thea|me)ter\b', r'\1tre'),

    # -ize → -ise
    (r'\b(\w+)ize\b', r'\1ise'),

    # Double consonants for -ed/-ing
    (r'\b(\w+[aeiou])([lnrt])ed\b', r'\1\2\2ed'),
    (r'\b(\w+[aeiou])([lnrt])ing\b', r'\1\2\2ing'),
]

# Direct word replacements
AU_SPELLING_WORDS = {
    # Unique words
    'aluminum': 'aluminium',
    'tire': 'tyre',
    'tires': 'tyres',
    'gray': 'grey',

    # Exception: Labor Party keeps US spelling
    # (handled by whitelist)
}

# Words that should NOT be converted
AU_SPELLING_WHITELIST = [
    'labor party',  # Political party name
    'program',      # Computer program (AU uses US spelling)
    'inquiry',      # AU prefers "inquiry" over "enquiry"
]

def convert_to_au_spelling(text: str, custom_whitelist: list = None) -> str:
    """
    Convert American English text to Australian English spelling.

    Args:
        text: Input text in American English
        custom_whitelist: Additional words/phrases to protect from conversion

    Returns:
        Text converted to Australian English spelling
    """
    if not text:
        return text

    # Combine whitelists
    whitelist = AU_SPELLING_WHITELIST.copy()
    if custom_whitelist:
        whitelist.extend(custom_whitelist)

    # Check whitelist (case-insensitive)
    text_lower = text.lower()
    for protected in whitelist:
        if protected.lower() in text_lower:
            return text  # Don't convert if whitelisted phrase present

    result = text

    # Apply direct word replacements
    for us_word, au_word in AU_SPELLING_WORDS.items():
        result = re.sub(r'\b' + us_word + r'\b', au_word, result, flags=re.IGNORECASE)

    # Apply pattern-based replacements
    for pattern in AU_SPELLING_PATTERNS:
        if len(pattern) == 3:
            # Pattern with word list
            regex, replacement, word_list = pattern
            for word in word_list:
                result = re.sub(word + r'\b', word.replace('or', 'our'), result, flags=re.IGNORECASE)
        else:
            # Simple pattern
            regex, replacement = pattern
            result = re.sub(regex, replacement, result, flags=re.IGNORECASE)

    return result

Update main.py:

from config import AU_SPELLING
from au_spelling import convert_to_au_spelling

def on_hotkey():
    text = pyperclip.paste()
    result = polish(model, tokenizer, text)

    # Apply AU spelling if enabled
    if AU_SPELLING:
        result = convert_to_au_spelling(result)

    pyperclip.copy(result)

3. Config Features Implementation (HIGH)

AGGRESSION Levels

Implementation in main.py:

def on_hotkey():
    text = pyperclip.paste()

    # Skip processing if text is too short
    if len(text) < MIN_LENGTH:
        logging.info(f"Text too short ({len(text)} < {MIN_LENGTH}), skipping")
        return

    # Check custom dictionary for protected words
    if CUSTOM_DICTIONARY:
        has_protected = any(word.lower() in text.lower() for word in CUSTOM_DICTIONARY)
        if has_protected and AGGRESSION == "minimal":
            logging.info("Protected word detected in minimal mode, reducing corrections")
            # Could adjust max_length or temperature here

    result = polish(model, tokenizer, text)

    # Apply AU spelling
    if AU_SPELLING:
        whitelist = CUSTOM_DICTIONARY if AGGRESSION in ["minimal", "custom"] else []
        result = convert_to_au_spelling(result, whitelist)

    pyperclip.copy(result)

    # Log diff if enabled
    if LOGGING and text != result:
        diff = log_diff(text, result)
        logging.info(f"Changes:\n{diff}")

CUSTOM_DICTIONARY

Already implemented above - words in CUSTOM_DICTIONARY are:

  1. Protected from AU spelling conversion
  2. Used to adjust correction aggression

MIN_LENGTH

Already implemented above - text shorter than MIN_LENGTH skips processing.


4. Service Testing (MEDIUM)

Current service file: service/clipboard-polisher.service

  • User set to bob
  • Uses venv python path
  • ⚠️ Not tested

Testing steps:

# Copy service file
sudo cp service/clipboard-polisher.service /etc/systemd/system/

# Reload systemd
sudo systemctl daemon-reload

# Start service
sudo systemctl start clipboard-polisher

# Check status
sudo systemctl status clipboard-polisher

# View logs
journalctl -u clipboard-polisher -f

# Enable on boot (optional)
sudo systemctl enable clipboard-polisher

Note: Hotkey functionality requires X11/Wayland access. Service may need DISPLAY environment variable.


5. Testing Plan

Test 1: Performance (Re-run after ONNX)

python test_performance.py

Target: <20ms average inference, <20s load time

Test 2: AU Spelling

python -c "
from src.au_spelling import convert_to_au_spelling
tests = [
    ('I cant beleive its color', 'I cant beleive its colour'),
    ('The theater center', 'The theatre centre'),
    ('Authorize the program', 'Authorise the program'),
]
for input_text, expected in tests:
    result = convert_to_au_spelling(input_text)
    assert result == expected, f'Failed: {result} != {expected}'
print('All AU spelling tests passed!')
"

Test 3: Integration

Create test_integration.py:

#!/usr/bin/env python3
import sys
sys.path.insert(0, '/MASTERFOLDER/Tools/text-polish/src')

from model_loader import load_model, polish
from au_spelling import convert_to_au_spelling

model, tokenizer = load_model()

test_cases = [
    "teh color was realy nice",  # Should become "the colour was really nice"
    "I need to organize the theater",  # Should become "I need to organise the theatre"
]

for test in test_cases:
    result = polish(model, tokenizer, test)
    result_au = convert_to_au_spelling(result)
    print(f"Input:  {test}")
    print(f"Polish: {result}")
    print(f"AU:     {result_au}")
    print()

6. Priority Task List

Week 1: Performance

  1. Install optimum library
  2. Export and quantize model
  3. Update model_loader.py
  4. Run performance tests
  5. Document results

Week 2: AU Spelling

  1. Create au_spelling.py with all patterns
  2. Write unit tests for each pattern
  3. Integrate into main.py
  4. Test with real examples
  5. Update documentation

Week 3: Config Features

  1. Implement AGGRESSION logic
  2. Implement MIN_LENGTH check
  3. Integrate CUSTOM_DICTIONARY
  4. Add logging for all changes
  5. Test all combinations

Week 4: Deployment

  1. Test systemd service
  2. Fix any environment issues
  3. Test hotkey functionality
  4. Add monitoring/logging
  5. Documentation

7. Success Metrics

Performance:

  • Model load < 20s (intermediate target, final target 2s)
  • Average inference < 20ms (intermediate, final 10ms)
  • Memory < 300MB

Functionality:

  • AU spelling conversions working (all 7 patterns)
  • AGGRESSION levels functional
  • CUSTOM_DICTIONARY protects words
  • MIN_LENGTH filter works
  • Logging shows diffs

Deployment:

  • Service starts successfully
  • Hotkey works in service mode
  • 24/7 uptime capable
  • Error handling robust

Research Sources

  1. ONNX Optimization:

  2. AU Spelling:

    • Article 1: "Spelling Differences Between American and Australian English"
    • Source: getproofed.com.au
    • Article 2: "4 Reasons Australian English is Unique"
    • Source: unitedlanguagegroup.com
    • Key: 7 main spelling patterns identified
  3. Custom Dictionaries:

    • Article: "Autocorrect Feature using NLP in Python"
    • Source: analyticsvidhya.com
    • Key: Whitelist implementation patterns