FSSCoding 9316bc50f1 Initial commit: FSS-Polish v1.0.0

Complete implementation of Fast Spelling and Style Polish tool with:
- Australian English spelling conversion (7 patterns + case preservation)
- CLI support with text input or clipboard mode
- Daemon mode with configurable hotkey
- MIN_LENGTH, AGGRESSION, and CUSTOM_DICTIONARY config options
- Comprehensive diff logging
- 12 passing tests (100% test coverage for AU spelling)
- Wheel package built and ready for deployment
- Agent-friendly CLI with stdin/stdout support

Features:
- Text correction using t5-small-spoken-typo model
- Australian/American spelling conversion
- Configurable correction aggression levels
- Custom dictionary whitelist support
- Background daemon with hotkey trigger
- CLI tool for direct text polishing
- Preserves clipboard history (adds new item vs replace)

Ready for deployment to /opt and Gitea repository.

2025-10-25 23:59:34 +11:00

12 KiB

Raw Permalink Blame History

Text-Polish Implementation Plan

Based on Blueprint Gap Analysis and Web Research Generated: 2025-10-25

Executive Summary

Current Status:

✅ Core MVP works: hotkey → clipboard → model → clipboard
❌ Performance below targets: 82s load (vs 2s), 63ms inference (vs 10ms)
❌ AU spelling not implemented (Phase 1 requirement)
❌ Config features are stubs

Priority Order:

CRITICAL: Model optimization (ONNX + quantization)
CRITICAL: AU spelling implementation
HIGH: Config features (AGGRESSION, CUSTOM_DICTIONARY, MIN_LENGTH)
MEDIUM: Service testing and deployment

1. Model Optimization (CRITICAL)

Research Findings

Source: /tmp/model-optimization-research/ Article: "Blazing Fast Inference with Quantized ONNX Models" by Tarun Gudipati

Performance Gains:

5x faster inference (0.5s → 0.1s in article example)
2.2x less memory (11MB → 4.9MB in article example)
Expected results for text-polish:
- Load time: 82s → ~16s (target: <2s, still needs work)
- Inference: 63ms → ~12ms (target: <10ms, close!)
- First inference: 284ms → ~57ms

Implementation Steps

Step 1: Install optimum library

cd /MASTERFOLDER/Tools/text-polish
source venv/bin/activate
pip install optimum[onnxruntime]

Step 2: Export model to ONNX

optimum-cli export onnx \
  --model willwade/t5-small-spoken-typo \
  --optimize O3 \
  --task text2text-generation \
  t5_onnx

Step 3: Quantize the model

optimum-cli onnxruntime quantize \
  --onnx_model t5_onnx \
  --output t5_onnx_quantized

Step 4: Update model_loader.py Replace pytorch loading with ONNX:

from optimum.onnxruntime import ORTModelForSeq2SeqLM
from transformers import AutoTokenizer, pipeline

def load_model(model_path="t5_onnx_quantized"):
    tokenizer = AutoTokenizer.from_pretrained("willwade/t5-small-spoken-typo")
    model = ORTModelForSeq2SeqLM.from_pretrained(model_path)
    pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer)
    return pipe, tokenizer

Step 5: Re-run performance test

python test_performance.py

Expected Results:

Load time: ~16s (improvement but still high, may need caching strategies)
Inference: ~12ms average (close to 10ms target!)

2. Australian Spelling Implementation (CRITICAL)

Research Findings

Source: /tmp/au-spelling-research/ Articles:

"Spelling Differences Between American and Australian English" (getproofed.com.au)
"4 Reasons Australian English is Unique" (unitedlanguagegroup.com)

AU Spelling Rules

Pattern 1: -our vs -or

"-or" → "-our"
Examples: color→colour, favor→favour, behavior→behaviour, neighbor→neighbour
Exception: "Labor Party" keeps -or

Pattern 2: -tre vs -ter

"-ter" → "-tre" (French origin words)
Examples: center→centre, theater→theatre, meter→metre

Pattern 3: -ise vs -ize

"-ize" → "-ise" (most common in AU)
Examples: authorize→authorise, plagiarize→plagiarise, organize→organise
Note: Both are acceptable, but -ise is standard

Pattern 4: -c vs -s (practice/practise)

Noun: "practice" (with c)
Verb: "practise" (with s)
US uses "practice" for both

Pattern 5: -oe/-ae vs -e

Mixed usage in AU (more relaxed than UK)
manoeuvre (AU/UK) vs maneuver (US)
encyclopedia (AU/US) vs encyclopaedia (UK)

Pattern 6: Double consonants

"-ed"/"-ing" → double consonant
Examples: traveled→travelled, modeling→modelling
Exception: "program" preferred over "programme"

Pattern 7: Unique words

aluminum → aluminium
tire → tyre

Implementation

Create new file: src/au_spelling.py

"""Australian English spelling conversion module"""
import re

# Pattern-based replacements
AU_SPELLING_PATTERNS = [
    # -or → -our (but not -ior, -oor)
    (r'\b(\w+)or\b', r'\1our', ['color', 'favor', 'honor', 'labor', 'neighbor', 'behavior']),

    # -ter → -tre (French words)
    (r'\b(cen|thea|me)ter\b', r'\1tre'),

    # -ize → -ise
    (r'\b(\w+)ize\b', r'\1ise'),

    # Double consonants for -ed/-ing
    (r'\b(\w+[aeiou])([lnrt])ed\b', r'\1\2\2ed'),
    (r'\b(\w+[aeiou])([lnrt])ing\b', r'\1\2\2ing'),
]

# Direct word replacements
AU_SPELLING_WORDS = {
    # Unique words
    'aluminum': 'aluminium',
    'tire': 'tyre',
    'tires': 'tyres',
    'gray': 'grey',

    # Exception: Labor Party keeps US spelling
    # (handled by whitelist)
}

# Words that should NOT be converted
AU_SPELLING_WHITELIST = [
    'labor party',  # Political party name
    'program',      # Computer program (AU uses US spelling)
    'inquiry',      # AU prefers "inquiry" over "enquiry"
]

def convert_to_au_spelling(text: str, custom_whitelist: list = None) -> str:
    """
    Convert American English text to Australian English spelling.

    Args:
        text: Input text in American English
        custom_whitelist: Additional words/phrases to protect from conversion

    Returns:
        Text converted to Australian English spelling
    """
    if not text:
        return text

    # Combine whitelists
    whitelist = AU_SPELLING_WHITELIST.copy()
    if custom_whitelist:
        whitelist.extend(custom_whitelist)

    # Check whitelist (case-insensitive)
    text_lower = text.lower()
    for protected in whitelist:
        if protected.lower() in text_lower:
            return text  # Don't convert if whitelisted phrase present

    result = text

    # Apply direct word replacements
    for us_word, au_word in AU_SPELLING_WORDS.items():
        result = re.sub(r'\b' + us_word + r'\b', au_word, result, flags=re.IGNORECASE)

    # Apply pattern-based replacements
    for pattern in AU_SPELLING_PATTERNS:
        if len(pattern) == 3:
            # Pattern with word list
            regex, replacement, word_list = pattern
            for word in word_list:
                result = re.sub(word + r'\b', word.replace('or', 'our'), result, flags=re.IGNORECASE)
        else:
            # Simple pattern
            regex, replacement = pattern
            result = re.sub(regex, replacement, result, flags=re.IGNORECASE)

    return result

Update main.py:

from config import AU_SPELLING
from au_spelling import convert_to_au_spelling

def on_hotkey():
    text = pyperclip.paste()
    result = polish(model, tokenizer, text)

    # Apply AU spelling if enabled
    if AU_SPELLING:
        result = convert_to_au_spelling(result)

    pyperclip.copy(result)

3. Config Features Implementation (HIGH)

AGGRESSION Levels

Implementation in main.py:

def on_hotkey():
    text = pyperclip.paste()

    # Skip processing if text is too short
    if len(text) < MIN_LENGTH:
        logging.info(f"Text too short ({len(text)} < {MIN_LENGTH}), skipping")
        return

    # Check custom dictionary for protected words
    if CUSTOM_DICTIONARY:
        has_protected = any(word.lower() in text.lower() for word in CUSTOM_DICTIONARY)
        if has_protected and AGGRESSION == "minimal":
            logging.info("Protected word detected in minimal mode, reducing corrections")
            # Could adjust max_length or temperature here

    result = polish(model, tokenizer, text)

    # Apply AU spelling
    if AU_SPELLING:
        whitelist = CUSTOM_DICTIONARY if AGGRESSION in ["minimal", "custom"] else []
        result = convert_to_au_spelling(result, whitelist)

    pyperclip.copy(result)

    # Log diff if enabled
    if LOGGING and text != result:
        diff = log_diff(text, result)
        logging.info(f"Changes:\n{diff}")

CUSTOM_DICTIONARY

Already implemented above - words in CUSTOM_DICTIONARY are:

Protected from AU spelling conversion
Used to adjust correction aggression

MIN_LENGTH

Already implemented above - text shorter than MIN_LENGTH skips processing.

4. Service Testing (MEDIUM)

Current service file: service/clipboard-polisher.service

✅ User set to bob
✅ Uses venv python path
⚠️ Not tested

Testing steps:

# Copy service file
sudo cp service/clipboard-polisher.service /etc/systemd/system/

# Reload systemd
sudo systemctl daemon-reload

# Start service
sudo systemctl start clipboard-polisher

# Check status
sudo systemctl status clipboard-polisher

# View logs
journalctl -u clipboard-polisher -f

# Enable on boot (optional)
sudo systemctl enable clipboard-polisher

Note: Hotkey functionality requires X11/Wayland access. Service may need DISPLAY environment variable.

5. Testing Plan

Test 1: Performance (Re-run after ONNX)

python test_performance.py

Target: <20ms average inference, <20s load time

Test 2: AU Spelling

python -c "
from src.au_spelling import convert_to_au_spelling
tests = [
    ('I cant beleive its color', 'I cant beleive its colour'),
    ('The theater center', 'The theatre centre'),
    ('Authorize the program', 'Authorise the program'),
]
for input_text, expected in tests:
    result = convert_to_au_spelling(input_text)
    assert result == expected, f'Failed: {result} != {expected}'
print('All AU spelling tests passed!')
"

Test 3: Integration

Create test_integration.py:

#!/usr/bin/env python3
import sys
sys.path.insert(0, '/MASTERFOLDER/Tools/text-polish/src')

from model_loader import load_model, polish
from au_spelling import convert_to_au_spelling

model, tokenizer = load_model()

test_cases = [
    "teh color was realy nice",  # Should become "the colour was really nice"
    "I need to organize the theater",  # Should become "I need to organise the theatre"
]

for test in test_cases:
    result = polish(model, tokenizer, test)
    result_au = convert_to_au_spelling(result)
    print(f"Input:  {test}")
    print(f"Polish: {result}")
    print(f"AU:     {result_au}")
    print()

6. Priority Task List

Week 1: Performance

Install optimum library
Export and quantize model
Update model_loader.py
Run performance tests
Document results

Week 2: AU Spelling

Create au_spelling.py with all patterns
Write unit tests for each pattern
Integrate into main.py
Test with real examples
Update documentation

Week 3: Config Features

Implement AGGRESSION logic
Implement MIN_LENGTH check
Integrate CUSTOM_DICTIONARY
Add logging for all changes
Test all combinations

Week 4: Deployment

Test systemd service
Fix any environment issues
Test hotkey functionality
Add monitoring/logging
Documentation

7. Success Metrics

Performance:

Model load < 20s (intermediate target, final target 2s)
Average inference < 20ms (intermediate, final 10ms)
Memory < 300MB

Functionality:

AU spelling conversions working (all 7 patterns)
AGGRESSION levels functional
CUSTOM_DICTIONARY protects words
MIN_LENGTH filter works
Logging shows diffs

Deployment:

Service starts successfully
Hotkey works in service mode
24/7 uptime capable
Error handling robust

Research Sources

ONNX Optimization:
- Article: "Blazing Fast Inference with Quantized ONNX Models"
- Author: Tarun Gudipati
- URL: https://codezen.medium.com/blazing-fast-inference-with-quantized-onnx-models-518f23777741
- Key: 5x speed, 2.2x memory reduction
AU Spelling:
- Article 1: "Spelling Differences Between American and Australian English"
- Source: getproofed.com.au
- Article 2: "4 Reasons Australian English is Unique"
- Source: unitedlanguagegroup.com
- Key: 7 main spelling patterns identified
Custom Dictionaries:
- Article: "Autocorrect Feature using NLP in Python"
- Source: analyticsvidhya.com
- Key: Whitelist implementation patterns

12 KiB Raw Permalink Blame History

Text-Polish Implementation Plan

Executive Summary

1. Model Optimization (CRITICAL)

Research Findings

Implementation Steps

2. Australian Spelling Implementation (CRITICAL)

Research Findings

AU Spelling Rules

Implementation

3. Config Features Implementation (HIGH)

AGGRESSION Levels

CUSTOM_DICTIONARY

MIN_LENGTH

4. Service Testing (MEDIUM)

5. Testing Plan

Test 1: Performance (Re-run after ONNX)

Test 2: AU Spelling

Test 3: Integration

6. Priority Task List

Week 1: Performance

Week 2: AU Spelling

Week 3: Config Features

Week 4: Deployment

7. Success Metrics

Research Sources

12 KiB

Raw Permalink Blame History