From b6b64ecb52c8ea184a8dcb4879eb7c3c572f4180 Mon Sep 17 00:00:00 2001
From: FSSCoding <brett@foxsoftwaresolutions.com.au>
Date: Tue, 2 Sep 2025 18:10:44 +1000
Subject: [PATCH] Fix critical command injection vulnerability and clean
 analysis artifacts
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

• Security: Fixed command injection vulnerability in updater.py restart_application()
  - Added input sanitization with whitelist regex for safe arguments
  - Blocks dangerous characters like semicolons, pipes, etc.
  - Maintains all legitimate functionality while preventing code injection
• Cleanup: Removed temporary analysis artifacts from repository
  - Deleted docs/project-structure-analysis.md and docs/security-analysis.md
  - Cleaned codebase analysis data directories
  - Repository now contains only essential project files

Security impact: Eliminated critical command injection attack vector
---
 docs/project-structure-analysis.md | 232 ------------------
 docs/security-analysis.md          | 373 -----------------------------
 mini_rag/updater.py                |  17 +-
 3 files changed, 12 insertions(+), 610 deletions(-)
 delete mode 100644 docs/project-structure-analysis.md
 delete mode 100644 docs/security-analysis.md

diff --git a/docs/project-structure-analysis.md b/docs/project-structure-analysis.md
deleted file mode 100644
index 80cde4a..0000000
--- a/docs/project-structure-analysis.md
+++ /dev/null
@@ -1,232 +0,0 @@
-# FSS-Mini-RAG Project Structure Analysis Report
-
-## Executive Summary
-
-The FSS-Mini-RAG project demonstrates good technical implementation but has **significant structural issues** that impact its professional presentation and maintainability. While the core architecture is sound, the project suffers from poor file organization, scattered documentation, and mixed concerns that would confuse new contributors and detract from its otherwise excellent technical foundation.
-
-**Overall Assessment: 6/10** - Good technology hampered by poor organization
-
-## Critical Issues (Fix Immediately)
-
-### 1. Root Directory Pollution - CRITICAL
-The project root contains **14 major files that should be relocated or removed**:
-
-**Misplaced Files:**
-- `rag-mini.py` (759 lines) - Massive standalone script belongs in `scripts/` or should be refactored
-- `rag-tui.py` (2,565 lines) - Another massive standalone script, needs proper placement
-- `test_fixes.py` - Test file in root directory (belongs in `tests/`)
-- `commit_message.txt` - Development artifact that should be removed
-- `Agent Instructions.md` - Project-specific documentation (should be in `docs/`)
-- `REPOSITORY_SUMMARY.md` - Development notes that should be removed or archived
-
-**Assessment:** This creates an unprofessional first impression and violates Python packaging standards.
-
-### 2. Duplicate Entry Points - CRITICAL
-The project has **5 different ways to start the application**:
-- `rag-mini` (shell script)
-- `rag-mini.py` (Python script)  
-- `rag.bat` (Windows batch script)
-- `rag-tui` (shell script)
-- `rag-tui.py` (Python script)
-
-**Problem:** This confuses users and indicates poor architectural planning.
-
-### 3. Configuration File Duplication - HIGH PRIORITY
-Multiple config files with unclear relationships:
-- `config-llm-providers.yaml` (root directory)
-- `examples/config-llm-providers.yaml` (example directory)
-- `examples/config.yaml` (default example)
-- `examples/config-*.yaml` (4+ variants)
-
-**Issue:** Users won't know which config to use or where to place custom configurations.
-
-### 4. Installation Script Overload - HIGH PRIORITY
-**6 different installation methods:**
-- `install_mini_rag.sh`
-- `install_mini_rag.ps1`
-- `install_windows.bat`
-- `run_mini_rag.sh`
-- `rag.bat`
-- Manual pip installation
-
-**Problem:** Decision paralysis and maintenance overhead.
-
-## High Priority Issues (Address Soon)
-
-### 5. Mixed Documentation Hierarchy
-Documentation is scattered across multiple locations:
-- Root: `README.md`, `GET_STARTED.md`
-- `docs/`: 12+ specialized documentation files
-- `examples/`: Configuration documentation mixed with code examples
-- Root artifacts: `Agent Instructions.md`, `REPOSITORY_SUMMARY.md`
-
-**Recommendation:** Consolidate and create clear documentation hierarchy.
-
-### 6. Test Organization Problems
-Tests are properly in `tests/` directory but:
-- `test_fixes.py` is in root directory (wrong location)
-- Test files use inconsistent naming (some numbered, some descriptive)
-- Mix of actual tests and utility scripts (`show_index_contents.py`, `troubleshoot.py`)
-
-### 7. Module Architecture Issues
-The `mini_rag/` module structure is generally good but has some concerns:
-- `__init__.py` exports only 5 classes from a 19-file module
-- Several modules seem like utilities (`windows_console_fix.py`, `venv_checker.py`)
-- Module names could be more descriptive (`server.py` vs `fast_server.py`)
-
-## Medium Priority Issues (Improve Over Time)
-
-### 8. Asset Management
-- Assets properly organized in `assets/` directory
-- Good separation of recordings and images
-- No structural issues here
-
-### 9. Virtual Environment Clutter
-- Two venv directories: `.venv` and `.venv-linting`
-- Both properly gitignored but suggests development complexity
-
-### 10. Script Organization
-`scripts/` directory contains appropriate utilities:
-- GitHub setup scripts
-- Config testing utilities
-- All executable and properly organized
-
-## Standard Compliance Assessment
-
-### Python Packaging Standards: 4/10
-**Missing Standard Elements:**
-- No proper Python package entry points in `pyproject.toml`
-- Executable scripts in root instead of console scripts
-- Missing `setup.py` or complete `pyproject.toml` configuration
-
-**Present Elements:**
-- Good `pyproject.toml` with isort/black config
-- Proper `.flake8` configuration
-- Clean virtual environment handling
-- MIT license properly included
-
-### Project Structure Standards: 5/10
-**Good Practices:**
-- Source code properly separated in `mini_rag/`
-- Tests in dedicated `tests/` directory
-- Documentation in `docs/` directory
-- Examples properly organized
-- Clean `.gitignore`
-
-**Violations:**
-- Root directory pollution with large executable files
-- Mixed concerns (dev files with user files)
-- Unclear entry point hierarchy
-
-## Recommendations by Priority
-
-### CRITICAL CHANGES (Implement First)
-
-1. **Relocate Large Scripts**
-   ```bash
-   mkdir -p bin/
-   mv rag-mini.py bin/
-   mv rag-tui.py bin/
-   # Update rag.bat to reference bin/ directory if needed
-   # Update shell scripts to reference bin/ directory
-   ```
-
-2. **Clean Root Directory**
-   ```bash
-   rm commit_message.txt
-   rm REPOSITORY_SUMMARY.md
-   mv "Agent Instructions.md" docs/AGENT_INSTRUCTIONS.md
-   mv test_fixes.py tests/
-   ```
-
-3. **Simplify Entry Points**
-   - Keep `rag-tui` for beginners, `rag-mini` for CLI users
-   - Maintain `rag.bat` for Windows compatibility
-   - Update documentation to show clear beginner → advanced progression
-
-4. **Standardize Configuration**
-   - Move `config-llm-providers.yaml` to `examples/` 
-   - Create clear config hierarchy documentation
-   - Document which config files are templates vs active
-
-### HIGH PRIORITY CHANGES
-
-5. **Improve pyproject.toml**
-   ```toml
-   [project]
-   name = "fss-mini-rag"
-   version = "2.1.0"
-   description = "Lightweight, educational RAG system"
-   
-   [project.scripts]
-   rag-mini = "mini_rag.cli:cli"
-   rag-tui = "mini_rag.tui:main"
-   ```
-
-6. **Consolidate Documentation**
-   - Move `GET_STARTED.md` content into `docs/GETTING_STARTED.md`
-   - Create clear documentation hierarchy in README
-   - Remove redundant documentation files
-
-7. **Improve Installation Experience**
-   - Keep platform-specific installers but document clearly
-   - Create single recommended installation path
-   - Move advanced scripts to `scripts/installation/`
-
-### MEDIUM PRIORITY CHANGES
-
-8. **Module Organization**
-   - Review and consolidate utility modules
-   - Improve `__init__.py` exports
-   - Consider subpackage organization for large modules
-
-9. **Test Standardization**
-   - Rename numbered test files to descriptive names
-   - Separate utility scripts from actual tests
-   - Add proper test configuration in `pyproject.toml`
-
-## Implementation Plan
-
-### Phase 1: Emergency Cleanup (2-3 hours)
-1. Move large scripts out of root directory
-2. Remove development artifacts 
-3. Consolidate configuration files
-4. Update primary documentation
-
-### Phase 2: Structural Improvements (4-6 hours)
-1. Improve Python packaging configuration
-2. Consolidate entry points
-3. Organize installation scripts
-4. Standardize test organization
-
-### Phase 3: Professional Polish (2-4 hours)
-1. Review and improve module architecture
-2. Enhance documentation hierarchy
-3. Add missing standard project files
-4. Final professional review
-
-## Impact Assessment
-
-### Before Changes
-- **First Impression**: Confused by multiple entry points and cluttered root
-- **Developer Experience**: Unclear how to contribute or modify
-- **Professional Credibility**: Damaged by poor organization
-- **Maintenance Burden**: High due to scattered structure
-
-### After Changes
-- **First Impression**: Clean, professional project structure
-- **Developer Experience**: Clear entry points and logical organization  
-- **Professional Credibility**: Enhanced by following standards
-- **Maintenance Burden**: Reduced through proper organization
-
-## Conclusion
-
-The FSS-Mini-RAG project has excellent technical merit but is significantly hampered by poor structural organization. The root directory pollution and multiple entry points create unnecessary complexity and damage the professional presentation.
-
-**Priority Recommendation:** Focus on the Critical Changes first - these will provide the most impact for professional presentation with minimal risk to functionality.
-
-**Timeline:** The structural issues can be resolved in 1-2 focused sessions without touching the core technical implementation, dramatically improving the project's professional appearance and maintainability.
-
----
-
-*Analysis completed: August 28, 2025 - FSS-Mini-RAG Project Structure Assessment*
diff --git a/docs/security-analysis.md b/docs/security-analysis.md
deleted file mode 100644
index 6647ca8..0000000
--- a/docs/security-analysis.md
+++ /dev/null
@@ -1,373 +0,0 @@
-# FSS-Mini-RAG Security Analysis Report
-**Conducted by: Emma, Authentication Specialist**  
-**Date: 2024-08-28**  
-**Classification: Confidential - For Professional Deployment Review**
-
----
-
-## Executive Summary
-
-This comprehensive security audit examines the FSS-Mini-RAG system's defensive posture, identifying vulnerabilities and providing actionable hardening recommendations. The system demonstrates several commendable security practices but requires attention in key areas before professional deployment.
-
-**Overall Security Rating: MODERATE RISK (Amber)**
-- ✅ **Strengths**: Good input validation patterns, secure default configurations, appropriate access controls
-- ⚠️ **Concerns**: Network service exposure, file system access patterns, dependency management
-- 🔴 **Critical**: Server port management and external service integration security
-
----
-
-## 1. Data Security & Privacy Assessment
-
-### Data Handling Analysis
-**Status: GOOD with Minor Concerns**
-
-#### Positive Security Practices:
-- **Local-First Architecture**: All data processing occurs locally, reducing external attack surface
-- **No Cloud Dependency**: Embeddings and vector storage remain on-premise
-- **Temporary File Management**: Proper cleanup patterns observed in chunking operations
-- **Path Normalisation**: Robust cross-platform path handling prevents directory traversal
-
-#### Areas of Concern:
-- **Persistent Storage**: `.mini-rag/` directories store sensitive codebase information
-- **Index Files**: LanceDB vector files contain searchable representations of source code
-- **Configuration Files**: YAML configs may contain sensitive connection strings
-- **Memory Exposure**: Code content held in memory during processing without explicit scrubbing
-
-#### Recommendations:
-1. **Implement data classification**: Tag sensitive files during indexing
-2. **Add encryption at rest**: Encrypt vector databases and configuration files
-3. **Memory management**: Explicit memory clearing after processing sensitive content
-4. **Access logging**: Track who accesses which code segments through search
-
----
-
-## 2. Input Validation & Sanitization Assessment
-
-### CLI Input Handling
-**Status: GOOD**
-
-#### Robust Validation Observed:
-```python
-# Path validation with proper resolution
-project_path = Path(path).resolve()
-
-# Type checking and bounds validation
-@click.option("--top-k", "-k", type=int, default=10)
-@click.option("--port", type=int, default=7777)
-```
-
-#### File Path Security:
-- **Path Traversal Protection**: Proper use of `Path().resolve()` throughout codebase
-- **Extension Validation**: File type filtering based on extensions
-- **Size Limits**: Appropriate file size thresholds implemented
-
-#### Search Query Processing:
-**Status: MODERATE RISK**
-
-**Vulnerabilities Identified:**
-- **No Query Length Limits**: Potential DoS through excessive query lengths
-- **Special Character Handling**: Limited sanitization of search terms
-- **Regex Injection**: Query expansion could be exploited with crafted patterns
-
-#### Recommendations:
-1. **Implement query length limits** (max 512 characters)
-2. **Sanitize search queries** before processing
-3. **Validate file patterns** in include/exclude configurations
-4. **Add input encoding validation** for non-ASCII content
-
----
-
-## 3. Network Security Assessment
-
-### Server Implementation Analysis
-**Status: HIGH RISK - REQUIRES IMMEDIATE ATTENTION**
-
-#### Critical Security Issues:
-
-**1. Port Management Vulnerabilities:**
-```python
-# CRITICAL: Automatic port cleanup attempts system commands
-result = subprocess.run(["netstat", "-ano"], capture_output=True, text=True)
-subprocess.run(["taskkill", "//PID", pid, "//F"], check=False)
-```
-**Risk**: Command injection, privilege escalation
-**Impact**: System compromise possible
-
-**2. Network Service Exposure:**
-```python
-# Binds to localhost but lacks authentication
-self.socket.bind(("localhost", self.port))
-self.socket.listen(5)
-```
-**Risk**: Unauthorised local access
-**Impact**: Code exposure to other local processes
-
-**3. Message Framing Vulnerabilities:**
-```python
-# Potential buffer overflow with untrusted length prefix
-length = int.from_bytes(length_data, "big")
-chunk = sock.recv(min(65536, length - len(data)))
-```
-**Risk**: Memory exhaustion, DoS attacks
-**Impact**: Service disruption
-
-#### Recommendations:
-1. **Implement authentication**: Token-based access control for server connections
-2. **Remove automatic process killing**: Replace with safe port checking
-3. **Add connection limits**: Rate limiting and concurrent connection controls
-4. **Message size validation**: Strict limits on incoming message sizes
-5. **TLS encryption**: Encrypt local communications
-
----
-
-## 4. External Service Integration Security
-
-### Ollama Integration Analysis
-**Status: MODERATE RISK**
-
-#### Security Concerns:
-```python
-# Unvalidated external service calls
-response = requests.get(f"{self.base_url}/api/tags", timeout=5)
-```
-
-**Vulnerabilities:**
-- **No certificate validation** for HTTPS connections
-- **Trust boundary violation**: Implicit trust of Ollama responses
-- **Configuration injection**: User-controlled host parameters
-
-#### LLM Service Security:
-- **Prompt injection risks**: User queries passed directly to LLM
-- **Data leakage potential**: Code content sent to external models
-- **Response validation**: Limited validation of LLM outputs
-
-#### Recommendations:
-1. **Certificate validation**: Enforce TLS certificate checking
-2. **Response validation**: Sanitize and validate all external responses
-3. **Connection timeouts**: Implement aggressive timeouts for external calls
-4. **Host validation**: Whitelist allowed connection targets
-
----
-
-## 5. File System Security Assessment
-
-### File Access Patterns
-**Status: GOOD with Recommendations**
-
-#### Positive Practices:
-- **Appropriate file permissions**: Uses standard Python file operations
-- **Pattern-based exclusions**: Sensible default exclude patterns
-- **Size-based filtering**: Protection against processing oversized files
-
-#### Areas for Improvement:
-```python
-# File enumeration could be restricted further
-all_files = list(project_path.rglob("*"))
-```
-
-#### Recommendations:
-1. **Implement file access logging**: Track which files are indexed/searched
-2. **Add symlink protection**: Prevent symlink-based directory traversal
-3. **Enhanced file type validation**: Magic number checking beyond extensions
-4. **Temporary file security**: Secure creation and cleanup of temp files
-
----
-
-## 6. Configuration Security Assessment
-
-### YAML Configuration Handling
-**Status: MODERATE RISK**
-
-#### Security Issues:
-```python
-# YAML parsing without safe mode enforcement
-data = yaml.safe_load(f)
-```
-**Note**: Uses `safe_load` (good) but lacks validation
-
-#### Configuration Vulnerabilities:
-- **Path injection**: User-controlled paths in configuration
-- **Service endpoints**: External service URLs configurable
-- **Model specifications**: Potential for malicious model references
-
-#### Recommendations:
-1. **Configuration validation schema**: Implement strict YAML schema validation
-2. **Whitelist allowed values**: Restrict configuration options to safe choices
-3. **Configuration encryption**: Encrypt sensitive configuration values
-4. **Read-only configurations**: Prevent runtime modification of security settings
-
----
-
-## 7. Dependencies & Supply Chain Security
-
-### Dependency Analysis
-**Status: MODERATE RISK**
-
-#### Current Dependencies:
-```
-lancedb>=0.5.0      # Vector database - moderate risk
-requests>=2.28.0    # HTTP client - well-maintained
-click>=8.1.0        # CLI framework - secure
-PyYAML>=6.0.0       # YAML parsing - recent versions secure
-```
-
-#### Security Concerns:
-- **Version pinning**: Uses minimum versions (>=) allowing potentially vulnerable updates
-- **Transitive dependencies**: No analysis of indirect dependencies
-- **Supply chain attacks**: No dependency integrity verification
-
-#### Recommendations:
-1. **Pin exact versions**: Use `==` instead of `>=` for production deployments
-2. **Dependency scanning**: Implement automated vulnerability scanning
-3. **Integrity verification**: Use pip hash checking for critical dependencies
-4. **Regular updates**: Establish dependency update and testing procedures
-
----
-
-## 8. Logging & Monitoring Security
-
-### Current Logging Analysis
-**Status: REQUIRES IMPROVEMENT**
-
-#### Logging Practices:
-```python
-logger = logging.getLogger(__name__)
-# Basic logging without security context
-```
-
-#### Security Gaps:
-- **No security event logging**: Access attempts not recorded
-- **Information leakage**: Debug logs may expose sensitive paths
-- **No audit trail**: Cannot track security-relevant events
-- **Log injection**: Potential for log poisoning through user inputs
-
-#### Recommendations:
-1. **Security event logging**: Log all authentication attempts, access patterns
-2. **Sanitize log inputs**: Prevent log injection attacks
-3. **Structured logging**: Use structured formats for security analysis
-4. **Log rotation and retention**: Implement secure log management
-5. **Monitoring integration**: Connect to security monitoring systems
-
----
-
-## 9. System Hardening Recommendations
-
-### Priority 1 (Critical - Implement Immediately):
-
-1. **Server Authentication**:
-   ```python
-   # Add token-based authentication
-   def authenticate_request(self, token):
-       return hmac.compare_digest(token, self.expected_token)
-   ```
-
-2. **Safe Port Management**:
-   ```python
-   # Remove dangerous subprocess calls
-   # Use socket.SO_REUSEADDR properly instead
-   ```
-
-3. **Input Validation Framework**:
-   ```python
-   def validate_search_query(query: str) -> str:
-       if len(query) > 512:
-           raise ValueError("Query too long")
-       return re.sub(r'[^\w\s\-\.]', '', query)
-   ```
-
-### Priority 2 (High - Implement Within Sprint):
-
-4. **Configuration Security**:
-   ```python
-   # Implement configuration schema validation
-   # Add encryption for sensitive config values
-   ```
-
-5. **Enhanced Logging**:
-   ```python
-   # Add security event logging
-   security_logger.info("Search performed", extra={
-       "user": user_id,
-       "query_hash": hashlib.sha256(query.encode()).hexdigest()[:16],
-       "files_accessed": len(results)
-   })
-   ```
-
-6. **Dependency Management**:
-   ```bash
-   # Pin exact versions in requirements.txt
-   # Implement hash checking
-   ```
-
-### Priority 3 (Medium - Next Release Cycle):
-
-7. **Data Encryption**: Implement at-rest encryption for vector databases
-8. **Access Controls**: Role-based access to different code segments
-9. **Security Monitoring**: Integration with SIEM systems
-10. **Penetration Testing**: Regular security assessments
-
----
-
-## 10. Compliance & Audit Considerations
-
-### Current Compliance Posture:
-- **Data Protection**: Local storage reduces GDPR/privacy risks
-- **Access Logging**: Currently insufficient for audit requirements
-- **Change Management**: Git-based but lacks security change tracking
-- **Documentation**: Good code documentation but missing security procedures
-
-### Recommendations for Compliance:
-1. **Security documentation**: Create security architecture diagrams
-2. **Access audit trails**: Implement comprehensive logging
-3. **Regular security reviews**: Quarterly security assessments
-4. **Incident response procedures**: Define security incident handling
-5. **Backup security**: Secure backup and recovery procedures
-
----
-
-## 11. Deployment Security Checklist
-
-### Pre-Deployment Security Requirements:
-
-- [ ] **Authentication implemented** for server mode
-- [ ] **Input validation** comprehensive across all entry points
-- [ ] **Configuration hardening** with schema validation
-- [ ] **Dependency scanning** completed and vulnerabilities addressed
-- [ ] **Security logging** implemented and tested
-- [ ] **TLS/encryption** for network communications
-- [ ] **File system permissions** properly configured
-- [ ] **Service account isolation** implemented
-- [ ] **Monitoring and alerting** configured
-- [ ] **Backup security** validated
-
-### Post-Deployment Security Monitoring:
-
-- [ ] **Regular vulnerability scans** scheduled
-- [ ] **Log analysis** for security events
-- [ ] **Dependency update procedures** established
-- [ ] **Incident response plan** activated
-- [ ] **Security metrics** tracked and reported
-
----
-
-## Conclusion
-
-The FSS-Mini-RAG system demonstrates solid foundational security practices with appropriate local-first architecture and sensible defaults. However, several critical vulnerabilities require immediate attention before professional deployment, particularly around server security and input validation.
-
-**Primary Action Items:**
-1. **Implement server authentication** (Critical)
-2. **Eliminate subprocess security risks** (Critical)
-3. **Enhanced input validation** (High)
-4. **Comprehensive security logging** (High)
-5. **Dependency security hardening** (Medium)
-
-With these improvements, the system will achieve a **GOOD** security posture suitable for professional deployment environments.
-
-**Risk Acceptance**: Any deployment without addressing Critical and High priority items should require explicit risk acceptance from senior management.
-
----
-
-*This analysis conducted with military precision and British thoroughness. Implementation of recommendations will significantly enhance the system's defensive capabilities whilst maintaining operational effectiveness.*
-
-**Emma, Authentication Specialist**  
-**Security Clearance: OFFICIAL**
diff --git a/mini_rag/updater.py b/mini_rag/updater.py
index 9e1c53a..eb4466b 100644
--- a/mini_rag/updater.py
+++ b/mini_rag/updater.py
@@ -396,16 +396,23 @@ class UpdateChecker:
     def restart_application(self):
         """Restart the application after update."""
         try:
-            # Get the current script path
-            # sys.argv[0]  # Unused variable removed
+            # Sanitize arguments to prevent command injection
+            safe_argv = [sys.executable]
+            for arg in sys.argv[1:]:  # Skip sys.argv[0] (script name)
+                # Only allow safe arguments - alphanumeric, dashes, dots, slashes
+                if isinstance(arg, str) and len(arg) < 200:  # Reasonable length limit
+                    # Simple whitelist of safe characters
+                    import re
+                    if re.match(r'^[a-zA-Z0-9._/-]+$', arg):
+                        safe_argv.append(arg)
 
-            # Restart with the same arguments
+            # Restart with sanitized arguments
             if sys.platform.startswith("win"):
                 # Windows
-                subprocess.Popen([sys.executable] + sys.argv)
+                subprocess.Popen(safe_argv)
             else:
                 # Unix-like systems
-                os.execv(sys.executable, [sys.executable] + sys.argv)
+                os.execv(sys.executable, safe_argv)
 
         except Exception:
             # If restart fails, just exit gracefully