🎯 Complete transformation from 5.9GB bloated system to 70MB optimized solution ✨ Key Features: - Hybrid embedding system (Ollama + ML fallback + hash backup) - Intelligent chunking with language-aware parsing - Semantic + BM25 hybrid search with rich context - Zero-config portable design with graceful degradation - Beautiful TUI for beginners + powerful CLI for experts - Comprehensive documentation with 8+ Mermaid diagrams - Professional animated demo (183KB optimized GIF) 🏗️ Architecture Highlights: - LanceDB vector storage with streaming indexing - Smart file tracking (size/mtime) to avoid expensive rehashing - Progressive chunking: Markdown headers → Python functions → fixed-size - Quality filtering: 200+ chars, 20+ words, 30% alphanumeric content - Concurrent batch processing with error recovery 📦 Package Contents: - Core engine: claude_rag/ (11 modules, 2,847 lines) - Entry points: rag-mini (unified), rag-tui (beginner interface) - Documentation: README + 6 guides with visual diagrams - Assets: 3D icon, optimized demo GIF, recording tools - Tests: 8 comprehensive integration and validation tests - Examples: Usage patterns, config templates, dependency analysis 🎥 Demo System: - Scripted demonstration showing 12 files → 58 chunks indexing - Semantic search with multi-line result previews - Complete workflow from TUI startup to CLI mastery - Professional recording pipeline with asciinema + GIF conversion 🛡️ Security & Quality: - Complete .gitignore with personal data protection - Dependency optimization (removed python-dotenv) - Code quality validation and educational test suite - Agent-reviewed architecture and documentation Ready for production use - copy folder, run ./rag-mini, start searching\!
43 lines
1.3 KiB
YAML
43 lines
1.3 KiB
YAML
# FSS-Mini-RAG Configuration
|
|
# Edit this file to customize indexing and search behavior
|
|
# See docs/GETTING_STARTED.md for detailed explanations
|
|
|
|
# Text chunking settings
|
|
chunking:
|
|
max_size: 2000 # Maximum characters per chunk
|
|
min_size: 150 # Minimum characters per chunk
|
|
strategy: semantic # 'semantic' (language-aware) or 'fixed'
|
|
|
|
# Large file streaming settings
|
|
streaming:
|
|
enabled: true
|
|
threshold_bytes: 1048576 # Files larger than this use streaming (1MB)
|
|
|
|
# File processing settings
|
|
files:
|
|
min_file_size: 50 # Skip files smaller than this
|
|
exclude_patterns:
|
|
- "node_modules/**"
|
|
- ".git/**"
|
|
- "__pycache__/**"
|
|
- "*.pyc"
|
|
- ".venv/**"
|
|
- "venv/**"
|
|
- "build/**"
|
|
- "dist/**"
|
|
include_patterns:
|
|
- "**/*" # Include all files by default
|
|
|
|
# Embedding generation settings
|
|
embedding:
|
|
preferred_method: ollama # 'ollama', 'ml', 'hash', or 'auto'
|
|
ollama_model: nomic-embed-text
|
|
ollama_host: localhost:11434
|
|
ml_model: sentence-transformers/all-MiniLM-L6-v2
|
|
batch_size: 32 # Embeddings processed per batch
|
|
|
|
# Search behavior settings
|
|
search:
|
|
default_limit: 10 # Default number of results
|
|
enable_bm25: true # Enable keyword matching boost
|
|
similarity_threshold: 0.1 # Minimum similarity score |