Fss-Rag-Mini/tests/show_index_contents.py
BobAi 4166d0a362 Initial release: FSS-Mini-RAG - Lightweight semantic code search system
🎯 Complete transformation from 5.9GB bloated system to 70MB optimized solution

 Key Features:
- Hybrid embedding system (Ollama + ML fallback + hash backup)
- Intelligent chunking with language-aware parsing
- Semantic + BM25 hybrid search with rich context
- Zero-config portable design with graceful degradation
- Beautiful TUI for beginners + powerful CLI for experts
- Comprehensive documentation with 8+ Mermaid diagrams
- Professional animated demo (183KB optimized GIF)

🏗️ Architecture Highlights:
- LanceDB vector storage with streaming indexing
- Smart file tracking (size/mtime) to avoid expensive rehashing
- Progressive chunking: Markdown headers → Python functions → fixed-size
- Quality filtering: 200+ chars, 20+ words, 30% alphanumeric content
- Concurrent batch processing with error recovery

📦 Package Contents:
- Core engine: claude_rag/ (11 modules, 2,847 lines)
- Entry points: rag-mini (unified), rag-tui (beginner interface)
- Documentation: README + 6 guides with visual diagrams
- Assets: 3D icon, optimized demo GIF, recording tools
- Tests: 8 comprehensive integration and validation tests
- Examples: Usage patterns, config templates, dependency analysis

🎥 Demo System:
- Scripted demonstration showing 12 files → 58 chunks indexing
- Semantic search with multi-line result previews
- Complete workflow from TUI startup to CLI mastery
- Professional recording pipeline with asciinema + GIF conversion

🛡️ Security & Quality:
- Complete .gitignore with personal data protection
- Dependency optimization (removed python-dotenv)
- Code quality validation and educational test suite
- Agent-reviewed architecture and documentation

Ready for production use - copy folder, run ./rag-mini, start searching\!
2025-08-12 16:38:28 +10:00

47 lines
1.2 KiB
Python

#!/usr/bin/env python3
"""
Show what files are actually indexed in the RAG system.
"""
import sys
import os
from pathlib import Path
if sys.platform == 'win32':
os.environ['PYTHONUTF8'] = '1'
sys.stdout.reconfigure(encoding='utf-8')
sys.path.insert(0, str(Path(__file__).parent))
from claude_rag.vector_store import VectorStore
from collections import Counter
project_path = Path.cwd()
store = VectorStore(project_path)
store._connect()
# Get all indexed files
files = []
chunks_by_file = Counter()
chunk_types = Counter()
for row in store.table.to_pandas().itertuples():
files.append(row.file_path)
chunks_by_file[row.file_path] += 1
chunk_types[row.chunk_type] += 1
unique_files = sorted(set(files))
print(f"\n Indexed Files Summary")
print(f"Total files: {len(unique_files)}")
print(f"Total chunks: {len(files)}")
print(f"\nChunk types: {dict(chunk_types)}")
print(f"\n Files with most chunks:")
for file, count in chunks_by_file.most_common(10):
print(f" {count:3d} chunks: {file}")
print(f"\n Text-to-speech files:")
tts_files = [f for f in unique_files if 'text-to-speech' in f or 'speak' in f.lower()]
for f in tts_files:
print(f" - {f} ({chunks_by_file[f]} chunks)")