355 lines
8.5 KiB
Markdown
355 lines
8.5 KiB
Markdown
# @bobai/frontmatter
|
|
|
|
BOBAI Markdown Standard v1.1 frontmatter generator for FSS parsers. Provides consistent, standardized frontmatter generation across all parser types.
|
|
|
|
## Installation
|
|
|
|
```bash
|
|
# From local path (recommended for FSS parsers)
|
|
npm install ../packages/bobai-frontmatter
|
|
|
|
# Or link globally
|
|
cd /MASTERFOLDER/Tools/parsers/packages/bobai-frontmatter
|
|
npm link
|
|
# Then in your parser:
|
|
npm link @bobai/frontmatter
|
|
```
|
|
|
|
## Quick Start
|
|
|
|
```typescript
|
|
import {
|
|
FrontmatterGenerator,
|
|
getEnrichmentPrompt,
|
|
PARSER_PROFILES,
|
|
LLMEnrichment
|
|
} from '@bobai/frontmatter';
|
|
|
|
// Generate markdown with frontmatter
|
|
const markdown = FrontmatterGenerator.generateMarkdown(
|
|
{
|
|
generator: 'fss-parse-pdf',
|
|
version: '1.2.0',
|
|
title: 'My Document',
|
|
sourcePath: '/path/to/file.pdf',
|
|
profile: PARSER_PROFILES['fss-parse-pdf'] // 'technical'
|
|
},
|
|
{
|
|
word_count: 1234,
|
|
page_count: 8,
|
|
has_tables: true,
|
|
has_images: false
|
|
},
|
|
content, // Markdown content string
|
|
undefined, // LLMEnrichment or undefined
|
|
'balanced' // OutputMode
|
|
);
|
|
```
|
|
|
|
## API Reference
|
|
|
|
### FrontmatterGenerator
|
|
|
|
#### `generate(options, deterministic?, enrichment?, mode?)`
|
|
|
|
Generate frontmatter YAML block only.
|
|
|
|
```typescript
|
|
const frontmatter = FrontmatterGenerator.generate(
|
|
options: FrontmatterOptions,
|
|
deterministic?: DeterministicFields,
|
|
enrichment?: LLMEnrichment,
|
|
mode?: OutputMode // 'none' | 'balanced' | 'complete'
|
|
): string;
|
|
```
|
|
|
|
#### `generateMarkdown(options, deterministic, content, enrichment?, mode?)`
|
|
|
|
Generate complete markdown with frontmatter prepended.
|
|
|
|
```typescript
|
|
const markdown = FrontmatterGenerator.generateMarkdown(
|
|
options: FrontmatterOptions,
|
|
deterministic: DeterministicFields,
|
|
content: string,
|
|
enrichment?: LLMEnrichment,
|
|
mode?: OutputMode
|
|
): string;
|
|
```
|
|
|
|
### Types
|
|
|
|
#### FrontmatterOptions
|
|
|
|
```typescript
|
|
interface FrontmatterOptions {
|
|
generator: string; // e.g., 'fss-parse-pdf'
|
|
version: string; // e.g., '1.2.0'
|
|
title: string; // Document title
|
|
sourcePath?: string | null; // Original file path
|
|
profile?: ProfileType; // Document profile
|
|
extractionConfidence?: number; // 0.0-1.0
|
|
contentQuality?: number; // 0.0-2.0
|
|
}
|
|
```
|
|
|
|
#### DeterministicFields
|
|
|
|
Parser-extracted metadata. Any fields can be included:
|
|
|
|
```typescript
|
|
interface DeterministicFields {
|
|
word_count?: number;
|
|
page_count?: number;
|
|
character_count?: number;
|
|
[key: string]: any; // Parser-specific fields
|
|
}
|
|
```
|
|
|
|
#### LLMEnrichment
|
|
|
|
AI-generated metadata fields:
|
|
|
|
```typescript
|
|
interface LLMEnrichment {
|
|
summary?: string;
|
|
tags?: string[];
|
|
category?: string;
|
|
audience?: 'all' | 'beginner' | 'intermediate' | 'expert';
|
|
doc_purpose?: 'reference' | 'tutorial' | 'troubleshooting' | 'conceptual' | 'guide' | 'specification';
|
|
complexity?: number; // 1-5
|
|
actionable?: boolean;
|
|
key_technologies?: string[];
|
|
}
|
|
```
|
|
|
|
## Output Modes
|
|
|
|
### `none`
|
|
Returns empty string (no frontmatter). Content only.
|
|
|
|
### `balanced` (default)
|
|
Includes:
|
|
- Core required fields (profile, created, generator, version, title, etc.)
|
|
- Key deterministic fields from BALANCED_FIELDS list
|
|
- LLM enrichment fields (or placeholders)
|
|
|
|
Best for RAG indexing and search.
|
|
|
|
### `complete`
|
|
Includes all fields from deterministic object plus core and enrichment fields.
|
|
Use for archival or when full metadata is needed.
|
|
|
|
## Parser Profiles
|
|
|
|
Default profiles for each parser type:
|
|
|
|
```typescript
|
|
import { PARSER_PROFILES } from '@bobai/frontmatter';
|
|
|
|
PARSER_PROFILES['fss-parse-pdf'] // 'technical'
|
|
PARSER_PROFILES['fss-parse-word'] // 'technical'
|
|
PARSER_PROFILES['fss-parse-excel'] // 'data'
|
|
PARSER_PROFILES['fss-parse-image'] // 'data'
|
|
PARSER_PROFILES['fss-parse-audio'] // 'meeting'
|
|
PARSER_PROFILES['fss-parse-video'] // 'meeting'
|
|
PARSER_PROFILES['fss-parse-email'] // 'data'
|
|
PARSER_PROFILES['fss-parse-presentation'] // 'technical'
|
|
PARSER_PROFILES['fss-parse-data'] // 'data'
|
|
PARSER_PROFILES['fss-parse-diagram'] // 'schema'
|
|
```
|
|
|
|
## Balanced Fields by Parser Type
|
|
|
|
The BALANCED_FIELDS list includes 70+ fields covering all parser types:
|
|
|
|
### Universal
|
|
`word_count`, `page_count`, `character_count`, `author`, `subject`, `creator`, `created`, `modified`, `file_size`, `format`
|
|
|
|
### PDF/Word Structure
|
|
`has_tables`, `has_images`, `table_count`, `image_count`, `section_count`, `has_toc`, `has_forms`, `has_tracked_changes`, `paragraph_count`, `heading_count`
|
|
|
|
### Excel/Data
|
|
`sheet_count`, `row_count`, `column_count`, `record_count`, `format_detected`
|
|
|
|
### Image
|
|
`width`, `height`, `channels`, `has_alpha`, `color_space`, `ocr_confidence`, `has_exif`
|
|
|
|
### Audio
|
|
`duration`, `duration_seconds`, `bitrate`, `sample_rate`, `codec`, `has_transcript`, `speaker_count`, `language`
|
|
|
|
### Video
|
|
`fps`, `aspect_ratio`, `resolution`, `video_codec`, `audio_codec`
|
|
|
|
### Presentation
|
|
`slide_count`, `total_slides`, `chart_count`, `has_speaker_notes`, `has_animations`
|
|
|
|
### Email
|
|
`from`, `to`, `cc`, `sender`, `recipients`, `date`, `message_id`, `has_attachments`, `attachment_count`, `importance`, `thread_id`
|
|
|
|
### Diagram
|
|
`diagram_count`, `diagram_type`, `valid_diagrams`, `invalid_diagrams`, `node_count`, `edge_count`
|
|
|
|
## LLM Enrichment
|
|
|
|
### Getting the Prompt
|
|
|
|
```typescript
|
|
import { getEnrichmentPrompt, getSamplePromptForDocType } from '@bobai/frontmatter';
|
|
|
|
// Get prompt for LLM enrichment
|
|
const prompt = getEnrichmentPrompt(content, 'pdf');
|
|
|
|
// Send to your LLM...
|
|
const response = await llm.generate(prompt);
|
|
const enrichment: LLMEnrichment = JSON.parse(response);
|
|
|
|
// Use in frontmatter generation
|
|
const markdown = FrontmatterGenerator.generateMarkdown(
|
|
options,
|
|
deterministic,
|
|
content,
|
|
enrichment,
|
|
'balanced'
|
|
);
|
|
```
|
|
|
|
### Prompt Output Format
|
|
|
|
The LLM will return JSON matching the LLMEnrichment interface:
|
|
|
|
```json
|
|
{
|
|
"summary": "2-3 sentence description",
|
|
"tags": ["specific", "search", "terms"],
|
|
"category": "technical",
|
|
"audience": "intermediate",
|
|
"doc_purpose": "reference",
|
|
"complexity": 3,
|
|
"actionable": false,
|
|
"key_technologies": ["TypeScript", "Node.js"]
|
|
}
|
|
```
|
|
|
|
## Parser Integration Example
|
|
|
|
```typescript
|
|
// In your parser (e.g., pdf-ts/src/pdf-parser.ts)
|
|
import {
|
|
FrontmatterGenerator,
|
|
PARSER_PROFILES,
|
|
FrontmatterOptions,
|
|
DeterministicFields
|
|
} from '@bobai/frontmatter';
|
|
import { version } from '../package.json';
|
|
|
|
export function generateOutput(
|
|
content: string,
|
|
metadata: ParsedMetadata,
|
|
sourcePath: string,
|
|
mode: 'none' | 'balanced' | 'complete' = 'balanced'
|
|
): string {
|
|
const options: FrontmatterOptions = {
|
|
generator: 'fss-parse-pdf',
|
|
version,
|
|
title: metadata.title || 'Untitled',
|
|
sourcePath,
|
|
profile: PARSER_PROFILES['fss-parse-pdf'],
|
|
extractionConfidence: metadata.confidence,
|
|
contentQuality: calculateQuality(metadata)
|
|
};
|
|
|
|
const deterministic: DeterministicFields = {
|
|
word_count: metadata.wordCount,
|
|
page_count: metadata.pageCount,
|
|
character_count: metadata.characterCount,
|
|
has_tables: metadata.hasTables,
|
|
has_images: metadata.hasImages,
|
|
table_count: metadata.tableCount,
|
|
image_count: metadata.imageCount,
|
|
author: metadata.author,
|
|
created: metadata.creationDate,
|
|
modified: metadata.modificationDate,
|
|
encrypted: metadata.isEncrypted
|
|
};
|
|
|
|
return FrontmatterGenerator.generateMarkdown(
|
|
options,
|
|
deterministic,
|
|
content,
|
|
undefined, // No LLM enrichment
|
|
mode
|
|
);
|
|
}
|
|
```
|
|
|
|
## Constants & Defaults
|
|
|
|
```typescript
|
|
import {
|
|
DEFAULTS,
|
|
AUDIENCE_VALUES,
|
|
DOC_PURPOSE_VALUES,
|
|
PROFILE_VALUES,
|
|
BALANCED_FIELDS
|
|
} from '@bobai/frontmatter';
|
|
|
|
// Default values
|
|
DEFAULTS.profile // 'data'
|
|
DEFAULTS.audience // 'all'
|
|
DEFAULTS.extractionConfidence // 1.0
|
|
DEFAULTS.contentQuality // 1.5
|
|
DEFAULTS.complexity // 3
|
|
|
|
// Valid values for validation
|
|
AUDIENCE_VALUES // ['all', 'beginner', 'intermediate', 'expert']
|
|
DOC_PURPOSE_VALUES // ['reference', 'tutorial', ...]
|
|
PROFILE_VALUES // ['scraped', 'research', 'technical', ...]
|
|
```
|
|
|
|
## Testing
|
|
|
|
```bash
|
|
npm test # Run all tests
|
|
npm run test:watch # Watch mode
|
|
npm run test:coverage # Coverage report
|
|
```
|
|
|
|
## Building
|
|
|
|
```bash
|
|
npm run build # Compile TypeScript to dist/
|
|
npm run clean # Remove dist/
|
|
```
|
|
|
|
## Output Example
|
|
|
|
```yaml
|
|
---
|
|
profile: 'technical'
|
|
created: '2024-01-15T10:30:00.000Z'
|
|
generator: 'fss-parse-pdf'
|
|
version: '1.2.0'
|
|
title: 'API Documentation'
|
|
extraction_confidence: 1
|
|
content_quality: 1.5
|
|
source_file: '/docs/api.pdf'
|
|
word_count: 5000
|
|
page_count: 25
|
|
has_tables: true
|
|
has_images: true
|
|
author: 'Development Team'
|
|
summary: ''
|
|
tags: []
|
|
category: ''
|
|
---
|
|
|
|
# API Documentation
|
|
|
|
Content starts here...
|
|
```
|
|
|
|
## License
|
|
|
|
MIT
|