This commit resolves several critical issues that prevented the plugin from working correctly with Qdrant and adds essential metadata to indexed chunks. **Settings & Configuration:** - Fix settings initialization using deep merge instead of shallow Object.assign - Prevents nested settings from being lost during load - Ensures all default values are properly preserved - Add orchestrator reinitialization when settings are saved - Ensures QdrantClient and embedding providers use updated settings - Fixes issue where plugin used localhost instead of saved HTTPS URL **UUID Generation:** - Fix generateDeterministicUUID() creating invalid UUIDs - Was generating 35-character UUIDs instead of proper 36-character format - Now correctly creates valid UUID v4 format: xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx - Properly generates segment 5 (12 hex chars) from combined hash data - Fixes segment 4 to start with 8/9/a/b per UUID spec - Resolves Qdrant API rejections: "value X is not a valid point ID" **Chunk Metadata:** - Add chunk_text field to ChunkMetadata type - Stores the actual text content of each chunk in Qdrant payload - Essential for displaying search results and content preview - Add model name to chunk metadata - Populates model field with embedding provider name (e.g., "nomic-embed-text") - Enables tracking which model generated each embedding - Supports future multi-model collections **Debug Logging:** - Add logging for settings loading and URL tracking - Add logging for QdrantClient initialization - Add logging for orchestrator creation with settings **Documentation:** - Add CLAUDE.md with comprehensive architecture documentation - Build commands and development workflow - Core components and data processing pipeline - Important implementation details and debugging guide 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
187 lines
8.0 KiB
Markdown
187 lines
8.0 KiB
Markdown
# CLAUDE.md
|
|
|
|
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
|
|
|
## Project Overview
|
|
|
|
This is an Obsidian plugin that provides semantic search over vault contents by indexing documents into Qdrant vector database using Ollama (local) or OpenAI (cloud) embeddings. It supports markdown, text files, PDFs, and images with OCR through the Text Extractor plugin.
|
|
|
|
## Build & Development Commands
|
|
|
|
```bash
|
|
# Install dependencies
|
|
npm install
|
|
|
|
# Development mode (watch mode with hot reload)
|
|
npm run dev
|
|
|
|
# Production build
|
|
npm run build
|
|
|
|
# Type checking (run before committing)
|
|
npm run build
|
|
```
|
|
|
|
### Development Workflow
|
|
|
|
1. The plugin uses esbuild for bundling (configured in `esbuild.config.mjs`)
|
|
2. Entry point is `main.ts` which bundles all `src/**/*.ts` files into `main.js`
|
|
3. To test the plugin, symlink or copy `main.js`, `manifest.json`, and `styles.css` to your vault's `.obsidian/plugins/obsidian-qdrant/` folder
|
|
4. TypeScript strict mode is enabled - all types must be properly defined
|
|
|
|
## Architecture
|
|
|
|
### Core Components (Event-Driven Pipeline)
|
|
|
|
The plugin follows an event-driven architecture with these key components:
|
|
|
|
1. **IndexingOrchestrator** (`src/indexing/orchestrator.ts`): Central coordinator
|
|
- Initializes all subsystems and manages their lifecycle
|
|
- Coordinates the indexing pipeline from file changes to vector storage
|
|
- Key initialization sequence: load manifest → get embedding dimension → initialize Qdrant collection → start file watching
|
|
|
|
2. **FileWatcher** (`src/indexing/fileWatcher.ts`): Monitors vault changes
|
|
- Listens to Obsidian vault events (create, modify, delete)
|
|
- Filters files based on settings (include/exclude patterns, ignored folders, max file size)
|
|
- Adds qualifying files to IndexingQueue
|
|
|
|
3. **IndexingQueue** (`src/indexing/indexQueue.ts`): Async job processor
|
|
- Processes files through the pipeline: Extract → Chunk → Embed → Upsert
|
|
- Manages concurrency and batch processing
|
|
- Tracks progress and error handling
|
|
|
|
4. **FileManifest** (`src/indexing/manifest.ts`): Index state tracker
|
|
- Stores metadata about indexed files (mtime, size, hash, chunk count)
|
|
- Determines which files need re-indexing (changed since last index)
|
|
- Identifies orphaned files (in index but deleted from vault)
|
|
- Persisted to vault's `.obsidian/plugins/obsidian-qdrant/` folder
|
|
|
|
### Data Processing Pipeline
|
|
|
|
**Extract** → **Chunk** → **Embed** → **Store**
|
|
|
|
1. **Extractors** (`src/extractors/`): Extract content from different file types
|
|
- `MarkdownExtractor`: Parses frontmatter, headings, links, tags
|
|
- `TextExtractor`: Handles plain text and code files
|
|
- `TextExtractorPlugin`: Integrates with Text Extractor plugin for PDFs/images
|
|
- Each returns `ExtractedContent` with text and rich metadata
|
|
|
|
2. **Chunker** (`src/chunking/chunker.ts`): Split content into semantic chunks
|
|
- `HybridChunker`: Splits on markdown headings first, then by token count
|
|
- Uses simple word-based tokenizer (GPT-style would require large dependency)
|
|
- Maintains overlap between chunks for context continuity
|
|
- Each chunk includes metadata: path, title, tags, heading hierarchy, position
|
|
|
|
3. **Embedding Providers** (`src/embeddings/`): Generate vector embeddings
|
|
- Factory pattern: `createEmbeddingProvider()` returns appropriate provider
|
|
- `OllamaEmbeddingProvider`: Uses local Ollama server (default: nomic-embed-text)
|
|
- `OpenAIEmbeddingProvider`: Uses OpenAI API
|
|
- Batching and concurrency control for efficiency
|
|
|
|
4. **Qdrant Client** (`src/qdrant/client.ts`): Vector storage operations
|
|
- Wraps Qdrant REST API using Obsidian's `requestUrl()`
|
|
- Handles collection management (create, ensure, delete)
|
|
- Point operations (upsert, search, delete, recommend)
|
|
- Uses cosine distance for similarity
|
|
|
|
5. **CollectionManager** (`src/qdrant/collection.ts`): Collection lifecycle
|
|
- Creates collections with naming: `{vaultName}_{modelName}` (sanitized)
|
|
- Ensures correct vector dimensions match embedding provider
|
|
- Configures sparse vectors for hybrid search (future feature)
|
|
|
|
### Search Flow
|
|
|
|
1. User opens SearchModal (`src/search/searchModal.ts`)
|
|
2. Query text is embedded using the same provider as indexing
|
|
3. Query vector is searched against Qdrant collection
|
|
4. Results rendered by ResultRenderer (`src/search/resultRenderer.ts`)
|
|
5. User can click result to open file at specific chunk position
|
|
|
|
### Settings & Configuration
|
|
|
|
- Settings stored in vault's `.obsidian/plugins/obsidian-qdrant/data.json`
|
|
- Defaults in `src/settings.ts` (especially important: DEFAULT_SETTINGS)
|
|
- **Critical**: The orchestrator must use `this.settings` loaded from disk, not DEFAULT_SETTINGS
|
|
- Settings tab (`src/ui/settingsTab.ts`) provides UI for all configuration
|
|
- Validation function ensures settings are complete before initialization
|
|
|
|
### Type System
|
|
|
|
All types defined in `src/types.ts`:
|
|
- `PluginSettings`: Complete configuration structure
|
|
- `ChunkMetadata`: Rich metadata stored with each vector point
|
|
- `SearchResult`, `SearchOptions`: Search interface
|
|
- `IndexingProgress`: Real-time progress tracking
|
|
- `FileManifestEntry`: Index state per file
|
|
- Interface contracts: `EmbeddingProviderInterface`, `ExtractorInterface`, etc.
|
|
|
|
## Important Implementation Details
|
|
|
|
### Qdrant Point IDs
|
|
- Must be UUIDs (not sequential integers) for Qdrant compatibility
|
|
- Generated using crypto.randomUUID() in the chunker
|
|
- Each chunk gets a unique ID that persists across re-indexing
|
|
|
|
### Settings Initialization Timing
|
|
- Settings are loaded async in `main.ts:onload()`
|
|
- Orchestrator created after settings load with `new IndexingOrchestrator(this.app, this.settings)`
|
|
- **Bug to watch for**: Orchestrator components must reference the passed settings, not DEFAULT_SETTINGS
|
|
|
|
### File Change Handling
|
|
- Create/Modify: Add to queue with 'update' action
|
|
- Delete: Add to queue with 'delete' action (removes points from Qdrant)
|
|
- Rename: Treated as delete old + create new
|
|
|
|
### Error Handling
|
|
- Connection failures are caught and surfaced via status bar
|
|
- File extraction errors are logged but don't stop the queue
|
|
- Progress callback and error callback allow UI updates
|
|
|
|
### Text Extractor Plugin Integration
|
|
- Optional dependency detected at runtime
|
|
- Falls back gracefully if not installed
|
|
- Uses Obsidian plugin API to access text-extractor methods
|
|
|
|
## Common Development Patterns
|
|
|
|
### Adding a New Extractor
|
|
1. Implement `ExtractorInterface` in `src/extractors/`
|
|
2. Add to `ExtractorManager.initializeExtractors()` priority list
|
|
3. Update `getExtractorStatus()` to report the new extractor
|
|
|
|
### Adding a New Embedding Provider
|
|
1. Extend `BaseEmbeddingProvider` in `src/embeddings/`
|
|
2. Implement `embed()`, `getDimension()`, `getName()`, `testConnection()`
|
|
3. Add to `createEmbeddingProvider()` factory switch
|
|
4. Add provider enum value to `src/types.ts:EmbeddingProvider`
|
|
5. Add settings interface and update `PluginSettings`
|
|
6. Add UI controls in `src/ui/settingsTab.ts`
|
|
|
|
### Debugging Indexing Issues
|
|
1. Check console logs: orchestrator logs initialization steps
|
|
2. Verify settings: `this.settings` should match what's in data.json
|
|
3. Check manifest: FileManifest tracks what's been indexed
|
|
4. Test connections: Use settings tab "Test Connection" buttons
|
|
5. Monitor status bar: Shows progress and errors
|
|
|
|
## Testing & Validation
|
|
|
|
While there are no automated tests, manual testing workflow:
|
|
|
|
1. Build the plugin: `npm run build`
|
|
2. Copy to test vault's plugin folder
|
|
3. Configure settings (Qdrant URL, embedding provider)
|
|
4. Test connection buttons in settings
|
|
5. Index a single file via command palette
|
|
6. Verify in Qdrant dashboard that points were created
|
|
7. Run semantic search and verify results
|
|
8. Test file modifications and deletions
|
|
|
|
## Known Issues & Limitations
|
|
|
|
- Graph visualization not yet implemented (UI placeholder exists)
|
|
- Hybrid search (sparse + dense) configured but not exposed in search UI
|
|
- Simple word-based tokenizer doesn't match exact GPT token counts
|
|
- No rate limiting on API calls (relies on provider batch/concurrency settings)
|
|
- File manifest doesn't handle vault renames (would require full reindex)
|