Nicholai 68cec8090b Fix critical bugs: settings loading, UUID generation, and chunk metadata

This commit resolves several critical issues that prevented the plugin from
working correctly with Qdrant and adds essential metadata to indexed chunks.

**Settings & Configuration:**
- Fix settings initialization using deep merge instead of shallow Object.assign
  - Prevents nested settings from being lost during load
  - Ensures all default values are properly preserved
- Add orchestrator reinitialization when settings are saved
  - Ensures QdrantClient and embedding providers use updated settings
  - Fixes issue where plugin used localhost instead of saved HTTPS URL

**UUID Generation:**
- Fix generateDeterministicUUID() creating invalid UUIDs
  - Was generating 35-character UUIDs instead of proper 36-character format
  - Now correctly creates valid UUID v4 format: xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx
  - Properly generates segment 5 (12 hex chars) from combined hash data
  - Fixes segment 4 to start with 8/9/a/b per UUID spec
  - Resolves Qdrant API rejections: "value X is not a valid point ID"

**Chunk Metadata:**
- Add chunk_text field to ChunkMetadata type
  - Stores the actual text content of each chunk in Qdrant payload
  - Essential for displaying search results and content preview
- Add model name to chunk metadata
  - Populates model field with embedding provider name (e.g., "nomic-embed-text")
  - Enables tracking which model generated each embedding
  - Supports future multi-model collections

**Debug Logging:**
- Add logging for settings loading and URL tracking
- Add logging for QdrantClient initialization
- Add logging for orchestrator creation with settings

**Documentation:**
- Add CLAUDE.md with comprehensive architecture documentation
  - Build commands and development workflow
  - Core components and data processing pipeline
  - Important implementation details and debugging guide

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-10-23 11:29:48 -06:00

8.0 KiB

Raw Permalink Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

This is an Obsidian plugin that provides semantic search over vault contents by indexing documents into Qdrant vector database using Ollama (local) or OpenAI (cloud) embeddings. It supports markdown, text files, PDFs, and images with OCR through the Text Extractor plugin.

Build & Development Commands

# Install dependencies
npm install

# Development mode (watch mode with hot reload)
npm run dev

# Production build
npm run build

# Type checking (run before committing)
npm run build

Development Workflow

The plugin uses esbuild for bundling (configured in esbuild.config.mjs)
Entry point is main.ts which bundles all src/**/*.ts files into main.js
To test the plugin, symlink or copy main.js, manifest.json, and styles.css to your vault's .obsidian/plugins/obsidian-qdrant/ folder
TypeScript strict mode is enabled - all types must be properly defined

Architecture

Core Components (Event-Driven Pipeline)

The plugin follows an event-driven architecture with these key components:

IndexingOrchestrator (src/indexing/orchestrator.ts): Central coordinator
- Initializes all subsystems and manages their lifecycle
- Coordinates the indexing pipeline from file changes to vector storage
- Key initialization sequence: load manifest → get embedding dimension → initialize Qdrant collection → start file watching
FileWatcher (src/indexing/fileWatcher.ts): Monitors vault changes
- Listens to Obsidian vault events (create, modify, delete)
- Filters files based on settings (include/exclude patterns, ignored folders, max file size)
- Adds qualifying files to IndexingQueue
IndexingQueue (src/indexing/indexQueue.ts): Async job processor
- Processes files through the pipeline: Extract → Chunk → Embed → Upsert
- Manages concurrency and batch processing
- Tracks progress and error handling
FileManifest (src/indexing/manifest.ts): Index state tracker
- Stores metadata about indexed files (mtime, size, hash, chunk count)
- Determines which files need re-indexing (changed since last index)
- Identifies orphaned files (in index but deleted from vault)
- Persisted to vault's .obsidian/plugins/obsidian-qdrant/ folder

Data Processing Pipeline

Extract → Chunk → Embed → Store

Extractors (src/extractors/): Extract content from different file types
- MarkdownExtractor: Parses frontmatter, headings, links, tags
- TextExtractor: Handles plain text and code files
- TextExtractorPlugin: Integrates with Text Extractor plugin for PDFs/images
- Each returns ExtractedContent with text and rich metadata
Chunker (src/chunking/chunker.ts): Split content into semantic chunks
- HybridChunker: Splits on markdown headings first, then by token count
- Uses simple word-based tokenizer (GPT-style would require large dependency)
- Maintains overlap between chunks for context continuity
- Each chunk includes metadata: path, title, tags, heading hierarchy, position
Embedding Providers (src/embeddings/): Generate vector embeddings
- Factory pattern: createEmbeddingProvider() returns appropriate provider
- OllamaEmbeddingProvider: Uses local Ollama server (default: nomic-embed-text)
- OpenAIEmbeddingProvider: Uses OpenAI API
- Batching and concurrency control for efficiency
Qdrant Client (src/qdrant/client.ts): Vector storage operations
- Wraps Qdrant REST API using Obsidian's requestUrl()
- Handles collection management (create, ensure, delete)
- Point operations (upsert, search, delete, recommend)
- Uses cosine distance for similarity
CollectionManager (src/qdrant/collection.ts): Collection lifecycle
- Creates collections with naming: {vaultName}_{modelName} (sanitized)
- Ensures correct vector dimensions match embedding provider
- Configures sparse vectors for hybrid search (future feature)

Search Flow

User opens SearchModal (src/search/searchModal.ts)
Query text is embedded using the same provider as indexing
Query vector is searched against Qdrant collection
Results rendered by ResultRenderer (src/search/resultRenderer.ts)
User can click result to open file at specific chunk position

Settings & Configuration

Settings stored in vault's .obsidian/plugins/obsidian-qdrant/data.json
Defaults in src/settings.ts (especially important: DEFAULT_SETTINGS)
Critical: The orchestrator must use this.settings loaded from disk, not DEFAULT_SETTINGS
Settings tab (src/ui/settingsTab.ts) provides UI for all configuration
Validation function ensures settings are complete before initialization

Type System

All types defined in src/types.ts:

PluginSettings: Complete configuration structure
ChunkMetadata: Rich metadata stored with each vector point
SearchResult, SearchOptions: Search interface
IndexingProgress: Real-time progress tracking
FileManifestEntry: Index state per file
Interface contracts: EmbeddingProviderInterface, ExtractorInterface, etc.

Important Implementation Details

Qdrant Point IDs

Must be UUIDs (not sequential integers) for Qdrant compatibility
Generated using crypto.randomUUID() in the chunker
Each chunk gets a unique ID that persists across re-indexing

Settings Initialization Timing

Settings are loaded async in main.ts:onload()
Orchestrator created after settings load with new IndexingOrchestrator(this.app, this.settings)
Bug to watch for: Orchestrator components must reference the passed settings, not DEFAULT_SETTINGS

File Change Handling

Create/Modify: Add to queue with 'update' action
Delete: Add to queue with 'delete' action (removes points from Qdrant)
Rename: Treated as delete old + create new

Error Handling

Connection failures are caught and surfaced via status bar
File extraction errors are logged but don't stop the queue
Progress callback and error callback allow UI updates

Text Extractor Plugin Integration

Optional dependency detected at runtime
Falls back gracefully if not installed
Uses Obsidian plugin API to access text-extractor methods

Common Development Patterns

Adding a New Extractor

Implement ExtractorInterface in src/extractors/
Add to ExtractorManager.initializeExtractors() priority list
Update getExtractorStatus() to report the new extractor

Adding a New Embedding Provider

Extend BaseEmbeddingProvider in src/embeddings/
Implement embed(), getDimension(), getName(), testConnection()
Add to createEmbeddingProvider() factory switch
Add provider enum value to src/types.ts:EmbeddingProvider
Add settings interface and update PluginSettings
Add UI controls in src/ui/settingsTab.ts

Debugging Indexing Issues

Check console logs: orchestrator logs initialization steps
Verify settings: this.settings should match what's in data.json
Check manifest: FileManifest tracks what's been indexed
Test connections: Use settings tab "Test Connection" buttons
Monitor status bar: Shows progress and errors

Testing & Validation

While there are no automated tests, manual testing workflow:

Build the plugin: npm run build
Copy to test vault's plugin folder
Configure settings (Qdrant URL, embedding provider)
Test connection buttons in settings
Index a single file via command palette
Verify in Qdrant dashboard that points were created
Run semantic search and verify results
Test file modifications and deletions

Known Issues & Limitations

Graph visualization not yet implemented (UI placeholder exists)
Hybrid search (sparse + dense) configured but not exposed in search UI
Simple word-based tokenizer doesn't match exact GPT token counts
No rate limiting on API calls (relies on provider batch/concurrency settings)
File manifest doesn't handle vault renames (would require full reindex)

8.0 KiB Raw Permalink Blame History