obsidian-qdrant/CLAUDE.md
Nicholai 68cec8090b Fix critical bugs: settings loading, UUID generation, and chunk metadata
This commit resolves several critical issues that prevented the plugin from
working correctly with Qdrant and adds essential metadata to indexed chunks.

**Settings & Configuration:**
- Fix settings initialization using deep merge instead of shallow Object.assign
  - Prevents nested settings from being lost during load
  - Ensures all default values are properly preserved
- Add orchestrator reinitialization when settings are saved
  - Ensures QdrantClient and embedding providers use updated settings
  - Fixes issue where plugin used localhost instead of saved HTTPS URL

**UUID Generation:**
- Fix generateDeterministicUUID() creating invalid UUIDs
  - Was generating 35-character UUIDs instead of proper 36-character format
  - Now correctly creates valid UUID v4 format: xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx
  - Properly generates segment 5 (12 hex chars) from combined hash data
  - Fixes segment 4 to start with 8/9/a/b per UUID spec
  - Resolves Qdrant API rejections: "value X is not a valid point ID"

**Chunk Metadata:**
- Add chunk_text field to ChunkMetadata type
  - Stores the actual text content of each chunk in Qdrant payload
  - Essential for displaying search results and content preview
- Add model name to chunk metadata
  - Populates model field with embedding provider name (e.g., "nomic-embed-text")
  - Enables tracking which model generated each embedding
  - Supports future multi-model collections

**Debug Logging:**
- Add logging for settings loading and URL tracking
- Add logging for QdrantClient initialization
- Add logging for orchestrator creation with settings

**Documentation:**
- Add CLAUDE.md with comprehensive architecture documentation
  - Build commands and development workflow
  - Core components and data processing pipeline
  - Important implementation details and debugging guide

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-23 11:29:48 -06:00

8.0 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

This is an Obsidian plugin that provides semantic search over vault contents by indexing documents into Qdrant vector database using Ollama (local) or OpenAI (cloud) embeddings. It supports markdown, text files, PDFs, and images with OCR through the Text Extractor plugin.

Build & Development Commands

# Install dependencies
npm install

# Development mode (watch mode with hot reload)
npm run dev

# Production build
npm run build

# Type checking (run before committing)
npm run build

Development Workflow

  1. The plugin uses esbuild for bundling (configured in esbuild.config.mjs)
  2. Entry point is main.ts which bundles all src/**/*.ts files into main.js
  3. To test the plugin, symlink or copy main.js, manifest.json, and styles.css to your vault's .obsidian/plugins/obsidian-qdrant/ folder
  4. TypeScript strict mode is enabled - all types must be properly defined

Architecture

Core Components (Event-Driven Pipeline)

The plugin follows an event-driven architecture with these key components:

  1. IndexingOrchestrator (src/indexing/orchestrator.ts): Central coordinator

    • Initializes all subsystems and manages their lifecycle
    • Coordinates the indexing pipeline from file changes to vector storage
    • Key initialization sequence: load manifest → get embedding dimension → initialize Qdrant collection → start file watching
  2. FileWatcher (src/indexing/fileWatcher.ts): Monitors vault changes

    • Listens to Obsidian vault events (create, modify, delete)
    • Filters files based on settings (include/exclude patterns, ignored folders, max file size)
    • Adds qualifying files to IndexingQueue
  3. IndexingQueue (src/indexing/indexQueue.ts): Async job processor

    • Processes files through the pipeline: Extract → Chunk → Embed → Upsert
    • Manages concurrency and batch processing
    • Tracks progress and error handling
  4. FileManifest (src/indexing/manifest.ts): Index state tracker

    • Stores metadata about indexed files (mtime, size, hash, chunk count)
    • Determines which files need re-indexing (changed since last index)
    • Identifies orphaned files (in index but deleted from vault)
    • Persisted to vault's .obsidian/plugins/obsidian-qdrant/ folder

Data Processing Pipeline

ExtractChunkEmbedStore

  1. Extractors (src/extractors/): Extract content from different file types

    • MarkdownExtractor: Parses frontmatter, headings, links, tags
    • TextExtractor: Handles plain text and code files
    • TextExtractorPlugin: Integrates with Text Extractor plugin for PDFs/images
    • Each returns ExtractedContent with text and rich metadata
  2. Chunker (src/chunking/chunker.ts): Split content into semantic chunks

    • HybridChunker: Splits on markdown headings first, then by token count
    • Uses simple word-based tokenizer (GPT-style would require large dependency)
    • Maintains overlap between chunks for context continuity
    • Each chunk includes metadata: path, title, tags, heading hierarchy, position
  3. Embedding Providers (src/embeddings/): Generate vector embeddings

    • Factory pattern: createEmbeddingProvider() returns appropriate provider
    • OllamaEmbeddingProvider: Uses local Ollama server (default: nomic-embed-text)
    • OpenAIEmbeddingProvider: Uses OpenAI API
    • Batching and concurrency control for efficiency
  4. Qdrant Client (src/qdrant/client.ts): Vector storage operations

    • Wraps Qdrant REST API using Obsidian's requestUrl()
    • Handles collection management (create, ensure, delete)
    • Point operations (upsert, search, delete, recommend)
    • Uses cosine distance for similarity
  5. CollectionManager (src/qdrant/collection.ts): Collection lifecycle

    • Creates collections with naming: {vaultName}_{modelName} (sanitized)
    • Ensures correct vector dimensions match embedding provider
    • Configures sparse vectors for hybrid search (future feature)

Search Flow

  1. User opens SearchModal (src/search/searchModal.ts)
  2. Query text is embedded using the same provider as indexing
  3. Query vector is searched against Qdrant collection
  4. Results rendered by ResultRenderer (src/search/resultRenderer.ts)
  5. User can click result to open file at specific chunk position

Settings & Configuration

  • Settings stored in vault's .obsidian/plugins/obsidian-qdrant/data.json
  • Defaults in src/settings.ts (especially important: DEFAULT_SETTINGS)
  • Critical: The orchestrator must use this.settings loaded from disk, not DEFAULT_SETTINGS
  • Settings tab (src/ui/settingsTab.ts) provides UI for all configuration
  • Validation function ensures settings are complete before initialization

Type System

All types defined in src/types.ts:

  • PluginSettings: Complete configuration structure
  • ChunkMetadata: Rich metadata stored with each vector point
  • SearchResult, SearchOptions: Search interface
  • IndexingProgress: Real-time progress tracking
  • FileManifestEntry: Index state per file
  • Interface contracts: EmbeddingProviderInterface, ExtractorInterface, etc.

Important Implementation Details

Qdrant Point IDs

  • Must be UUIDs (not sequential integers) for Qdrant compatibility
  • Generated using crypto.randomUUID() in the chunker
  • Each chunk gets a unique ID that persists across re-indexing

Settings Initialization Timing

  • Settings are loaded async in main.ts:onload()
  • Orchestrator created after settings load with new IndexingOrchestrator(this.app, this.settings)
  • Bug to watch for: Orchestrator components must reference the passed settings, not DEFAULT_SETTINGS

File Change Handling

  • Create/Modify: Add to queue with 'update' action
  • Delete: Add to queue with 'delete' action (removes points from Qdrant)
  • Rename: Treated as delete old + create new

Error Handling

  • Connection failures are caught and surfaced via status bar
  • File extraction errors are logged but don't stop the queue
  • Progress callback and error callback allow UI updates

Text Extractor Plugin Integration

  • Optional dependency detected at runtime
  • Falls back gracefully if not installed
  • Uses Obsidian plugin API to access text-extractor methods

Common Development Patterns

Adding a New Extractor

  1. Implement ExtractorInterface in src/extractors/
  2. Add to ExtractorManager.initializeExtractors() priority list
  3. Update getExtractorStatus() to report the new extractor

Adding a New Embedding Provider

  1. Extend BaseEmbeddingProvider in src/embeddings/
  2. Implement embed(), getDimension(), getName(), testConnection()
  3. Add to createEmbeddingProvider() factory switch
  4. Add provider enum value to src/types.ts:EmbeddingProvider
  5. Add settings interface and update PluginSettings
  6. Add UI controls in src/ui/settingsTab.ts

Debugging Indexing Issues

  1. Check console logs: orchestrator logs initialization steps
  2. Verify settings: this.settings should match what's in data.json
  3. Check manifest: FileManifest tracks what's been indexed
  4. Test connections: Use settings tab "Test Connection" buttons
  5. Monitor status bar: Shows progress and errors

Testing & Validation

While there are no automated tests, manual testing workflow:

  1. Build the plugin: npm run build
  2. Copy to test vault's plugin folder
  3. Configure settings (Qdrant URL, embedding provider)
  4. Test connection buttons in settings
  5. Index a single file via command palette
  6. Verify in Qdrant dashboard that points were created
  7. Run semantic search and verify results
  8. Test file modifications and deletions

Known Issues & Limitations

  • Graph visualization not yet implemented (UI placeholder exists)
  • Hybrid search (sparse + dense) configured but not exposed in search UI
  • Simple word-based tokenizer doesn't match exact GPT token counts
  • No rate limiting on API calls (relies on provider batch/concurrency settings)
  • File manifest doesn't handle vault renames (would require full reindex)