file-browser/README.md

264 lines
9.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Nextcloud + Elasticsearch Discovery File Explorer
A discovery-first file explorer built with Next.js (App Router, TypeScript, Tailwind, shadcn UI) that connects to Nextcloud via WebDAV and indexes metadata/content into Elasticsearch (BM25 baseline, optional semantic hybrid search). Includes upload/download, folder tree browsing, global search, and Sentry instrumentation.
## TL;DR (Quick Start)
Prereqs:
- Node 18+ and npm
- Docker (for Elasticsearch/Kibana/Tika services)
Setup:
1) Configure environment
- Copy `.env.example` to `.env.local` and fill in your values (Nextcloud creds + Elasticsearch endpoint).
- A pre-populated `.env.local` is included for your provided Nextcloud host (update if needed).
2) Provision Elasticsearch + Tika (Docker sample below)
3) Create Elasticsearch index
- `npm run create:index`
4) Ingest Nextcloud into ES (BM25 baseline)
- `npx tsx -r dotenv/config -r tsconfig-paths/register scripts/ingest-nextcloud.ts`
5) Run the app
- `npm run dev` then open the reported URL (e.g., http://localhost:3000)
## Why this exists
Nextcloud is the system of record for files; this app provides a discovery UX (fast search, filters, previews later) by indexing normalized documents into Elasticsearch. Apache Tika can extract plain text for rich BM25 search, and optional OpenAI-compatible embeddings enable semantic hybrid search.
## Architecture
- UI: Next.js App Router (TypeScript), Tailwind CSS, shadcn UI
- Data
- Nextcloud WebDAV for browse/upload/download
- Elasticsearch for indexing and search (BM25 baseline)
- Apache Tika server for content extraction (optional but recommended)
- OpenAI-compatible embeddings (optional) for dense vectors/hybrid queries
- Observability: Sentry (client/server/edge) with logs and spans
## Features
- Browse Nextcloud directories via a lazy-loading sidebar tree
- View files with size/type and actions in a table
- Upload files into the current folder
- Download/proxy via Next.js route
- Global search box (BM25 baseline) with optional “Semantic” toggle for hybrid search (when embeddings are configured)
- Sentry logging/tracing sprinkled around WebDAV and Elasticsearch calls
## Project Layout (highlights)
- src/lib
- env.ts: zod-validated env loader with flags
- paths.ts: normalization/helpers for DAV paths
- elasticsearch.ts: ES client, ensureIndex, bm25Search, knnSearch, hybridSearch
- webdav.ts: Nextcloud WebDAV wrapper (list/create/upload/download/stat) with spans
- embeddings.ts: OpenAI-compatible embeddings client
- src/app/api
- folders/list, folders/create
- files/list, files/upload, files/download
- search/query
- scripts
- create-index.ts: creates ES index + alias
- ingest-nextcloud.ts: crawl Nextcloud → optional Tika → index into ES
- docs/elasticsearch/mappings.json: canonical baseline mapping
- Sentry init
- instrumentation-client.ts, sentry.server.config.ts, sentry.edge.config.ts
## Environment Variables
Populate `.env.local` (not committed). See `.env.example` for full list.
Required:
- NEXTCLOUD_BASE_URL: e.g. https://your-nextcloud.example.com
- NEXTCLOUD_USERNAME: WebDAV user
- NEXTCLOUD_APP_PASSWORD: App password generated in Nextcloud (not login password)
- NEXTCLOUD_ROOT_PATH: e.g. `/remote.php/dav/files/admin`
- ELASTICSEARCH_URL: e.g. http://localhost:9200
- ELASTICSEARCH_INDEX: default `files`
- ELASTICSEARCH_ALIAS: default `files_current`
Optional:
- TIKA_BASE_URL: e.g. http://localhost:9998 (if using Apache Tika for extraction)
- SENTRY_DSN: if provided, Sentry is enabled
- OPENAI_API_BASE, OPENAI_API_KEY, OPENAI_EMBEDDING_MODEL, EMBEDDING_DIM: Enable embeddings + semantic hybrid
## Dependencies and Local Services (Docker)
Below is a sample docker-compose for local development. Adjust versions for your environment. For TrueNAS Scale, translate this into an appropriate app configuration.
```yaml
version: "3.9"
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.12.2
container_name: es
environment:
- discovery.type=single-node
- xpack.security.enabled=false
- ES_JAVA_OPTS=-Xms1g -Xmx1g
ports:
- "9200:9200"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9200/_cluster/health"]
interval: 10s
timeout: 5s
retries: 10
kibana:
image: docker.elastic.co/kibana/kibana:8.12.2
container_name: kibana
environment:
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
ports:
- "5601:5601"
depends_on:
elasticsearch:
condition: service_healthy
tika:
image: apache/tika:latest-full
container_name: tika
ports:
- "9998:9998"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9998"]
interval: 10s
timeout: 5s
retries: 10
```
Notes:
- Security is relaxed in this dev config (xpack security disabled). Harden for production.
- Set `TIKA_BASE_URL=http://localhost:9998` to enable content extraction.
## Install & Run
Install dependencies:
```bash
npm install
```
Run development server:
```bash
npm run dev
# Next.js will print the URL, for example http://localhost:3000
```
Build and start:
```bash
npm run build
npm start
```
## Initialize Elasticsearch
Create the index and alias (files/files_current):
```bash
npm run create:index
# internally: tsx -r dotenv/config -r tsconfig-paths/register scripts/create-index.ts
```
To recreate:
```bash
npm run create:index -- --recreate
```
## Ingest Nextcloud → Elasticsearch
Crawl Nextcloud via WebDAV, optionally extract content with Tika, index into ES:
```bash
npx tsx -r dotenv/config -r tsconfig-paths/register scripts/ingest-nextcloud.ts
```
Restrict to a subtree:
```bash
npx tsx -r dotenv/config -r tsconfig-paths/register scripts/ingest-nextcloud.ts -- --root=/remote.php/dav/files/admin/SomeFolder
```
After ingestion, try searching in the UI (BM25).
## Optional: Semantic Hybrid Search
To enable, configure embeddings:
- OPENAI_API_BASE (e.g., https://api.openai.com/v1 or your compatible endpoint)
- OPENAI_API_KEY
- OPENAI_EMBEDDING_MODEL (e.g., text-embedding-3-large)
- EMBEDDING_DIM (e.g., 1536)
Current code path supports:
- Hybrid search in API when `semantic: true` in requests; UI has a “Semantic” toggle.
- A future backfill script can be added to compute vectors for existing docs (planned).
## Sentry
- Files:
- instrumentation-client.ts (client init)
- sentry.server.config.ts (node runtime)
- sentry.edge.config.ts (edge runtime)
- Enable by setting SENTRY_DSN in `.env.local`.
- Logging enabled (`consoleLoggingIntegration`), capturing console.log/warn/error.
- Spans instrument WebDAV and ES operations with meaningful op/name and attributes.
## API Endpoints
- GET `/api/folders/list?path=/abs/path` → { folders[], files[] }
- POST `/api/folders/create` body: `{ path: "/abs/path" }``{ ok: true }`
- GET `/api/files/list?path=/abs/path&page=1&perPage=50``{ total, page, perPage, items[] }`
- POST `/api/files/upload` (multipart) form-data: `file`, `destPath`
- GET `/api/files/download?path=/abs/path` → stream download
- POST `/api/search/query` → body: `{ q, filters?, sort?, page?, perPage?, semantic? }`
## UI Usage
- Navigate folders via the left sidebar. Root defaults to `NEXTCLOUD_ROOT_PATH`.
- Breadcrumbs show the current path; click to navigate.
- Global search queries ES (BM25). Toggle “Semantic” to blend vector similarity (when enabled).
- Use “Upload” to send a file to the current folder.
- Click a file name or download action to retrieve it.
## Known Caveats / TODO
- When starting from `NEXTCLOUD_ROOT_PATH`, breadcrumb segments may include technical path prefixes (e.g., `/remote.php`, `/dav`) that arent browsable independently. This can be adjusted to start breadcrumbs from the user root only.
- Embeddings backfill script & “Find similar” API/UI are planned.
- Tests (unit/integration/E2E) and TrueNAS-specific compose notes can be added next.
## Development Notes
- Code style: TypeScript, ESLint (Next config), Tailwind v4.
- shadcn components added (Slate), most primitives included.
- ES Types via `@elastic/elasticsearch` estypes.
- Stream interop handled for Node → Web streams in downloads and uploads.
- Zod-based env loader normalizes URLs and validates required vars.
## Troubleshooting
- Search 500s:
- Ensure Elasticsearch is running and reachable at `ELASTICSEARCH_URL`.
- Run `npm run create:index` to create the index and alias.
- If using Tika for content, ensure `TIKA_BASE_URL` is set and service is healthy (optional).
- WebDAV failures:
- Verify `NEXTCLOUD_BASE_URL`, `NEXTCLOUD_USERNAME`, `NEXTCLOUD_APP_PASSWORD`, and `NEXTCLOUD_ROOT_PATH`.
- Confirm the user has permission for the target path.
- Sentry issues:
- Ensure `SENTRY_DSN` is set; check networking/outbound restrictions.
## Scripts
- `npm run create:index`: Create/recreate Elasticsearch index and alias
- `scripts/ingest-nextcloud.ts`: Crawl Nextcloud → Tika (optional) → Elasticsearch
## Security
- Keep `.env.local` out of source control (already in .gitignore).
- Use Nextcloud App Passwords, not login passwords.
- Harden Elasticsearch/Kibana/Tika for production (auth, TLS, resource limits).
---
Built per implementation_plan.md with a deterministic order of rollout and Sentry instrumentation aligned to organizational rules.