264 lines
9.2 KiB
Markdown
264 lines
9.2 KiB
Markdown
# Nextcloud + Elasticsearch Discovery File Explorer
|
||
|
||
A discovery-first file explorer built with Next.js (App Router, TypeScript, Tailwind, shadcn UI) that connects to Nextcloud via WebDAV and indexes metadata/content into Elasticsearch (BM25 baseline, optional semantic hybrid search). Includes upload/download, folder tree browsing, global search, and Sentry instrumentation.
|
||
|
||
## TL;DR (Quick Start)
|
||
|
||
Prereqs:
|
||
- Node 18+ and npm
|
||
- Docker (for Elasticsearch/Kibana/Tika services)
|
||
|
||
Setup:
|
||
1) Configure environment
|
||
- Copy `.env.example` to `.env.local` and fill in your values (Nextcloud creds + Elasticsearch endpoint).
|
||
- A pre-populated `.env.local` is included for your provided Nextcloud host (update if needed).
|
||
|
||
2) Provision Elasticsearch + Tika (Docker sample below)
|
||
3) Create Elasticsearch index
|
||
- `npm run create:index`
|
||
4) Ingest Nextcloud into ES (BM25 baseline)
|
||
- `npx tsx -r dotenv/config -r tsconfig-paths/register scripts/ingest-nextcloud.ts`
|
||
5) Run the app
|
||
- `npm run dev` then open the reported URL (e.g., http://localhost:3000)
|
||
|
||
## Why this exists
|
||
|
||
Nextcloud is the system of record for files; this app provides a discovery UX (fast search, filters, previews later) by indexing normalized documents into Elasticsearch. Apache Tika can extract plain text for rich BM25 search, and optional OpenAI-compatible embeddings enable semantic hybrid search.
|
||
|
||
## Architecture
|
||
|
||
- UI: Next.js App Router (TypeScript), Tailwind CSS, shadcn UI
|
||
- Data
|
||
- Nextcloud WebDAV for browse/upload/download
|
||
- Elasticsearch for indexing and search (BM25 baseline)
|
||
- Apache Tika server for content extraction (optional but recommended)
|
||
- OpenAI-compatible embeddings (optional) for dense vectors/hybrid queries
|
||
- Observability: Sentry (client/server/edge) with logs and spans
|
||
|
||
## Features
|
||
|
||
- Browse Nextcloud directories via a lazy-loading sidebar tree
|
||
- View files with size/type and actions in a table
|
||
- Upload files into the current folder
|
||
- Download/proxy via Next.js route
|
||
- Global search box (BM25 baseline) with optional “Semantic” toggle for hybrid search (when embeddings are configured)
|
||
- Sentry logging/tracing sprinkled around WebDAV and Elasticsearch calls
|
||
|
||
## Project Layout (highlights)
|
||
|
||
- src/lib
|
||
- env.ts: zod-validated env loader with flags
|
||
- paths.ts: normalization/helpers for DAV paths
|
||
- elasticsearch.ts: ES client, ensureIndex, bm25Search, knnSearch, hybridSearch
|
||
- webdav.ts: Nextcloud WebDAV wrapper (list/create/upload/download/stat) with spans
|
||
- embeddings.ts: OpenAI-compatible embeddings client
|
||
- src/app/api
|
||
- folders/list, folders/create
|
||
- files/list, files/upload, files/download
|
||
- search/query
|
||
- scripts
|
||
- create-index.ts: creates ES index + alias
|
||
- ingest-nextcloud.ts: crawl Nextcloud → optional Tika → index into ES
|
||
- docs/elasticsearch/mappings.json: canonical baseline mapping
|
||
- Sentry init
|
||
- instrumentation-client.ts, sentry.server.config.ts, sentry.edge.config.ts
|
||
|
||
## Environment Variables
|
||
|
||
Populate `.env.local` (not committed). See `.env.example` for full list.
|
||
|
||
Required:
|
||
- NEXTCLOUD_BASE_URL: e.g. https://your-nextcloud.example.com
|
||
- NEXTCLOUD_USERNAME: WebDAV user
|
||
- NEXTCLOUD_APP_PASSWORD: App password generated in Nextcloud (not login password)
|
||
- NEXTCLOUD_ROOT_PATH: e.g. `/remote.php/dav/files/admin`
|
||
|
||
- ELASTICSEARCH_URL: e.g. http://localhost:9200
|
||
- ELASTICSEARCH_INDEX: default `files`
|
||
- ELASTICSEARCH_ALIAS: default `files_current`
|
||
|
||
Optional:
|
||
- TIKA_BASE_URL: e.g. http://localhost:9998 (if using Apache Tika for extraction)
|
||
- SENTRY_DSN: if provided, Sentry is enabled
|
||
- OPENAI_API_BASE, OPENAI_API_KEY, OPENAI_EMBEDDING_MODEL, EMBEDDING_DIM: Enable embeddings + semantic hybrid
|
||
|
||
## Dependencies and Local Services (Docker)
|
||
|
||
Below is a sample docker-compose for local development. Adjust versions for your environment. For TrueNAS Scale, translate this into an appropriate app configuration.
|
||
|
||
```yaml
|
||
version: "3.9"
|
||
services:
|
||
elasticsearch:
|
||
image: docker.elastic.co/elasticsearch/elasticsearch:8.12.2
|
||
container_name: es
|
||
environment:
|
||
- discovery.type=single-node
|
||
- xpack.security.enabled=false
|
||
- ES_JAVA_OPTS=-Xms1g -Xmx1g
|
||
ports:
|
||
- "9200:9200"
|
||
healthcheck:
|
||
test: ["CMD", "curl", "-f", "http://localhost:9200/_cluster/health"]
|
||
interval: 10s
|
||
timeout: 5s
|
||
retries: 10
|
||
|
||
kibana:
|
||
image: docker.elastic.co/kibana/kibana:8.12.2
|
||
container_name: kibana
|
||
environment:
|
||
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
|
||
ports:
|
||
- "5601:5601"
|
||
depends_on:
|
||
elasticsearch:
|
||
condition: service_healthy
|
||
|
||
tika:
|
||
image: apache/tika:latest-full
|
||
container_name: tika
|
||
ports:
|
||
- "9998:9998"
|
||
healthcheck:
|
||
test: ["CMD", "curl", "-f", "http://localhost:9998"]
|
||
interval: 10s
|
||
timeout: 5s
|
||
retries: 10
|
||
```
|
||
|
||
Notes:
|
||
- Security is relaxed in this dev config (xpack security disabled). Harden for production.
|
||
- Set `TIKA_BASE_URL=http://localhost:9998` to enable content extraction.
|
||
|
||
## Install & Run
|
||
|
||
Install dependencies:
|
||
|
||
```bash
|
||
npm install
|
||
```
|
||
|
||
Run development server:
|
||
|
||
```bash
|
||
npm run dev
|
||
# Next.js will print the URL, for example http://localhost:3000
|
||
```
|
||
|
||
Build and start:
|
||
|
||
```bash
|
||
npm run build
|
||
npm start
|
||
```
|
||
|
||
## Initialize Elasticsearch
|
||
|
||
Create the index and alias (files/files_current):
|
||
|
||
```bash
|
||
npm run create:index
|
||
# internally: tsx -r dotenv/config -r tsconfig-paths/register scripts/create-index.ts
|
||
```
|
||
|
||
To recreate:
|
||
```bash
|
||
npm run create:index -- --recreate
|
||
```
|
||
|
||
## Ingest Nextcloud → Elasticsearch
|
||
|
||
Crawl Nextcloud via WebDAV, optionally extract content with Tika, index into ES:
|
||
|
||
```bash
|
||
npx tsx -r dotenv/config -r tsconfig-paths/register scripts/ingest-nextcloud.ts
|
||
```
|
||
|
||
Restrict to a subtree:
|
||
```bash
|
||
npx tsx -r dotenv/config -r tsconfig-paths/register scripts/ingest-nextcloud.ts -- --root=/remote.php/dav/files/admin/SomeFolder
|
||
```
|
||
|
||
After ingestion, try searching in the UI (BM25).
|
||
|
||
## Optional: Semantic Hybrid Search
|
||
|
||
To enable, configure embeddings:
|
||
- OPENAI_API_BASE (e.g., https://api.openai.com/v1 or your compatible endpoint)
|
||
- OPENAI_API_KEY
|
||
- OPENAI_EMBEDDING_MODEL (e.g., text-embedding-3-large)
|
||
- EMBEDDING_DIM (e.g., 1536)
|
||
|
||
Current code path supports:
|
||
- Hybrid search in API when `semantic: true` in requests; UI has a “Semantic” toggle.
|
||
- A future backfill script can be added to compute vectors for existing docs (planned).
|
||
|
||
## Sentry
|
||
|
||
- Files:
|
||
- instrumentation-client.ts (client init)
|
||
- sentry.server.config.ts (node runtime)
|
||
- sentry.edge.config.ts (edge runtime)
|
||
- Enable by setting SENTRY_DSN in `.env.local`.
|
||
- Logging enabled (`consoleLoggingIntegration`), capturing console.log/warn/error.
|
||
- Spans instrument WebDAV and ES operations with meaningful op/name and attributes.
|
||
|
||
## API Endpoints
|
||
|
||
- GET `/api/folders/list?path=/abs/path` → { folders[], files[] }
|
||
- POST `/api/folders/create` body: `{ path: "/abs/path" }` → `{ ok: true }`
|
||
- GET `/api/files/list?path=/abs/path&page=1&perPage=50` → `{ total, page, perPage, items[] }`
|
||
- POST `/api/files/upload` (multipart) form-data: `file`, `destPath`
|
||
- GET `/api/files/download?path=/abs/path` → stream download
|
||
- POST `/api/search/query` → body: `{ q, filters?, sort?, page?, perPage?, semantic? }`
|
||
|
||
## UI Usage
|
||
|
||
- Navigate folders via the left sidebar. Root defaults to `NEXTCLOUD_ROOT_PATH`.
|
||
- Breadcrumbs show the current path; click to navigate.
|
||
- Global search queries ES (BM25). Toggle “Semantic” to blend vector similarity (when enabled).
|
||
- Use “Upload” to send a file to the current folder.
|
||
- Click a file name or download action to retrieve it.
|
||
|
||
## Known Caveats / TODO
|
||
|
||
- When starting from `NEXTCLOUD_ROOT_PATH`, breadcrumb segments may include technical path prefixes (e.g., `/remote.php`, `/dav`) that aren’t browsable independently. This can be adjusted to start breadcrumbs from the user root only.
|
||
- Embeddings backfill script & “Find similar” API/UI are planned.
|
||
- Tests (unit/integration/E2E) and TrueNAS-specific compose notes can be added next.
|
||
|
||
## Development Notes
|
||
|
||
- Code style: TypeScript, ESLint (Next config), Tailwind v4.
|
||
- shadcn components added (Slate), most primitives included.
|
||
- ES Types via `@elastic/elasticsearch` estypes.
|
||
- Stream interop handled for Node → Web streams in downloads and uploads.
|
||
- Zod-based env loader normalizes URLs and validates required vars.
|
||
|
||
## Troubleshooting
|
||
|
||
- Search 500s:
|
||
- Ensure Elasticsearch is running and reachable at `ELASTICSEARCH_URL`.
|
||
- Run `npm run create:index` to create the index and alias.
|
||
- If using Tika for content, ensure `TIKA_BASE_URL` is set and service is healthy (optional).
|
||
- WebDAV failures:
|
||
- Verify `NEXTCLOUD_BASE_URL`, `NEXTCLOUD_USERNAME`, `NEXTCLOUD_APP_PASSWORD`, and `NEXTCLOUD_ROOT_PATH`.
|
||
- Confirm the user has permission for the target path.
|
||
- Sentry issues:
|
||
- Ensure `SENTRY_DSN` is set; check networking/outbound restrictions.
|
||
|
||
## Scripts
|
||
|
||
- `npm run create:index`: Create/recreate Elasticsearch index and alias
|
||
- `scripts/ingest-nextcloud.ts`: Crawl Nextcloud → Tika (optional) → Elasticsearch
|
||
|
||
## Security
|
||
|
||
- Keep `.env.local` out of source control (already in .gitignore).
|
||
- Use Nextcloud App Passwords, not login passwords.
|
||
- Harden Elasticsearch/Kibana/Tika for production (auth, TLS, resource limits).
|
||
|
||
---
|
||
|
||
Built per implementation_plan.md with a deterministic order of rollout and Sentry instrumentation aligned to organizational rules.
|