-Jan is a ChatGPT-alternative that runs 100% offline on your device. Our goal is to make it easy for a layperson to download and run LLMs and use AI with **full control** and **privacy**.
-
-**⚠️ Jan is in active development.**
+Jan is an AI assistant that can run 100% offline on your device. Download and run LLMs with
+**full control** and **privacy**.
## Installation
-Because clicking a button is still the easiest way to get started:
+The easiest way to get started is by downloading one of the following versions for your respective operating system:
Download from [jan.ai](https://jan.ai/) or [GitHub Releases](https://github.com/menloresearch/jan/releases).
-## Demo
-
-
## Features
@@ -149,13 +137,12 @@ For detailed compatibility, check our [installation guides](https://jan.ai/docs/
## Troubleshooting
-When things go sideways (they will):
+If things go sideways:
1. Check our [troubleshooting docs](https://jan.ai/docs/troubleshooting)
2. Copy your error logs and system specs
3. Ask for help in our [Discord](https://discord.gg/FTk2MvZwJH) `#🆘|jan-help` channel
-We keep logs for 24 hours, so don't procrastinate on reporting issues.
## Contributing
@@ -175,15 +162,6 @@ Contributions welcome. See [CONTRIBUTING.md](CONTRIBUTING.md) for the full spiel
- **Jobs**: hr@jan.ai
- **General Discussion**: [Discord](https://discord.gg/FTk2MvZwJH)
-## Trust & Safety
-
-**Friendly reminder**: We're not trying to scam you.
-
-- We won't ask for personal information
-- Jan is completely free (no premium version exists)
-- We don't have a cryptocurrency or ICO
-- We're bootstrapped and not seeking your investment (yet)
-
## License
Apache 2.0 - Because sharing is caring.
diff --git a/docs/src/pages/docs/_assets/hf_hub.png b/docs/src/pages/docs/_assets/hf_hub.png
new file mode 100644
index 000000000..ad059c49a
Binary files /dev/null and b/docs/src/pages/docs/_assets/hf_hub.png differ
diff --git a/docs/src/pages/docs/_assets/hf_jan_nano.png b/docs/src/pages/docs/_assets/hf_jan_nano.png
new file mode 100644
index 000000000..147a5c70e
Binary files /dev/null and b/docs/src/pages/docs/_assets/hf_jan_nano.png differ
diff --git a/docs/src/pages/docs/_assets/hf_jan_nano_2.png b/docs/src/pages/docs/_assets/hf_jan_nano_2.png
new file mode 100644
index 000000000..10c410240
Binary files /dev/null and b/docs/src/pages/docs/_assets/hf_jan_nano_2.png differ
diff --git a/docs/src/pages/docs/_assets/hf_jan_nano_3.png b/docs/src/pages/docs/_assets/hf_jan_nano_3.png
new file mode 100644
index 000000000..dac240d29
Binary files /dev/null and b/docs/src/pages/docs/_assets/hf_jan_nano_3.png differ
diff --git a/docs/src/pages/docs/_assets/hf_jan_nano_4.png b/docs/src/pages/docs/_assets/hf_jan_nano_4.png
new file mode 100644
index 000000000..552f07b06
Binary files /dev/null and b/docs/src/pages/docs/_assets/hf_jan_nano_4.png differ
diff --git a/docs/src/pages/docs/_assets/hf_jan_nano_5.png b/docs/src/pages/docs/_assets/hf_jan_nano_5.png
new file mode 100644
index 000000000..b322f0f93
Binary files /dev/null and b/docs/src/pages/docs/_assets/hf_jan_nano_5.png differ
diff --git a/docs/src/pages/docs/_assets/hf_jan_nano_6.png b/docs/src/pages/docs/_assets/hf_jan_nano_6.png
new file mode 100644
index 000000000..c8be2b707
Binary files /dev/null and b/docs/src/pages/docs/_assets/hf_jan_nano_6.png differ
diff --git a/docs/src/pages/docs/_assets/hf_jan_nano_7.png b/docs/src/pages/docs/_assets/hf_jan_nano_7.png
new file mode 100644
index 000000000..2a8ba8438
Binary files /dev/null and b/docs/src/pages/docs/_assets/hf_jan_nano_7.png differ
diff --git a/docs/src/pages/docs/_assets/hf_jan_nano_8.png b/docs/src/pages/docs/_assets/hf_jan_nano_8.png
new file mode 100644
index 000000000..4e1885a8e
Binary files /dev/null and b/docs/src/pages/docs/_assets/hf_jan_nano_8.png differ
diff --git a/docs/src/pages/docs/_assets/hf_jan_nano_9.png b/docs/src/pages/docs/_assets/hf_jan_nano_9.png
new file mode 100644
index 000000000..09575c541
Binary files /dev/null and b/docs/src/pages/docs/_assets/hf_jan_nano_9.png differ
diff --git a/docs/src/pages/docs/_assets/hf_jan_setup.png b/docs/src/pages/docs/_assets/hf_jan_setup.png
new file mode 100644
index 000000000..2d917539b
Binary files /dev/null and b/docs/src/pages/docs/_assets/hf_jan_setup.png differ
diff --git a/docs/src/pages/docs/_assets/hf_providers.png b/docs/src/pages/docs/_assets/hf_providers.png
new file mode 100644
index 000000000..1f8e4daf7
Binary files /dev/null and b/docs/src/pages/docs/_assets/hf_providers.png differ
diff --git a/docs/src/pages/docs/remote-models/_meta.json b/docs/src/pages/docs/remote-models/_meta.json
index 39660be88..9ef524352 100644
--- a/docs/src/pages/docs/remote-models/_meta.json
+++ b/docs/src/pages/docs/remote-models/_meta.json
@@ -26,5 +26,9 @@
"openrouter": {
"title": "OpenRouter",
"href": "/docs/remote-models/openrouter"
+ },
+ "huggingface": {
+ "title": "Hugging Face",
+ "href": "/docs/remote-models/huggingface"
}
}
diff --git a/docs/src/pages/docs/remote-models/huggingface.mdx b/docs/src/pages/docs/remote-models/huggingface.mdx
new file mode 100644
index 000000000..07f2103d2
--- /dev/null
+++ b/docs/src/pages/docs/remote-models/huggingface.mdx
@@ -0,0 +1,152 @@
+---
+title: Hugging Face
+description: Learn how to integrate Hugging Face models with Jan using the Router or Inference Endpoints.
+keywords:
+ [
+ Hugging Face,
+ Jan,
+ Jan AI,
+ Hugging Face Router,
+ Hugging Face Inference Endpoints,
+ Hugging Face API,
+ Hugging Face Integration,
+ Hugging Face API Integration
+ ]
+---
+
+import { Callout, Steps } from 'nextra/components'
+import { Settings, Plus } from 'lucide-react'
+
+# Hugging Face
+
+Jan supports Hugging Face models through two methods: the new **HF Router** (recommended) and **Inference Endpoints**. Both methods require a Hugging Face token and **billing to be set up**.
+
+
+
+## Option 1: HF Router (Recommended)
+
+The HF Router provides access to models from multiple providers (Replicate, Together AI, SambaNova, Fireworks, Cohere, and more) through a single endpoint.
+
+
+
+### Step 1: Get Your HF Token
+
+Visit [Hugging Face Settings > Access Tokens](https://huggingface.co/settings/tokens) and create a token. Make sure you have billing set up on your account.
+
+### Step 2: Configure Jan
+
+1. Go to **Settings** > **Model Providers** > **HuggingFace**
+2. Enter your HF token
+3. Use this URL: `https://router.huggingface.co/v1`
+
+
+
+You can find out more about the HF Router [here](https://huggingface.co/docs/inference-providers/index).
+
+### Step 3: Start Using Models
+
+Jan comes with three HF Router models pre-configured. Select one and start chatting immediately.
+
+
+
+
+The HF Router automatically routes your requests to the best available provider for each model, giving you access to a wide variety of models without managing individual endpoints.
+
+
+## Option 2: HF Inference Endpoints
+
+For more control over specific models and deployment configurations, you can use Hugging Face Inference Endpoints.
+
+
+
+### Step 1: Navigate to the HuggingFace Model Hub
+
+Visit the [Hugging Face Model Hub](https://huggingface.co/models) (make sure you are logged in) and pick the model you want to use.
+
+
+
+### Step 2: Configure HF Inference Endpoint and Deploy
+
+After you have selected the model you want to use, click on the **Deploy** button and select a deployment method. We will select HF Inference Endpoints for this one.
+
+
+
+
+This will take you to the deployment set up page. For this example, we will leave the default settings as they are under the GPU tab and click on **Create Endpoint**.
+
+
+
+
+Once your endpoint is ready, test that it works on the **Test your endpoint** tab.
+
+
+
+
+If you get a response, you can click on **Copy** to copy the endpoint URL and API key.
+
+
+ You will need to be logged into the HuggingFace Inference Endpoints and have a credit card on file to deploy a model.
+
+
+### Step 3: Configure Jan
+
+If you do not have an API key you can create one under **Settings** > **Access Tokens** [here](https://huggingface.co/settings/tokens). Once you finish, copy the token and add it to Jan alongside your endpoint URL at **Settings** > **Model Providers** > **HuggingFace**.
+
+**3.1 HF Token**
+
+
+
+**3.2 HF Endpoint URL**
+
+
+
+**3.3 Jan Settings**
+
+
+
+Make sure to add `/v1/` to the end of your endpoint URL. This is required by the OpenAI API.
+
+
+**3.4 Add Model Details**
+
+
+### Step 4: Start Using the Model
+
+Now you can start using the model in any chat.
+
+
+
+If you want to learn how to use Jan Nano with MCP, check out [the guide here](../jan-models/jan-nano-32).
+
+
+
+
+## Available Hugging Face Models
+
+**Option 1 (HF Router):** Access to models from multiple providers as shown in the providers image above.
+
+**Option 2 (Inference Endpoints):** You can follow the steps above with a large amount of models on Hugging Face and bring them to Jan. Check out other models in the [Hugging Face Model Hub](https://huggingface.co/models).
+
+## Troubleshooting
+
+Common issues and solutions:
+
+**1. Started a chat but the model is not responding**
+- Verify your API_KEY/HF_TOKEN is correct and not expired
+- Ensure you have billing set up on your HF account
+- For Inference Endpoints: Ensure the model you're trying to use is running again since, after a while, they go idle so that you don't get charged when you are not using it
+
+
+
+**2. Connection Problems**
+- Check your internet connection
+- Verify Hugging Face's system status
+- Look for error messages in [Jan's logs](/docs/troubleshooting#how-to-get-error-logs)
+
+**3. Model Unavailable**
+- Confirm your API key has access to the model
+- Check if you're using the correct model ID
+- Verify your Hugging Face account has the necessary permissions
+
+Need more help? Join our [Discord community](https://discord.gg/FTk2MvZwJH) or check the
+[Hugging Face's documentation](https://docs.huggingface.co/en/inference-endpoints/index).
diff --git a/docs/src/pages/post/_assets/gpt-oss locally.jpeg b/docs/src/pages/post/_assets/gpt-oss locally.jpeg
new file mode 100644
index 000000000..7d0e59717
Binary files /dev/null and b/docs/src/pages/post/_assets/gpt-oss locally.jpeg differ
diff --git a/docs/src/pages/post/_assets/jan gpt-oss.jpeg b/docs/src/pages/post/_assets/jan gpt-oss.jpeg
new file mode 100644
index 000000000..2e9e6b9d6
Binary files /dev/null and b/docs/src/pages/post/_assets/jan gpt-oss.jpeg differ
diff --git a/docs/src/pages/post/_assets/jan hub gpt-oss locally.jpeg b/docs/src/pages/post/_assets/jan hub gpt-oss locally.jpeg
new file mode 100644
index 000000000..04b0f5ca6
Binary files /dev/null and b/docs/src/pages/post/_assets/jan hub gpt-oss locally.jpeg differ
diff --git a/docs/src/pages/post/_assets/run gpt-oss locally in jan.jpeg b/docs/src/pages/post/_assets/run gpt-oss locally in jan.jpeg
new file mode 100644
index 000000000..68b6d725c
Binary files /dev/null and b/docs/src/pages/post/_assets/run gpt-oss locally in jan.jpeg differ
diff --git a/docs/src/pages/post/run-gpt-oss-locally.mdx b/docs/src/pages/post/run-gpt-oss-locally.mdx
new file mode 100644
index 000000000..5f71e8b45
--- /dev/null
+++ b/docs/src/pages/post/run-gpt-oss-locally.mdx
@@ -0,0 +1,211 @@
+---
+title: "Run OpenAI's gpt-oss locally in 5 mins (Beginner Guide)"
+description: "Complete 5-minute beginner guide to running OpenAI's gpt-oss locally. Step-by-step setup with Jan AI for private, offline AI conversations."
+tags: OpenAI, gpt-oss, local AI, Jan, privacy, Apache-2.0, llama.cpp, Ollama, LM Studio
+categories: guides
+date: 2025-08-06
+ogImage: assets/gpt-oss%20locally.jpeg
+twitter:
+ card: summary_large_image
+ site: "@jandotai"
+ title: "Run OpenAI's gpt-oss Locally in 5 Minutes (Beginner Guide)"
+ description: "Complete 5-minute beginner guide to running OpenAI's gpt-oss locally with Jan AI for private, offline conversations."
+ image: assets/gpt-oss%20locally.jpeg
+---
+import { Callout } from 'nextra/components'
+import CTABlog from '@/components/Blog/CTA'
+
+# Run OpenAI's gpt-oss Locally in 5 mins
+
+
+
+OpenAI launched [gpt-oss](https://openai.com/index/introducing-gpt-oss/), marking their return to open-source AI after GPT-2. This model is designed to run locally on consumer hardware. This guide shows you how to install and run gpt-oss on your computer for private, offline AI conversations.
+
+## What is gpt-oss?
+
+gpt-oss is OpenAI's open-source large language model, released under the Apache-2.0 license. Unlike ChatGPT, gpt-oss:
+
+- Runs completely offline - No internet required after setup
+- 100% private - Your conversations never leave your device
+- Unlimited usage - No token limits or rate limiting
+- Free forever - No subscription fees
+- Commercial use allowed - Apache-2.0 license permits business use
+
+Running AI models locally means everything happens on your own hardware, giving you complete control over your data and conversations.
+
+## gpt-oss System Requirements
+
+| Component | Minimum | Recommended |
+|-----------|---------|-------------|
+| **RAM** | 16 GB | 32 GB+ |
+| **Storage** | 11+ GB free | 25 GB+ free |
+| **CPU** | 4 cores | 8+ cores |
+| **GPU** | Optional | Modern GPU with 6GB+ VRAM recommended |
+| **OS** | Windows 10+, macOS 11+, Linux | Latest versions |
+
+**Installation apps available:**
+- **Jan** (Recommended - easiest setup)
+- **llama.cpp** (Command line)
+- **Ollama** (Docker-based)
+- **LM Studio** (GUI alternative)
+
+## How to install gpt-oss locally with Jan (5 mins)
+
+### Step 1: Download Jan
+
+First download Jan to run gpt-oss locally: [Download Jan AI](https://jan.ai/)
+
+
+Jan is the simplest way to run AI models locally. It automatically handles CPU/GPU optimization, provides a clean chat interface, and requires zero technical knowledge.
+
+
+### Step 2: Install gpt-oss Model (2-3 minutes)
+
+
+
+1. Open Jan Hub -> search "gpt-oss" (it appears at the top)
+2. Click Download and wait for completion (~11GB download)
+3. Installation is automatic - Jan handles everything
+
+### Step 3: Start using gpt-oss offline (30 seconds)
+
+
+
+1. Go to New Chat → select gpt-oss-20b from model picker
+2. Start chatting - Jan automatically optimizes for your hardware
+3. You're done! Your AI conversations now stay completely private
+
+Success: Your gpt-oss setup is complete. No internet required for chatting, unlimited usage, zero subscription fees.
+
+## Jan with gpt-oss vs ChatGPT vs other Local AI Models
+
+| Feature | gpt-oss (Local) | ChatGPT Plus | Claude Pro | Other Local Models |
+|---------|----------------|--------------|------------|-------------------|
+| Cost | Free forever | $20/month | $20/month | Free |
+| Privacy | 100% private | Data sent to OpenAI | Data sent to Anthropic | 100% private |
+| Internet | Offline after setup | Requires internet | Requires internet | Offline |
+| Usage limits | Unlimited | Rate limited | Rate limited | Unlimited |
+| Performance | Good (hardware dependent) | Excellent | Excellent | Varies |
+| Setup difficulty | Easy with Jan | None | None | Varies |
+
+## Alternative Installation Methods
+
+### Option 1: Jan (Recommended)
+
+- Best for: Complete beginners, users wanting GUI interface
+- Setup time: 5 minutes
+- Difficulty: Very Easy
+
+Already covered above - [Download Jan](https://jan.ai/)
+
+### Option 2: llama.cpp (Command Line)
+
+- Best for: Developers, terminal users, custom integrations
+- Setup time: 10-15 minutes
+- Difficulty: Intermediate
+
+```bash
+# macOS
+brew install llama-cpp
+
+# Windows: grab Windows exe from releases
+curl -L -o gpt-oss-20b.gguf https://huggingface.co/openai/gpt-oss-20b-gguf/resolve/main/gpt-oss-20b.gguf
+./main -m gpt-oss-20b.gguf --chat-simple
+
+# Add GPU acceleration (adjust -ngl value based on your GPU VRAM)
+./main -m gpt-oss-20b.gguf --chat-simple -ngl 20
+```
+
+### Option 3: Ollama (Docker-Based)
+
+Best for: Docker users, server deployments
+Setup time: 5-10 minutes
+Difficulty: Intermediate
+
+```bash
+# Install from https://ollama.com
+ollama run gpt-oss:20b
+```
+
+### Option 4: LM Studio (GUI Alternative)
+
+Best for: Users wanting GUI but not Jan
+Setup time: 10 minutes
+Difficulty: Easy
+
+1. Download LM Studio from official website
+2. Go to Models → search "gpt-oss-20b (GGUF)"
+3. Download the model (wait for completion)
+4. Go to Chat tab → select the model and start chatting
+
+## gpt-oss Performance & Troubleshooting
+
+### Expected Performance Benchmarks
+
+| Hardware Setup | First Response | Subsequent Responses | Tokens/Second |
+|---------------|---------------|---------------------|---------------|
+| **16GB RAM + CPU only** | 30-45 seconds | 3-6 seconds | 3-8 tokens/sec |
+| **32GB RAM + RTX 3060** | 15-25 seconds | 1-3 seconds | 15-25 tokens/sec |
+| **32GB RAM + RTX 4080+** | 8-15 seconds | 1-2 seconds | 25-45 tokens/sec |
+
+### Common Issues & Solutions
+
+Performance optimization tips:
+- First response is slow: Normal - kernels compile once, then speed up dramatically
+- Out of VRAM error: Reduce context length in settings or switch to CPU mode
+- Out of memory: Close memory-heavy apps (Chrome, games, video editors)
+- Slow responses: Check if other apps are using GPU/CPU heavily
+
+Quick fixes:
+1. Restart Jan if responses become slow
+2. Lower context window from 4096 to 2048 tokens
+3. Enable CPU mode if GPU issues persist
+4. Free up RAM by closing unused applications
+
+## Frequently Asked Questions (FAQ)
+
+### Is gpt-oss completely free?
+Yes! gpt-oss is 100% free under Apache-2.0 license. No subscription fees, no token limits, no hidden costs.
+
+### How much internet data does gpt-oss use?
+Only for the initial 11GB download. After installation, gpt-oss works completely offline with zero internet usage.
+
+### Can I use gpt-oss for commercial projects?
+Absolutely! The Apache-2.0 license permits commercial use, modification, and distribution.
+
+### Is gpt-oss better than ChatGPT?
+gpt-oss offers different advantages: complete privacy, unlimited usage, offline capability, and no costs. ChatGPT may have better performance but requires internet and subscriptions.
+
+### What happens to my conversations with gpt-oss?
+Your conversations stay 100% on your device. Nothing is sent to OpenAI, Jan, or any external servers.
+
+### Can I run gpt-oss on a Mac with 8GB RAM?
+No, gpt-oss requires minimum 16GB RAM. Consider upgrading your RAM or using cloud-based alternatives.
+
+### How do I update gpt-oss to newer versions?
+Jan automatically notifies you of updates. Simply click update in Jan Hub when new versions are available.
+
+## Why Choose gpt-oss Over ChatGPT Plus?
+
+gpt-oss advantages:
+- $0/month vs $20/month for ChatGPT Plus
+- 100% private - no data leaves your device
+- Unlimited usage - no rate limits or restrictions
+- Works offline - no internet required after setup
+- Commercial use allowed - build businesses with it
+
+When to choose ChatGPT Plus instead:
+- You need the absolute best performance
+- You don't want to manage local installation
+- You have less than 16GB RAM
+
+## Get started with gpt-oss today
+
+
+
+Ready to try gpt-oss?
+- Download Jan: [https://jan.ai/](https://jan.ai/)
+- View source code: [https://github.com/menloresearch/jan](https://github.com/menloresearch/jan)
+- Need help? Check our [local AI guide](/post/run-ai-models-locally) for beginners
+
+
\ No newline at end of file
diff --git a/extensions/llamacpp-extension/settings.json b/extensions/llamacpp-extension/settings.json
index 363822f9a..46c4995ff 100644
--- a/extensions/llamacpp-extension/settings.json
+++ b/extensions/llamacpp-extension/settings.json
@@ -25,18 +25,6 @@
"controllerType": "checkbox",
"controllerProps": { "value": true }
},
- {
- "key": "chat_template",
- "title": "Custom Jinja Chat template",
- "description": "Custom Jinja chat_template to be used for the model",
- "controllerType": "input",
- "controllerProps": {
- "value": "",
- "placeholder": "e.g., {% for message in messages %}...{% endfor %} (default is read from GGUF)",
- "type": "text",
- "textAlign": "right"
- }
- },
{
"key": "threads",
"title": "Threads",
diff --git a/extensions/llamacpp-extension/src/backend.ts b/extensions/llamacpp-extension/src/backend.ts
index e8068f63b..3bf6a2675 100644
--- a/extensions/llamacpp-extension/src/backend.ts
+++ b/extensions/llamacpp-extension/src/backend.ts
@@ -50,14 +50,18 @@ export async function listSupportedBackends(): Promise<
if (features.avx2) supportedBackends.push('linux-avx2-x64')
if (features.avx512) supportedBackends.push('linux-avx512-x64')
if (features.cuda11) {
- if (features.avx512) supportedBackends.push('linux-avx512-cuda-cu11.7-x64')
- else if (features.avx2) supportedBackends.push('linux-avx2-cuda-cu11.7-x64')
+ if (features.avx512)
+ supportedBackends.push('linux-avx512-cuda-cu11.7-x64')
+ else if (features.avx2)
+ supportedBackends.push('linux-avx2-cuda-cu11.7-x64')
else if (features.avx) supportedBackends.push('linux-avx-cuda-cu11.7-x64')
else supportedBackends.push('linux-noavx-cuda-cu11.7-x64')
}
if (features.cuda12) {
- if (features.avx512) supportedBackends.push('linux-avx512-cuda-cu12.0-x64')
- else if (features.avx2) supportedBackends.push('linux-avx2-cuda-cu12.0-x64')
+ if (features.avx512)
+ supportedBackends.push('linux-avx512-cuda-cu12.0-x64')
+ else if (features.avx2)
+ supportedBackends.push('linux-avx2-cuda-cu12.0-x64')
else if (features.avx) supportedBackends.push('linux-avx-cuda-cu12.0-x64')
else supportedBackends.push('linux-noavx-cuda-cu12.0-x64')
}
@@ -256,10 +260,16 @@ async function _getSupportedFeatures() {
if (compareVersions(driverVersion, minCuda12DriverVersion) >= 0)
features.cuda12 = true
}
-
- if (gpuInfo.vulkan_info?.api_version) features.vulkan = true
+ // Vulkan support check - only discrete GPUs with 6GB+ VRAM
+ if (
+ gpuInfo.vulkan_info?.api_version &&
+ gpuInfo.vulkan_info?.device_type === 'DISCRETE_GPU' &&
+ gpuInfo.total_memory >= 6 * 1024
+ ) {
+ // 6GB (total_memory is in MB)
+ features.vulkan = true
+ }
}
-
return features
}
diff --git a/extensions/llamacpp-extension/src/index.ts b/extensions/llamacpp-extension/src/index.ts
index 140b08418..92ceaad60 100644
--- a/extensions/llamacpp-extension/src/index.ts
+++ b/extensions/llamacpp-extension/src/index.ts
@@ -39,6 +39,7 @@ type LlamacppConfig = {
auto_unload: boolean
chat_template: string
n_gpu_layers: number
+ override_tensor_buffer_t: string
ctx_size: number
threads: number
threads_batch: number
@@ -144,7 +145,6 @@ export default class llamacpp_extension extends AIEngine {
readonly providerId: string = 'llamacpp'
private config: LlamacppConfig
- private activeSessions: Map = new Map()
private providerPath!: string
private apiSecret: string = 'JustAskNow'
private pendingDownloads: Map> = new Map()
@@ -770,16 +770,6 @@ export default class llamacpp_extension extends AIEngine {
override async onUnload(): Promise {
// Terminate all active sessions
- for (const [_, sInfo] of this.activeSessions) {
- try {
- await this.unload(sInfo.model_id)
- } catch (error) {
- logger.error(`Failed to unload model ${sInfo.model_id}:`, error)
- }
- }
-
- // Clear the sessions map
- this.activeSessions.clear()
}
onSettingUpdate(key: string, value: T): void {
@@ -1103,67 +1093,13 @@ export default class llamacpp_extension extends AIEngine {
* Function to find a random port
*/
private async getRandomPort(): Promise {
- const MAX_ATTEMPTS = 20000
- let attempts = 0
-
- while (attempts < MAX_ATTEMPTS) {
- const port = Math.floor(Math.random() * 1000) + 3000
-
- const isAlreadyUsed = Array.from(this.activeSessions.values()).some(
- (info) => info.port === port
- )
-
- if (!isAlreadyUsed) {
- const isAvailable = await invoke('is_port_available', { port })
- if (isAvailable) return port
- }
-
- attempts++
+ try {
+ const port = await invoke('get_random_port')
+ return port
+ } catch {
+ logger.error('Unable to find a suitable port')
+ throw new Error('Unable to find a suitable port for model')
}
-
- throw new Error('Failed to find an available port for the model to load')
- }
-
- private async sleep(ms: number): Promise {
- return new Promise((resolve) => setTimeout(resolve, ms))
- }
-
- private async waitForModelLoad(
- sInfo: SessionInfo,
- timeoutMs = 240_000
- ): Promise {
- await this.sleep(500) // Wait before first check
- const start = Date.now()
- while (Date.now() - start < timeoutMs) {
- try {
- const res = await fetch(`http://localhost:${sInfo.port}/health`)
-
- if (res.status === 503) {
- const body = await res.json()
- const msg = body?.error?.message ?? 'Model loading'
- logger.info(`waiting for model load... (${msg})`)
- } else if (res.ok) {
- const body = await res.json()
- if (body.status === 'ok') {
- return
- } else {
- logger.warn('Unexpected OK response from /health:', body)
- }
- } else {
- logger.warn(`Unexpected status ${res.status} from /health`)
- }
- } catch (e) {
- await this.unload(sInfo.model_id)
- throw new Error(`Model appears to have crashed: ${e}`)
- }
-
- await this.sleep(800) // Retry interval
- }
-
- await this.unload(sInfo.model_id)
- throw new Error(
- `Timed out loading model after ${timeoutMs}... killing llamacpp`
- )
}
override async load(
@@ -1171,7 +1107,7 @@ export default class llamacpp_extension extends AIEngine {
overrideSettings?: Partial,
isEmbedding: boolean = false
): Promise {
- const sInfo = this.findSessionByModel(modelId)
+ const sInfo = await this.findSessionByModel(modelId)
if (sInfo) {
throw new Error('Model already loaded!!')
}
@@ -1262,6 +1198,14 @@ export default class llamacpp_extension extends AIEngine {
args.push('--jinja')
args.push('--reasoning-format', 'none')
args.push('-m', modelPath)
+ // For overriding tensor buffer type, useful where
+ // massive MOE models can be made faster by keeping attention on the GPU
+ // and offloading the expert FFNs to the CPU.
+ // This is an expert level settings and should only be used by people
+ // who knows what they are doing.
+ // Takes a regex with matching tensor name as input
+ if (cfg.override_tensor_buffer_t)
+ args.push('--override-tensor', cfg.override_tensor_buffer_t)
args.push('-a', modelId)
args.push('--port', String(port))
if (modelConfig.mmproj_path) {
@@ -1333,26 +1277,20 @@ export default class llamacpp_extension extends AIEngine {
libraryPath,
args,
})
-
- // Store the session info for later use
- this.activeSessions.set(sInfo.pid, sInfo)
- await this.waitForModelLoad(sInfo)
-
return sInfo
} catch (error) {
- logger.error('Error loading llama-server:\n', error)
- throw new Error(`Failed to load llama-server: ${error}`)
+ logger.error('Error in load command:\n', error)
+ throw new Error(`Failed to load model:\n${error}`)
}
}
override async unload(modelId: string): Promise {
- const sInfo: SessionInfo = this.findSessionByModel(modelId)
+ const sInfo: SessionInfo = await this.findSessionByModel(modelId)
if (!sInfo) {
throw new Error(`No active session found for model: ${modelId}`)
}
const pid = sInfo.pid
try {
- this.activeSessions.delete(pid)
// Pass the PID as the session_id
const result = await invoke('unload_llama_model', {
@@ -1364,13 +1302,11 @@ export default class llamacpp_extension extends AIEngine {
logger.info(`Successfully unloaded model with PID ${pid}`)
} else {
logger.warn(`Failed to unload model: ${result.error}`)
- this.activeSessions.set(sInfo.pid, sInfo)
}
return result
} catch (error) {
logger.error('Error in unload command:', error)
- this.activeSessions.set(sInfo.pid, sInfo)
return {
success: false,
error: `Failed to unload model: ${error}`,
@@ -1493,17 +1429,21 @@ export default class llamacpp_extension extends AIEngine {
}
}
- private findSessionByModel(modelId: string): SessionInfo | undefined {
- return Array.from(this.activeSessions.values()).find(
- (session) => session.model_id === modelId
- )
+ private async findSessionByModel(modelId: string): Promise {
+ try {
+ let sInfo = await invoke('find_session_by_model', {modelId})
+ return sInfo
+ } catch (e) {
+ logger.error(e)
+ throw new Error(String(e))
+ }
}
override async chat(
opts: chatCompletionRequest,
abortController?: AbortController
): Promise> {
- const sessionInfo = this.findSessionByModel(opts.model)
+ const sessionInfo = await this.findSessionByModel(opts.model)
if (!sessionInfo) {
throw new Error(`No active session found for model: ${opts.model}`)
}
@@ -1519,7 +1459,6 @@ export default class llamacpp_extension extends AIEngine {
throw new Error('Model appears to have crashed! Please reload!')
}
} else {
- this.activeSessions.delete(sessionInfo.pid)
throw new Error('Model have crashed! Please reload!')
}
const baseUrl = `http://localhost:${sessionInfo.port}/v1`
@@ -1568,11 +1507,13 @@ export default class llamacpp_extension extends AIEngine {
}
override async getLoadedModels(): Promise {
- let lmodels: string[] = []
- for (const [_, sInfo] of this.activeSessions) {
- lmodels.push(sInfo.model_id)
- }
- return lmodels
+ try {
+ let models: string[] = await invoke('get_loaded_models')
+ return models
+ } catch (e) {
+ logger.error(e)
+ throw new Error(e)
+ }
}
async getDevices(): Promise {
@@ -1602,7 +1543,7 @@ export default class llamacpp_extension extends AIEngine {
}
async embed(text: string[]): Promise {
- let sInfo = this.findSessionByModel('sentence-transformer-mini')
+ let sInfo = await this.findSessionByModel('sentence-transformer-mini')
if (!sInfo) {
const downloadedModelList = await this.list()
if (
diff --git a/src-tauri/Cargo.toml b/src-tauri/Cargo.toml
index 0f334d178..ca1a54bba 100644
--- a/src-tauri/Cargo.toml
+++ b/src-tauri/Cargo.toml
@@ -63,8 +63,12 @@ nix = "=0.30.1"
[target.'cfg(windows)'.dependencies]
libc = "0.2.172"
+windows-sys = { version = "0.60.2", features = ["Win32_Storage_FileSystem"] }
[target.'cfg(not(any(target_os = "android", target_os = "ios")))'.dependencies]
tauri-plugin-updater = "2"
once_cell = "1.18"
tauri-plugin-single-instance = { version = "2.0.0", features = ["deep-link"] }
+
+[target.'cfg(windows)'.dev-dependencies]
+tempfile = "3.20.0"
diff --git a/src-tauri/src/core/utils/extensions/inference_llamacpp_extension/server.rs b/src-tauri/src/core/utils/extensions/inference_llamacpp_extension/server.rs
index ffa6cfe92..b95e17010 100644
--- a/src-tauri/src/core/utils/extensions/inference_llamacpp_extension/server.rs
+++ b/src-tauri/src/core/utils/extensions/inference_llamacpp_extension/server.rs
@@ -1,7 +1,9 @@
use base64::{engine::general_purpose, Engine as _};
use hmac::{Hmac, Mac};
+use rand::{rngs::StdRng, Rng, SeedableRng};
use serde::{Deserialize, Serialize};
use sha2::Sha256;
+use std::collections::HashSet;
use std::path::PathBuf;
use std::process::Stdio;
use std::time::Duration;
@@ -67,13 +69,39 @@ pub struct DeviceInfo {
pub free: i32,
}
+#[cfg(windows)]
+use std::os::windows::ffi::OsStrExt;
+
+#[cfg(windows)]
+use std::ffi::OsStr;
+
+#[cfg(windows)]
+use windows_sys::Win32::Storage::FileSystem::GetShortPathNameW;
+
+#[cfg(windows)]
+pub fn get_short_path>(path: P) -> Option {
+ let wide: Vec = OsStr::new(path.as_ref())
+ .encode_wide()
+ .chain(Some(0))
+ .collect();
+
+ let mut buffer = vec![0u16; 260];
+ let len = unsafe { GetShortPathNameW(wide.as_ptr(), buffer.as_mut_ptr(), buffer.len() as u32) };
+
+ if len > 0 {
+ Some(String::from_utf16_lossy(&buffer[..len as usize]))
+ } else {
+ None
+ }
+}
+
// --- Load Command ---
#[tauri::command]
pub async fn load_llama_model(
state: State<'_, AppState>,
backend_path: &str,
library_path: Option<&str>,
- args: Vec,
+ mut args: Vec,
) -> ServerResult {
let mut process_map = state.llama_server_process.lock().await;
@@ -105,13 +133,38 @@ pub async fn load_llama_model(
8080
}
};
-
- let model_path = args
+ // FOR MODEL PATH; TODO: DO SIMILARLY FOR MMPROJ PATH
+ let model_path_index = args
.iter()
.position(|arg| arg == "-m")
- .and_then(|i| args.get(i + 1))
- .cloned()
- .unwrap_or_default();
+ .ok_or(ServerError::LlamacppError("Missing `-m` flag".into()))?;
+
+ let model_path = args
+ .get(model_path_index + 1)
+ .ok_or(ServerError::LlamacppError("Missing path after `-m`".into()))?
+ .clone();
+
+ let model_path_pb = PathBuf::from(model_path);
+ if !model_path_pb.exists() {
+ return Err(ServerError::LlamacppError(format!(
+ "Invalid or inaccessible model path: {}",
+ model_path_pb.display().to_string(),
+ )));
+ }
+ #[cfg(windows)]
+ {
+ // use short path on Windows
+ if let Some(short) = get_short_path(&model_path_pb) {
+ args[model_path_index + 1] = short;
+ } else {
+ args[model_path_index + 1] = model_path_pb.display().to_string();
+ }
+ }
+ #[cfg(not(windows))]
+ {
+ args[model_path_index + 1] = model_path_pb.display().to_string();
+ }
+ // -----------------------------------------------------------------
let api_key = args
.iter()
@@ -181,7 +234,6 @@ pub async fn load_llama_model(
// Create channels for communication between tasks
let (ready_tx, mut ready_rx) = mpsc::channel::(1);
- let (error_tx, mut error_rx) = mpsc::channel::(1);
// Spawn task to monitor stdout for readiness
let _stdout_task = tokio::spawn(async move {
@@ -228,20 +280,10 @@ pub async fn load_llama_model(
// Check for critical error indicators that should stop the process
let line_lower = line.to_string().to_lowercase();
- if line_lower.contains("error loading model")
- || line_lower.contains("unknown model architecture")
- || line_lower.contains("fatal")
- || line_lower.contains("cuda error")
- || line_lower.contains("out of memory")
- || line_lower.contains("error")
- || line_lower.contains("failed")
- {
- let _ = error_tx.send(line.to_string()).await;
- }
// Check for readiness indicator - llama-server outputs this when ready
- else if line.contains("server is listening on")
- || line.contains("starting the main loop")
- || line.contains("server listening on")
+ if line_lower.contains("server is listening on")
+ || line_lower.contains("starting the main loop")
+ || line_lower.contains("server listening on")
{
log::info!("Server appears to be ready based on stderr: '{}'", line);
let _ = ready_tx.send(true).await;
@@ -279,26 +321,6 @@ pub async fn load_llama_model(
log::info!("Server is ready to accept requests!");
break;
}
- // Error occurred
- Some(error_msg) = error_rx.recv() => {
- log::error!("Server encountered an error: {}", error_msg);
-
- // Give process a moment to exit naturally
- tokio::time::sleep(Duration::from_millis(100)).await;
-
- // Check if process already exited
- if let Some(status) = child.try_wait()? {
- log::info!("Process exited with code {:?}", status);
- return Err(ServerError::LlamacppError(error_msg));
- } else {
- log::info!("Process still running, killing it...");
- let _ = child.kill().await;
- }
-
- // Get full stderr output
- let stderr_output = stderr_task.await.unwrap_or_default();
- return Err(ServerError::LlamacppError(format!("Error: {}\n\nFull stderr:\n{}", error_msg, stderr_output)));
- }
// Check for process exit more frequently
_ = tokio::time::sleep(Duration::from_millis(50)) => {
// Check if process exited
@@ -332,7 +354,7 @@ pub async fn load_llama_model(
pid: pid.clone(),
port: port,
model_id: model_id,
- model_path: model_path,
+ model_path: model_path_pb.display().to_string(),
api_key: api_key,
};
@@ -704,16 +726,88 @@ pub async fn is_process_running(pid: i32, state: State<'_, AppState>) -> Result<
}
// check port availability
-#[tauri::command]
-pub fn is_port_available(port: u16) -> bool {
+fn is_port_available(port: u16) -> bool {
std::net::TcpListener::bind(("127.0.0.1", port)).is_ok()
}
+#[tauri::command]
+pub async fn get_random_port(state: State<'_, AppState>) -> Result {
+ const MAX_ATTEMPTS: u32 = 20000;
+ let mut attempts = 0;
+ let mut rng = StdRng::from_entropy();
+
+ // Get all active ports from sessions
+ let map = state.llama_server_process.lock().await;
+
+ let used_ports: HashSet = map
+ .values()
+ .filter_map(|session| {
+ // Convert valid ports to u16 (filter out placeholder ports like -1)
+ if session.info.port > 0 && session.info.port <= u16::MAX as i32 {
+ Some(session.info.port as u16)
+ } else {
+ None
+ }
+ })
+ .collect();
+
+ drop(map); // unlock early
+
+ while attempts < MAX_ATTEMPTS {
+ let port = rng.gen_range(3000..4000);
+
+ if used_ports.contains(&port) {
+ attempts += 1;
+ continue;
+ }
+
+ if is_port_available(port) {
+ return Ok(port);
+ }
+
+ attempts += 1;
+ }
+
+ Err("Failed to find an available port for the model to load".into())
+}
+
+// find session
+#[tauri::command]
+pub async fn find_session_by_model(
+ model_id: String,
+ state: State<'_, AppState>,
+) -> Result