* feat: Smart model management
* **New UI option** – `memory_util` added to `settings.json` with a dropdown (high / medium / low) to let users control how aggressively the engine uses system memory.
* **Configuration updates** – `LlamacppConfig` now includes `memory_util`; the extension class stores it in a new `memoryMode` property and handles updates through `updateConfig`.
* **System memory handling**
* Introduced `SystemMemory` interface and `getTotalSystemMemory()` to report combined VRAM + RAM.
* Added helper methods `getKVCachePerToken`, `getLayerSize`, and a new `ModelPlan` type.
* **Smart model‑load planner** – `planModelLoad()` computes:
* Number of GPU layers that can fit in usable VRAM.
* Maximum context length based on KV‑cache size and the selected memory utilization mode (high/medium/low).
* Whether KV‑cache must be off‑loaded to CPU and the overall loading mode (GPU, Hybrid, CPU, Unsupported).
* Detailed logging of the planning decision.
* **Improved support check** – `isModelSupported()` now:
* Uses the combined VRAM/RAM totals from `getTotalSystemMemory()`.
* Applies an 80% usable‑memory heuristic.
* Returns **GREEN** only when both weights and KV‑cache fit in VRAM, **YELLOW** when they fit only in total memory or require CPU off‑load, and **RED** when the model cannot fit at all.
* **Cleanup** – Removed unused `GgufMetadata` import; updated imports and type definitions accordingly.
* **Documentation/comments** – Added explanatory JSDoc comments for the new methods and clarified the return semantics of `isModelSupported`.
* chore: migrate no_kv_offload from llamacpp setting to model setting
* chore: add UI auto optimize model setting
* feat: improve model loading planner with mmproj support and smarter memory budgeting
* Extend `ModelPlan` with optional `noOffloadMmproj` flag to indicate when a multimodal projector can stay in VRAM.
* Add `mmprojPath` parameter to `planModelLoad` and calculate its size, attempting to keep it on GPU when possible.
* Refactor system memory detection:
* Use `used_memory` (actual free RAM) instead of total RAM for budgeting.
* Introduced `usableRAM` placeholder for future use.
* Rewrite KV‑cache size calculation:
* Properly handle GQA models via `attention.head_count_kv`.
* Compute bytes per token as `nHeadKV * headDim * 2 * 2 * nLayer`.
* Replace the old 70 % VRAM heuristic with a more flexible budget:
* Reserve a fixed VRAM amount and apply an overhead factor.
* Derive usable system RAM from total memory minus VRAM.
* Implement a robust allocation algorithm:
* Prioritize placing the mmproj in VRAM.
* Search for the best balance of GPU layers and context length.
* Fallback strategies for hybrid and pure‑CPU modes with detailed safety checks.
* Add extensive validation of model size, KV‑cache size, layer size, and memory mode.
* Improve logging throughout the planning process for easier debugging.
* Adjust final plan return shape to include the new `noOffloadMmproj` field.
* remove unused variable
---------
Co-authored-by: Faisal Amir <urmauur@gmail.com>
* call jan api
* fix lint
* ci: add jan server web
* chore: add Dockerfile
* clean up ui ux and support for reasoning fields, make app spa
* add logo
* chore: update tag for preview image
* chore: update k8s service name
* chore: update image tag and image name
* fixed test
---------
Co-authored-by: Minh141120 <minh.itptit@gmail.com>
Co-authored-by: Nguyen Ngoc Minh <91668012+Minh141120@users.noreply.github.com>
* feat: Add support for overriding tensor buffer type
This commit introduces a new configuration option, `override_tensor_buffer_t`, which allows users to specify a regex for matching tensor names to override their buffer type. This is an advanced setting primarily useful for optimizing the performance of large models, particularly Mixture of Experts (MoE) models.
By overriding the tensor buffer type, users can keep critical parts of the model, like the attention layers, on the GPU while offloading other parts, such as the expert feed-forward networks, to the CPU. This can lead to significant speed improvements for massive models.
Additionally, this change refines the error message to be more specific when a model fails to load. The previous message "Failed to load llama-server" has been updated to "Failed to load model" to be more accurate.
* chore: update FE to suppoer override-tensor
---------
Co-authored-by: Faisal Amir <urmauur@gmail.com>
* fix: generate a response button should appear when an incomplete tool call message is present
* fix: wording
* fix: do not send duplicate messages on regenerating
* fix: tests
* feat: improve local provider connectivity with CORS bypass
- Add @tauri-apps/plugin-http dependency
- Implement dual fetch strategy for local vs remote providers
- Auto-detect local providers (localhost, Ollama:11434, LM Studio:1234)
- Make API key optional for local providers
- Add comprehensive test coverage for provider fetching
refactor: simplify fetchModelsFromProvider by removing preflight check logic
* feat: extend config options to include custom fetch function for CORS handling
* feat: conditionally use Tauri's fetch for openai-compatible providers to handle CORS
* 🐛fix: showing release notes for beta and prod
* ♻️refactor: make an utils env
* ♻️refactor: hide MCP for production
* ♻️refactor: simplify the boolean expression fetch release note
* chore: hide some model capabilities
* chore: update model setting description
* chore: disable vision and embedding and add tooltip
---------
Co-authored-by: Louis <louis@jan.ai>