jan/nitro.md at 87873ecf77e3ab6d7750bbd9f6aecc8a5034277b

2023-10-27 15:13:20 +07:00

2.4 KiB

Raw Blame History

title
Nitro (C++ Inference Engine)

Nitro, is the inference engine that powers Jan. Nitro is written in C++, optimized for edge deployment.

⚡ Explore Nitro's codebase: GitHub

Dependencies and Acknowledgements:

llama.cpp: Nitro wraps Llama.cpp, which runs Llama models in C++
drogon: Nitro runs Drogon, which is a fast, C++17/20 HTTP application framework.
(Coming soon) tensorrt-llm support for CUDA acceleration

Features

In addition to the above features, Nitro also provides:

OpenAI compatibility
HTTP interface with no bindings needed
Runs as a separate process, not interfering with main app processes
Multi-threaded server supporting concurrent users
1-click install
No hardware dedendencies
Ships as a small binary (~3mb compressed on average)
Runs on Windows, MacOS, and Linux
Compatible with arm64, x86, and NVIDIA GPUs

HTTP Interface

Nitro offers a straightforward HTTP interface. With compatibility for multiple standard APIs, including OpenAI formats.

curl --location 'http://localhost:3928/inferences/llamacpp/chat_completion' \
      --header 'Content-Type: application/json' \
      --header 'Accept: text/event-stream' \
      --header 'Access-Control-Allow-Origin: *' \
      --data '{
         "messages": [
            {"content": "Hello there 👋", "role": "assistant"},
            {"content": "Can you write a long story", "role": "user"}
         ],
         "stream": true,
         "model": "gpt-3.5-turbo",
         "max_tokens": 2000
      }'

Using Nitro

Step 1: Obtain Nitro:
Access Nitro binaries from the release page.
🔗 Download Nitro

Step 2: Source a Model:
For those interested in the llama C++ integration, obtain a "GGUF" model from The Bloke's repository.
🔗 Download Model

Step 3: Initialize Nitro:
Launch Nitro and position your model using the following API call:

curl -X POST 'http://localhost:3928/inferences/llamacpp/loadmodel' \
  -H 'Content-Type: application/json' \
  -d '{
    "llama_model_path": "/path/to/your_model.gguf",
    "ctx_len": 2048,
    "ngl": 100,
    "embedding": true
  }'

2.4 KiB Raw Blame History

Dependencies and Acknowledgements:

Features

HTTP Interface

Using Nitro

Architecture diagram

2.4 KiB

Raw Blame History