Add internal guides to run llama.cpp and trt-lmm but don't know where to put them

2023-11-01 12:28:36 +07:00 · 2023-11-01 12:28:36 +07:00 · 6ec8159b90
commit 6ec8159b90
parent 5288609dfc
2 changed files with 95 additions and 91 deletions
--- a/docs/docs/guides/concepts.md
+++ b/docs/docs/guides/concepts.md
@ -1,11 +1,8 @@
 ---
 title: Concepts
 ---
-
-## Concepts
-
 - Jan Platform: Desktop app/ Cloud native SaaS that can run on Linux, Windows, Mac or even Server that comes with extensibilities, toolbox and state of the art but optimized models for next gen App.
 - Jan App: Next gen App built on Jan Plaform as `portable intelligence` that can be run everywhere.
 - Models:
-  - LLM models
-  - Other models
+  - Large Language Models
+  - Stable Diffusion models
--- a/docs/docs/guides/internal.md
+++ b/docs/docs/guides/internal.md
@ -1,115 +1,122 @@
-Connect to rigs
-Download Pritunl
-https://client.pritunl.com/#install
+---
+title: Internal Guidelines
+---

-Import the .ovpn file
+# Internal Guidelines

-Use Vscode to connect
-Hint: You need to install "Remote-SSH" extension.
+## Connecting to Rigs

+### Pritunl Setup

+1. **Install Pritunl**: [Download here](https://client.pritunl.com/#install)
+2. **Import .ovpn file**
+3. **VSCode**: Install the "Remote-SSH" extension for connection

-Llama.cpp
+### Llama.cpp Setup

-Get llama.cpp
-`
-git clone https://github.com/ggerganov/llama.cpp
-cd llama.cpp
-`
-
-Build with cmake for faster result
-`
-mkdir build
-cd build
-# You can play with the params to find the best out of it
+1. **Clone Repo**: `git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp`
+2. **Build**: 
+```bash
+mkdir build && cd build
 cmake .. -DLLAMA_CUBLAS=ON -DLLAMA_CUDA_F16=ON -DLLAMA_CUDA_MMV_Y=8
 cmake --build . --config Release
-`
-
-Download model
-`
-# Back to llama.cpp
-cd ..
-cd models
-# This will get the llama-7b-Q8 GGUF
-wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q8_0.gguf
-`
-
-`
-# Back to llama.cpp
-`
-cd llama.cpp/build/bin/
+```
+3. **Download Model:**
+```bash
+cd ../models && wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q8_0.gguf
+```
+4. **Run:**
+```bash
+cd ../build/bin/
 ./main -m ./models/llama-2-7b.Q8_0.gguf -p "Writing a thesis proposal can be done in 10 simple steps:\nStep 1:" -n 2048 -e -ngl 100 -t 48
-`
+```
+
+For the llama.cpp CLI arguments you could see here:
+
+| Short Option | Long Option           | Param Value | Description |
+|--------------|-----------------------|-------------|-------------|
+| `-h`         | `--help`              |             | Show this help message and exit |
+| `-i`         | `--interactive`       |             | Run in interactive mode |
+|              | `--interactive-first` |             | Run in interactive mode and wait for input right away |
+|              | `-ins`, `--instruct`  |             | Run in instruction mode (use with Alpaca models) |
+| `-r`         | `--reverse-prompt`    | `PROMPT`    | Run in interactive mode and poll user input upon seeing `PROMPT` |
+|              | `--color`             |             | Colorise output to distinguish prompt and user input from |
+|**Generations**|
+| `-s`         | `--seed`              | `SEED`      | Seed for random number generator |
+| `-t`         | `--threads`           | `N`         | Number of threads to use during computation |
+| `-p`         | `--prompt`            | `PROMPT`    | Prompt to start generation with |
+|              | `--random-prompt`     |             | Start with a randomized prompt |
+|              | `--in-prefix`         | `STRING`    | String to prefix user inputs with |
+| `-f`         | `--file`              | `FNAME`     | Prompt file to start generation |
+| `-n`         | `--n_predict`         | `N`         | Number of tokens to predict |
+|              | `--top_k`             | `N`         | Top-k sampling |
+|              | `--top_p`             | `N`         | Top-p sampling |
+|              | `--repeat_last_n`     | `N`         | Last n tokens to consider for penalize |
+|              | `--repeat_penalty`    | `N`         | Penalize repeat sequence of tokens |
+| `-c`         | `--ctx_size`          | `N`         | Size of the prompt context |
+|              | `--ignore-eos`        |             | Ignore end of stream token and continue generating |
+|              | `--memory_f32`        |             | Use `f32` instead of `f16` for memory key+value |
+|              | `--temp`              | `N`         | Temperature |
+|              | `--n_parts`           | `N`         | Number of model parts |
+| `-b`         | `--batch_size`        | `N`         | Batch size for prompt processing |
+|              | `--perplexity`        |             | Compute perplexity over the prompt |
+|              | `--keep`              |             | Number of tokens to keep from the initial prompt |
+|              | `--mlock`             |             | Force system to keep model in RAM |
+|              | `--mtest`             |             | Determine the maximum memory usage |
+|              | `--verbose-prompt`    |             | Print prompt before generation |
+| `-m`         | `--model`             | `FNAME`     | Model path |


+### TensorRT-LLM Setup
+#### **Docker and TensorRT-LLM build**

-Tensorrt-LLM
+> Note: You should run with admin permission to make sure everything works fine

-The following command creates a Docker image for development:
-
-`
+1. **Docker Image:**
+```bash
 sudo make -C docker build
-`
-
-Check docker images command:
-`
-docker images
-`
-
-The image will be tagged locally with tensorrt_llm/devel:latest. To run the container, use the following command:
-`
+```
+2. **Run Container:** 
+```bash
 sudo make -C docker run
-`
+```

-Build TensorRT-LLM
 Once in the container, TensorRT-LLM can be built from source using:
-`
+
+3. **Build:**
+```bash
 # To build the TensorRT-LLM code.
 python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt
-
 # Deploy TensorRT-LLM in your environment.
 pip install ./build/tensorrt_llm*.whl
-`
+```

-It is possible to restrict the compilation of TensorRT-LLM to specific CUDA architectures. For that purpose, the build_wheel.py script accepts a semicolon separated list of CUDA architecture as shown in the following example:
+> Note: You can specify the GPU achitecture (e.g. for 4090 is ADA) for compilation time reduction
+> The list of supported architectures can be found in the `CMakeLists.txt` file.

-# Build TensorRT-LLM for Ada (4090)
-`
+```bash
 python3 ./scripts/build_wheel.py --cuda_architectures "89-real;90-real"
-`
+```

-The list of supported architectures can be found in the CMakeLists.txt file.
+#### Running TensorRT-LLM
+1. **Requirements:**
+```bash
+pip install -r examples/bloom/requirements.txt && git lfs install
+```

-Run Tensorrt-LLM
-`
-pip install -r examples/bloom/requirements.txt
-git lfs install
-`
+2. **Download Weights:**
+```bash
+cd examples/llama && rm -rf ./llama/7B && mkdir -p ./llama/7B && git clone https://huggingface.co/NousResearch/Llama-2-7b-hf ./llama/7B
+```

-Download llama weight
-`
-cd examples/llama
-rm -rf ./llama/7B
-mkdir -p ./llama/7B && git clone https://huggingface.co/NousResearch/Llama-2-7b-hf ./llama/7B
-`
+3. **Build Engine:**
+```bash
+python build.py --model_dir ./llama/7B/ --dtype float16 --remove_input_padding --use_gpt_attention_plugin float16 --enable_context_fmha --use_gemm_plugin float16 --use_weight_only --output_dir ./llama/7B/trt_engines/weight_only/1-gpu/
+```

-Build the engine with Single GPU on Llama 7B
-`
-python build.py --model_dir ./llama/7B/ \
-                --dtype float16 \
-                --remove_input_padding \
-                --use_gpt_attention_plugin float16 \
-                --enable_context_fmha \
-                --use_gemm_plugin float16 \
-                --use_weight_only \
-                --output_dir ./llama/7B/trt_engines/weight_only/1-gpu/
-`
+4. Run Inference:
+```bash
+python3 run.py --max_output_len=2048 --tokenizer_dir ./llama/7B/ --engine_dir=./llama/7B/trt_engines/weight_only/1-gpu/ --input_text "Writing a thesis proposal can be done in 10 simple steps:\nStep 1:"
+```

-Run inference. Use custom `run.py` to check the tokens/seconds
-`
-python3 run.py --max_output_len=2048 \
-               --tokenizer_dir ./llama/7B/ \
-               --engine_dir=./llama/7B/trt_engines/weight_only/1-gpu/
-               --input_text Writing a thesis proposal can be done in 10 simple steps:\nStep 1:
-`
+For the tensorRT-LLM CLI arguments you could see in the `run.py`