update the GPU installation
This commit is contained in:
parent
bec7b11cdc
commit
dd1812807a
115
docs/docs/guides/internal.md
Normal file
115
docs/docs/guides/internal.md
Normal file
@ -0,0 +1,115 @@
|
||||
Connect to rigs
|
||||
Download Pritunl
|
||||
https://client.pritunl.com/#install
|
||||
|
||||
Import the .ovpn file
|
||||
|
||||
Use Vscode to connect
|
||||
Hint: You need to install "Remote-SSH" extension.
|
||||
|
||||
|
||||
|
||||
Llama.cpp
|
||||
|
||||
Get llama.cpp
|
||||
`
|
||||
git clone https://github.com/ggerganov/llama.cpp
|
||||
cd llama.cpp
|
||||
`
|
||||
|
||||
Build with cmake for faster result
|
||||
`
|
||||
mkdir build
|
||||
cd build
|
||||
# You can play with the params to find the best out of it
|
||||
cmake .. -DLLAMA_CUBLAS=ON -DLLAMA_CUDA_F16=ON -DLLAMA_CUDA_MMV_Y=8
|
||||
cmake --build . --config Release
|
||||
`
|
||||
|
||||
Download model
|
||||
`
|
||||
# Back to llama.cpp
|
||||
cd ..
|
||||
cd models
|
||||
# This will get the llama-7b-Q8 GGUF
|
||||
wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q8_0.gguf
|
||||
`
|
||||
|
||||
`
|
||||
# Back to llama.cpp
|
||||
`
|
||||
cd llama.cpp/build/bin/
|
||||
./main -m ./models/llama-2-7b.Q8_0.gguf -p "Writing a thesis proposal can be done in 10 simple steps:\nStep 1:" -n 2048 -e -ngl 100 -t 48
|
||||
`
|
||||
|
||||
|
||||
|
||||
Tensorrt-LLM
|
||||
|
||||
The following command creates a Docker image for development:
|
||||
|
||||
`
|
||||
sudo make -C docker build
|
||||
`
|
||||
|
||||
Check docker images command:
|
||||
`
|
||||
docker images
|
||||
`
|
||||
|
||||
The image will be tagged locally with tensorrt_llm/devel:latest. To run the container, use the following command:
|
||||
`
|
||||
sudo make -C docker run
|
||||
`
|
||||
|
||||
Build TensorRT-LLM
|
||||
Once in the container, TensorRT-LLM can be built from source using:
|
||||
`
|
||||
# To build the TensorRT-LLM code.
|
||||
python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt
|
||||
|
||||
# Deploy TensorRT-LLM in your environment.
|
||||
pip install ./build/tensorrt_llm*.whl
|
||||
`
|
||||
|
||||
It is possible to restrict the compilation of TensorRT-LLM to specific CUDA architectures. For that purpose, the build_wheel.py script accepts a semicolon separated list of CUDA architecture as shown in the following example:
|
||||
|
||||
# Build TensorRT-LLM for Ada (4090)
|
||||
`
|
||||
python3 ./scripts/build_wheel.py --cuda_architectures "89-real;90-real"
|
||||
`
|
||||
|
||||
The list of supported architectures can be found in the CMakeLists.txt file.
|
||||
|
||||
Run Tensorrt-LLM
|
||||
`
|
||||
pip install -r examples/bloom/requirements.txt
|
||||
git lfs install
|
||||
`
|
||||
|
||||
Download llama weight
|
||||
`
|
||||
cd examples/llama
|
||||
rm -rf ./llama/7B
|
||||
mkdir -p ./llama/7B && git clone https://huggingface.co/NousResearch/Llama-2-7b-hf ./llama/7B
|
||||
`
|
||||
|
||||
Build the engine with Single GPU on Llama 7B
|
||||
`
|
||||
python build.py --model_dir ./llama/7B/ \
|
||||
--dtype float16 \
|
||||
--remove_input_padding \
|
||||
--use_gpt_attention_plugin float16 \
|
||||
--enable_context_fmha \
|
||||
--use_gemm_plugin float16 \
|
||||
--use_weight_only \
|
||||
--output_dir ./llama/7B/trt_engines/weight_only/1-gpu/
|
||||
`
|
||||
|
||||
Run inference. Use custom `run.py` to check the tokens/seconds
|
||||
`
|
||||
python3 run.py --max_output_len=2048 \
|
||||
--tokenizer_dir ./llama/7B/ \
|
||||
--engine_dir=./llama/7B/trt_engines/weight_only/1-gpu/
|
||||
--input_text Writing a thesis proposal can be done in 10 simple steps:\nStep 1:
|
||||
`
|
||||
@ -11,6 +11,19 @@ To begin using 👋Jan.ai on your Windows computer, follow these steps:
|
||||
|
||||

|
||||
|
||||
> Note: For faster results, you should enable your NVIDIA GPU. Make sure to have the CUDA toolkit installed. You can download it from your Linux distro's package manager or from here: [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads).
|
||||
```bash
|
||||
apt install nvidia-cuda-toolkit
|
||||
```
|
||||
|
||||
> Check the installation by
|
||||
|
||||
```bash
|
||||
nvidia-smi
|
||||
```
|
||||
|
||||
> For AMD GPU. You can download it from your Linux distro's package manager or from here: [ROCm Quick Start (Linux)](https://rocm.docs.amd.com/en/latest/deploy/linux/quick_start.html).
|
||||
|
||||
## Step 2: Download your first model
|
||||
Now, let's get your first model:
|
||||
|
||||
|
||||
@ -23,6 +23,16 @@ When you run the Jan Installer, Windows Defender may display a warning. Here's w
|
||||
|
||||

|
||||
|
||||
> Note: For faster results, you should enable your NVIDIA GPU. Make sure to have the CUDA toolkit installed. You can download it from here: [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads) or [CUDA Installation guide](https://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/index.html#verify-you-have-a-cuda-capable-gpu).
|
||||
|
||||
> Check the installation by
|
||||
|
||||
```bash
|
||||
nvidia-smi
|
||||
```
|
||||
|
||||
> For AMD GPU, you should use [WSLv2](https://learn.microsoft.com/en-us/windows/wsl/install). You can download it from here: [ROCm Quick Start (Linux)](https://rocm.docs.amd.com/en/latest/deploy/linux/quick_start.html).
|
||||
|
||||
## Step 3: Download your first model
|
||||
Now, let's get your first model:
|
||||
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user