5.5 KiB
5.5 KiB
| title | description | keywords | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Engineering | Jan is a ChatGPT-alternative that runs on your own computer, with a local API server. |
|
Connecting to Rigs
Pritunl Setup
- Install Pritunl: Download here
- Import .ovpn file
- VSCode: Install the "Remote-SSH" extension for connection
Llama.cpp Setup
- Clone Repo:
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp - Build:
mkdir build && cd build
cmake .. -DLLAMA_CUBLAS=ON -DLLAMA_CUDA_F16=ON -DLLAMA_CUDA_MMV_Y=8
cmake --build . --config Release
- Download Model:
cd ../models && wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q8_0.gguf
- Run:
cd ../build/bin/
./main -m ./models/llama-2-7b.Q8_0.gguf -p "Writing a thesis proposal can be done in 10 simple steps:\nStep 1:" -n 2048 -e -ngl 100 -t 48
For the llama.cpp CLI arguments you can see here:
| Short Option | Long Option | Param Value | Description |
|---|---|---|---|
-h |
--help |
Show this help message and exit | |
-i |
--interactive |
Run in interactive mode | |
--interactive-first |
Run in interactive mode and wait for input right away | ||
-ins, --instruct |
Run in instruction mode (use with Alpaca models) | ||
-r |
--reverse-prompt |
PROMPT |
Run in interactive mode and poll user input upon seeing PROMPT |
--color |
Colorise output to distinguish prompt and user input from | ||
| Generations | |||
-s |
--seed |
SEED |
Seed for random number generator |
-t |
--threads |
N |
Number of threads to use during computation |
-p |
--prompt |
PROMPT |
Prompt to start generation with |
--random-prompt |
Start with a randomized prompt | ||
--in-prefix |
STRING |
String to prefix user inputs with | |
-f |
--file |
FNAME |
Prompt file to start generation |
-n |
--n_predict |
N |
Number of tokens to predict |
--top_k |
N |
Top-k sampling | |
--top_p |
N |
Top-p sampling | |
--repeat_last_n |
N |
Last n tokens to consider for penalize | |
--repeat_penalty |
N |
Penalize repeat sequence of tokens | |
-c |
--ctx_size |
N |
Size of the prompt context |
--ignore-eos |
Ignore end of stream token and continue generating | ||
--memory_f32 |
Use f32 instead of f16 for memory key+value |
||
--temp |
N |
Temperature | |
--n_parts |
N |
Number of model parts | |
-b |
--batch_size |
N |
Batch size for prompt processing |
--perplexity |
Compute perplexity over the prompt | ||
--keep |
Number of tokens to keep from the initial prompt | ||
--mlock |
Force system to keep model in RAM | ||
--mtest |
Determine the maximum memory usage | ||
--verbose-prompt |
Print prompt before generation | ||
-m |
--model |
FNAME |
Model path |
TensorRT-LLM Setup
Docker and TensorRT-LLM build
Note: You should run with admin permission to make sure everything works fine
- Docker Image:
sudo make -C docker build
- Run Container:
sudo make -C docker run
Once in the container, TensorRT-LLM can be built from the source using the following:
- Build:
# To build the TensorRT-LLM code.
python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt
# Deploy TensorRT-LLM in your environment.
pip install ./build/tensorrt_llm*.whl
Note: You can specify the GPU architecture (e.g. for 4090 is ADA) for compilation time reduction The list of supported architectures can be found in the
CMakeLists.txtfile.
python3 ./scripts/build_wheel.py --cuda_architectures "89-real;90-real"
Running TensorRT-LLM
- Requirements:
pip install -r examples/bloom/requirements.txt && git lfs install
- Download Weights:
cd examples/llama && rm -rf ./llama/7B && mkdir -p ./llama/7B && git clone https://huggingface.co/NousResearch/Llama-2-7b-hf ./llama/7B
- Build Engine:
python build.py --model_dir ./llama/7B/ --dtype float16 --remove_input_padding --use_gpt_attention_plugin float16 --enable_context_fmha --use_gemm_plugin float16 --use_weight_only --output_dir ./llama/7B/trt_engines/weight_only/1-gpu/
- Run Inference:
python3 run.py --max_output_len=2048 --tokenizer_dir ./llama/7B/ --engine_dir=./llama/7B/trt_engines/weight_only/1-gpu/ --input_text "Writing a thesis proposal can be done in 10 simple steps:\nStep 1:"
For the tensorRT-LLM CLI arguments, you can see in the run.py.