jan/engineering.md at 15f87c16f8b664a08c8186b7c6b470b1664d5eff

Nicholai/jan

Fork 0

0xSage ea81704d91 docs: add pm handbook

2023-12-18 17:20:03 +08:00

5.5 KiB

Raw Blame History

title

description

keywords

Engineering

Jan is a ChatGPT-alternative that runs on your own computer, with a local API server.

Jan AI

Jan

ChatGPT alternative

local AI

private AI

conversational AI

no-subscription fee

large language model

Connecting to Rigs

Pritunl Setup

Install Pritunl: Download here
Import .ovpn file
VSCode: Install the "Remote-SSH" extension for connection

Llama.cpp Setup

Clone Repo: git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
Build:

mkdir build && cd build
cmake .. -DLLAMA_CUBLAS=ON -DLLAMA_CUDA_F16=ON -DLLAMA_CUDA_MMV_Y=8
cmake --build . --config Release

Download Model:

cd ../models && wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q8_0.gguf

Run:

cd ../build/bin/
./main -m ./models/llama-2-7b.Q8_0.gguf -p "Writing a thesis proposal can be done in 10 simple steps:\nStep 1:" -n 2048 -e -ngl 100 -t 48

For the llama.cpp CLI arguments you can see here:

Short Option	Long Option	Param Value	Description
`-h`	`--help`		Show this help message and exit
`-i`	`--interactive`		Run in interactive mode
	`--interactive-first`		Run in interactive mode and wait for input right away
	`-ins`, `--instruct`		Run in instruction mode (use with Alpaca models)
`-r`	`--reverse-prompt`	`PROMPT`	Run in interactive mode and poll user input upon seeing `PROMPT`
	`--color`		Colorise output to distinguish prompt and user input from
Generations
`-s`	`--seed`	`SEED`	Seed for random number generator
`-t`	`--threads`	`N`	Number of threads to use during computation
`-p`	`--prompt`	`PROMPT`	Prompt to start generation with
	`--random-prompt`		Start with a randomized prompt
	`--in-prefix`	`STRING`	String to prefix user inputs with
`-f`	`--file`	`FNAME`	Prompt file to start generation
`-n`	`--n_predict`	`N`	Number of tokens to predict
	`--top_k`	`N`	Top-k sampling
	`--top_p`	`N`	Top-p sampling
	`--repeat_last_n`	`N`	Last n tokens to consider for penalize
	`--repeat_penalty`	`N`	Penalize repeat sequence of tokens
`-c`	`--ctx_size`	`N`	Size of the prompt context
	`--ignore-eos`		Ignore end of stream token and continue generating
	`--memory_f32`		Use `f32` instead of `f16` for memory key+value
	`--temp`	`N`	Temperature
	`--n_parts`	`N`	Number of model parts
`-b`	`--batch_size`	`N`	Batch size for prompt processing
	`--perplexity`		Compute perplexity over the prompt
	`--keep`		Number of tokens to keep from the initial prompt
	`--mlock`		Force system to keep model in RAM
	`--mtest`		Determine the maximum memory usage
	`--verbose-prompt`		Print prompt before generation
`-m`	`--model`	`FNAME`	Model path

TensorRT-LLM Setup

Docker and TensorRT-LLM build

Note: You should run with admin permission to make sure everything works fine

Docker Image:

sudo make -C docker build

Run Container:

sudo make -C docker run

Once in the container, TensorRT-LLM can be built from the source using the following:

Build:

# To build the TensorRT-LLM code.
python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt
# Deploy TensorRT-LLM in your environment.
pip install ./build/tensorrt_llm*.whl

Note: You can specify the GPU architecture (e.g. for 4090 is ADA) for compilation time reduction The list of supported architectures can be found in the CMakeLists.txt file.

python3 ./scripts/build_wheel.py --cuda_architectures "89-real;90-real"

Running TensorRT-LLM

Requirements:

pip install -r examples/bloom/requirements.txt && git lfs install

Download Weights:

cd examples/llama && rm -rf ./llama/7B && mkdir -p ./llama/7B && git clone https://huggingface.co/NousResearch/Llama-2-7b-hf ./llama/7B

Build Engine:

python build.py --model_dir ./llama/7B/ --dtype float16 --remove_input_padding --use_gpt_attention_plugin float16 --enable_context_fmha --use_gemm_plugin float16 --use_weight_only --output_dir ./llama/7B/trt_engines/weight_only/1-gpu/

Run Inference:

python3 run.py --max_output_len=2048 --tokenizer_dir ./llama/7B/ --engine_dir=./llama/7B/trt_engines/weight_only/1-gpu/ --input_text "Writing a thesis proposal can be done in 10 simple steps:\nStep 1:"

For the tensorRT-LLM CLI arguments, you can see in the run.py.

5.5 KiB Raw Blame History