docs: improve title

This commit is contained in:
hieu-jan 2024-03-02 00:07:24 +09:00
parent c966f7f070
commit f43dbc5ef6

View File

@ -29,7 +29,7 @@ In short, (1) extending a general foundation model like [Mistral](https://huggin
Problems still arise with catastrophic forgetting in general tasks, commonly observed during specialized domain fine-tuning. In our case, this is likely exacerbated by our lack of access to Mistrals original training dataset and various compression techniques used in our approach to keep the model small.
## Selecting a strong foundation model
## Selecting a Strong Foundation Model
[Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-v0.1) outshines both [Meta's Llama-2 7B](https://huggingface.co/meta-llama/Llama-2-7b) and [Google's Gemma 7B](https://huggingface.co/google/gemma-7b) in key benchmarks, making it our choice for a base model. Starting with a strong foundation like Mistral allowed us to achieve greater accuracy in our specialized adaptations.
@ -39,7 +39,7 @@ Problems still arise with catastrophic forgetting in general tasks, commonly obs
*Note: we are not sponsored by the Mistral team. Though many folks in their community do like to run Mistral locally using our desktop client - [Jan](https://jan.ai/).*
## Cost-effectively improving the base model
## Cost-Effectively Improving the Base Model
Mistral alone has known, poor math capabilities, which we needed for our highly technical use case. Thus, we tested all model variants on top of Mistral, from foundation models to finetunes to model merges, in order to find a stronger base model to receive our own finetuning.
@ -57,7 +57,7 @@ We ended up with [Stealth 7B v1.1](https://huggingface.co/jan-hq/stealth-v1.1),
This particular combination yielded the best tradeoff across mathematical & technical reasoning while retaining the most pre-merge performance on general tasks.
## **DPO finetuning**
## DPO Finetuning
Merging different LLMs can lead to a mixed answering style because each model was originally trained on different types of data.
@ -65,7 +65,7 @@ Thus, we applied Direct Preference Optimization ([DPO](https://arxiv.org/abs/230
This approach results in a final model - [Stealth 7B v1.2](https://huggingface.co/jan-hq/stealth-v1.2), with minimal loss, and realign to our technical preferences.
## **Using our own technical documentation**
## Using Our Technical Documentation
With the base model ready, we started on our specific use case.
@ -77,7 +77,7 @@ Specifically, we trained it on Nitro [docs](https://nitro.jan.ai/docs). For cont
It made an interesting corpus because it was rife with post-2023 technical jargon, edge cases, and poor informational layout.
## Generating a training dataset for GPT-4 and training
## Generating a Training Dataset for GPT-4
The first step was to transform Nitros unstructured format into a synthetic Q&A dataset designed for [instruction tuning](https://arxiv.org/pdf/2109.01652.pdf).
@ -85,9 +85,9 @@ The text was split into chunks of 300-token segments with 30-token overlaps. Thi
The chunks were then given to GPT-4 with 8k context length to generate 3800 Q&A pairs. The [training dataset](https://huggingface.co/datasets/jan-hq/nitro_binarized_v2) is available on HuggingFace.
## **Training**
## Training
The training was done with supervised finetuning (SFT) from the [Hugging Face's alignment handbook](https://github.com/huggingface/alignment-handbook), per [Huggingface's Zephyr Beta](https://github.com/huggingface/alignment-handbook/tree/main/recipes/zephyr-7b-beta) guidelines.
The training was done with supervised finetuning (SFT) from the [Hugging Face's alignment handbook](https://github.com/huggingface/alignment-handbook) based on the [Huggingface's Zephyr Beta](https://github.com/huggingface/alignment-handbook/tree/main/recipes/zephyr-7b-beta) guidelines.
We used consumer-grade, dual Nvidia RTX 4090s for the training. The end-to-end training took 18 minutes. We found optimal hyperparameters in LoRA for this specific task to be `r = 256` and `alpha = 512`.
@ -97,7 +97,7 @@ This final model is publicly available at https://huggingface.co/jan-hq/nitro-v1
*Figure 3. Using the new finetuned model in [Jan](https://jan.ai/)*
## Improving results with RAG
## Improving Results With Rag
As an additional step, we also added [Retrieval Augmented Generation (RAG)](https://blogs.nvidia.com/blog/what-is-retrieval-augmented-generation/) as an experiment parameter.