diff --git a/docs/blog/rag-is-not-enough.md b/docs/blog/rag-is-not-enough.md index df221d2a2..7165a0a15 100644 --- a/docs/blog/rag-is-not-enough.md +++ b/docs/blog/rag-is-not-enough.md @@ -77,4 +77,12 @@ So, we directed our efforts toward training a model to answer user questions bas Specifically, we trained it on Nitro [docs](https://nitro.jan.ai/docs). For context, Nitro is the default inference engine for Jan. It’s a serious server implementation of LlamaCPP, written in C++, with multimodal, queues, and other production-level server capabilities. -It made an interesting corpus because it was rife with post-2023 technical jargon, edge cases, and poor informational layout. \ No newline at end of file +It made an interesting corpus because it was rife with post-2023 technical jargon, edge cases, and poor informational layout. + +## Generating a training dataset for GPT-4 and training + +The first step was to transform Nitro’s unstructured format into a synthetic Q&A dataset designed for [instruction tuning](https://arxiv.org/pdf/2109.01652.pdf). + +The text was split into chunks of 300-token segments with 30-token overlaps. This was to target a GPT-4 with 8k context length. This helped to avoid a [lost-in-the-middle](https://arxiv.org/abs/2307.03172) problem where LLM can’t use context efficiently to answer given questions. + +The chunks were then given to **GPT-4** to generate 3800 Q&A pairs. You can find the [open-sourced dataset here](https://huggingface.co/datasets/jan-hq/nitro_binarized_v2) on HuggingFace. \ No newline at end of file