--- title: "The Invisible Moat around Open-Source LLM" description: "Uncover the pivotal role of data ownership in training the next iteration of LLM." tags: OpenAI has a moat, Catastrophic forgetting, ChatGPT date: 2024-03-25 unlisted: true categories: research --- import CTABlog from '@/components/Blog/CTA' # The Invisible Moat around Open-Source LLM In the crowded AI landscape, OpenAI's ChatGPT stands out, not just for its capabilities but for its unique access to the pre-trained dataset. This post explores the vital role of data in maintaining a competitive edge, focusing on OpenAI's strategic advantage through data ownership. ## Data: The Secret Weapon OpenAI, with ChatGPT, has carved a distinct advantage. By harnessing user interactions, it gains invaluable insights into diverse use cases, enabling precise model refinements. The cornerstone of this advantage lies in the "pre-trained dataset." This treasure trove of data empowers OpenAI to cater to specific needs, ensuring sustained improvement and differentiation. ## The rise of the opensource ``` - How they/Mistral/Llama make money? -> around having pretrained data -> finetuning First para: Rise of Open Source LLMs like Mistral, Llama2, Llama3 People think they don't have a moat = everything is open source Second para: We actually think these guys have an "invisible moat" Pre-training data is not released, and makes a huge difference in fine-tuning efficacy ``` ### Why pretrained data is important? > *Owning the pre-trained dataset is crucial as it represents the original distribution.* Access to the pre-trained dataset acts as a master key to address the critical issue of ["Catastrophic forgetting"](https://en.wikipedia.org/wiki/Catastrophic_interference) in Language Learning Models (LLMs). This phenomenon describes how LLMs lose hold of prior knowledge upon learning new information. Access to the foundational dataset allows for effective fine-tuning, balancing the introduction of new data with the retention of existing knowledge. ![Catastrophic forgetting](./_assets/catastrophic-demo.png) **Figure 1.** Demonstrates the catastrophic forgetting issue: without mixing datasets, AI overfits on new tasks, impairing normal communication. ### Illustrating Catastrophic Forgetting ``` What is fine-tuning Process of Finetuning (pretrain, instruct, finetune) Fine-tuning datasets Risk of catastrophic forgetting "Why is Pre-trained data important?" What is pre-training dataset How does fine-tuning with pre-training dataset differ from when you don't have it How does it avoid catastrophic forgetting ``` Catastrophic forgetting can be visualized as a ball in a multidimensional landscape, where moving towards new knowledge risks losing grasp on the old. Pre-trained data acts as a map, guiding fine-tuning in a way that incorporates new information while safeguarding existing knowledge. ![Gradient decent](./_assets/gradient-decent.gif) **Figure 2.** [Gradient decent demonstration](https://en.wikipedia.org/wiki/Gradient_descent) ### Smoothing Distribution Shifts As described above, with the mixture of the pre-trained dataset ensures smoother distribution shifts when introducing new information, as it embodies a comprehensive spectrum of prior knowledge. This continuity in knowledge transition helps in maintaining the robustness of the model against sudden changes, akin to providing a more gradual learning curve where the new information is incrementally integrated with the existing knowledge base. This concept is supported by the [EleutherAI's research](https://arxiv.org/abs/2403.08763) highlighting the importance of how tasks are sequenced in the learning process, suggesting that introducing dissimilar tasks early on can expand the network's capacity for new information. **Table 1.** Final results for English-only 405M parameter models trained with different replay amounts show models with more replay perform better in balancing learning and forgetting (measured as AVG Loss). Notably, just 1% mix with a pre-trained dataset significantly lowers AVG loss, effectively shifting model knowledge from English (the Pile) to German. ![Replay method](./_assets/replay.png) *Note:* **Replay** is the method involves combining the training dataset from the pre-trained model with new task datasets. ### Acting as a Noise Mask The pre-trained data can also serve as a form of "noise masking", similar to techniques used in training [early computer vision models](https://arxiv.org/abs/1911.04252). This approach introduces a level of ["noise"](https://arxiv.org/abs/2310.05914) during training, which can prevent the model from overfitting to the new dataset. By retaining a mix of original and new data, the model is exposed to a broader range of scenarios, enhancing its generalization capabilities and robustness across tasks. ## Solutions ### Overwholming approach Overcoming these challenges requires a balanced approach. One partial method involves inundating the model with extensive, curated data, allowing for comprehensive fine-tuning. While effective, this approach demands significant computational resources, a comprehensive filtering process for low-quality inputs, and an extraordinarily high cost associated with gathering millions of high-quality responses. In the open-source community, 2 notable examples of fine-tuning with Mistral as a base model on large datasets collected from top-rated GPT-4 and human responses demonstrate a distribution shift that enhances model performance, including [OpenChat](https://huggingface.co/openchat/openchat-3.5-0106) and [Hermes-Pro](https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B). ![Openchat results](./_assets/openchat-bench-0106.png) **Figure 2.** After fine-tuning with a large amount of data samples, the model's performance improved, outperforming ChatGPT and Grok-1 in some benchmarks. ### Fully open source model - Example: Dolma + olma from allenai ## Conclusion The ownership and strategic use of pre-trained data serve as an invisible moat. It not only enables the tackling of complex challenges like catastrophic forgetting but also provides a baseline for continuous, targeted improvements. Although there is a solution to decomotralize, the cost remains reasonably high. Fully open pretrained + open weight ## Reference - [Catastrophic forgetting](https://arxiv.org/abs/2308.08747) - [Simple and Scalable Strategies to Continually Pre-train Large Language Models](https://arxiv.org/abs/2403.08763) - [Gradient descent](https://en.wikipedia.org/wiki/Gradient_descent) - [Neftune](https://arxiv.org/abs/2310.05914) - [Self-training with Noisy Student improves ImageNet classification](https://arxiv.org/abs/1911.04252)