From 74f499ea5ae34d4f3613803e28a0c0c602dfe39e Mon Sep 17 00:00:00 2001 From: hahuyhoang411 Date: Fri, 1 Mar 2024 13:20:02 +0700 Subject: [PATCH] add: dpo finetuning section --- docs/blog/rag-is-not-enough.md | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/docs/blog/rag-is-not-enough.md b/docs/blog/rag-is-not-enough.md index 7a19bab4a..0b8ae8492 100644 --- a/docs/blog/rag-is-not-enough.md +++ b/docs/blog/rag-is-not-enough.md @@ -57,4 +57,12 @@ We ended up with [Stealth](https://huggingface.co/jan-hq/stealth-v1.3), a [SLERP - [WizardCoder](https://huggingface.co/WizardLM/WizardCoder-Python-7B-V1.0) for its coding capabilities - Our own [Trinity](https://huggingface.co/jan-hq/trinity-v1.2) model for its versatility across general tasks -This particular combination yielded the best tradeoff across mathematical & technical reasoning while retaining the most pre-merge performance on general tasks. \ No newline at end of file +This particular combination yielded the best tradeoff across mathematical & technical reasoning while retaining the most pre-merge performance on general tasks. + +## **DPO finetuning** + +Merging different LLMs can lead to the mixed answering style because each model was originally trained on different types of data. + +Thus, we applied Direct Preference Optimization ([DPO](https://arxiv.org/abs/2305.18290)) using the [Intel's Orca DPO pairs](https://huggingface.co/datasets/Intel/orca_dpo_pairs) dataset, chosen for its helpful answering style in general, math and coding concentration. + +This approach allowed us to have a final model, with minimal loss, and realign to our technical preferences. \ No newline at end of file