Fine-Tuning Embedding for RAG with Synthetic Data

This repo shows you how to fine-tune an embedding model to improve RAG performance even if you don’t have labelled data (i.e. positive pairs of query/relevant documents).

We walkthrough step-by-step the process of generating a synthetic dataset with LLM, finetuning an opensource embedding model, and finally evaluating the finetuned model.

We experiment with a small scale dataset of financial PDF documents, and show that finetuning the embedding model can substantially improve retrieval performance.

https://github.com/run-llama/finetune-embedding

Deeplearning.fr

You have to learn the rules of the game. And then you have to play better than anyone else

Fine-Tuning Embedding for RAG with Synthetic Data