Rag techniques notebook

This repository showcases various advanced techniques for Retrieval-Augmented Generation (RAG) systems. RAG systems combine information retrieval with generative models to provide accurate and contextually rich responses.

https://github.com/NirDiamant/RAG_Techniques

https://github.com/NirDiamant/RAG_Techniques/tree/main/all_rag_techniques

https://github.com/NirDiamant/RAG_Techniques/tree/main/all_rag_techniques

Perplexica – An AI-powered search engine 

Perplexica is an open-source AI-powered searching tool or an AI-powered search engine that goes deep into the internet to find answers. Inspired by Perplexity AI, it’s an open-source option that not just searches the web but understands your questions. It uses advanced machine learning algorithms like similarity searching and embeddings to refine results and provides clear answers with sources cited.

Using SearxNG to stay current and fully open source, Perplexica ensures you always get the most up-to-date information without compromising your privacy.

https://github.com/ItzCrazyKns/Perplexica

RAG Foundry

RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation

RAG Foundry is a library designed to improve LLMs ability to use external information by fine-tuning models on specially created RAG-augmented datasets. The library helps create the data for training, given a RAG technique, helps easily train models using parameter-efficient finetuning (PEFT), and finally can help users measure the improved performance using various, RAG-specific metrics. The library is modular, workflows are customizable using configuration files.

https://github.com/IntelLabs/RAGFoundry

DeepEval: evaluating the performance of an LLM

In deepeval, a metric serves as a standard of measurement for evaluating the performance of an LLM output based on a specific criteria of interest. Essentially, while the metric acts as the ruler, a test case represents the thing you’re trying to measure. deepeval offers a range of default metrics for you to quickly get started with, such as:

  • G-Eval
  • Summarization
  • Faithfulness
  • Answer Relevancy
  • Contextual Relevancy
  • Contextual Precision
  • Contextual Recall
  • Ragas
  • Hallucination
  • Toxicity
  • Bias

deepeval also offers conversational metrics, which are metrics used to evaluate conversations instead of individual, granular LLM interactions. These include:

  • Conversation Completeness
  • Conversation Relevancy
  • Knowledge Retention

https://docs.confident-ai.com/docs/metrics-introduction

BANKER++ Embedding for RAG

Fine-tuning an embedding model is a powerful technique for optimizing retrieval augmented generation (RAG) systems in finance. By training a smaller open-source embedding model like BAAI/bge-small-en on a domain-specific dataset, the model learns more meaningful vector representations that capture the nuances and semantics of financial language. This leads to significantly improved retrieval performance compared to using generic pre-trained embeddings.

https://huggingface.co/baconnier/Finance_embedding_large_en-V1.5

Fine-tuned financial embedding models, such as Banker++ RAG, demonstrate superior accuracy on tasks like semantic search, text similarity, and clustering. They enable RAG systems to better understand complex financial jargon and retrieve the most relevant information given a query.

Integrating these specialized embeddings is straightforward using libraries like LlamaIndex or Sentence-Transformers.

As the financial industry increasingly adopts AI, fine-tuned embedding models will play a crucial role in powering domain-specific NLP applications. From analyzing market sentiment to personalizing investment recommendations, these optimized embeddings unlock new possibilities for harnessing unstructured financial data. By combining the power of open-source models with the domain expertise embedded in financial corpora, fine-tuning paves the way for more intelligent and impactful RAG systems in finance.

BANKER ++

Banker++ is trained to act like a Senior Banker.

I’m excited to share a resource that could potentially be a valuable addition to your financial toolkit. It’s a Financial LLM Model tailored for corporate entities and financial institutions.

In today’s complex financial landscape, having access to reliable analytical tools is crucial. This model, available at link:

https://huggingface.co/spaces/baconnier/Finance

https://huggingface.co/baconnier/Finance_dolphin-2.9.1-yi-1.5-9b

It offers a framework for assessing various financial scenarios with a level of precision and insight that could benefit your decision-making processes.

However, it’s important to note that this content is strictly for educational purposes and should not be construed as financial advice. Please exercise caution when applying any information provided.

While I’m not suggesting it’s a one-size-fits-all solution or a replacement for professional financial advice, it may offer valuable insights into areas such as risk management, investment strategies, and portfolio optimization.

To illustrate, let’s consider a common question in finance:

Question: What is CDS and how does it compare to a swap?

Answer:

  • CDS: Credit Default Swap (CDS) is a financial derivative contract between two parties (buyer and seller) for insurance against default or credit risk associated with a bond or loan. The protection buyer pays a premium to the protection seller in exchange for the right to receive payment if a credit event occurs. Typically, the protection seller is a financial institution, while the protection buyer can be an investor or a bank.
  • Swap: A swap is an agreement between two parties to exchange cash flows, typically involving interest payments or principal at a future date. Common types of swaps include interest rate swaps, currency swaps, and commodity swaps.

CDS differs from swaps as it focuses specifically on credit risk protection, providing insurance against default events. In contrast, swaps involve the exchange of cash flows, often related to interest rates, currencies, or commodities, without directly addressing credit risk.

Feel free to explore the model and see if it aligns with your organization’s needs and objectives. Remember to approach its use with caution and consider consulting with financial experts when making significant decisions.

As we navigate the complexities of the financial world together, let’s remain humble in our pursuit of knowledge and improvement.

AutoGluon: Time Series Forecasting

AutoGluon can forecast the future values of multiple time series given the historical data and other related covariates. A single call to AutoGluon TimeSeriesPredictor’s fit() method trains multiple models to generate accurate probabilistic forecasts, and does not require you to manually deal with cumbersome issues like model selection and hyperparameter tuning.

https://auto.gluon.ai/stable/tutorials/timeseries/index.html

https://raw.githubusercontent.com/Innixma/autogluon-doc-utils/main/docs/cheatsheets/stable/autogluon-cheat-sheet.jpeg

Efficient Training Techniques for Transformers on a Single GPU

When training Transformer models on a single GPU, it’s important to optimize for both speed and memory efficiency to make the most of limited resources. Here are some key parameters and techniques to consider:

Mixed Precision Training

  • Use fp16 or bf16 mixed precision to reduce memory usage while maintaining most of the fp32 precision. Set fp16=True or bf16=True in TrainingArguments.
  • torch.bfloat16 or torch.float16 can reduce memory usage by 2-4x with minimal impact on model quality, allowing training with larger batch sizes.
  • bfloat16 is more numerically stable than float16 and recommended if supported by your GPU (e.g., Nvidia A100).
  • Mixed precision speeds up math-heavy operations like linear layers and convolutions.

Optimizer Choice

  • AdamW is the most common optimizer but has high memory usage.
  • Alternatives like adamw_hf, adamw_torch, adamw_apex_fused, adamw_anyprecision, or adafactor can be more memory efficient.
  • Fused optimizers like adamw_apex_fused from Apex or adamw_anyprecision fuse multiple operations for faster speed.
  • Fused Adam combines the Adam update’s elementwise operations into a single kernel, reducing memory access.

Gradient Accumulation

  • Accumulate gradients over multiple smaller batches before doing an optimizer step to emulate training with larger batch sizes. Set gradient_accumulation_steps in TrainingArguments.
  • Allows training with large effective batch sizes that don’t fit in GPU memory.
  • Increases training speed by reducing communication overhead but can slow down convergence.

Gradient Checkpointing

  • Trade off compute for memory by recomputing activations in the backward pass instead of storing them.
  • Speeds up training by allowing larger batch sizes but slows down each iteration.
  • Enable with gradient_checkpointing=True in TrainingArguments.

Efficient Data Loading

  • Use DataLoader(pin_memory=True) to speed up data transfer from CPU to GPU memory.
  • Set DataLoader(num_workers=N) to preload data with multiple worker processes.

Offload to CPU

  • Offload the optimizer state and model parameters to CPU when not in use to free up GPU memory. Set offload_optimizer=True and offload_param=True.
  • Speeds up training by allowing larger models and batch sizes.

TF32 on Ampere GPUs

  • Enable TF32 mode on Ampere or newer GPUs to automatically use the TF32 data type, which has the range of fp32 but precision of fp16. Set tf32=True in TrainingArguments.
  • Speeds up linear layers, convolutions, and matmuls with minimal accuracy impact.
  • Set torch.backends.cuda.matmul.allow_tf32 = True.

Flash Attention 2

  • Use Flash Attention 2 kernels integrated in the Transformers library to speed up attention computation and reduce memory usage.

The Trainer API in 🤗 Transformers supports most of these techniques via the TrainingArguments class. Experiment with combinations of these approaches to find the optimal tradeoff between speed, memory efficiency, and model quality for your specific model and hardware setup.

Sources https://huggingface.co/docs/transformers/en/perf_train_gpu_one

TinyTimeMixers

TinyTimeMixers (TTMs) are compact pre-trained models for Multivariate Time-Series Forecasting, open-sourced by IBM Research. With less than 1 Million parameters, TTM introduces the notion of the first-ever “tiny” pre-trained models for Time-Series Forecasting.

https://huggingface.co/ibm/TTM

TTM outperforms several popular benchmarks demanding billions of parameters in zero-shot and few-shot forecasting. TTMs are lightweight forecasters, pre-trained on publicly available time series data with various augmentations. TTM provides state-of-the-art zero-shot forecasts and can easily be fine-tuned for multi-variate forecasts with just 5% of the training data to be competitive. 

Distilling Knowledge from Large LLMs: Fine-tuning Mistral with LoRA

As large language models (LLMs) continue to advance, there is a growing need to distill their knowledge into smaller, more efficient models suitable for real-world applications. One promising approach is knowledge distillation via fine-tuning using techniques like LoRA (Low-Rank Adaptation). In this article, we’ll dive into best practices for fine-tuning the 7B parameter Mistral model with LoRA.

The LoRA Advantage

Traditional fine-tuning updates all the weights of a pre-trained LLM, which can be computationally expensive and data-hungry, especially for large models. LoRA circumvents this by injecting trainable rank decomposition matrices into the LLM layers, enabling efficient adaptation to new tasks without modifying the original model weights.

Compared to full fine-tuning, LoRA requires significantly less compute and data, making it well-suited for fine-tuning models like Mistral. It has been shown to match or even exceed the performance of full fine-tuning on various tasks while using orders of magnitude fewer trainable parameters.

Selecting the Optimal LoRA Rank

The LoRA rank (r) determines the number of trainable parameters and directly impacts the model’s capacity to capture task-specific knowledge. A higher rank allows the model to better approximate the ideal fine-tuned weights, potentially improving performance. However, it also increases memory requirements and the risk of overfitting.

For Mistral, common ranks used are r=64 or r=128, though some have experimented with higher values like r=256 which can finetune around 8% of the model’s parameters. The optimal rank depends on the complexity of the task and dataset size – simple tasks may work well with lower r, while more complex ones may benefit from higher r.

Dataset Size and Quality

While LoRA is data-efficient compared to full fine-tuning, having sufficient high-quality training data is still crucial for achieving good performance. For a 7B model like Mistral, researchers recommend at least 50,000 examples for reasonable results, with 100,000+ examples often yielding better performance.

However, even smaller datasets of 1,000 – 10,000 carefully curated examples can be effective when using LoRA, outperforming full fine-tuning which requires much more data. Data quality and relevance to the target task are more important than sheer quantity – high-quality, curated datasets can outperform larger, noisier ones.

Using too little data (e.g. less than 1,000 examples) may lead to overfitting or poor performance. For very large datasets (>1M examples), full fine-tuning may be more effective than LoRA, depending on available compute resources.

Putting it All Together

So, what are the best practices for fine-tuning Mistral with LoRA? Based on current research, a good starting point could be:

  • LoRA rank (r) = 128
  • 10,000 – 100,000 high-quality, task-relevant examples

During training, it’s essential to monitor performance on a held-out validation set to select the best checkpoint and avoid overfitting. Additionally, increasing the LoRA alpha (lora_alpha) can help counteract a lower rank but may introduce instability.

Distillation Approaches

Beyond LoRA, researchers have explored various distillation approaches for transferring knowledge from large LLMs to smaller models:

  1. Reverse KL Divergence: Replacing the standard forward KL divergence loss with reverse KL can prevent the student model from overestimating low-probability regions of the teacher LLM’s distribution, making it more suitable for generative tasks.
  2. Multi-Task Learning with Rationales: Training the student on two tasks – label prediction and rationale generation, where rationales are intermediate reasoning steps extracted from the LLM teacher. This creates an explicit connection between inputs and outputs.
  3. Data Augmentation: Leveraging data augmentation to generate context-rich, skill-specific training data from the LLM teacher. This helps the student model approximate the teacher’s contextual abilities and ethical alignment.

The Future of LLM Distillation

As LLMs continue to grow in size and capability, techniques like LoRA and knowledge distillation will become increasingly important for making these models accessible and deployable across a wide range of applications.

By following best practices, leveraging the latest research, and adhering to legal and ethical considerations when working with LLM outputs, practitioners can effectively distill the knowledge from large models like Mistral into smaller, more efficient models tailored to their specific needs.

The possibilities for LLM distillation are vast, paving the way for a future where the power of large language models is available to everyone, regardless of computational resources.