Top 6 Open-Source Frameworks for Evaluating Large Language Models

Publié le 23 janvier 2025 par loic

Evaluating Large Language Models (LLMs) is essential for ensuring optimal performance in applications like chatbots and document summarization. Here are six powerful open-source frameworks that simplify the evaluation process:

Key Frameworks

DeepEval
A comprehensive suite offering 14+ evaluation metrics, including summarization accuracy and hallucination detection, with seamless Pytest integration.

Opik by Comet
A versatile platform for evaluating and monitoring LLMs, featuring interactive prompt experimentation and automated testing capabilities.

RAGAs
Specializes in evaluating Retrieval-Augmented Generation pipelines, with a focus on faithfulness and contextual precision metrics.

Deepchecks
A modular framework supporting various evaluation tasks, particularly excelling in bias detection and fairness assessment.

Phoenix
An AI observability platform that integrates with popular frameworks like LangChain and supports major LLM providers, offering comprehensive monitoring and benchmarking tools.

Evalverse
A unified evaluation framework that stands out with its Slack integration for no-code evaluations and collaborative features.

Implementation Benefits

These frameworks provide essential tools for ensuring reliable model performance, offering:

Automated testing capabilities
Comprehensive metrics for evaluation
Integration with popular development tools
Bias and fairness detection features
Hallucination detection capabilities.

Source: https://hub.athina.ai/blogs/top-6-open-source-frameworks-for-evaluating-large-language-models/

LLM explained in 2 videos

Publié le 14 octobre 2024 par loic

If you want to learn about LLMs in just 3 hours, two lectures given by Yann Dubois are just what you need:

Overview of LLM training and post-training:

https://youtu.be/9vM4p9NN0Ts

Scalable LLM evaluation:

Accelerating Your Model Evaluation and Fine-tuning with SFR-Judge

Publié le 3 octobre 2024 par loic

SFR-Judge is a family of three judge models (8B, 12B, and 70B parameters) developed by Salesforce AI Research.
These models are built using Meta Llama 3 and Mistral NeMO, designed to evaluate outputs from large language models (LLMs).
SFR-Judge can perform three types of evaluation tasks:
- Pairwise comparisons
- Single ratings on a Likert scale
- Binary classification.
The models are trained to provide explanations for their judgments, enhancing transparency.
SFR-Judge outperformed other open-source and proprietary judge models in 10 out of 13 benchmarks.
The models demonstrated lower bias and higher consistency compared to competitive judge models.
SFR-Judge models ranked first, second, and fourth on the RewardBench leaderboard for generative judge models.
These models are the first to achieve over 90% accuracy on RewardBench.
SFR-Judge can be used for auto-evaluation and as reward models for reinforcement learning from human feedback (RLHF).
Downstream models improved with SFR-Judge showed better performance on the AlpacaEval-2 instruction following benchmark.
The research paper and code (coming soon) are available for further explorations.

Blog reference

paper

Perplexica – An AI-powered search engine

Publié le 9 août 2024 par loic

Perplexica is an open-source AI-powered searching tool or an AI-powered search engine that goes deep into the internet to find answers. Inspired by Perplexity AI, it’s an open-source option that not just searches the web but understands your questions. It uses advanced machine learning algorithms like similarity searching and embeddings to refine results and provides clear answers with sources cited.

Using SearxNG to stay current and fully open source, Perplexica ensures you always get the most up-to-date information without compromising your privacy.

https://github.com/ItzCrazyKns/Perplexica

RAG Foundry

Publié le 9 août 2024 par loic

RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation

RAG Foundry is a library designed to improve LLMs ability to use external information by fine-tuning models on specially created RAG-augmented datasets. The library helps create the data for training, given a RAG technique, helps easily train models using parameter-efficient finetuning (PEFT), and finally can help users measure the improved performance using various, RAG-specific metrics. The library is modular, workflows are customizable using configuration files.

https://github.com/IntelLabs/RAGFoundry

DeepEval: evaluating the performance of an LLM

Publié le 9 août 2024 par loic

In deepeval, a metric serves as a standard of measurement for evaluating the performance of an LLM output based on a specific criteria of interest. Essentially, while the metric acts as the ruler, a test case represents the thing you’re trying to measure. deepeval offers a range of default metrics for you to quickly get started with, such as:

G-Eval
Summarization
Faithfulness
Answer Relevancy
Contextual Relevancy
Contextual Precision
Contextual Recall
Ragas
Hallucination
Toxicity
Bias

deepeval also offers conversational metrics, which are metrics used to evaluate conversations instead of individual, granular LLM interactions. These include:

Conversation Completeness
Conversation Relevancy
Knowledge Retention

https://docs.confident-ai.com/docs/metrics-introduction

Revolutionizing AI Efficiency: How Microsoft’s LLMLingua-2 is Changing the Game with 8x Less Memory

Publié le 21 mars 2024 par loic

LLMLingua-2 is a novel compression technology developed by Microsoft Research, achieving state-of-the-art results with 8 times less GPU memory on tasks typically handled by models like GPT-4.
It introduces innovative approaches such as « Data Distillation, » « Bidirectional Token Classification, » and optimized compression objectives to efficiently compress prompts without losing key information.
The technology has shown superior performance across various language tasks and demonstrated remarkable generalization across different LLMs and languages, from GPT-3.5 to Mistral-7B and from English to Chinese.
Compared to existing prompt compression methods, LLMLingua-2 is 3 to 6 times faster, accelerates end-to-end inference by 1.6 to 2.9 times, and significantly reduces GPU memory usage by a factor of 8.
This advancement represents a significant step forward in making language AI more practical and scalable for real-world applications, demonstrating Microsoft Research’s leadership in the field.

https://arxiv.org/pdf/2403.12968.pdf

sample

https://huggingface.co/microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank

Text splitting

Publié le 24 février 2024 par loic

Large language models (LLMs) can be used for many tasks, but often have a limited context size that can be smaller than documents you might want to use. To use documents of larger length, you often have to split your text into chunks to fit within this context size.

This crate provides methods for splitting longer pieces of text into smaller chunks, aiming to maximize a desired chunk size, but still splitting at semantically sensible boundaries whenever possible.

Levels Of Text Splitting

Level 1: Character Splitting – Simple static character chunks of data
Level 2: Recursive Character Text Splitting – Recursive chunking based on a list of separators
Level 3: Document Specific Splitting – Various chunking methods for different document types (PDF, Python, Markdown)
Level 4: Semantic Splitting – Embedding walk based chunking
Level 5: Agentic Splitting – Experimental method of splitting text with an agent-like system. Good for if you believe that token cost will trend to $0.00
*Bonus Level:* Alternative Representation Chunking + Indexing – Derivative representations of your raw text that will aid in retrieval and indexing

Semantic text splitting library

https://github.com/benbrandt/text-splitter

Chunks Vizualizer

https://chunkviz.up.railway.app/

Revolutionizing AI Reading Comprehension: ReadAgent’s Breakthrough in Handling Documents with 20 Million Tokens

Publié le 17 février 2024 par loic

Introduction to ReadAgent by Google DeepMind
Development of ReadAgent, an AI capable of understanding long texts beyond the limits of its language model.
Utilizes a human-like reading strategy to comprehend complex documents.
Challenges Faced by Language Models
Context length limitation: Fixed token processing capacity leading to performance decline.
Ineffective context usage: Decreased comprehension with increasing text length.
Features of ReadAgent
Mimics human reading by forming and using « gist memories » of texts.
Breaks down texts into smaller « episodes » and generates gist memories for each.
Looks up relevant episodes when needed for answering questions.
Performance Enhancements
Capable of understanding documents « 20 times longer » than its base language model.
Shows improved performance on long document question answering datasets:
- QuALITY: Accuracy improved from 85.8% to 86.9%.
- NarrativeQA: Rating increased by 13-32% over baselines.
- QMSum: Rating improved from 44.96% to 49.58%.
Potential Applications
Legal contract review, scientific literature analysis, customer support, financial report summarization, automated online course creation.
Indicates the future potential of AI in mastering lengthy real-world documents through human-like reading strategies.

https://read-agent.github.io/

Deeplearning.fr

You have to learn the rules of the game. And then you have to play better than anyone else

Archives par mot-clé : LLm

Top 6 Open-Source Frameworks for Evaluating Large Language Models

Key Frameworks

Implementation Benefits

LLM explained in 2 videos

Accelerating Your Model Evaluation and Fine-tuning with SFR-Judge

Perplexica – An AI-powered search engine

RAG Foundry

DeepEval: evaluating the performance of an LLM

Revolutionizing AI Efficiency: How Microsoft’s LLMLingua-2 is Changing the Game with 8x Less Memory

Text splitting

Revolutionizing AI Reading Comprehension: ReadAgent’s Breakthrough in Handling Documents with 20 Million Tokens