DeepEval: evaluating the performance of an LLM

In deepeval, a metric serves as a standard of measurement for evaluating the performance of an LLM output based on a specific criteria of interest. Essentially, while the metric acts as the ruler, a test case represents the thing you’re trying to measure. deepeval offers a range of default metrics for you to quickly get started with, such as:

G-Eval
Summarization
Faithfulness
Answer Relevancy
Contextual Relevancy
Contextual Precision
Contextual Recall
Ragas
Hallucination
Toxicity
Bias

deepeval also offers conversational metrics, which are metrics used to evaluate conversations instead of individual, granular LLM interactions. These include:

Conversation Completeness
Conversation Relevancy
Knowledge Retention

https://docs.confident-ai.com/docs/metrics-introduction

Deeplearning.fr

You have to learn the rules of the game. And then you have to play better than anyone else

DeepEval: evaluating the performance of an LLM