DeepEval: evaluating the performance of an LLM

In deepeval, a metric serves as a standard of measurement for evaluating the performance of an LLM output based on a specific criteria of interest. Essentially, while the metric acts as the ruler, a test case represents the thing you’re trying to measure. deepeval offers a range of default metrics for you to quickly get started with, such as:

  • G-Eval
  • Summarization
  • Faithfulness
  • Answer Relevancy
  • Contextual Relevancy
  • Contextual Precision
  • Contextual Recall
  • Ragas
  • Hallucination
  • Toxicity
  • Bias

deepeval also offers conversational metrics, which are metrics used to evaluate conversations instead of individual, granular LLM interactions. These include:

  • Conversation Completeness
  • Conversation Relevancy
  • Knowledge Retention

https://docs.confident-ai.com/docs/metrics-introduction