Accelerating Your Model Evaluation and Fine-tuning with SFR-Judge

  • SFR-Judge is a family of three judge models (8B, 12B, and 70B parameters) developed by Salesforce AI Research.
  • These models are built using Meta Llama 3 and Mistral NeMO, designed to evaluate outputs from large language models (LLMs).
  • SFR-Judge can perform three types of evaluation tasks:
    • Pairwise comparisons
    • Single ratings on a Likert scale
    • Binary classification.
  • The models are trained to provide explanations for their judgments, enhancing transparency.
  • SFR-Judge outperformed other open-source and proprietary judge models in 10 out of 13 benchmarks.
  • The models demonstrated lower bias and higher consistency compared to competitive judge models.
  • SFR-Judge models ranked first, second, and fourth on the RewardBench leaderboard for generative judge models.
  • These models are the first to achieve over 90% accuracy on RewardBench.
  • SFR-Judge can be used for auto-evaluation and as reward models for reinforcement learning from human feedback (RLHF).
  • Downstream models improved with SFR-Judge showed better performance on the AlpacaEval-2 instruction following benchmark.
  • The research paper and code (coming soon) are available for further explorations.

Blog reference

paper

Publié dans LLM | Marqué avec