Your AI models fail in production – here is how to repair the selection of the model

Join our daily and weekly newsletters for the latest updates and the exclusive content on AI coverage. Learn more
Companies should know if the models that feed their applications and their agents work in real scenarios. This type of evaluation can sometimes be complex because it is difficult to predict specific scenarios. A revised version of the award award seeks to give organizations a better idea of ​​the real performance of a model.
The Allen Institute of AI (AI2) has launched RewardBench 2, an updated version of its reward award, RewardBench, which, according to them, offers a more holistic vision of model performance and evaluates how models align with the objectives and standards of a company.
AI2 has built the reward with classification tasks that measure correlations by calculating inference time and downstream training. RewardBench mainly deals with reward models (RM), which can act as judges and assess LLM outputs. RMS assigns a score or a “reward” which guides learning to strengthen human feedback (RHLF).
Nathan Lambert, principal researcher in AI2, told Venturebeat that the first award had worked as planned when it was launched. However, the model environment quickly evolved, as is its bearings.
“While the reward models have become more advanced and more nuanced use cases, we quickly recognized with the community that the first version did not fully capture the complexity of real world preferences,” he said.
Lambert added that with RewardBench 2, “we decided to improve both the extent and depth of the evaluation – more diverse and more difficult incorporation and refine the methodology to better reflect the way humans really judge AI in practice.” He said the second version uses invisible human prompts, has a more difficult score configuration and new areas.
Use of evaluations for models that assess
Although the reward models test the operation of the models, it is also important that RMS aligns with the values ​​of the company; Otherwise, the fine learning process and strengthening can strengthen bad behavior, such as hallucinations, reduce generalization and note the too high harmful responses.
RewardBench 2 covers six different areas: Billing, precise teaching after, mathematics, security, concentration and links.
“Companies should use RewardBench 2 in two different ways depending on their application. If they themselves carry out RLHF, they should adopt best practices and data sets from the leading models in their own pipelines because the reward models need policy recipes (IE Reward Models which reflect the model they are trying to train with RL). For their time model for scaling or filtering data, the reward reward 2 correlated performance, “said Lambert.
Lambert noted that references like RewardBench offer users a way to assess the models they choose according to the “dimensions that count most for them, rather than counting on a single narrow score”. He said that the performance idea, that many evaluation methods claim to assess, is very subjective because a good response of a model depends strongly on the context and the objectives of the user. At the same time, human preferences become very nuanced.
AI 2 published the first version of RewardBench in March 2024. At the time, the company said that it was the first reference and leader in reward models. Since then, several methods of comparative analysis and improvement of RM have emerged. Researchers from the Meta Fair came out with Rewordbench. Deepseek has published a new technique entitled Critique AutoprinCiplée Tuning for RM smarter and scalable.
How models worked
Since RewardBench 2 is a rewardbench update version, AI2 has tested existing and newly formed models to see if they continue to rank. These included a variety of models, such as the versions of Gemini, Claude, GPT-4.1 and Llama-3.1, as well as data sets and models like Qwen, Skywork and its own tulu.
The company has found that the larger reward models work better on the reference because their basic models are stronger. Overall, the most efficient models are the variants of LLAMA-3.1 instruction. In terms of concentration and security, Skywork data “are particularly useful”, and Tulu has done well on billing.
AI2 said that even if they believe that RewardBench 2 “is a step forward in the wide assessment and based on multi-domain accuracy” for reward models, they have warned that model assessment should be mainly used as a guide for choosing models that best work with the needs of a company.




