A recent study from leading academic institutions alleges that LM Arena, the group managing the well known Chatbot Arena benchmark, has favored certain top tech firms in its leaderboard process. Researchers claim LM Arena secretly allowed high profile companies including Meta, Google, OpenAI, and Amazon to privately test multiple model versions and keep underperforming results unpublished, benefiting their leaderboard rankings. According to one of the authors, only select firms were notified about this private testing option and some received much more access than others. Cohere’s VP for AI research labeled the process as a form of gamification. AI benchmarking, founded as a project at UC Berkeley in 2023, serves as a primary site for competitive ranking of AI chatbot models. The platform pits responses from two models against each other, letting users vote for the favorite and ultimately determine leaderboard positions. Many companies submit both tested and unreleased models, sometimes using pseudonyms. Points accrue over time, affecting each model’s placement and visibility. Although LM Arena markets itself as fair and impartial, the study argues the platform privately supports major industry players. The researchers observed that Meta tested 27 different models in the months leading up to the release of Llama 4, but publicly shared results for only one high scoring version.
Controversy Over AI Benchmark Fairness
LM Arena responded by denying the study’s validity, stating that more submissions simply mean more exposure and that all model providers are welcome to participate equally. The organization maintained its commitment to fair and community-focused assessments and said submitting additional models was not evidence of bias. The study began when the authors heard rumors about privileged access to Chatbot Arena and ultimately analyzed over 2.8 million “battles” during five months. Their data points to Meta, Google, and OpenAI repeatedly testing models at higher rates, gathering data that may have improved performance on related benchmarks. Such intensive sampling could lead to a model’s score on Arena Hard improving by over 100 percent, according to the researchers. However, LM Arena argued that strong performance on one benchmark does not necessarily predict outcomes on others. The group also disputed some research methods, noting that model identification depended on the models themselves revealing their origins. LM Arena highlighted cases where non-major labs reportedly faced no disadvantage in the number of times their models appeared. The authors, however, found that when they emailed LM Arena with early findings, the group did not initially challenge the main points. Major companies mentioned in the paper declined to comment when approached by journalists.
The study’s authors propose new policies, such as disclosing scores from private tests and setting transparent limits on private evaluations. In response, LM Arena insisted it has published details on prerelease testing for months and questioned the value of showing performance for models not yet accessible to the community. The paper also recommended that LM Arena ensure each model faces the same number of battles for equitable sampling, a suggestion LM Arena publicly said it would consider with a new algorithm. These criticisms arise shortly after Meta was accused of optimizing a specific Llama 4 model for Chatbot Arena to secure a top ranking before quietly shelving that version. LM Arena later admonished Meta for not being clear about their benchmarking tactics. The increased scrutiny comes as LM Arena has announced the formation of a new company and plans to seek investor funding, prompting concerns over whether such private benchmarks can maintain objectivity as the industry expands.