Meta recently introduced two new iterations of its Llama 4 AI, namely the compact version called Scout and a medium-sized variant known as Maverick. The company made a bold statement that its Maverick model outperformed notable competitors like ChatGPT-4o and Gemini 2.0 Flash, fueling excitement in the tech community. However, the tale takes an intriguing twist involving unexpected details of its performance evaluation.
Meta’s claim to fame as Maverick soared to the second position on the LMArena leaderboard soon after its release was met with skepticism. For the unacquainted, LMArena is a platform where AI enthusiasts evaluate and vote on AI model responses based on their relevance and accuracy. Meta’s Maverick flaunted an impressive ELO score of 1417, supposedly leaving GPT-4o in its wake and closely trailing behind Gemini 2.5 Pro.
The plot thickened when it came to light that Meta’s claimed victories were not entirely as they seemed. The AI model submitted to LMArena turned out to be an experimental version, fine-tuned for conversational enhancement, which was different from the version intended for public release. This revelation led to accusations of deceptive performance claims.
LMArena clarified that their expectations from model providers were not aligned with Meta’s actions and urged for greater transparency. As a result, LMArena adjusted its leaderboard policies to ensure future rankings would reflect more fair and reliable assessments.
In the middle of this controversy, Meta maintained that they had released an open-source version, encouraging developers to customize Llama 4 for individual applications. Nevertheless, this move spotlighted concerns over possible manipulation of the leaderboard using an optimal, non-publicly accessible model. Simon Willison, an independent AI researcher, expressed his initial admiration for Maverick’s performance but later realized the need to scrutinize the conditions under which it excelled.
Speculation arose that Meta might have trained its AI to excel specifically in public benchmarks, a notion refuted by Ahman Al-Dahle, the company’s VP of Generative AI. Despite these challenges, Meta’s release of Maverick, albeit on an unconventional Sunday, was explained by Mark Zuckerberg as a simple matter of readiness.
The competition in the AI space is fierce, and each new release like Llama 4 adds another layer to this high-stakes landscape. As developments unfold, more insights on Meta’s strategies and the capabilities of Llama 4 will surely emerge, making it an exciting field to watch. Keep tuned for further updates as the story develops.




