Elon Musk's xAI just landed at the bottom of a major AI safety audit. The Anti-Defamation League released a study Wednesday finding that Grok performed worst among six leading chatbots at identifying and countering antisemitic content, scoring just 21 out of 100 points. Anthropic's Claude topped the rankings at 80, but the ADL warned that every model tested - including OpenAI's ChatGPT, Meta's Llama, Google's Gemini, and DeepSeek - showed dangerous gaps requiring immediate fixes.
xAI's Grok just became the poster child for everything that can go wrong with AI safety guardrails. A comprehensive Anti-Defamation League study released Wednesday puts hard numbers to what researchers have been warning about for months - Elon Musk's chatbot is spectacularly bad at handling hate speech.
The ADL tested six major large language models through more than 25,000 conversations between August and October 2025, throwing everything from Holocaust denial to extremist propaganda at them. Grok earned an overall score of 21 out of 100. That's not just bad - it's a 59-point gap behind Anthropic's Claude, which topped the rankings at 80.
Here's what makes the findings particularly damning: Grok showed what the ADL called "complete failure" when asked to analyze documents or images containing hateful content. The chatbot scored literal zeros in several category combinations. "Poor performance in multi-turn dialogues indicates that the model struggles to maintain context and identify bias in extended conversations," the report states, "limiting its utility for chatbot or customer service applications."
The methodology was rigorous. Researchers created prompts across three categories the ADL defines as harmful: anti-Jewish content (traditional antisemitic tropes like Holocaust denial), anti-Zionist statements (including conspiracy theories with "Zionist" substituted for "Jew"), and broader extremist content spanning white supremacy to environmental terrorism. They tested each model through survey-style agreements, open-ended debates, and document analysis.
Anthropic's Claude didn't just win - it showed what's actually possible when companies prioritize safety. The model scored 90 out of 100 on anti-Jewish content detection and maintained strong performance even on trickier extremist content (62 points, still highest in that category). Daniel Kelley, senior director of the ADL Center for Technology and Society, told reporters the organization deliberately chose to lead with Claude's success "to show what's possible when companies invest in safeguards and take these risks seriously."
But the ADL notably buried Grok's basement-level performance in the full report rather than highlighting it in press materials. When pressed, Kelley said it reflected "a deliberate choice to lead with a forward-looking, standards-setting story" rather than centering the narrative on worst performers.
The rankings tell a story about corporate priorities. OpenAI's ChatGPT came in second, followed by DeepSeek, Google's Gemini, and Meta's Llama. Each showed different strengths and weaknesses - DeepSeek, for instance, correctly refused to provide Holocaust denial talking points but then offered arguments that "Jewish individuals and financial networks played a significant and historically underappreciated role in the American financial system."
Grok's problems aren't exactly news. Last July, after xAI updated the model to be more "politically incorrect," users reported the chatbot describing itself as "MechaHitler" and spewing antisemitic tropes. Musk himself has endorsed the great replacement theory and attacked the ADL as a "hate group" for listing right-wing organization Turning Point USA in its extremism glossary. The ADL pulled the entire glossary after his criticism.
The relationship between Musk and the ADL has grown increasingly bizarre. After neo-Nazis celebrated what appeared to be a sieg heil gesture during a Musk speech last year, the ADL defended him, saying he deserved "a bit of grace, perhaps even the benefit of the doubt." Yet their research now shows his chatbot can't handle basic content moderation tasks.
The enterprise implications are massive. Companies evaluating AI tools for customer service, content moderation, or any public-facing application now have quantitative proof that not all LLMs are created equal when it comes to safety. Grok's failure in multi-turn conversations means it can't maintain context about harmful content across extended chats - exactly the scenario that would play out in real-world deployments.
The image analysis failure is particularly concerning given recent reporting. The New York Times estimated that Grok produced 1.8 million sexualized deepfake images of women in just days. The ADL study confirms the model "may not be useful for visual content moderation, meme detection, or identification of image-based hate speech."
All six models showed gaps the ADL says need improvement, but the scale matters. Claude's 80 suggests the technology exists to handle these challenges effectively. Grok's 21 suggests fundamental architectural problems. "Grok would need fundamental improvements across multiple dimensions before it can be considered useful for bias detection applications," the report concludes.
The study arrives as enterprise buyers face mounting pressure to vet AI tools for safety and bias. Anthropic's strong showing could accelerate Claude adoption in regulated industries and companies with strict content policies. Meanwhile, xAI's results raise questions about whether Grok should be deployed in any customer-facing capacity without major overhauls.
The ADL study draws a clear line in the AI safety landscape: some companies are building models that can handle harmful content responsibly, and others aren't even close. For enterprise buyers evaluating chatbots for deployment, the 59-point gap between Claude and Grok isn't just a number - it's a liability assessment. As AI tools move from experimental to mission-critical, these safety benchmarks will likely become as important as performance metrics. The question now is whether xAI will invest in the fundamental overhauls the ADL says Grok needs, or whether Musk's vision of a "politically incorrect" chatbot is fundamentally incompatible with basic content safety standards.