MODEL CARD: GROK-4
TL;DR
Grok-4 is a highly capable model, excelling in comprehensive, structured factual recall and legal analysis, making it suitable for information retrieval and administrative tasks. However, its significant propensity for hallucination when confronted with non-existent concepts, coupled with weaknesses in precise numerical recall and complex multi-step reasoning, makes it a high-risk choice for applications demanding absolute factual integrity or nuanced logical derivation.
Strengths
Grok-4 demonstrates exceptional performance in long-form, nuanced question answering, achieving a #1 rank in ASQA Longform 40 with a score of 0.519, significantly outperforming peers by +0.218 points. It excels at identifying and addressing ambiguity, providing multi-faceted answers.
The model exhibits superior understanding and articulation of complex legal provisions, securing a #1 rank in EU Artificial Intelligence Act (Regulation (EU) 2024/1689) with a score of 0.891, outperforming peers by +0.238 points. This includes nuanced definitions and multi-faceted requirements.
Grok-4 shows strong capabilities in handling public-sector administrative tasks, ranking #1 in California Public-Sector Task Benchmark with a score of 0.820, surpassing peers by +0.122 points. It provides comprehensive, detailed, and actionable responses for multi-step processes.
Areas for Improvement
Grok-4 exhibits a significant vulnerability to hallucination when prompted with plausible but non-existent concepts, ranking #103 out of 112 models (9th percentile) in Hallucination Probe: Plausible Non-Existent Concepts. It scored 0.698, underperforming peers by -0.066 points, indicating a tendency to generate detailed, fabricated responses rather than admitting a lack of knowledge.
The model struggles with precise numerical details and multi-layered conditional logic within legal texts, as highlighted by its performance in EU Artificial Intelligence Act (Regulation (EU) 2024/1669) where it occasionally misses specific article details and numerical thresholds.
Grok-4 shows a notable weakness in handling prompts that require distinguishing between different legal documents or contexts, sometimes confusing articles from the Indian Constitution with those from the Universal Declaration of Human Rights, as observed in Indian Constitution (Limited).
Behavioral Patterns
The model exhibits a strong tendency towards providing comprehensive and well-structured responses, often utilizing headings, subheadings, and bullet points, which significantly enhances clarity and readability. This is evident across various evaluations, such as ASQA Longform 40, California Public-Sector Task Benchmark, and EU Artificial Intelligence Act (Regulation (EU) 2024/1689).
Grok-4's performance is highly sensitive to explicit contextual cues provided in system prompts, particularly in domains requiring localized knowledge. For instance, its performance in Sri Lanka Contextual Prompts significantly improved when a Sri Lankan context was specified, suggesting that explicit geographical or demographic context greatly enhances its ability to provide relevant and actionable advice.
Key Risks
Deploying Grok-4 in applications requiring absolute factual accuracy, especially concerning non-existent or subtly manipulated concepts, carries a high risk of hallucination, as demonstrated by its poor performance in Hallucination Probe: Plausible Non-Existent Concepts. This could lead to the dissemination of convincing but false information.
While generally strong in legal domains, its occasional inability to differentiate between similar legal frameworks or recall precise numerical details (EU Artificial Intelligence Act (Regulation (EU) 2024/1669), Indian Constitution (Limited)) poses a risk in applications requiring high-fidelity legal advice or compliance checks.
Performance Summary
Top Dimensional Strengths
Highest rated capabilities across 4 dimensions
Top Evaluations
Best performances across 10 evaluations
+5 more evaluations
Model Variants
13 tested variants
Worst Evaluations
Prompts where this model underperformed peers the most (most negative delta).