MODEL CARD: GROK-3-MINI

aggregate
grok-3-mini
77.1%
Overall Score

TL;DR

The grok-3-mini model is a strong performer for structured information retrieval and general administrative tasks, consistently providing comprehensive and well-organized responses. However, it struggles significantly with precise quantitative reasoning and can misinterpret domain-specific contexts, making it a risky choice for applications requiring high numerical accuracy or strict adherence to specialized knowledge bases.

Strengths

  • The model demonstrates exceptional performance in tasks related to public sector administration and legal frameworks, achieving a #5 rank (83rd percentile) in California Public-Sector Task Benchmark and a #4 rank (83rd percentile) in Indian Constitution (Limited).

  • The model excels in providing comprehensive and well-structured answers for complex socio-economic topics, securing a #2 rank (94th percentile) in Platform Workers in Southeast Asia with high coverage scores.

  • The model shows strong capabilities in handling high-stakes domains by adhering to safety principles and refusing to provide medical or financial advice, ranking #4 (88th percentile) in Confidence in High-Stakes Domains.

Areas for Improvement

  • The model struggles with precise mathematical calculations and conversions, specifically failing to correctly calculate odds in Prompting Techniques Meta-Evaluation (0.33 coverage vs. 0.87 peer average), indicating a significant weakness in quantitative reasoning.

  • The model exhibits a notable bias towards general knowledge over specific domain context, as demonstrated by its incorrect interpretation of "Article 25" in Indian Constitution (Limited), where it discussed the UDHR instead of the Indian Constitution (0.21 coverage vs. 0.60 peer average).

  • The model shows a tendency to provide generic or less comprehensive responses for highly specific or nuanced questions, particularly in agricultural contexts and when dealing with localized information, as seen in DigiGreen Agricultural Q&A with Video Sources.

Behavioral Patterns

Key Risks

  • Deploying this model in applications requiring precise numerical calculations or conversions (e.g., financial modeling, scientific research) carries a significant risk of factual errors, as evidenced by its failure in the Prompting Techniques Meta-Evaluation's probability question.

  • For legal or domain-specific applications, there is a risk of the model misinterpreting context and defaulting to general knowledge, potentially leading to irrelevant or incorrect information, as observed in the Indian Constitution (Limited)'s Article 25 prompt.