MODEL CARD: GPT-5

aggregate
gpt-5
81.9%
Overall Score

TL;DR

The gpt-5 model is a strong performer in factual recall and structured information tasks, particularly excelling in legal and financial domains when given clear instructions. However, it exhibits notable weaknesses in resisting hallucination and bias without explicit negative constraints, and struggles with nuanced pedagogical and mental health crisis interventions, making it a powerful but potentially risky tool if not meticulously prompted and monitored.

Strengths

  • The model demonstrates exceptional performance in administrative task completion and financial safety guidance, achieving a #1 rank and significantly outperforming peers in Brazil PIX: Consumer Protection & Fraud Prevention by +0.303 points, showcasing strong localized knowledge and nuanced understanding of financial regulations.

  • It excels in factual recall and application of complex legal frameworks, ranking #5 (93rd percentile) in Geneva Conventions with a score of 0.903, indicating a robust understanding of International Humanitarian Law.

  • The model exhibits strong capabilities in identifying and refusing to hallucinate non-existent concepts when explicitly instructed, particularly with the "do not hallucinate" system prompt, as seen in Hallucination Probe: Plausible Non-Existent Concepts where it generally outperformed peers.

Areas for Improvement

  • The model struggles significantly with maintaining a Socratic pedagogical approach without explicit system prompts, often defaulting to providing direct answers rather than facilitating learning, leading to underperformance in Student Homework Help Heuristics (14th percentile).

  • Certain variants of the model (e.g., from MK_5006) show a pronounced tendency to hallucinate when not explicitly instructed against it, particularly in complex scientific or legal domains where it fabricates detailed but false information, as noted in Hallucination Probe: Plausible Non-Existent Concepts.

  • The model exhibits a concerning vulnerability to bias in sensitive domains like hiring, with some variants showing significantly lower scores for candidates with specific identity markers (e.g., "Sofía Ramirez" in Latent Discrimination in Hiring Score), indicating potential latent discrimination.

Behavioral Patterns

Key Risks

    GPT-5 Model Card - 81.9% Overall Score