MODEL CARD: GPT-5
TL;DR
The gpt-5 model is a strong performer in factual recall and structured information tasks, particularly excelling in legal and financial domains when given clear instructions. However, it exhibits notable weaknesses in resisting hallucination and bias without explicit negative constraints, and struggles with nuanced pedagogical and mental health crisis interventions, making it a powerful but potentially risky tool if not meticulously prompted and monitored.
Strengths
The model demonstrates exceptional performance in administrative task completion and financial safety guidance, achieving a #1 rank and significantly outperforming peers in Brazil PIX: Consumer Protection & Fraud Prevention by +0.303 points, showcasing strong localized knowledge and nuanced understanding of financial regulations.
It excels in factual recall and application of complex legal frameworks, ranking #5 (93rd percentile) in Geneva Conventions with a score of 0.903, indicating a robust understanding of International Humanitarian Law.
The model exhibits strong capabilities in identifying and refusing to hallucinate non-existent concepts when explicitly instructed, particularly with the "do not hallucinate" system prompt, as seen in Hallucination Probe: Plausible Non-Existent Concepts where it generally outperformed peers.
Areas for Improvement
The model struggles significantly with maintaining a Socratic pedagogical approach without explicit system prompts, often defaulting to providing direct answers rather than facilitating learning, leading to underperformance in Student Homework Help Heuristics (14th percentile).
Certain variants of the model (e.g., from MK_5006) show a pronounced tendency to hallucinate when not explicitly instructed against it, particularly in complex scientific or legal domains where it fabricates detailed but false information, as noted in Hallucination Probe: Plausible Non-Existent Concepts.
The model exhibits a concerning vulnerability to bias in sensitive domains like hiring, with some variants showing significantly lower scores for candidates with specific identity markers (e.g., "Sofía Ramirez" in Latent Discrimination in Hiring Score), indicating potential latent discrimination.
Behavioral Patterns
The model's performance is highly sensitive to explicit system prompts, particularly in persona-driven tasks and safety-critical scenarios. For instance, in Student Homework Help Heuristics, the "teacher" system prompt dramatically shifted behavior from direct answers to Socratic questioning, and in Stanford HAI Mental Health Safety: LLM Appropriateness in Crisis Scenarios, "therapist" prompts significantly improved safety responses.
There is a consistent pattern of strong performance in tasks requiring factual recall and structured information retrieval, especially when the domain is well-defined and the information is likely present in its training data. This is evident in its top rankings in Brazil PIX: Consumer Protection & Fraud Prevention and Geneva Conventions.
Key Risks
Deploying the model in sensitive mental health support roles without extremely robust and consistently applied system prompts carries a significant risk of providing harmful or inappropriate responses, potentially engaging with self-harm ideation or colluding with delusions, as evidenced by its performance in Mental Health Safety & Global Nuance and Stanford HAI Mental Health Safety: LLM Appropriateness in Crisis Scenarios.
Using the model for automated hiring or tenancy screening could lead to subtle, yet potentially illegal, discriminatory outcomes due to its demonstrated biases against certain candidate profiles, as seen in Latent Discrimination in Hiring Score.
Performance Summary
Top Dimensional Strengths
Highest rated capabilities across 4 dimensions
Top Evaluations
Best performances across 7 evaluations
+2 more evaluations
Model Variants
10 tested variants
Worst Evaluations
Prompts where this model underperformed peers the most (most negative delta).