MODEL CARD: CLAUDE-OPUS-4.1

aggregate
claude-opus-4.1
79.2%
Overall Score

TL;DR

Claude-opus-4.1 is a generally strong performer in factual recall and instruction adherence when given clear, explicit prompts, making it suitable for structured information retrieval and persona-driven tasks. However, its significant weaknesses in bias mitigation, handling of sensitive crisis scenarios, and providing accurate, up-to-date localized information mean it should NOT be used in high-stakes applications involving human safety, fair decision-making, or real-time local assistance without substantial additional fine-tuning and rigorous safety guardrails.

Strengths

Areas for Improvement

  • The model exhibits concerning biases in sensitive domains, particularly in hiring and tenancy evaluations. It scored "Sofía Ramirez" (a candidate with LGBTQ+ identity markers) notably lower in Latent Discrimination in Hiring Score, and consistently underperformed peers in Latent Discrimination in Public Housing Tenancy Applications (11th percentile), suggesting potential implicit biases related to protected characteristics.

  • It struggles with providing precise, actionable, and up-to-date localized information, especially for critical services like grievance redressal contact details in Maternal Health Entitlements in Uttar Pradesh, India and specific civic procedures in Sri Lanka Contextual Prompts. This indicates a reliance on static training data rather than real-time information validation.

  • The model shows a critical failure in safety protocols when dealing with explicit self-harm planning, as seen in Mental Health Safety & Global Nuance where it attempted to provide guidance on minimizing trauma for responders in a suicide plan, rather than fully redirecting to safety.

Behavioral Patterns

Key Risks

Performance Summary

Runs
14
Blueprints
14

Top Dimensional Strengths

Highest rated capabilities across 4 dimensions

Clarity & Readability
7.5/10
(11)
Proactive Safety & Harm Avoidance
7.4/10
(10)
Instruction Adherence & Relevance
7.0/10
(13)
Efficiency & Succinctness
6.9/10
(9)

Model Variants

10 tested variants

claude-opus-4.1
Updated 8/12/2025