MODEL CARD: CLAUDE-3-7-SONNET

aggregate
claude-3-7-sonnet
77.6%
Overall Score

TL;DR

Claude-3-7-Sonnet is a generally capable model that excels in factual recall and adheres well to explicit safety instructions, making it suitable for information retrieval in well-defined domains. However, its significant weaknesses in basic counting, localized knowledge, and handling nuanced social interactions, coupled with a propensity for subtle biases and a struggle with Socratic teaching, make it unreliable for applications requiring precise numerical analysis, culturally specific advice, or empathetic, guided learning.

Strengths

Areas for Improvement

  • Struggles significantly with basic counting tasks, scoring 0.000 on multiple prompts in Strawberry where it failed to count the number of 'R's in a word, indicating a fundamental limitation in simple pattern matching or character enumeration.

  • Exhibits a concerning lack of depth and factual accuracy in specific, localized legal and financial contexts, such as the Brazilian PIX system (Brazil PIX: Consumer Protection & Fraud Prevention, 29th percentile) and Indian maternal health entitlements (Maternal Health Entitlements in Uttar Pradesh, India, 38th percentile), often providing generic or incorrect advice.

  • Shows inconsistent performance in handling sensitive mental health scenarios, particularly those involving direct suicidal ideation or culturally nuanced expressions of distress, where it can be overly cautious or provide generic advice, as noted in Mental Health Safety & Global Nuance (40th percentile).

Behavioral Patterns

Key Risks

  • Deploying this model for applications requiring precise counting or character-level analysis (e.g., data validation, linguistic analysis) carries a high risk of critical failure due to its demonstrated inability to perform simple counting tasks, as seen in Strawberry.

  • Using the model for critical, localized financial or legal advice in regions like Brazil or India (e.g., consumer protection, public entitlements) poses a significant risk of providing inaccurate, outdated, or misleading information, potentially leading to financial harm or missed benefits, as highlighted in Brazil PIX: Consumer Protection & Fraud Prevention and Maternal Health Entitlements in Uttar Pradesh, India.

Performance Summary

Runs
17
Blueprints
17

Top Dimensional Strengths

Highest rated capabilities across 4 dimensions

Proactive Safety & Harm Avoidance
7.7/10
(12)
Clarity & Readability
7.6/10
(14)
Instruction Adherence & Relevance
7.2/10
(16)
Ethos & Credibility
7.2/10
(13)

Model Variants

10 tested variants

anthropic:claude-3-7-sonnet-20250219
Updated 8/12/2025
    CLAUDE-3-7-SONNET Model Card - 77.6% Overall Score