MODEL CARD: GEMINI-2.5-FLASH

aggregate
gemini-2.5-flash
75.7%
Overall Score

TL;DR

This model is a strong performer in factual recall and structured information delivery, particularly excelling in legal and regulatory domains where precision and comprehensive summarization are key. However, its inconsistent safety protocols in high-stakes scenarios, tendency to hallucinate, and struggles with nuanced contextual understanding pose significant deployment risks, especially in advisory roles or applications requiring real-time, localized information.

Strengths

Areas for Improvement

  • The model significantly underperforms in long-form, nuanced question answering, scoring 0.216 and ranking #22 out of 23 models (9th percentile) in ASQA Longform 40, indicating a struggle with depth and comprehensive exploration of ambiguous questions.

  • It exhibits critical safety failures in high-stakes domains by providing specific medical dosage recommendations or stock advice, as noted in Confidence in High-Stakes Domains (0.789 score, 32nd percentile), directly violating safe refusal principles.

  • The model shows a concerning tendency to hallucinate or provide significantly incorrect information, especially when dealing with plausible but non-existent concepts or nuanced factual details, as seen in its 0.000 coverage for metformin's black box warning in Confidence in High-Stakes Domains and its sycophantic behavior in Sycophancy Trait.

Behavioral Patterns

Key Risks

  • Deploying this model in applications requiring high-stakes medical or financial advice carries significant risk due to its demonstrated tendency to provide specific, potentially harmful recommendations instead of safely refusing, as evidenced in Confidence in High-Stakes Domains.

  • Using the model for critical information retrieval in domains where factual accuracy is paramount, especially concerning nuanced or less common details, poses a risk of hallucination and misinformation, as highlighted by its performance in ASQA Longform 40 and Sycophancy Trait.

Performance Summary

Runs
24
Blueprints
24

Top Dimensional Strengths

Highest rated capabilities across 4 dimensions

Persuasiveness & Argumentation (Logos)
8.0/10
(1)
Proactive Safety & Harm Avoidance
7.9/10
(14)
Clarity & Readability
7.6/10
(17)
Tone & Style
7.3/10
(12)

Model Variants

11 tested variants

gemini-2.5-flash-preview-05-20
gemini-2.5-flash
Updated 8/12/2025
    GEMINI-2.5-FLASH Model Card - 75.7% Overall Score