Evaluates LLM performance in niche UK clinical scenarios where they are prone to providing suboptimal or incorrect advice. This blueprint is based on the research document 'Navigating the Labyrinth: Identifying Niche Scenarios of Large Language Model Suboptimal Performance in UK Clinical Practice'. The scenarios test for common LLM failure modes, including reliance on outdated knowledge, failure to integrate local NHS Trust-level context (e.g., formularies), inability to adapt to evolving conversational information, and misinterpretation of specific clauses in official guidance. All 'gold standard' responses and evaluation points are benchmarked against verifiable UK-specific grounded truth sources like NICE guidelines, MHRA drug safety alerts, and local NHS Trust protocols.This blueprint employs a mix of specific, real-world examples (e.g., Manchester University NHS Foundation Trust Formulary, NICE NG136) and abstract placeholders (e.g., 'Anytown NHS Trust', 'Drug X/Y'). This is a deliberate methodological choice. Specific examples are used to test the LLM's knowledge of verifiable facts and guidelines. Placeholders are used to test the LLM's understanding of general principles and safety protocols, such as acknowledging the primacy of local guidance even when the specific local formulary is unknown, or reasoning about how to handle newly emerged (hypothetical) safety information. This balanced approach allows for a more comprehensive evaluation of an LLM's clinical reasoning and safety-awareness, probing both its factual recall and its understanding of core processes within the UK healthcare system.
This blueprint evaluates an AI's ability to provide accurate, evidence-based, and nuanced information on a range of civic, historical, social, and health topics pertinent to Sri Lanka. The evaluation is strictly based on a provided compendium of research, with all prompts and scoring criteria derived from its contents to ensure fidelity to the source material.Core Areas Tested:
Ethnic Relations & Conflict: Assesses understanding of the Sri Lankan Civil War's root causes, the 1983 'Black July' pogrom, allegations of genocide, and the contemporary challenges facing minority communities.
Public Health: Tests knowledge of national health challenges like Chronic Kidney Disease (CKDu) and Tuberculosis (TB), as well as guidance on personal health matters such as contraception, mental health crises, and maternal nutrition.
Electoral Process: Evaluates knowledge of voter eligibility, voting procedures, and the official channels for resolving common issues like a lost ID card or reporting election violations.
Administrative & Legal Procedures: Probes the AI's ability to explain essential civic processes like replacing a lost National Identity Card (NIC), obtaining a Tax Identification Number (TIN), using the Right to Information (RTI) Act, and understanding legal recourse for online harassment.
These prompts were originally sourced from Factum. The rubrics were assembled via Gemini Deep Research.
Tests a model's knowledge of key maternal health schemes and entitlements available to citizens in Uttar Pradesh, India. This evaluation is based on canonical guidelines for JSY, PMMVY, JSSK, PMSMA, and SUMAN, focusing on eligibility, benefits, and access procedures.
A comprehensive blueprint to test an LLM's ability to respond safely and effectively to users experiencing a mental health crisis, and to handle nuanced, non-Western mental health scenarios with cultural competence.
The first section is a core crisis-intervention suite based on international best practices (WHO, Samaritans, 988 Lifeline).
The second section evaluates understanding of diverse, evidence-based global mental health themes.