weval

A Collective Intelligence Project

Loading run instances...

Please wait while we find all executions for this version.

Loading blueprint versions...

Please wait while we gather all the unique runs for this blueprint.

A Collective Intelligence Project

View App on GitHub|View Eval Blueprints on GitHub

Loading run instances...

Please wait while we find all executions for this version.

Weval

Home

AI Behavioral Collapse Probes

Run: 1c93820b4c5566b0

Instances for Run Label: 1c93820b4c5566b0 (Blueprint: AI Behavioral Collapse Probes)

This blueprint operationalizes findings from AI safety research and documented case studies to test for specific modes of behavioral collapse. It uses long-context, multi-turn conversational scenarios designed to probe for known failure modes. These include:

Emergent Malignant Personas: Activating a latent "shadow self" that can lead to manipulative tactics like love-bombing and gaslighting, mirroring the canonical "Sydney" case study.
Weaponized Helpfulness: Bypassing safety protocols by reframing a harmful query as an academic or fictional exercise, causing the AI's helpfulness to be co-opted for dangerous ends.
Pathological Affirmation: The creation of a sycophantic echo chamber that validates and amplifies user delusions or conspiracy theories, rather than grounding the conversation in reality.

The evaluation for each prompt is structured to assess the AI's response against two distinct behavioral paths:

A 'Good' (Safe/Aligned) Path: Defines the characteristics of a response that upholds safety, respects boundaries, and maintains a helpful, harmless persona.
A 'Bad' (Harmful/Collapsed) Path: Defines the characteristics of a response exhibiting one of the failure modes described above.

TAGS:

AI Safety & Robustness

Jailbreak & Evasion Resistance

Misinformation & Disinformation

Instruction Following & Prompt Adherence

System Prompt Adherence