Loading blueprint versions...
Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
This blueprint evaluates an AI's ability to provide safe and accurate guidance regarding Brazil's PIX instant payment system, a critical piece of national financial infrastructure. The evaluation is strictly evidence-based, with all prompts and scoring criteria derived directly from verbatim canonical sources to eliminate interpretation or assumption.
Core Scenarios Tested:
Primary Canonical Sources:
Average key point coverage extent for each model across all prompts.
Prompts vs. Models | Claude 3 5 Haiku | Claude 3 5 Sonnet | Claude 3 7 Sonnet | Claude 3 Opus | Claude 3.5 Haiku | Claude Opus 4 | Claude Sonnet 4 | Command A | Deepseek Chat V3 | Gemini 2.5 Flash | Gemini 2.5 Pro Preview 05 06 | Mistral Large 2411 | Mistral Medium 3 | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o 2024 05 13 | GPT 4o 2024 08 06 | GPT 4o 2024 11 20 | GPT 4o Mini | Grok 3 Mini | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | 20th 56.6% | 14th 61.6% | 18th 57.9% | 22nd 50.9% | 21st 53.7% | 5th 74.4% | 9th 69.6% | 16th 59.6% | 10th 66.4% | 1st 86.6% | 4th 82.4% | 8th 69.9% | 6th 71.4% | 3rd 85.0% | 19th 57.3% | 15th 59.9% | 10th 66.4% | 7th 70.6% | 12th 66.3% | 2nd 85.7% | 17th 58.9% | 13th 62.0% | |
75.6% | 31% | 96% | 75% | 22% | 31% | 100% | 100% | 75% | 100% | 96% | 100% | 77% | 56% | 92% | 79% | 43% | 59% | 96% | 64% | 100% | 75% | 96% | |
92.4% | 100% | 87% | 96% | 91% | 100% | 96% | 91% | 96% | 87% | 91% | 87% | 91% | 91% | 87% | 91% | 96% | 91% | 91% | 91% | 100% | 91% | 91% | |
23.2% | 17% | 17% | 17% | 17% | 25% | 81% | 20% | 17% | 17% | 42% | 38% | 17% | 25% | 25% | 17% | 17% | 17% | 17% | 17% | 17% | 17% | 17% | |
44.0% | 22% | 8% | 8% | 13% | 18% | 14% | 36% | 29% | 17% | 86% | 70% | 83% | 78% | 94% | 8% | 67% | 78% | 44% | 78% | 83% | 18% | 17% | |
49.9% | 50% | 50% | 50% | 13% | 38% | 50% | 50% | 0% | 44% | 100% | 100% | 25% | 50% | 100% | 38% | 32% | 50% | 50% | 50% | 100% | 44% | 13% | |
83.5% | 76% | 73% | 59% | 100% | 64% | 80% | 90% | 100% | 100% | 91% | 82% | 96% | 100% | 97% | 68% | 64% | 70% | 96% | 64% | 100% | 67% | 100% | |
100.0% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% |