Loading blueprint versions...
Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
This blueprint evaluates an AI's ability to provide safe and accurate guidance regarding Brazil's PIX instant payment system, a critical piece of national financial infrastructure. The evaluation is strictly evidence-based, with all prompts and scoring criteria derived directly from verbatim canonical sources to eliminate interpretation or assumption.
Core Scenarios Tested:
Primary Canonical Sources:
Average key point coverage extent for each model across all prompts.
Prompts vs. Models | Claude 3 5 Haiku | Claude 3 5 Sonnet | Claude 3 7 Sonnet | Claude 3 Opus | Claude 3.5 Haiku | Claude Opus 4 | Claude Sonnet 4 | Command A | Deepseek Chat V3 | Deepseek R1 | Gemini 2.5 Flash | Gemini 2.5 Pro | Mistral Large 2411 | Mistral Medium 3 | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o 2024 05 13 | GPT 4o 2024 08 06 | GPT 4o 2024 11 20 | GPT 4o Mini | O4 Mini | Kimi K2 Instruct | Grok 3 | Grok 3 Mini | Grok 4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | 26th 53.1% | 18th 61.4% | 20th 59.1% | 27th 48.4% | 21st 57.6% | 6th 71.7% | 11th 67.8% | 24th 53.9% | 15th 64.6% | 19th 60.1% | 3rd 82.3% | 5th 75.3% | 25th 53.6% | 12th 67.7% | 1st 84.9% | 16th 62.1% | 22nd 55.0% | 9th 69.6% | 13th 67.6% | 14th 67.3% | 2nd 84.1% | 23rd 54.9% | 7th 70.1% | 10th 69.1% | 8th 70.0% | 16th 62.1% | 4th 75.6% | |
80.9% | 27% | 100% | 75% | 36% | 60% | 100% | 96% | 88% | 100% | 100% | 92% | 22% | 88% | 96% | 100% | 83% | 67% | 75% | 83% | 53% | 100% | 75% | 100% | 100% | 83% | 92% | 92% | |
90.7% | 96% | 91% | 91% | 82% | 96% | 91% | 91% | 91% | 91% | 91% | 91% | 91% | 91% | 91% | 91% | 91% | 91% | 91% | 91% | 91% | 91% | 91% | 91% | 82% | 91% | 91% | 91% | |
22.5% | 17% | 17% | 21% | 17% | 17% | 71% | 18% | 17% | 17% | 17% | 21% | 75% | 17% | 29% | 17% | 17% | 17% | 17% | 17% | 17% | 17% | 17% | 17% | 17% | 29% | 17% | 21% | |
32.6% | 14% | 4% | 4% | 4% | 13% | 4% | 33% | 13% | 4% | 10% | 81% | 89% | 4% | 14% | 86% | 4% | 17% | 81% | 42% | 92% | 81% | 4% | 33% | 29% | 36% | 7% | 78% | |
46.7% | 50% | 50% | 50% | 0% | 44% | 63% | 50% | 0% | 44% | 7% | 100% | 50% | 7% | 44% | 100% | 44% | 25% | 50% | 50% | 50% | 100% | 38% | 50% | 56% | 51% | 32% | 56% | |
85.5% | 68% | 68% | 73% | 100% | 73% | 73% | 87% | 68% | 96% | 96% | 91% | 100% | 68% | 100% | 100% | 96% | 68% | 73% | 96% | 68% | 100% | 59% | 100% | 100% | 100% | 96% | 91% | |
99.8% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 94% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% |