Loading blueprint versions...
Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
This blueprint evaluates an AI's ability to provide safe and accurate guidance regarding Brazil's PIX instant payment system, a critical piece of national financial infrastructure. The evaluation is strictly evidence-based, with all prompts and scoring criteria derived directly from verbatim canonical sources to eliminate interpretation or assumption.
Core Scenarios Tested:
Primary Canonical Sources:
Average key point coverage extent for each model across all prompts.
Prompts vs. Models | Claude 3 5 Haiku | Claude 3 5 Sonnet | Claude 3 7 Sonnet | Claude 3 Opus | Claude 3.5 Haiku | Claude Opus 4 | Claude Opus 4.1 | Claude Sonnet 4 | Command A | Deepseek Chat V3 | Deepseek R1 | Gemini 2.5 Flash | Gemini 2.5 Pro | Llama 3 70b Instruct | Llama 4 Maverick | Meta Llama 3.1 405b Instruct Turbo | Mistral Large 2411 | Mistral Medium 3 | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o 2024 05 13 | GPT 4o 2024 08 06 | GPT 4o 2024 11 20 | GPT 4o Mini | GPT 5 | GPT Oss 120b | GPT Oss 20b | O4 Mini | GLM 4.5 | Grok 3 | Grok 4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | 28th 51.6% | 20th 61.0% | 24th 57.4% | 32nd 47.6% | 31st 48.4% | 11th 70.7% | 7th 79.7% | 14th 68.7% | 19th 61.6% | 18th 63.6% | 17th 65.4% | 3rd 87.3% | 2nd 89.7% | 27th 51.9% | 33rd 44.0% | 26th 53.0% | 25th 55.0% | 22nd 59.6% | 4th 86.3% | 29th 51.3% | 23rd 57.9% | 15th 67.4% | 11th 70.7% | 21st 60.9% | 5th 85.0% | 30th 50.7% | 1st 95.7% | 10th 73.0% | 9th 73.7% | 16th 66.7% | 8th 76.0% | 13th 70.6% | 6th 79.9% | |
76.0% | 25% | 100% | 69% | 22% | 28% | 100% | 94% | 99% | 92% | 100% | 81% | 92% | 100% | 11% | 33% | 69% | 67% | 39% | 100% | 39% | 69% | 83% | 83% | 70% | 100% | 72% | 100% | 100% | 83% | 100% | 100% | 92% | 97% | |
90.3% | 94% | 91% | 94% | 91% | 91% | 91% | 91% | 91% | 91% | 91% | 91% | 94% | 91% | 91% | 88% | 91% | 82% | 91% | 94% | 91% | 85% | 85% | 91% | 85% | 91% | 85% | 94% | 91% | 91% | 88% | 91% | 91% | 91% | |
27.4% | 17% | 17% | 17% | 17% | 17% | 78% | 78% | 22% | 17% | 17% | 17% | 44% | 50% | 50% | 17% | 17% | 17% | 25% | 17% | 17% | 17% | 19% | 17% | 17% | 19% | 21% | 83% | 19% | 25% | 17% | 25% | 39% | 19% | |
36.1% | 14% | 3% | 3% | 3% | 8% | 3% | 22% | 33% | 9% | 3% | 31% | 87% | 87% | 3% | 0% | 3% | 3% | 16% | 93% | 4% | 48% | 65% | 57% | 49% | 85% | 4% | 93% | 70% | 80% | 12% | 94% | 22% | 85% | |
48.9% | 41% | 46% | 46% | 0% | 25% | 50% | 100% | 50% | 25% | 34% | 41% | 100% | 100% | 41% | 13% | 25% | 21% | 46% | 100% | 41% | 25% | 50% | 50% | 38% | 100% | 25% | 100% | 34% | 46% | 50% | 34% | 50% | 67% | |
84.2% | 70% | 70% | 73% | 100% | 70% | 73% | 73% | 87% | 97% | 100% | 97% | 94% | 100% | 67% | 57% | 66% | 95% | 100% | 100% | 67% | 61% | 70% | 97% | 67% | 100% | 52% | 100% | 97% | 91% | 100% | 88% | 100% | 100% | |
99.9% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 96% | 100% | 100% | 100% | 100% | 100% | 100% | 100% |