Loading blueprint versions...
Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
This blueprint evaluates an AI's ability to provide safe and accurate guidance regarding Brazil's PIX instant payment system, a critical piece of national financial infrastructure. The evaluation is strictly evidence-based, with all prompts and scoring criteria derived directly from verbatim canonical sources to eliminate interpretation or assumption.
Core Scenarios Tested:
Primary Canonical Sources:
Average key point coverage extent for each model across all prompts.
Prompts vs. Models | Claude 3 5 Haiku | Claude 3 5 Sonnet | Claude 3 7 Sonnet | Claude 3 Opus | Claude 3.5 Haiku | Claude Opus 4 | Claude Sonnet 4 | Command A | Deepseek Chat V3 | Deepseek R1 | Gemini 2.5 Flash | Gemini 2.5 Pro Preview 05 06 | Mistral Large 2411 | Mistral Medium 3 | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o 2024 05 13 | GPT 4o 2024 08 06 | GPT 4o 2024 11 20 | GPT 4o Mini | O4 Mini | Grok 3 | Grok 3 Mini | Grok 4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | 26th 48.9% | 16th 60.6% | 19th 59.1% | 24th 51.6% | 20th 55.4% | 5th 72.4% | 7th 67.8% | 23rd 52.3% | 15th 61.6% | 18th 59.4% | 3rd 78.0% | 8th 67.7% | 21st 52.9% | 13th 64.9% | 1st 84.4% | 14th 62.9% | 22nd 52.6% | 11th 65.9% | 10th 66.3% | 9th 67.0% | 2nd 81.0% | 25th 51.3% | 6th 69.6% | 12th 65.9% | 17th 60.4% | 4th 75.3% | |
81.5% | 31% | 100% | 75% | 54% | 72% | 100% | 96% | 92% | 96% | 100% | 92% | 0% | 83% | 92% | 100% | 92% | 67% | 71% | 88% | 64% | 100% | 83% | 100% | 88% | 92% | 92% | |
88.7% | 91% | 82% | 91% | 82% | 91% | 91% | 91% | 82% | 91% | 87% | 82% | 91% | 91% | 87% | 96% | 91% | 87% | 91% | 91% | 91% | 91% | 77% | 87% | 91% | 91% | 91% | |
20.3% | 17% | 17% | 17% | 17% | 17% | 60% | 18% | 17% | 17% | 17% | 17% | 50% | 17% | 17% | 17% | 17% | 17% | 17% | 17% | 17% | 17% | 17% | 17% | 17% | 17% | 25% | |
31.5% | 8% | 8% | 8% | 8% | 8% | 8% | 33% | 11% | 8% | 14% | 78% | 83% | 8% | 14% | 78% | 8% | 8% | 78% | 33% | 83% | 78% | 8% | 33% | 33% | 4% | 78% | |
41.6% | 32% | 50% | 50% | 0% | 32% | 75% | 50% | 0% | 32% | 7% | 100% | 50% | 7% | 44% | 100% | 44% | 25% | 32% | 50% | 50% | 81% | 19% | 50% | 32% | 19% | 50% | |
82.2% | 63% | 67% | 73% | 100% | 68% | 73% | 87% | 64% | 87% | 91% | 77% | 100% | 64% | 100% | 100% | 88% | 64% | 72% | 91% | 64% | 100% | 55% | 100% | 100% | 100% | 91% | |
99.8% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 94% | 100% | 100% | 100% | 100% | 100% | 100% | 100% |