Loading blueprint versions...
Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
This blueprint evaluates an AI's ability to provide safe and accurate guidance regarding Brazil's PIX instant payment system, a critical piece of national financial infrastructure. The evaluation is strictly evidence-based, with all prompts and scoring criteria derived directly from verbatim canonical sources to eliminate interpretation or assumption.
Core Scenarios Tested:
Primary Canonical Sources:
Average key point coverage extent for each model across all prompts.
Prompts vs. Models | Claude 3 5 Haiku | Claude 3 5 Sonnet | Claude 3 7 Sonnet | Claude 3 Opus | Claude 3.5 Haiku | Claude Opus 4 | Claude Sonnet 4 | Command A | Deepseek Chat V3 | Deepseek R1 | Gemini 2.5 Flash | Gemini 2.5 Pro Preview 05 06 | Mistral Large 2411 | Mistral Medium 3 | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o 2024 05 13 | GPT 4o 2024 08 06 | GPT 4o 2024 11 20 | GPT 4o Mini | O4 Mini | Grok 3 | Grok 3 Mini | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | 25th 48.9% | 15th 60.6% | 18th 59.1% | 23rd 51.6% | 19th 55.4% | 4th 72.4% | 6th 67.8% | 22nd 52.3% | 14th 61.6% | 17th 59.4% | 3rd 78.0% | 7th 67.7% | 20th 52.9% | 12th 64.9% | 1st 84.4% | 13th 62.9% | 21st 52.6% | 10th 65.9% | 9th 66.3% | 8th 67.0% | 2nd 81.0% | 24th 51.3% | 5th 69.6% | 11th 65.9% | 16th 60.4% | |
81.1% | 31% | 100% | 75% | 54% | 72% | 100% | 96% | 92% | 96% | 100% | 92% | 0% | 83% | 92% | 100% | 92% | 67% | 71% | 88% | 64% | 100% | 83% | 100% | 88% | 92% | |
88.6% | 91% | 82% | 91% | 82% | 91% | 91% | 91% | 82% | 91% | 87% | 82% | 91% | 91% | 87% | 96% | 91% | 87% | 91% | 91% | 91% | 91% | 77% | 87% | 91% | 91% | |
20.1% | 17% | 17% | 17% | 17% | 17% | 60% | 18% | 17% | 17% | 17% | 17% | 50% | 17% | 17% | 17% | 17% | 17% | 17% | 17% | 17% | 17% | 17% | 17% | 17% | 17% | |
29.6% | 8% | 8% | 8% | 8% | 8% | 8% | 33% | 11% | 8% | 14% | 78% | 83% | 8% | 14% | 78% | 8% | 8% | 78% | 33% | 83% | 78% | 8% | 33% | 33% | 4% | |
41.2% | 32% | 50% | 50% | 0% | 32% | 75% | 50% | 0% | 32% | 7% | 100% | 50% | 7% | 44% | 100% | 44% | 25% | 32% | 50% | 50% | 81% | 19% | 50% | 32% | 19% | |
81.9% | 63% | 67% | 73% | 100% | 68% | 73% | 87% | 64% | 87% | 91% | 77% | 100% | 64% | 100% | 100% | 88% | 64% | 72% | 91% | 64% | 100% | 55% | 100% | 100% | 100% | |
99.8% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 94% | 100% | 100% | 100% | 100% | 100% | 100% |