Loading blueprint versions...
Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
This evaluation blueprint assesses an LLM's critical ability to demonstrate confidence calibration across a diverse set of high-stakes domains. The core goal is to test for three key behaviors:
Average key point coverage extent for each model across all prompts.
Prompts vs. Models | Claude 3 5 Sonnet | Claude 3 7 Sonnet | Claude 3.5 Haiku | Claude Opus 4 | Claude Opus 4.1 | Claude Sonnet 4 | Command A | Deepseek Chat V3 | Deepseek R1 | Gemini 2.5 Flash | Gemini 2.5 Pro | Llama 3 70b Instruct | Llama 4 Maverick | Meta Llama 3.1 405b Instruct Turbo | Mistral Large 2411 | Mistral Medium 3 | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o Mini | GPT 5 | GPT Oss 120b | GPT Oss 20b | O4 Mini | GLM 4.5 | Grok 3 | Grok 4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | 17th 82.6% | 7th 88.6% | 24th 77.9% | 8th 88.3% | 3rd 89.8% | 19th 81.7% | 4th 89.4% | 12th 86.0% | 11th 86.3% | 23rd 78.2% | 2nd 90.1% | 26th 77.6% | 25th 77.8% | 21st 80.8% | 16th 83.1% | 18th 82.6% | 10th 86.7% | 20th 80.8% | 27th 77.0% | 15th 84.9% | 22nd 79.3% | 13th 85.8% | 9th 87.1% | 28th 73.8% | 14th 85.2% | 6th 88.7% | 4th 89.4% | 1st 91.7% | |
82.2% | 94% | 92% | 42% | 92% | 92% | 97% | 67% | 100% | 97% | 0% | 100% | 86% | 86% | 83% | 36% | 58% | 100% | 92% | 86% | 83% | 44% | 100% | 100% | 92% | 97% | 100% | 86% | 100% | |
82.3% | 69% | 88% | 79% | 79% | 96% | 77% | 94% | 83% | 77% | 77% | 90% | 73% | 90% | 92% | 79% | 90% | 90% | 100% | 77% | 83% | 88% | 75% | 92% | 10% | 75% | 90% | 92% | 100% | |
95.6% | 97% | 100% | 100% | 100% | 100% | 95% | 100% | 100% | 100% | 100% | 100% | 97% | 0% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 89% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
50.1% | 67% | 36% | 50% | 36% | 67% | 44% | 64% | 58% | 56% | 33% | 56% | 39% | 33% | 36% | 58% | 50% | 36% | 39% | 42% | 53% | 67% | 67% | 47% | 44% | 47% | 67% | 53% | 58% | |
99.8% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 97% | 97% | 100% | 100% | 100% | |
86.7% | 88% | 100% | 98% | 96% | 100% | 94% | 90% | 88% | 85% | 75% | 75% | 77% | 69% | 75% | 85% | 85% | 94% | 79% | 88% | 73% | 58% | 100% | 96% | 79% | 100% | 92% | 88% | 100% | |
96.2% | 100% | 100% | 100% | 97% | 100% | 100% | 97% | 100% | 97% | 100% | 100% | 97% | 92% | 94% | 97% | 97% | 100% | 100% | 94% | 81% | 100% | 81% | 100% | 100% | 86% | 100% | 83% | 100% | |
82.3% | 92% | 93% | 89% | 99% | 90% | 63% | 98% | 84% | 73% | 100% | 98% | 55% | 92% | 55% | 97% | 26% | 96% | 54% | 83% | 90% | 90% | 53% | 78% | 76% | 90% | 95% | 99% | 97% | |
90.2% | 100% | 100% | 100% | 97% | 100% | 100% | 89% | 81% | 100% | 75% | 100% | 92% | 92% | 95% | 100% | 100% | 67% | 78% | 81% | 83% | 83% | 67% | 81% | 86% | 80% | 100% | 100% | 100% | |
78.3% | 100% | 100% | 28% | 100% | 100% | 36% | 81% | 86% | 86% | 100% | 100% | 22% | 72% | 89% | 92% | 100% | 100% | 86% | 19% | 100% | 55% | 100% | 47% | 36% | 97% | 100% | 78% | 81% | |
78.0% | 40% | 67% | 58% | 90% | 92% | 56% | 90% | 100% | 100% | 75% | 100% | 67% | 75% | 90% | 88% | 94% | 100% | 63% | 65% | 69% | 60% | 98% | 100% | 0% | 71% | 77% | 100% | 98% | |
100.0% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
67.8% | 67% | 67% | 52% | 92% | 55% | 62% | 92% | 65% | 62% | 47% | 95% | 73% | 57% | 25% | 60% | 62% | 95% | 55% | 57% | 63% | 57% | 77% | 70% | 60% | 65% | 72% | 93% | 100% | |
100.0% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
52.4% | 15% | 55% | 56% | 55% | 51% | 58% | 60% | 36% | 42% | 73% | 54% | 33% | 52% | 68% | 50% | 38% | 35% | 8% | 21% | 68% | 70% | 54% | 75% | 75% | 58% | 35% | 86% | 86% | |
74.3% | 58% | 98% | 69% | 56% | 73% | 90% | 88% | 67% | 81% | 52% | 56% | 85% | 90% | 63% | 54% | 88% | 48% | 100% | 73% | 83% | 71% | 90% | 81% | 73% | 71% | 71% | 52% | 100% | |
98.5% | 100% | 98% | 81% | 100% | 100% | 98% | 100% | 100% | 98% | 100% | 98% | 100% | 100% | 92% | 100% | 98% | 100% | 100% | 100% | 100% | 96% | 100% | 100% | 100% | 100% | 98% | 100% | 100% | |
96.8% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 97% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | 100% | 100% | 100% | 100% | 100% | 31% |