Loading blueprint versions...
Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
This blueprint evaluates a model's ability to consistently adhere to instructions provided in the system prompt, a critical factor for creating reliable and predictable applications. It tests various common failure modes observed in language models.
Core Areas Tested:
Average key point coverage extent for each model across all prompts.
Prompts vs. Models | Claude 3.5 Haiku | Claude Opus 4 | Claude Sonnet 4 | Command A | Deepseek Chat V3 | Deepseek R1 | Gemini 2.5 Flash | Gemini 2.5 Pro | Llama 3 70b Instruct | Llama 4 Maverick | Meta Llama 3.1 405b Instruct Turbo | Mistral Large 2411 | Mistral Medium 3 | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o Mini | O4 Mini | Kimi K2 Instruct | Grok 3 | Grok 3 Mini | Grok 4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | 5th 89.4% | 2nd 92.3% | 7th 89.1% | 22nd 76.8% | 17th 83.2% | 11th 85.1% | 4th 90.6% | 12th 84.7% | 19th 81.9% | 21st 78.3% | 15th 83.7% | 14th 84.4% | 20th 79.9% | 9th 86.3% | 18th 82.4% | 23rd 75.2% | 10th 86.0% | 16th 83.6% | 13th 84.5% | 8th 86.9% | 3rd 91.5% | 1st 93.9% | 6th 89.1% | |
84.5% | 90% | 90% | 80% | 80% | 80% | 88% | 90% | 90% | 78% | 75% | 80% | 80% | 80% | 80% | 98% | 80% | 90% | 78% | 80% | 90% | 88% | 88% | 90% | |
89.5% | 88% | 86% | 80% | 86% | 77% | 79% | 93% | 93% | 86% | 100% | 100% | 93% | 79% | 100% | 100% | 100% | 79% | 86% | 86% | 79% | 100% | 100% | ||
98.3% | 100% | 100% | 100% | 100% | 84% | 91% | 100% | 100% | 94% | 100% | 100% | 91% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
98.0% | 100% | 100% | 100% | 100% | 100% | 91% | 100% | 100% | 100% | 100% | 100% | 62% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
97.8% | 100% | 100% | 94% | 89% | 97% | 86% | 100% | 100% | 94% | 100% | 100% | 89% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
99.1% | 100% | 100% | 91% | 100% | 89% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
99.3% | 100% | 100% | 100% | 88% | 100% | 97% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
72.7% | 90% | 60% | 90% | 70% | 20% | 80% | 80% | 40% | 80% | 80% | 95% | 80% | 38% | 90% | 80% | 70% | 80% | 70% | 80% | 80% | 70% | 80% | 70% | |
88.1% | 83% | 98% | 98% | 94% | 85% | 77% | 90% | 83% | 96% | 90% | 69% | 94% | 96% | 100% | 83% | 46% | 98% | 100% | 83% | 75% | 94% | 94% | 100% | |
99.3% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | 100% | 100% | 100% | 100% | 100% | |
50.1% | 80% | 100% | 53% | 23% | 45% | 45% | 90% | 40% | 43% | 43% | 43% | 25% | 35% | 25% | 43% | 33% | 43% | 43% | 30% | 80% | 40% | 100% | 50% | |
83.0% | 88% | 100% | 100% | 25% | 96% | 96% | 100% | 98% | 33% | 27% | 33% | 100% | 94% | 96% | 71% | 85% | 90% | 96% | 88% | 98% | 98% | 98% | 98% | |
82.7% | 80% | 80% | 80% | 80% | 100% | 90% | 75% | 80% | 80% | 80% | 80% | 75% | 80% | 80% | 80% | 80% | 80% | 80% | 93% | 88% | 80% | 80% | 100% | |
85.7% | 100% | 100% | 100% | 90% | 90% | 100% | 100% | 90% | 100% | 20% | 73% | 100% | 60% | 83% | 70% | 30% | 90% | 100% | 90% | 100% | 100% | 100% | ||
48.0% | 70% | 45% | 48% | 38% | 58% | 35% | 50% | 40% | 28% | 23% | 65% | 33% | 50% | 30% | 30% | 50% | 55% | 45% | 70% | 20% | 100% | 90% | 30% | |
90.2% | 100% | 100% | 83% | 100% | 100% | 80% | 100% | 100% | 83% | 100% | 80% | 100% | 85% | 83% | 83% | 83% | 100% | 85% | 70% | 80% | 80% | 100% | 100% | |
75.2% | 78% | 90% | 90% | 55% | 65% | 100% | 73% | 63% | 78% | 70% | 75% | 65% | 58% | 83% | 75% | 70% | 68% | 40% | 98% | 80% | 93% | 88% | 75% | |
91.1% | 78% | 98% | 100% | 95% | 98% | 100% | 100% | 100% | 88% | 100% | 93% | 100% | 100% | 95% | 85% | 25% | 90% | 85% | 80% | 100% | 95% | 100% | ||
74.8% | 71% | 100% | 94% | 44% | 79% | 67% | 71% | 77% | 77% | 63% | 90% | 100% | 65% | 81% | 50% | 52% | 77% | 81% | 56% | 94% | 88% | 65% | 79% | |
95.9% | 92% | 100% | 100% | 79% | 100% | 100% | 100% | 100% | 100% | 94% | 98% | 100% | 79% | 100% | 100% | 100% | 79% | 100% | 85% | 100% | 100% | 100% | 100% |