Loading blueprint versions...
Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
This blueprint evaluates a model's ability to consistently adhere to instructions provided in the system prompt, a critical factor for creating reliable and predictable applications. It tests various common failure modes observed in language models.
Core Areas Tested:
Average key point coverage extent for each model across all prompts.
Prompts vs. Models | Claude 3.5 Haiku | Claude Opus 4 | Claude Sonnet 4 | Command A | Deepseek Chat V3 | Deepseek R1 | Gemini 2.5 Flash | Gemini 2.5 Pro | Llama 3 70b Instruct | Llama 4 Maverick | Meta Llama 3.1 405b Instruct Turbo | Mistral Large 2411 | Mistral Medium 3 | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o Mini | O4 Mini | Kimi K2 Instruct | Grok 3 | Grok 3 Mini | Grok 4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | 9th 82.0% | 1st 89.1% | 8th 82.1% | 20th 71.6% | 13th 80.4% | 7th 82.2% | 3rd 86.2% | 15th 78.3% | 14th 79.3% | 22nd 69.3% | 19th 75.1% | 10th 81.5% | 21st 71.1% | 11th 81.0% | 17th 76.3% | 23rd 68.0% | 16th 77.4% | 18th 75.6% | 12th 80.7% | 5th 84.6% | 6th 83.3% | 2nd 87.1% | 4th 84.7% | |
84.5% | 90% | 90% | 80% | 80% | 80% | 88% | 90% | 90% | 78% | 75% | 80% | 80% | 80% | 80% | 98% | 80% | 90% | 78% | 80% | 90% | 88% | 88% | 90% | |
89.5% | 88% | 86% | 80% | 86% | 77% | 79% | 93% | 93% | 86% | 100% | 100% | 93% | 79% | 100% | 100% | 100% | 79% | 86% | 86% | 79% | 100% | 100% | ||
76.7% | 63% | 94% | 63% | 63% | 97% | 63% | 75% | 75% | 94% | 63% | 63% | 81% | 63% | 81% | 88% | 81% | 63% | 75% | 94% | 100% | 63% | 75% | 88% | |
75.0% | 63% | 75% | 63% | 63% | 94% | 94% | 94% | 81% | 88% | 63% | 63% | 85% | 63% | 88% | 88% | 63% | 63% | 63% | 75% | 75% | 63% | 69% | 88% | |
74.7% | 75% | 88% | 63% | 94% | 81% | 99% | 94% | 69% | 94% | 63% | 63% | 63% | 63% | 69% | 63% | 81% | 63% | 63% | 75% | 100% | 63% | 63% | 69% | |
72.5% | 69% | 88% | 75% | 81% | 63% | 68% | 63% | 69% | 81% | 63% | 63% | 75% | 63% | 75% | 63% | 63% | 63% | 63% | 100% | 100% | 63% | 69% | 88% | |
72.7% | 90% | 60% | 90% | 70% | 20% | 80% | 80% | 40% | 80% | 80% | 95% | 80% | 38% | 90% | 80% | 70% | 80% | 70% | 80% | 80% | 70% | 80% | 70% | |
88.1% | 83% | 98% | 98% | 94% | 85% | 77% | 90% | 83% | 96% | 90% | 69% | 94% | 96% | 100% | 83% | 46% | 98% | 100% | 83% | 75% | 94% | 94% | 100% | |
99.3% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | 100% | 100% | 100% | 100% | 100% | |
50.1% | 80% | 100% | 53% | 23% | 45% | 45% | 90% | 40% | 43% | 43% | 43% | 25% | 35% | 25% | 43% | 33% | 43% | 43% | 30% | 80% | 40% | 100% | 50% | |
83.0% | 88% | 100% | 100% | 25% | 96% | 96% | 100% | 98% | 33% | 27% | 33% | 100% | 94% | 96% | 71% | 85% | 90% | 96% | 88% | 98% | 98% | 98% | 98% | |
82.7% | 80% | 80% | 80% | 80% | 100% | 90% | 75% | 80% | 80% | 80% | 80% | 75% | 80% | 80% | 80% | 80% | 80% | 80% | 93% | 88% | 80% | 80% | 100% | |
85.7% | 100% | 100% | 100% | 90% | 90% | 100% | 100% | 90% | 100% | 20% | 73% | 100% | 60% | 83% | 70% | 30% | 90% | 100% | 90% | 100% | 100% | 100% | ||
48.0% | 70% | 45% | 48% | 38% | 58% | 35% | 50% | 40% | 28% | 23% | 65% | 33% | 50% | 30% | 30% | 50% | 55% | 45% | 70% | 20% | 100% | 90% | 30% | |
90.2% | 100% | 100% | 83% | 100% | 100% | 80% | 100% | 100% | 83% | 100% | 80% | 100% | 85% | 83% | 83% | 83% | 100% | 85% | 70% | 80% | 80% | 100% | 100% | |
75.2% | 78% | 90% | 90% | 55% | 65% | 100% | 73% | 63% | 78% | 70% | 75% | 65% | 58% | 83% | 75% | 70% | 68% | 40% | 98% | 80% | 93% | 88% | 75% | |
91.1% | 78% | 98% | 100% | 95% | 98% | 100% | 100% | 100% | 88% | 100% | 93% | 100% | 100% | 95% | 85% | 25% | 90% | 85% | 80% | 100% | 95% | 100% | ||
74.8% | 71% | 100% | 94% | 44% | 79% | 67% | 71% | 77% | 77% | 63% | 90% | 100% | 65% | 81% | 50% | 52% | 77% | 81% | 56% | 94% | 88% | 65% | 79% | |
95.9% | 92% | 100% | 100% | 79% | 100% | 100% | 100% | 100% | 100% | 94% | 98% | 100% | 79% | 100% | 100% | 100% | 79% | 100% | 85% | 100% | 100% | 100% | 100% |