Loading blueprint versions...
Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
Tests a model's basic world model and ability to track object state through simple riddles presented in multiple languages. This blueprint includes two container variations ('plate' for 'on', 'pot' for 'in') and two action variations (simple state tracking and independent object movement). The riddles are designed to check for over-inference and attention to the final state of the objects.
Average key point coverage extent for each model across all prompts.
Prompts vs. Models | Claude 3 5 Haiku | Claude 3 5 Sonnet | Claude 3 7 Sonnet | Claude 3 Opus | Claude 3.5 Haiku | Claude Opus 4 | Claude Sonnet 4 | Command A | Deepseek Chat V3 | Deepseek R1 | Gemini 2.5 Flash | Gemini 2.5 Pro | Mistral Large 2411 | Mistral Medium 3 | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o Mini | O4 Mini | Kimi K2 Instruct | Grok 3 | Grok 3 Mini | Grok 4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | 16th 88.9% | 11th 95.4% | 8th 95.9% | 7th 96.2% | 12th 94.6% | 3rd 99.5% | 2nd 99.8% | 22nd 86.3% | 21st 87.1% | 6th 97.1% | 15th 93.2% | 20th 87.1% | 5th 97.1% | 19th 87.7% | 8th 95.9% | 14th 94.0% | 24th 74.3% | 1st 99.9% | 12th 94.6% | 18th 88.7% | 17th 88.7% | 10th 95.6% | 4th 99.4% | 23rd 80.0% | |
80.9% | 17% | 100% | 100% | 100% | 33% | 100% | 100% | 100% | 46% | 100% | 100% | 67% | 29% | 100% | 100% | 83% | 33% | 100% | 100% | 100% | 100% | 100% | 100% | 33% | |
87.8% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 0% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 4% | 100% | 4% | |
92.8% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 0% | 67% | 100% | 100% | ||
86.8% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 67% | 67% | 67% | 33% | 100% | 100% | 100% | 100% | 100% | 0% | 100% | 100% | 100% | 50% | |
95.3% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 96% | 67% | 100% | 100% | 100% | 79% | 50% | 100% | 100% | 100% | 100% | 100% | 100% | ||
93.6% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 29% | 100% | 100% | 17% | |
93.9% | 100% | 100% | 83% | 83% | 100% | 100% | 100% | 83% | 79% | 100% | 100% | 100% | 100% | 96% | 100% | 96% | 33% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
96.7% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | 79% | 79% | 100% | 100% | 100% | 79% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
97.2% | 100% | 100% | 96% | 100% | 100% | 100% | 100% | 83% | 87% | 100% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
95.3% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 87% | 83% | 100% | 100% | 100% | 67% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | |
92.3% | 100% | 100% | 100% | 100% | 96% | 100% | 100% | 67% | 83% | 100% | 100% | 100% | 100% | 83% | 83% | 100% | 4% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
94.6% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 79% | 67% | 87% | 100% | 100% | 100% | 79% | 96% | 100% | 83% | 100% | 100% | 100% | 100% | 100% | 100% | 79% | |
96.5% | 100% | 100% | 96% | 100% | 100% | 100% | 92% | 100% | 67% | 96% | 100% | 100% | 100% | 83% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | 100% | 100% | |
95.7% | 100% | 100% | 100% | 100% | 100% | 87% | 100% | 96% | 67% | 100% | 100% | 100% | 100% | 96% | 100% | 100% | 71% | 100% | 100% | 96% | 100% | 83% | 100% | 100% | |
97.5% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 92% | 83% | 100% | 100% | 100% | 83% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | |
87.7% | 100% | 100% | 83% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 96% | 100% | 100% | 33% | 100% | 100% | 4% | 67% | 100% | 100% | 21% | |
86.8% | 50% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 33% | 100% | 100% | 100% | 100% | 92% | 83% | 100% | 33% | 100% | 100% | 92% | 100% | 0% | |
54.0% | 8% | 0% | 0% | 0% | 100% | 100% | 100% | 0% | 67% | 100% | 67% | 0% | 100% | 96% | 96% | 0% | 33% | 100% | 100% | 25% | 0% | 79% | 100% | 25% | |
93.7% | 79% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 88% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | 54% | 100% | 100% | 100% | 100% | ||
94.2% | 100% | 67% | 100% | 100% | 100% | 96% | 100% | 100% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | 96% | 67% | 67% | 100% | 100% | 100% | |
94.5% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | 67% | 67% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | |
95.7% | 100% | 100% | 100% | 96% | 100% | 100% | 100% | 100% | 100% | 100% | 67% | 67% | 100% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
91.7% | 67% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | 67% | 33% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
95.0% | 58% | 100% | 100% | 100% | 58% | 100% | 100% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 96% | 100% | 100% | 100% | |
94.1% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 79% | 100% | 100% | 100% | 100% | 100% | 83% | 100% | 17% | 100% | 100% | 100% | 100% | 100% | 79% | 100% | |
94.8% | 100% | 100% | 100% | 83% | 100% | 100% | 100% | 67% | 79% | 100% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 79% | |
97.6% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 92% | 83% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
97.9% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | 100% | 100% | 100% | 100% | 83% | 83% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
90.6% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | 71% | 100% | 100% | 100% | 100% | 67% | 67% | 100% | 4% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | |
97.9% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | 83% | 100% | 100% | 100% | 83% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
97.9% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | 83% | 100% | 100% | 100% | 100% | 83% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
97.5% | 100% | 100% | 96% | 100% | 100% | 100% | 100% | 83% | 83% | 100% | 100% | 100% | 100% | 83% | 100% | 100% | 96% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
96.2% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 79% | 100% | 100% | 100% | 100% | 67% | 79% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | |
94.5% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | 100% | 33% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | |
94.6% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 67% | 100% | 21% | 100% | 100% | 100% | 100% | 100% | |
83.5% | 21% | 100% | 100% | 100% | 50% | 100% | 100% | 0% | 100% | 100% | 71% | 67% | 100% | 100% | 100% | 100% | 67% | 96% | 100% | 100% | 33% | 100% | 100% | 100% |