Loading blueprint versions...
Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
Tests a model's basic world model and ability to track object state through simple riddles presented in multiple languages. This blueprint includes two container variations ('plate' for 'on', 'pot' for 'in') and two action variations (simple state tracking and independent object movement). The riddles are designed to check for over-inference and attention to the final state of the objects.
Average key point coverage extent for each model across all prompts.
Prompts vs. Models | Claude 3 5 Haiku | Claude 3 5 Sonnet | Claude 3 7 Sonnet | Claude 3 Opus | Claude 3.5 Haiku | Claude Opus 4 | Claude Sonnet 4 | Command A | Deepseek Chat V3 | Deepseek R1 | Gemini 2.5 Flash | Gemini 2.5 Pro | Mistral Large 2411 | Mistral Medium 3 | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o Mini | O4 Mini | Kimi K2 Instruct | Grok 3 | Grok 3 Mini | Grok 4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | 16th 93.6% | 7th 98.2% | 5th 98.7% | 6th 98.6% | 11th 95.9% | 2nd 99.5% | 1st 99.8% | 18th 91.5% | 22nd 87.4% | 14th 94.8% | 12th 94.9% | 21st 87.9% | 8th 97.1% | 19th 88.2% | 9th 96.2% | 13th 94.8% | 24th 77.1% | 3rd 99.4% | 15th 93.8% | 20th 88.0% | 17th 91.9% | 9th 96.2% | 3rd 99.4% | 23rd 79.8% | |
80.9% | 17% | 100% | 100% | 100% | 33% | 100% | 100% | 100% | 46% | 100% | 100% | 67% | 29% | 100% | 100% | 83% | 33% | 100% | 100% | 100% | 100% | 100% | 100% | 33% | |
87.8% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 0% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 4% | 100% | 4% | |
92.8% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 0% | 67% | 100% | 100% | ||
86.8% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 67% | 67% | 67% | 33% | 100% | 100% | 100% | 100% | 100% | 0% | 100% | 100% | 100% | 50% | |
95.3% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 96% | 67% | 100% | 100% | 100% | 79% | 50% | 100% | 100% | 100% | 100% | 100% | 100% | ||
93.6% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 29% | 100% | 100% | 17% | |
93.9% | 100% | 100% | 83% | 83% | 100% | 100% | 100% | 83% | 79% | 100% | 100% | 100% | 100% | 96% | 100% | 96% | 33% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
96.7% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | 79% | 79% | 100% | 100% | 100% | 79% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
97.2% | 100% | 100% | 96% | 100% | 100% | 100% | 100% | 83% | 87% | 100% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
95.3% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 87% | 83% | 100% | 100% | 100% | 67% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | |
92.3% | 100% | 100% | 100% | 100% | 96% | 100% | 100% | 67% | 83% | 100% | 100% | 100% | 100% | 83% | 83% | 100% | 4% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
94.6% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 79% | 67% | 87% | 100% | 100% | 100% | 79% | 96% | 100% | 83% | 100% | 100% | 100% | 100% | 100% | 100% | 79% | |
96.5% | 100% | 100% | 96% | 100% | 100% | 100% | 92% | 100% | 67% | 96% | 100% | 100% | 100% | 83% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | 100% | 100% | |
95.7% | 100% | 100% | 100% | 100% | 100% | 87% | 100% | 96% | 67% | 100% | 100% | 100% | 100% | 96% | 100% | 100% | 71% | 100% | 100% | 96% | 100% | 83% | 100% | 100% | |
96.5% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | 100% | 83% | 83% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | |
87.7% | 100% | 100% | 83% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 96% | 100% | 100% | 33% | 100% | 100% | 4% | 67% | 100% | 100% | 21% | |
86.8% | 50% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 33% | 100% | 100% | 100% | 100% | 92% | 83% | 100% | 33% | 100% | 100% | 92% | 100% | 0% | |
86.0% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 13% | 100% | 100% | 100% | 33% | 100% | 100% | 100% | 0% | 100% | 100% | 100% | 17% | |
93.7% | 79% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 88% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | 54% | 100% | 100% | 100% | 100% | ||
94.2% | 100% | 67% | 100% | 100% | 100% | 96% | 100% | 100% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | 96% | 67% | 67% | 100% | 100% | 100% | |
94.5% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | 67% | 67% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | |
95.7% | 100% | 100% | 100% | 96% | 100% | 100% | 100% | 100% | 100% | 100% | 67% | 67% | 100% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
91.7% | 67% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | 67% | 33% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
95.0% | 58% | 100% | 100% | 100% | 58% | 100% | 100% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 96% | 100% | 100% | 100% | |
94.1% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 79% | 100% | 100% | 100% | 100% | 100% | 83% | 100% | 17% | 100% | 100% | 100% | 100% | 100% | 79% | 100% | |
94.8% | 100% | 100% | 100% | 83% | 100% | 100% | 100% | 67% | 79% | 100% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 79% | |
97.6% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 92% | 83% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
97.9% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | 100% | 100% | 100% | 100% | 83% | 83% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
90.6% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | 71% | 100% | 100% | 100% | 100% | 67% | 67% | 100% | 4% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | |
97.9% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | 83% | 100% | 100% | 100% | 83% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
97.9% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | 83% | 100% | 100% | 100% | 100% | 83% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
97.5% | 100% | 100% | 96% | 100% | 100% | 100% | 100% | 83% | 83% | 100% | 100% | 100% | 100% | 83% | 100% | 100% | 96% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
97.7% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | 100% | 100% | 100% | 100% | 79% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | |
94.5% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | 100% | 33% | 100% | 100% | 100% | 67% | 100% | 100% | 100% | |
94.6% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 83% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 67% | 100% | 21% | 100% | 100% | 100% | 100% | 100% | |
89.7% | 100% | 100% | 100% | 87% | 100% | 100% | 100% | 87% | 100% | 0% | 100% | 83% | 100% | 100% | 100% | 96% | 100% | 79% | 71% | 100% | 50% | 100% | 100% | 100% |