Loading blueprint versions...
Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
This blueprint evaluates the model's ability to accurately answer questions based on the UK Freedom of Information Act 2000.
Average key point coverage extent for each model across all prompts.
Prompts vs. Models | Claude 3.5 Haiku | Claude Sonnet 4 | Command A | Deepseek Chat V3 | Gemini 2.5 Flash | Mistral Large 2411 | Mistral Medium 3 | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o Mini | Grok 3 Mini | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | 13th 65.8% | 1st 90.1% | 11th 73.6% | 3rd 89.5% | 7th 82.5% | 8th 79.8% | 4th 86.0% | 1st 90.1% | 5th 84.1% | 12th 70.6% | 9th 75.8% | 10th 74.6% | 6th 83.8% | |
89.4% | 100% | 100% | 65% | 96% | 96% | 65% | 100% | 100% | 100% | 70% | 100% | 70% | 100% | |
60.3% | 75% | 91% | 50% | 56% | 56% | 50% | 75% | 53% | 53% | 53% | 56% | 53% | 63% | |
65.1% | 39% | 85% | 44% | 96% | 70% | 70% | 100% | 70% | 40% | 44% | 74% | 40% | 74% | |
80.6% | 74% | 81% | 70% | 100% | 74% | 81% | 63% | 100% | 100% | 70% | 81% | 84% | 70% | |
85.2% | 50% | 93% | 80% | 90% | 100% | 93% | 80% | 98% | 100% | 60% | 80% | 85% | 98% | |
71.8% | 25% | 85% | 80% | 78% | 83% | 79% | 77% | 100% | 84% | 80% | 21% | 76% | 65% | |
93.6% | 70% | 96% | 100% | 100% | 81% | 100% | 93% | 100% | 96% | 96% | 96% | 89% | 100% | |
97.9% | 93% | 90% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 92% | 98% | 100% | 100% |