Loading blueprint versions...
Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
Evaluates the models on the UDHR dataset (Universal Declaration of Human Rights).
Average key point coverage extent for each model across all prompts.
Prompts vs. Models | Claude 3.5 Haiku | Claude Sonnet 4 | Command A | Deepseek Chat V3 | Gemini 2.5 Flash | Gemini 2.5 Pro Preview 05 06 | Mistral Large 2411 | Mistral Medium 3 | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o Mini | Grok 3 | Grok 3 Mini | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | 10th 94.4% | 10th 94.4% | 5th 97.1% | 7th 96.1% | 2nd 99.3% | 12th 94.3% | 6th 96.7% | 8th 96.0% | 4th 98.4% | 13th 92.9% | 15th 88.2% | 9th 94.8% | 14th 88.4% | 3rd 98.9% | 1st 99.3% | |
98.5% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 78% | 100% | 100% | 100% | 100% | |
97.7% | 100% | 81% | 100% | 98% | 100% | 100% | 100% | 98% | 100% | 100% | 100% | 100% | 88% | 100% | 100% | |
98.7% | 100% | 96% | 100% | 100% | 100% | 100% | 100% | 98% | 100% | 100% | 90% | 100% | 96% | 100% | 100% | |
99.5% | 95% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 98% | 100% | 100% | 100% | |
91.9% | 84% | 92% | 100% | 89% | 100% | 100% | 98% | 100% | 100% | 72% | 86% | 88% | 72% | 98% | 100% | |
96.7% | 98% | 95% | 100% | 100% | 100% | 100% | 100% | 100% | 98% | 98% | 88% | 93% | 80% | 100% | 100% | |
95.0% | 91% | 100% | 100% | 99% | 100% | 100% | 100% | 91% | 100% | 90% | 78% | 100% | 76% | 100% | 100% | |
100.0% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
100.0% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | |
94.5% | 100% | 100% | 100% | 88% | 100% | 97% | 88% | 88% | 100% | 88% | 100% | 81% | 88% | 100% | 100% | |
78.9% | 77% | 77% | 79% | 88% | 91% | 68% | 74% | 82% | 83% | 70% | 55% | 80% | 73% | 92% | 94% | |
91.9% | 88% | 92% | 86% | 91% | 100% | 66% | 100% | 95% | 100% | 97% | 83% | 98% | 88% | 97% | 98% |