Loading blueprint versions...
Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
Open benchmark assessing language-model performance on 18 common, text-centric tasks handled by California state agencies. Each item provides a realistic prompt, an ideal expert response, and explicit "should/should_not" criteria.
Average key point coverage extent for each model across all prompts.
Prompts vs. Models | Claude 3.5 Haiku | Claude Sonnet 4 | Command A | Deepseek Chat V3 | Gemini 2.5 Flash | Mistral Large 2411 | Mistral Medium 3 | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o Mini | Grok 3 Mini | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | 11th 62.2% | 6th 72.3% | 7th 69.6% | 2nd 76.1% | 5th 73.1% | 10th 65.2% | 1st 76.2% | 3rd 75.7% | 9th 68.0% | 13th 60.6% | 8th 68.3% | 12th 61.9% | 4th 75.4% | |
71.0% | 60% | 75% | 80% | 80% | 70% | 80% | 80% | 70% | 70% | 53% | 70% | 55% | 80% | |
93.5% | 100% | 80% | 100% | 95% | 80% | 80% | 100% | 100% | 100% | 80% | 100% | 100% | 100% | |
80.7% | 78% | 85% | 60% | 90% | 85% | 90% | 75% | 80% | 83% | 62% | 93% | 88% | 80% | |
39.8% | 30% | 40% | 50% | 43% | 53% | 43% | 53% | 40% | 40% | 40% | 15% | 20% | 50% | |
69.8% | 68% | 67% | 67% | 65% | 55% | 70% | 65% | 72% | 75% | 75% | 80% | 73% | 75% | |
69.5% | 60% | 50% | 68% | 68% | 80% | 68% | 68% | 80% | 68% | 68% | 78% | 70% | 78% | |
61.8% | 65% | 62% | 47% | 52% | 62% | 60% | 65% | 78% | 62% | 60% | 68% | 60% | 63% | |
44.2% | 45% | 52% | 42% | 45% | 40% | 50% | 42% | 48% | 40% | 58% | 35% | 30% | 47% | |
68.5% | 55% | 80% | 78% | 80% | 78% | 60% | 83% | 60% | 60% | 58% | 60% | 60% | 78% | |
84.0% | 63% | 87% | 91% | 98% | 91% | 86% | 84% | 91% | 79% | 75% | 75% | 86% | 86% | |
77.2% | 68% | 78% | 80% | 78% | 78% | 75% | 85% | 83% | 85% | 68% | 80% | 60% | 85% | |
80.3% | 84% | 100% | 88% | 100% | 100% | 50% | 66% | 97% | 75% | 56% | 78% | 78% | 72% | |
61.2% | 63% | 83% | 63% | 55% | 65% | 45% | 65% | 60% | 60% | 60% | 60% | 48% | 68% | |
62.4% | 50% | 60% | 60% | 75% | 75% | 60% | 73% | 63% | 48% | 55% | 70% | 55% | 67% | |
56.5% | 20% | 70% | 40% | 100% | 28% | 45% | 100% | 73% | 43% | 20% | 65% | 30% | 100% | |
64.4% | 55% | 80% | 65% | 63% | 88% | 48% | 80% | 80% | 55% | 45% | 55% | 50% | 73% | |
76.0% | 75% | 73% | 73% | 85% | 87% | 63% | 88% | 88% | 83% | 60% | 73% | 63% | 77% | |
91.8% | 80% | 80% | 100% | 97% | 100% | 100% | 100% | 100% | 98% | 98% | 75% | 88% | 78% |