Loading blueprint versions...
Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
Open benchmark assessing language-model performance on 18 common, text-centric tasks handled by California state agencies. Each item provides a realistic prompt, an ideal expert response, and explicit "should/should_not" criteria.
Average key point coverage extent for each model across all prompts.
Prompts vs. Models | Claude 3.5 Haiku | Claude Sonnet 4 | Command A | Deepseek Chat V3 | Deepseek R1 | Gemini 2.5 Flash | Gemini 2.5 Pro Preview 05 06 | Mistral Large 2411 | Mistral Medium 3 | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o Mini | O4 Mini | Grok 3 | Grok 3 Mini | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | 17th 57.5% | 9th 68.9% | 10th 68.5% | 6th 73.9% | 7th 72.7% | 8th 71.1% | 13th 65.2% | 14th 63.4% | 4th 75.6% | 5th 74.2% | 11th 68.3% | 15th 62.4% | 12th 65.3% | 16th 58.7% | 2nd 76.2% | 3rd 76.2% | 1st 76.4% | |
71.4% | 65% | 55% | 80% | 63% | 70% | 60% | 80% | 80% | 80% | 80% | 70% | 85% | 75% | 40% | 80% | 80% | 70% | |
94.3% | 100% | 80% | 100% | 80% | 100% | 80% | 100% | 80% | 100% | 100% | 100% | 83% | 100% | 100% | 100% | 100% | 100% | |
74.4% | 63% | 80% | 60% | 85% | 80% | 78% | 52% | 65% | 95% | 70% | 85% | 70% | 80% | 85% | 73% | 60% | 83% | |
49.9% | 40% | 50% | 50% | 60% | 50% | 63% | 50% | 20% | 40% | 50% | 50% | 40% | 40% | 20% | 100% | 60% | 65% | |
52.5% | 50% | 50% | 50% | 63% | 40% | 45% | 60% | 75% | 45% | 43% | 50% | 50% | 50% | 50% | 48% | 48% | 75% | |
81.7% | 70% | 60% | 85% | 85% | 65% | 85% | 83% | 83% | 85% | 85% | 85% | 83% | 85% | 80% | 85% | 85% | 100% | |
63.2% | 60% | 65% | 52% | 55% | 75% | 60% | 60% | 55% | 75% | 75% | 78% | 60% | 60% | 60% | 60% | 65% | 60% | |
52.0% | 57% | 55% | 40% | 42% | 60% | 50% | 88% | 60% | 65% | 42% | 55% | 55% | 25% | 25% | 40% | 65% | 60% | |
63.9% | 40% | 85% | 75% | 75% | 88% | 75% | 73% | 60% | 43% | 65% | 75% | 48% | 60% | 60% | 20% | 75% | 70% | |
82.5% | 61% | 84% | 73% | 86% | 89% | 86% | 89% | 71% | 91% | 86% | 75% | 88% | 71% | 86% | 93% | 86% | 88% | |
71.9% | 65% | 70% | 78% | 68% | 85% | 78% | 45% | 60% | 75% | 68% | 80% | 68% | 73% | 65% | 95% | 75% | 75% | |
87.2% | 88% | 81% | 84% | 100% | 97% | 100% | 100% | 97% | 94% | 100% | 63% | 50% | 75% | 78% | 100% | 91% | 84% | |
60.4% | 60% | 80% | 60% | 65% | 63% | 63% | 40% | 50% | 73% | 60% | 60% | 60% | 60% | 53% | 60% | 60% | 60% | |
61.0% | 45% | 60% | 80% | 60% | 60% | 78% | 45% | 68% | 68% | 68% | 43% | 48% | 73% | 55% | 68% | 78% | 40% | |
58.8% | 25% | 60% | 53% | 93% | 100% | 33% | 60% | 25% | 73% | 80% | 40% | 20% | 65% | 20% | 80% | 73% | 100% | |
61.9% | 58% | 80% | 60% | 60% | 50% | 80% | 45% | 45% | 80% | 80% | 45% | 40% | 45% | 45% | 80% | 80% | 80% | |
67.9% | 58% | 65% | 55% | 90% | 37% | 65% | 43% | 48% | 78% | 83% | 75% | 80% | 63% | 55% | 90% | 90% | 80% | |
88.4% | 30% | 80% | 98% | 100% | 100% | 100% | 60% | 100% | 100% | 100% | 100% | 95% | 75% | 80% | 100% | 100% | 85% |