Loading blueprint versions...
Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
Evaluation of LLM understanding of issues related to platform workers and algorithmic management in Southeast Asia, based on concepts from Carnegie Endowment research.
Average key point coverage extent for each model across all prompts.
Prompts vs. Models | Claude 3.5 Haiku | Claude Sonnet 4 | Command A | Deepseek Chat V3 | Deepseek R1 | Gemini 2.5 Flash | Gemini 2.5 Pro Preview 05 06 | Mistral Large 2411 | Mistral Medium 3 | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o Mini | O4 Mini | Grok 3 | Grok 3 Mini | Grok 4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | 17th 68.6% | 12th 88.1% | 9th 90.3% | 7th 91.0% | 10th 89.4% | 1st 94.9% | 18th 67.8% | 13th 84.8% | 8th 90.7% | 11th 88.9% | 3rd 94.1% | 15th 81.7% | 16th 79.5% | 14th 82.0% | 6th 92.7% | 5th 93.2% | 2nd 94.8% | 4th 93.8% | |
75.5% | 63% | 75% | 91% | 72% | 75% | 81% | 69% | 75% | 75% | 63% | 100% | 66% | 50% | 72% | 69% | 75% | 100% | 88% | |
94.7% | 65% | 98% | 95% | 100% | 100% | 100% | 93% | 93% | 100% | 95% | 100% | 95% | 85% | 90% | 100% | 98% | 98% | 100% | |
92.5% | 85% | 96% | 100% | 100% | 96% | 100% | 38% | 100% | 100% | 100% | 100% | 83% | 96% | 71% | 100% | 100% | 100% | 100% | |
83.8% | 75% | 82% | 89% | 79% | 100% | 98% | 36% | 73% | 80% | 93% | 100% | 66% | 64% | 84% | 93% | 98% | 98% | 100% | |
81.8% | 40% | 85% | 88% | 88% | 90% | 98% | 38% | 65% | 95% | 95% | 80% | 85% | 68% | 80% | 95% | 95% | 93% | 95% | |
93.1% | 85% | 90% | 100% | 93% | 93% | 100% | 83% | 88% | 100% | 95% | 93% | 93% | 75% | 88% | 100% | 100% | 100% | 100% | |
96.8% | 94% | 97% | 100% | 100% | 94% | 100% | 100% | 94% | 94% | 97% | 97% | 100% | 97% | 81% | 97% | 100% | 100% | 100% | |
83.1% | 54% | 83% | 85% | 92% | 81% | 98% | 59% | 88% | 96% | 75% | 94% | 63% | 73% | 96% | 85% | 88% | 100% | 85% | |
88.3% | 18% | 95% | 95% | 100% | 95% | 95% | 98% | 93% | 88% | 95% | 98% | 68% | 100% | 65% | 100% | 98% | 88% | 100% | |
95.8% | 95% | 95% | 100% | 95% | 100% | 100% | 75% | 88% | 98% | 100% | 98% | 90% | 93% | 98% | 100% | 100% | 100% | 100% | |
74.6% | 71% | 81% | 71% | 83% | 79% | 79% | 42% | 77% | 77% | 79% | 79% | 81% | 75% | 71% | 79% | 71% | 73% | 75% | |
84.2% | 78% | 80% | 70% | 90% | 70% | 90% | 83% | 83% | 85% | 80% | 90% | 90% | 78% | 88% | 95% | 95% | 88% | 83% |