Loading blueprint versions...
Please wait while we gather all the unique runs for this blueprint.
Please wait while we gather all the unique runs for this blueprint.
Please wait while we prepare the detailed comparison.
Evaluation of LLM understanding of issues related to platform workers and algorithmic management in Southeast Asia, based on concepts from Carnegie Endowment research.
Average key point coverage extent for each model across all prompts.
Prompts vs. Models | Claude 3.5 Haiku | Claude Sonnet 4 | Command A | Deepseek Chat V3 | Gemini 2.5 Flash | Gemini 2.5 Pro Preview 05 06 | Mistral Large 2411 | Mistral Medium 3 | GPT 4.1 | GPT 4.1 Mini | GPT 4.1 Nano | GPT 4o | GPT 4o Mini | Grok 3 | Grok 3 Mini | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | 14th 74.8% | 9th 91.3% | 8th 93.0% | 5th 94.1% | 4th 95.3% | 15th 74.0% | 10th 91.0% | 7th 93.3% | 6th 93.8% | 1st 96.9% | 13th 85.3% | 11th 88.1% | 12th 86.8% | 3rd 95.4% | 2nd 96.2% | |
82.5% | 69% | 78% | 91% | 78% | 78% | 78% | 81% | 78% | 84% | 97% | 81% | 75% | 88% | 81% | 100% | |
94.7% | 73% | 93% | 95% | 100% | 100% | 93% | 100% | 100% | 98% | 100% | 93% | 90% | 90% | 95% | 100% | |
92.1% | 88% | 100% | 100% | 100% | 100% | 38% | 100% | 100% | 100% | 100% | 83% | 98% | 75% | 100% | 100% | |
86.7% | 82% | 80% | 97% | 88% | 98% | 48% | 84% | 96% | 97% | 100% | 66% | 79% | 88% | 100% | 98% | |
84.7% | 53% | 90% | 85% | 93% | 100% | 50% | 78% | 95% | 95% | 90% | 85% | 88% | 80% | 98% | 90% | |
96.2% | 88% | 95% | 98% | 100% | 100% | 95% | 93% | 100% | 98% | 100% | 93% | 88% | 95% | 100% | 100% | |
99.0% | 100% | 97% | 100% | 100% | 100% | 97% | 100% | 97% | 100% | 100% | 100% | 100% | 94% | 100% | 100% | |
87.1% | 65% | 85% | 92% | 96% | 98% | 60% | 92% | 96% | 83% | 100% | 71% | 77% | 94% | 98% | 100% | |
91.1% | 30% | 98% | 100% | 100% | 98% | 98% | 98% | 93% | 98% | 100% | 80% | 100% | 75% | 100% | 98% | |
97.9% | 95% | 98% | 100% | 98% | 100% | 88% | 95% | 100% | 100% | 98% | 98% | 98% | 100% | 100% | 100% | |
77.1% | 69% | 81% | 75% | 81% | 79% | 50% | 81% | 81% | 83% | 83% | 83% | 81% | 75% | 75% | 79% | |
90.4% | 85% | 100% | 83% | 95% | 93% | 93% | 90% | 83% | 90% | 95% | 90% | 83% | 88% | 98% | 90% |