LLM Rankings

From AI Wiki
Revision as of 22:57, 12 January 2025 by Alpha5 (talk | contribs)
See also: LLM Comparisons and LLM Benchmarks Timeline

Ranking of LLMs.

Model Average MMLU (General) GPQA (Reasoning) HumanEval (Coding) Math BFCL (Tool Use) MGSM (Multilingual)
Claude 3.5 Sonnet 84.5% 88.3% 65% 93.7% 78.3% 90.2% 91.6%
GPT-4o 80.5% 88.7% 53.6% 90.2% 76.6% 83.59% 90.5%
Llama 3.1 405b 80.4% 88.6% 51.1% 89% 73.8% 88.5% 91.6%
GPT-Turbo 78.1% 86.5% 48% 87.1% 72.6% 86% 88.5%
Claude 3 Opus 76.7% 85.7% 50.4% 84.9% 60.1% 88.4% 90.7%
GPT-4 75.5% 86.4% 41.4% 86.6% 64.5% 88.3% 85.9%
Llama 3.1 70b 75.5% 86% 46.7% 80.5% 68% 84.8% 86.9%
Llama 3.3 70b 74.5% 86% 48% 88.40% 77% 77.5% 91.1%
Gemini 1.5 Pro 74.1% 85.9% 46.2% 71.9% 67.7% 84.35% 88.7%
Claude 3.5 Haiku 68.3% 65% 41.6% 88.1% 69.4% 60% 85.6%
Gemini 1.5 Flash 66.7% 78.9% 39.5% 71.5% 54.9% 79.88% 75.5%
Claude 3 Haiku 62.9% 75.2% 35.7% 75.9% 38.9% 74.65% 71.7%
Llama 3.1 8b 62.6% 73% 32.8% 72.6% 51.9% 76.1% 68.9%
GPT-3.5 Turbo 59.2% 69.8% 30.8% 68% 34.1% 64.41% 56.3%
Gemini 2.0 Flash 76.4% 62.1% 89.7%
AWS Nova Micro 77.6% 40% 81.1% 69.3% 56.2%
AWS Nova Lite 80.5% 42% 85.4% 73.3% 66.6%
AWS Nova Pro 85.9% 46.9% 89% 76.6% 68.4%
GPT-4o mini 82% 40.2% 87.2% 70.2% 87%
Gemini Ultra 83.7% 35.7% 53.2% 79%
OpenAI o1 91.8% 75.7% 92.4% 96.4% 89.3%

[1]

References

  1. Vellum AI LLM Leaderboard https://www.vellum.ai/llm-leaderboard/