LLM Rankings: Difference between revisions
(Created page with " Category:Important") |
No edit summary |
||
(9 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
{{see also|LLM Benchmarks Timeline|LLM Comparisons}} | |||
Ranking of [[large language models]] ([[LLMs]]). | |||
{| class="wikitable sortable" | |||
! Model | |||
! Average | |||
! MMLU (General) | |||
! GPQA (Reasoning) | |||
! HumanEval (Coding) | |||
! Math | |||
! BFCL (Tool Use) | |||
! MGSM (Multilingual) | |||
|- | |||
| '''[[Claude 3.5 Sonnet]]''' | |||
| 84.5% | |||
| 88.3% | |||
| 65% | |||
| 93.7% | |||
| 78.3% | |||
| 90.2% | |||
| 91.6% | |||
|- | |||
| '''[[GPT-4o]]''' | |||
| 80.5% | |||
| 88.7% | |||
| 53.6% | |||
| 90.2% | |||
| 76.6% | |||
| 83.59% | |||
| 90.5% | |||
|- | |||
| '''[[Llama 3.1 405b]]''' | |||
| 80.4% | |||
| 88.6% | |||
| 51.1% | |||
| 89% | |||
| 73.8% | |||
| 88.5% | |||
| 91.6% | |||
|- | |||
| '''[[GPT-Turbo]]''' | |||
| 78.1% | |||
| 86.5% | |||
| 48% | |||
| 87.1% | |||
| 72.6% | |||
| 86% | |||
| 88.5% | |||
|- | |||
| '''[[Claude 3 Opus]]''' | |||
| 76.7% | |||
| 85.7% | |||
| 50.4% | |||
| 84.9% | |||
| 60.1% | |||
| 88.4% | |||
| 90.7% | |||
|- | |||
| '''[[GPT-4]]''' | |||
| 75.5% | |||
| 86.4% | |||
| 41.4% | |||
| 86.6% | |||
| 64.5% | |||
| 88.3% | |||
| 85.9% | |||
|- | |||
| '''[[Llama 3.1 70b]]''' | |||
| 75.5% | |||
| 86% | |||
| 46.7% | |||
| 80.5% | |||
| 68% | |||
| 84.8% | |||
| 86.9% | |||
|- | |||
| '''[[Llama 3.3 70b]]''' | |||
| 74.5% | |||
| 86% | |||
| 48% | |||
| 88.40% | |||
| 77% | |||
| 77.5% | |||
| 91.1% | |||
|- | |||
| '''[[Gemini 1.5 Pro]]''' | |||
| 74.1% | |||
| 85.9% | |||
| 46.2% | |||
| 71.9% | |||
| 67.7% | |||
| 84.35% | |||
| 88.7% | |||
|- | |||
| '''[[Claude 3.5 Haiku]]''' | |||
| 68.3% | |||
| 65% | |||
| 41.6% | |||
| 88.1% | |||
| 69.4% | |||
| 60% | |||
| 85.6% | |||
|- | |||
| '''[[Gemini 1.5 Flash]]''' | |||
| 66.7% | |||
| 78.9% | |||
| 39.5% | |||
| 71.5% | |||
| 54.9% | |||
| 79.88% | |||
| 75.5% | |||
|- | |||
| '''[[Claude 3 Haiku]]''' | |||
| 62.9% | |||
| 75.2% | |||
| 35.7% | |||
| 75.9% | |||
| 38.9% | |||
| 74.65% | |||
| 71.7% | |||
|- | |||
| '''[[Llama 3.1 8b]]''' | |||
| 62.6% | |||
| 73% | |||
| 32.8% | |||
| 72.6% | |||
| 51.9% | |||
| 76.1% | |||
| 68.9% | |||
|- | |||
| '''[[GPT-3.5 Turbo]]''' | |||
| 59.2% | |||
| 69.8% | |||
| 30.8% | |||
| 68% | |||
| 34.1% | |||
| 64.41% | |||
| 56.3% | |||
|- | |||
| '''[[Gemini 2.0 Flash]]''' | |||
| – | |||
| 76.4% | |||
| 62.1% | |||
| – | |||
| 89.7% | |||
| – | |||
| – | |||
|- | |||
| '''[[AWS Nova Micro]]''' | |||
| – | |||
| 77.6% | |||
| 40% | |||
| 81.1% | |||
| 69.3% | |||
| 56.2% | |||
| – | |||
|- | |||
| '''[[AWS Nova Lite]]''' | |||
| – | |||
| 80.5% | |||
| 42% | |||
| 85.4% | |||
| 73.3% | |||
| 66.6% | |||
| – | |||
|- | |||
| '''[[AWS Nova Pro]]''' | |||
| – | |||
| 85.9% | |||
| 46.9% | |||
| 89% | |||
| 76.6% | |||
| 68.4% | |||
| – | |||
|- | |||
| '''[[GPT-4o mini]]''' | |||
| – | |||
| 82% | |||
| 40.2% | |||
| 87.2% | |||
| 70.2% | |||
| – | |||
| 87% | |||
|- | |||
| '''[[Gemini Ultra]]''' | |||
| – | |||
| 83.7% | |||
| 35.7% | |||
| – | |||
| 53.2% | |||
| – | |||
| 79% | |||
|- | |||
| '''[[OpenAI o1]]''' | |||
| – | |||
| 91.8% | |||
| 75.7% | |||
| 92.4% | |||
| 96.4% | |||
| – | |||
| 89.3% | |||
|} | |||
<ref name="”1”">Vellum AI LLM Leaderboard https://www.vellum.ai/llm-leaderboard/</ref> | |||
[[Category:Important]] | ==References== | ||
<references /> | |||
[[Category:Important]] [[Category:Rankings]] [[Category:Aggregate pages]] |
Latest revision as of 21:02, 13 January 2025
- See also: LLM Benchmarks Timeline and LLM Comparisons
Ranking of large language models (LLMs).
Model | Average | MMLU (General) | GPQA (Reasoning) | HumanEval (Coding) | Math | BFCL (Tool Use) | MGSM (Multilingual) |
---|---|---|---|---|---|---|---|
Claude 3.5 Sonnet | 84.5% | 88.3% | 65% | 93.7% | 78.3% | 90.2% | 91.6% |
GPT-4o | 80.5% | 88.7% | 53.6% | 90.2% | 76.6% | 83.59% | 90.5% |
Llama 3.1 405b | 80.4% | 88.6% | 51.1% | 89% | 73.8% | 88.5% | 91.6% |
GPT-Turbo | 78.1% | 86.5% | 48% | 87.1% | 72.6% | 86% | 88.5% |
Claude 3 Opus | 76.7% | 85.7% | 50.4% | 84.9% | 60.1% | 88.4% | 90.7% |
GPT-4 | 75.5% | 86.4% | 41.4% | 86.6% | 64.5% | 88.3% | 85.9% |
Llama 3.1 70b | 75.5% | 86% | 46.7% | 80.5% | 68% | 84.8% | 86.9% |
Llama 3.3 70b | 74.5% | 86% | 48% | 88.40% | 77% | 77.5% | 91.1% |
Gemini 1.5 Pro | 74.1% | 85.9% | 46.2% | 71.9% | 67.7% | 84.35% | 88.7% |
Claude 3.5 Haiku | 68.3% | 65% | 41.6% | 88.1% | 69.4% | 60% | 85.6% |
Gemini 1.5 Flash | 66.7% | 78.9% | 39.5% | 71.5% | 54.9% | 79.88% | 75.5% |
Claude 3 Haiku | 62.9% | 75.2% | 35.7% | 75.9% | 38.9% | 74.65% | 71.7% |
Llama 3.1 8b | 62.6% | 73% | 32.8% | 72.6% | 51.9% | 76.1% | 68.9% |
GPT-3.5 Turbo | 59.2% | 69.8% | 30.8% | 68% | 34.1% | 64.41% | 56.3% |
Gemini 2.0 Flash | – | 76.4% | 62.1% | – | 89.7% | – | – |
AWS Nova Micro | – | 77.6% | 40% | 81.1% | 69.3% | 56.2% | – |
AWS Nova Lite | – | 80.5% | 42% | 85.4% | 73.3% | 66.6% | – |
AWS Nova Pro | – | 85.9% | 46.9% | 89% | 76.6% | 68.4% | – |
GPT-4o mini | – | 82% | 40.2% | 87.2% | 70.2% | – | 87% |
Gemini Ultra | – | 83.7% | 35.7% | – | 53.2% | – | 79% |
OpenAI o1 | – | 91.8% | 75.7% | 92.4% | 96.4% | – | 89.3% |
References
- ↑ Vellum AI LLM Leaderboard https://www.vellum.ai/llm-leaderboard/