LLM Rankings: Difference between revisions
No edit summary |
No edit summary |
||
(6 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
Ranking of [[LLMs]]. | {{see also|LLM Benchmarks Timeline|LLM Comparisons}} | ||
Ranking of [[large language models]] ([[LLMs]]). | |||
{| class="wikitable" | {| class="wikitable sortable" | ||
! | ! Model | ||
! | ! Average | ||
! | ! MMLU (General) | ||
! | ! GPQA (Reasoning) | ||
! | ! HumanEval (Coding) | ||
! Math | |||
! BFCL (Tool Use) | |||
! MGSM (Multilingual) | |||
|- | |- | ||
| [[ | | '''[[Claude 3.5 Sonnet]]''' | ||
| | | 84.5% | ||
| | | 88.3% | ||
| | | 65% | ||
| | | 93.7% | ||
| 78.3% | |||
| 90.2% | |||
| 91.6% | |||
|- | |- | ||
| [[GPT- | | '''[[GPT-4o]]''' | ||
| | | 80.5% | ||
| | | 88.7% | ||
| | | 53.6% | ||
| | | 90.2% | ||
| 76.6% | |||
| 83.59% | |||
| 90.5% | |||
|- | |- | ||
| [[ | | '''[[Llama 3.1 405b]]''' | ||
| | | 80.4% | ||
| | | 88.6% | ||
| | | 51.1% | ||
| | | 89% | ||
| 73.8% | |||
| 88.5% | |||
| 91.6% | |||
|- | |- | ||
| [[ | | '''[[GPT-Turbo]]''' | ||
| | | 78.1% | ||
| | | 86.5% | ||
| | | 48% | ||
| | | 87.1% | ||
| 72.6% | |||
| 86% | |||
| 88.5% | |||
|- | |- | ||
| [[Claude | | '''[[Claude 3 Opus]]''' | ||
| | | 76.7% | ||
| 7. | | 85.7% | ||
| | | 50.4% | ||
| | | 84.9% | ||
| 60.1% | |||
| 88.4% | |||
| 90.7% | |||
|- | |- | ||
| [[GPT- | | '''[[GPT-4]]''' | ||
| | | 75.5% | ||
| | | 86.4% | ||
| | | 41.4% | ||
| | | 86.6% | ||
| 64.5% | |||
| 88.3% | |||
| 85.9% | |||
|- | |- | ||
| [[ | | '''[[Llama 3.1 70b]]''' | ||
| | | 75.5% | ||
| 7. | | 86% | ||
| | | 46.7% | ||
| | | 80.5% | ||
| 68% | |||
| 84.8% | |||
| 86.9% | |||
|- | |- | ||
| [[ | | '''[[Llama 3.3 70b]]''' | ||
| | | 74.5% | ||
| | | 86% | ||
| | | 48% | ||
| | | 88.40% | ||
| 77% | |||
| 77.5% | |||
| 91.1% | |||
|- | |- | ||
| [[ | | '''[[Gemini 1.5 Pro]]''' | ||
| | | 74.1% | ||
| 7 | | 85.9% | ||
| | | 46.2% | ||
| | | 71.9% | ||
| 67.7% | |||
| 84.35% | |||
| 88.7% | |||
|- | |- | ||
| [[ | | '''[[Claude 3.5 Haiku]]''' | ||
| | | 68.3% | ||
| 6. | | 65% | ||
| | | 41.6% | ||
| | | 88.1% | ||
| 69.4% | |||
| 60% | |||
| 85.6% | |||
|- | |- | ||
| [[ | | '''[[Gemini 1.5 Flash]]''' | ||
| | | 66.7% | ||
| | | 78.9% | ||
| | | 39.5% | ||
| | | 71.5% | ||
| 54.9% | |||
| 79.88% | |||
| 75.5% | |||
|- | |- | ||
| [[ | | '''[[Claude 3 Haiku]]''' | ||
| | | 62.9% | ||
| 7. | | 75.2% | ||
| | | 35.7% | ||
| | | 75.9% | ||
| 38.9% | |||
| 74.65% | |||
| 71.7% | |||
|- | |- | ||
| [[ | | '''[[Llama 3.1 8b]]''' | ||
| | | 62.6% | ||
| 6. | | 73% | ||
| | | 32.8% | ||
| | | 72.6% | ||
| 51.9% | |||
| 76.1% | |||
| 68.9% | |||
|- | |- | ||
| [[ | | '''[[GPT-3.5 Turbo]]''' | ||
| | | 59.2% | ||
| | | 69.8% | ||
| | | 30.8% | ||
| | | 68% | ||
| 34.1% | |||
| 64.41% | |||
| 56.3% | |||
|- | |- | ||
| [[ | | '''[[Gemini 2.0 Flash]]''' | ||
| | | – | ||
| | | 76.4% | ||
| | | 62.1% | ||
| | | – | ||
| 89.7% | |||
| – | |||
| – | |||
|- | |- | ||
| [[ | | '''[[AWS Nova Micro]]''' | ||
| | | – | ||
| | | 77.6% | ||
| | | 40% | ||
| | | 81.1% | ||
| 69.3% | |||
| 56.2% | |||
| – | |||
|- | |- | ||
| [[ | | '''[[AWS Nova Lite]]''' | ||
| | | – | ||
| | | 80.5% | ||
| | | 42% | ||
| | | 85.4% | ||
| 73.3% | |||
| 66.6% | |||
| – | |||
|- | |- | ||
| [[ | | '''[[AWS Nova Pro]]''' | ||
| | | – | ||
| | | 85.9% | ||
| | | 46.9% | ||
| | | 89% | ||
| 76.6% | |||
| 68.4% | |||
| – | |||
|- | |- | ||
| [[ | | '''[[GPT-4o mini]]''' | ||
| | | – | ||
| | | 82% | ||
| | | 40.2% | ||
| | | 87.2% | ||
| 70.2% | |||
| – | |||
| 87% | |||
|- | |- | ||
| [[ | | '''[[Gemini Ultra]]''' | ||
| | | – | ||
| | | 83.7% | ||
| 53. | | 35.7% | ||
| | | – | ||
| 53.2% | |||
| – | |||
| 79% | |||
|- | |- | ||
| [[ | | '''[[OpenAI o1]]''' | ||
| – | |||
| 91.8% | |||
| 75.7% | |||
| 92.4% | |||
| 96.4% | |||
| – | |||
| | | 89.3% | ||
| | |||
| | |||
| | |||
| | |||
| | |||
| | |||
|} | |} | ||
<ref name="”1”"> | |||
<ref name="”1”">Vellum AI LLM Leaderboard https://www.vellum.ai/llm-leaderboard/</ref> | |||
==References== | ==References== | ||
Line 321: | Line 209: | ||
[[Category:Important]] | [[Category:Important]] [[Category:Rankings]] [[Category:Aggregate pages]] |
Latest revision as of 21:02, 13 January 2025
- See also: LLM Benchmarks Timeline and LLM Comparisons
Ranking of large language models (LLMs).
Model | Average | MMLU (General) | GPQA (Reasoning) | HumanEval (Coding) | Math | BFCL (Tool Use) | MGSM (Multilingual) |
---|---|---|---|---|---|---|---|
Claude 3.5 Sonnet | 84.5% | 88.3% | 65% | 93.7% | 78.3% | 90.2% | 91.6% |
GPT-4o | 80.5% | 88.7% | 53.6% | 90.2% | 76.6% | 83.59% | 90.5% |
Llama 3.1 405b | 80.4% | 88.6% | 51.1% | 89% | 73.8% | 88.5% | 91.6% |
GPT-Turbo | 78.1% | 86.5% | 48% | 87.1% | 72.6% | 86% | 88.5% |
Claude 3 Opus | 76.7% | 85.7% | 50.4% | 84.9% | 60.1% | 88.4% | 90.7% |
GPT-4 | 75.5% | 86.4% | 41.4% | 86.6% | 64.5% | 88.3% | 85.9% |
Llama 3.1 70b | 75.5% | 86% | 46.7% | 80.5% | 68% | 84.8% | 86.9% |
Llama 3.3 70b | 74.5% | 86% | 48% | 88.40% | 77% | 77.5% | 91.1% |
Gemini 1.5 Pro | 74.1% | 85.9% | 46.2% | 71.9% | 67.7% | 84.35% | 88.7% |
Claude 3.5 Haiku | 68.3% | 65% | 41.6% | 88.1% | 69.4% | 60% | 85.6% |
Gemini 1.5 Flash | 66.7% | 78.9% | 39.5% | 71.5% | 54.9% | 79.88% | 75.5% |
Claude 3 Haiku | 62.9% | 75.2% | 35.7% | 75.9% | 38.9% | 74.65% | 71.7% |
Llama 3.1 8b | 62.6% | 73% | 32.8% | 72.6% | 51.9% | 76.1% | 68.9% |
GPT-3.5 Turbo | 59.2% | 69.8% | 30.8% | 68% | 34.1% | 64.41% | 56.3% |
Gemini 2.0 Flash | – | 76.4% | 62.1% | – | 89.7% | – | – |
AWS Nova Micro | – | 77.6% | 40% | 81.1% | 69.3% | 56.2% | – |
AWS Nova Lite | – | 80.5% | 42% | 85.4% | 73.3% | 66.6% | – |
AWS Nova Pro | – | 85.9% | 46.9% | 89% | 76.6% | 68.4% | – |
GPT-4o mini | – | 82% | 40.2% | 87.2% | 70.2% | – | 87% |
Gemini Ultra | – | 83.7% | 35.7% | – | 53.2% | – | 79% |
OpenAI o1 | – | 91.8% | 75.7% | 92.4% | 96.4% | – | 89.3% |
References
- ↑ Vellum AI LLM Leaderboard https://www.vellum.ai/llm-leaderboard/