LLM Rankings: Difference between revisions

From AI Wiki
No edit summary
No edit summary
 
(6 intermediate revisions by the same user not shown)
Line 1: Line 1:
Ranking of [[LLMs]].
{{see also|LLM Benchmarks Timeline|LLM Comparisons}}
Ranking of [[large language models]] ([[LLMs]]).


{| class="wikitable"
{| class="wikitable sortable"
! [[Model]]
! Model
! ⭐ Arena Elo rating
! Average
! 📈 MT-bench (score)
! MMLU (General)
! MMLU
! GPQA (Reasoning)
! License
! HumanEval (Coding)
! Math
! BFCL (Tool Use)
! MGSM (Multilingual)
|-
|-
| [[GPT-4-Turbo]]
| '''[[Claude 3.5 Sonnet]]'''
| 1210
| 84.5%
| 9.32
| 88.3%
|  
| 65%
| Proprietary
| 93.7%
| 78.3%
| 90.2%
| 91.6%
|-
|-
| [[GPT-4]]
| '''[[GPT-4o]]'''
| 1159
| 80.5%
| 8.99
| 88.7%
| 86.4
| 53.6%
| Proprietary
| 90.2%
| 76.6%
| 83.59%
| 90.5%
|-
|-
| [[Claude-1]]
| '''[[Llama 3.1 405b]]'''
| 1146
| 80.4%
| 7.9
| 88.6%
| 77
| 51.1%
| Proprietary
| 89%
| 73.8%
| 88.5%
| 91.6%
|-
|-
| [[Claude-2]]
| '''[[GPT-Turbo]]'''
| 1125
| 78.1%
| 8.06
| 86.5%
| 78.5
| 48%
| Proprietary
| 87.1%
| 72.6%
| 86%
| 88.5%
|-
|-
| [[Claude-instant-1]]
| '''[[Claude 3 Opus]]'''
| 1106
| 76.7%
| 7.85
| 85.7%
| 73.4
| 50.4%
| Proprietary
| 84.9%
| 60.1%
| 88.4%
| 90.7%
|-
|-
| [[GPT-3.5-turbo]]
| '''[[GPT-4]]'''
| 1103
| 75.5%
| 7.94
| 86.4%
| 70
| 41.4%
| Proprietary
| 86.6%
| 64.5%
| 88.3%
| 85.9%
|-
|-
| [[WizardLM-70b-v1.0]]
| '''[[Llama 3.1 70b]]'''
| 1093
| 75.5%
| 7.71
| 86%
| 63.7
| 46.7%
| Llama 2 Community
| 80.5%
| 68%
| 84.8%
| 86.9%
|-
|-
| [[Vicuna-33B]]
| '''[[Llama 3.3 70b]]'''
| 1090
| 74.5%
| 7.12
| 86%
| 59.2
| 48%
| Non-commercial
| 88.40%
| 77%
| 77.5%
| 91.1%
|-
|-
| [[OpenChat-3.5]]
| '''[[Gemini 1.5 Pro]]'''
| 1070
| 74.1%
| 7.81
| 85.9%
| 64.3
| 46.2%
| Apache-2.0
| 71.9%
| 67.7%
| 84.35%
| 88.7%
|-
|-
| [[Llama-2-70b-chat]]
| '''[[Claude 3.5 Haiku]]'''
| 1065
| 68.3%
| 6.86
| 65%
| 63
| 41.6%
| Llama 2 Community
| 88.1%
| 69.4%
| 60%
| 85.6%
|-
|-
| [[WizardLM-13b-v1.2]]
| '''[[Gemini 1.5 Flash]]'''
| 1047
| 66.7%
| 7.2
| 78.9%
| 52.7
| 39.5%
| Llama 2 Community
| 71.5%
| 54.9%
| 79.88%
| 75.5%
|-
|-
| [[zephyr-7b-beta]]
| '''[[Claude 3 Haiku]]'''
| 1042
| 62.9%
| 7.34
| 75.2%
| 61.4
| 35.7%
| MIT
| 75.9%
| 38.9%
| 74.65%
| 71.7%
|-
|-
| [[MPT-30B-chat]]
| '''[[Llama 3.1 8b]]'''
| 1031
| 62.6%
| 6.39
| 73%
| 50.4
| 32.8%
| CC-BY-NC-SA-4.0
| 72.6%
| 51.9%
| 76.1%
| 68.9%
|-
|-
| [[Vicuna-13B]]
| '''[[GPT-3.5 Turbo]]'''
| 1031
| 59.2%
| 6.57
| 69.8%
| 55.8
| 30.8%
| Llama 2 Community
| 68%
| 34.1%
| 64.41%
| 56.3%
|-
|-
| [[QWen-Chat-14B]]
| '''[[Gemini 2.0 Flash]]'''
| 1030
| –
| 6.96
| 76.4%
| 66.5
| 62.1%
| Qianwen LICENSE
| –
| 89.7%
| –
|
|-
|-
| [[falcon-180b-chat]]
| '''[[AWS Nova Micro]]'''
| 1024
|
|  
| 77.6%
| 68
| 40%
| Falcon-180B TII License
| 81.1%
| 69.3%
| 56.2%
| –
|-
|-
| [[zephyr-7b-alpha]]
| '''[[AWS Nova Lite]]'''
| 1024
|
| 6.88
| 80.5%
|  
| 42%
| MIT
| 85.4%
| 73.3%
| 66.6%
|
|-
|-
| [[CodeLlama-34B-instruct]]
| '''[[AWS Nova Pro]]'''
| 1022
|
|  
| 85.9%
| 53.7
| 46.9%
| Llama 2 Community
| 89%
| 76.6%
| 68.4%
|
|-
|-
| [[Guanaco-33B]]
| '''[[GPT-4o mini]]'''
| 1021
|
| 6.53
| 82%
| 57.6
| 40.2%
| Non-commercial
| 87.2%
| 70.2%
| –
| 87%
|-
|-
| [[Llama-2-13b-chat]]
| '''[[Gemini Ultra]]'''
| 1021
|
| 6.65
| 83.7%
| 53.6
| 35.7%
| Llama 2 Community
| –
| 53.2%
|
| 79%
|-
|-
| [[Mistral-7B-Instruct-v0.1]]
| '''[[OpenAI o1]]'''
| 1008
|
| 6.84
| 91.8%
| 55.4
| 75.7%
| Apache 2.0
| 92.4%
|-
| 96.4%
| [[Llama-2-7b-chat]]
|
| 1001
| 89.3%
| 6.27
| 45.8
| Llama 2 Community
|-
| [[Vicuna-7B]]
| 994
| 6.17
| 49.8
| Llama 2 Community
|-
| [[PaLM-Chat-Bison-001]]
| 991
| 6.4
|
| Proprietary
|-
| [[ChatGLM3-6B]]
| 970
|
|
| Apache-2.0
|-
| [[Koala-13B]]
| 955
| 5.35
| 44.7
| Non-commercial
|-
| [[GPT4All-13B-Snoozy]]
| 925
| 5.41
| 43
| Non-commercial
|-
| [[MPT-7B-Chat]]
| 918
| 5.42
| 32
| CC-BY-NC-SA-4.0
|-
| [[ChatGLM2-6B]]
| 918
| 4.96
| 45.5
| Apache-2.0
|-
| [[RWKV-4-Raven-14B]]
| 915
| 3.98
| 25.6
| Apache 2.0
|-
| [[Alpaca-13B]]
| 893
| 4.53
| 48.1
| Non-commercial
|-
| [[OpenAssistant-Pythia-12B]]
| 884
| 4.32
| 27
| Apache 2.0
|-
| [[ChatGLM-6B]]
| 871
| 4.5
| 36.1
| Non-commercial
|-
| [[FastChat-T5-3B]]
| 863
| 3.04
| 47.7
| Apache 2.0
|-
| [[StableLM-Tuned-Alpha-7B]]
| 833
| 2.75
| 24.4
| CC-BY-NC-SA-4.0
|-
| [[Dolly-V2-12B]]
| 810
| 3.28
| 25.7
| MIT
|-
| [[LLaMA-13B]]
| 789
| 2.61
| 47
| Non-commercial
|-
| [[WizardLM-30B]]
|
| 7.01
| 58.7
| Non-commercial
|-
| [[Vicuna-13B-16k]]
|
| 6.92
| 54.5
| Llama 2 Community
|-
| [[WizardLM-13B-v1.1]]
|
| 6.76
| 50
| Non-commercial
|-
| [[Tulu-30B]]
|
| 6.43
| 58.1
| Non-commercial
|-
| [[Guanaco-65B]]
|
| 6.41
| 62.1
| Non-commercial
|-
| [[OpenAssistant-LLaMA-30B]]
|
| 6.41
| 56
| Non-commercial
|-
| [[WizardLM-13B-v1.0]]
|
| 6.35
| 52.3
| Non-commercial
|-
| [[Vicuna-7B-16k]]
|
| 6.22
| 48.5
| Llama 2 Community
|-
| [[Baize-v2-13B]]
|
| 5.75
| 48.9
| Non-commercial
|-
| [[XGen-7B-8K-Inst]]
|
| 5.55
| 42.1
| Non-commercial
|-
| [[Nous-Hermes-13B]]
|
| 5.51
| 49.3
| Non-commercial
|-
| [[MPT-30B-Instruct]]
|
| 5.22
| 47.8
| CC-BY-SA 3.0
|-
| [[Falcon-40B-Instruct]]
|
| 5.17
| 54.7
| Apache 2.0
|-
| [[H2O-Oasst-OpenLLaMA-13B]]
|
| 4.63
| 42.8
| Apache 2.0
|}
|}
<ref name="”1”">Chatbot Arena Leaderboard https://arena.lmsys.org/</ref>
 
<ref name="”1”">Vellum AI LLM Leaderboard https://www.vellum.ai/llm-leaderboard/</ref>


==References==
==References==
Line 321: Line 209:




[[Category:Important]]
[[Category:Important]] [[Category:Rankings]] [[Category:Aggregate pages]]

Latest revision as of 21:02, 13 January 2025

See also: LLM Benchmarks Timeline and LLM Comparisons

Ranking of large language models (LLMs).

Model Average MMLU (General) GPQA (Reasoning) HumanEval (Coding) Math BFCL (Tool Use) MGSM (Multilingual)
Claude 3.5 Sonnet 84.5% 88.3% 65% 93.7% 78.3% 90.2% 91.6%
GPT-4o 80.5% 88.7% 53.6% 90.2% 76.6% 83.59% 90.5%
Llama 3.1 405b 80.4% 88.6% 51.1% 89% 73.8% 88.5% 91.6%
GPT-Turbo 78.1% 86.5% 48% 87.1% 72.6% 86% 88.5%
Claude 3 Opus 76.7% 85.7% 50.4% 84.9% 60.1% 88.4% 90.7%
GPT-4 75.5% 86.4% 41.4% 86.6% 64.5% 88.3% 85.9%
Llama 3.1 70b 75.5% 86% 46.7% 80.5% 68% 84.8% 86.9%
Llama 3.3 70b 74.5% 86% 48% 88.40% 77% 77.5% 91.1%
Gemini 1.5 Pro 74.1% 85.9% 46.2% 71.9% 67.7% 84.35% 88.7%
Claude 3.5 Haiku 68.3% 65% 41.6% 88.1% 69.4% 60% 85.6%
Gemini 1.5 Flash 66.7% 78.9% 39.5% 71.5% 54.9% 79.88% 75.5%
Claude 3 Haiku 62.9% 75.2% 35.7% 75.9% 38.9% 74.65% 71.7%
Llama 3.1 8b 62.6% 73% 32.8% 72.6% 51.9% 76.1% 68.9%
GPT-3.5 Turbo 59.2% 69.8% 30.8% 68% 34.1% 64.41% 56.3%
Gemini 2.0 Flash 76.4% 62.1% 89.7%
AWS Nova Micro 77.6% 40% 81.1% 69.3% 56.2%
AWS Nova Lite 80.5% 42% 85.4% 73.3% 66.6%
AWS Nova Pro 85.9% 46.9% 89% 76.6% 68.4%
GPT-4o mini 82% 40.2% 87.2% 70.2% 87%
Gemini Ultra 83.7% 35.7% 53.2% 79%
OpenAI o1 91.8% 75.7% 92.4% 96.4% 89.3%

[1]

References

  1. Vellum AI LLM Leaderboard https://www.vellum.ai/llm-leaderboard/