LLM Rankings: Difference between revisions

From AI Wiki
No edit summary
No edit summary
Line 11: Line 11:
! MGSM (Multilingual)
! MGSM (Multilingual)
|-
|-
| Claude 3.5 Sonnet
| '''[[Claude 3.5 Sonnet]]'''
| 84.5%
| 84.5%
| 88.3%
| 88.3%
Line 20: Line 20:
| 91.6%
| 91.6%
|-
|-
| GPT-4o
| '''[[GPT-4o]]'''
| 80.5%
| 80.5%
| 88.7%
| 88.7%
Line 29: Line 29:
| 90.5%
| 90.5%
|-
|-
| Llama 3.1 405b
| '''[[Llama 3.1 405b]]'''
| 80.4%
| 80.4%
| 88.6%
| 88.6%
Line 38: Line 38:
| 91.6%
| 91.6%
|-
|-
| GPT-Turbo
| '''[[GPT-Turbo]]'''
| 78.1%
| 78.1%
| 86.5%
| 86.5%
Line 47: Line 47:
| 88.5%
| 88.5%
|-
|-
| Claude 3 Opus
| '''[[Claude 3 Opus]]'''
| 76.7%
| 76.7%
| 85.7%
| 85.7%
Line 56: Line 56:
| 90.7%
| 90.7%
|-
|-
| GPT-4
| '''[[GPT-4]]'''
| 75.5%
| 75.5%
| 86.4%
| 86.4%
Line 65: Line 65:
| 85.9%
| 85.9%
|-
|-
| Llama 3.1 70b
| '''[[Llama 3.1 70b]]'''
| 75.5%
| 75.5%
| 86%
| 86%
Line 74: Line 74:
| 86.9%
| 86.9%
|-
|-
| Llama 3.3 70b
| '''[[Llama 3.3 70b]]'''
| 74.5%
| 74.5%
| 86%
| 86%
| 48%
| 48%
| 88.4%
| 88.40%
| 77%
| 77%
| 77.5%
| 77.5%
| 91.1%
| 91.1%
|-
|-
| Gemini 1.5 Pro
| '''[[Gemini 1.5 Pro]]'''
| 74.1%
| 74.1%
| 85.9%
| 85.9%
Line 92: Line 92:
| 88.7%
| 88.7%
|-
|-
| Claude 3.5 Haiku
| '''[[Claude 3.5 Haiku]]'''
| 68.3%
| 68.3%
| 65%
| 65%
Line 101: Line 101:
| 85.6%
| 85.6%
|-
|-
| Gemini 1.5 Flash
| '''[[Gemini 1.5 Flash]]'''
| 66.7%
| 66.7%
| 78.9%
| 78.9%
Line 110: Line 110:
| 75.5%
| 75.5%
|-
|-
| Claude 3 Haiku
| '''[[Claude 3 Haiku]]'''
| 62.9%
| 62.9%
| 75.2%
| 75.2%
Line 119: Line 119:
| 71.7%
| 71.7%
|-
|-
| Llama 3.1 8b
| '''[[Llama 3.1 8b]]'''
| 62.6%
| 62.6%
| 73%
| 73%
Line 128: Line 128:
| 68.9%
| 68.9%
|-
|-
| GPT-3.5 Turbo
| '''[[GPT-3.5 Turbo]]'''
| 59.2%
| 59.2%
| 69.8%
| 69.8%
Line 137: Line 137:
| 56.3%
| 56.3%
|-
|-
| Gemini 2.0 Flash
| '''[[Gemini 2.0 Flash]]'''
| –
| –
| 76.4%
| 76.4%
Line 146: Line 146:
| –
| –
|-
|-
| AWS Nova Micro
| '''[[AWS Nova Micro]]'''
| –
| –
| 77.6%
| 77.6%
Line 155: Line 155:
| –
| –
|-
|-
| AWS Nova Lite
| '''[[AWS Nova Lite]]'''
| –
| –
| 80.5%
| 80.5%
Line 164: Line 164:
| –
| –
|-
|-
| AWS Nova Pro
| '''[[AWS Nova Pro]]'''
| –
| –
| 85.9%
| 85.9%
Line 173: Line 173:
| –
| –
|-
|-
| GPT-4o mini
| '''[[GPT-4o mini]]'''
| –
| –
| 82%
| 82%
Line 182: Line 182:
| 87%
| 87%
|-
|-
| Gemini Ultra
| '''[[Gemini Ultra]]'''
| –
| –
| 83.7%
| 83.7%
Line 191: Line 191:
| 79%
| 79%
|-
|-
| OpenAI o1
| '''[[OpenAI o1]]'''
| –
| –
| 91.8%
| 91.8%

Revision as of 22:51, 12 January 2025

Ranking of LLMs.

Model Average MMLU (General) GPQA (Reasoning) HumanEval (Coding) Math BFCL (Tool Use) MGSM (Multilingual)
Claude 3.5 Sonnet 84.5% 88.3% 65% 93.7% 78.3% 90.2% 91.6%
GPT-4o 80.5% 88.7% 53.6% 90.2% 76.6% 83.59% 90.5%
Llama 3.1 405b 80.4% 88.6% 51.1% 89% 73.8% 88.5% 91.6%
GPT-Turbo 78.1% 86.5% 48% 87.1% 72.6% 86% 88.5%
Claude 3 Opus 76.7% 85.7% 50.4% 84.9% 60.1% 88.4% 90.7%
GPT-4 75.5% 86.4% 41.4% 86.6% 64.5% 88.3% 85.9%
Llama 3.1 70b 75.5% 86% 46.7% 80.5% 68% 84.8% 86.9%
Llama 3.3 70b 74.5% 86% 48% 88.40% 77% 77.5% 91.1%
Gemini 1.5 Pro 74.1% 85.9% 46.2% 71.9% 67.7% 84.35% 88.7%
Claude 3.5 Haiku 68.3% 65% 41.6% 88.1% 69.4% 60% 85.6%
Gemini 1.5 Flash 66.7% 78.9% 39.5% 71.5% 54.9% 79.88% 75.5%
Claude 3 Haiku 62.9% 75.2% 35.7% 75.9% 38.9% 74.65% 71.7%
Llama 3.1 8b 62.6% 73% 32.8% 72.6% 51.9% 76.1% 68.9%
GPT-3.5 Turbo 59.2% 69.8% 30.8% 68% 34.1% 64.41% 56.3%
Gemini 2.0 Flash 76.4% 62.1% 89.7%
AWS Nova Micro 77.6% 40% 81.1% 69.3% 56.2%
AWS Nova Lite 80.5% 42% 85.4% 73.3% 66.6%
AWS Nova Pro 85.9% 46.9% 89% 76.6% 68.4%
GPT-4o mini 82% 40.2% 87.2% 70.2% 87%
Gemini Ultra 83.7% 35.7% 53.2% 79%
OpenAI o1 91.8% 75.7% 92.4% 96.4% 89.3%

[1]

References

  1. Vellum AI LLM Leaderboard https://www.vellum.ai/llm-leaderboard/