LLM Comparisons: Difference between revisions

From AI Wiki
No edit summary
 
(4 intermediate revisions by the same user not shown)
Line 1: Line 1:
Compare different [[large language models]] ([[LLM]]s): [[#Concise Comparison|Concise comparison]], [[#Detailed Comparison|detailed comparison]], [[#Terms|terminology definitions]].
{{see also|LLM Benchmarks Timeline|LLM Rankings}}
__TOC__
Compare different [[large language models]] ([[LLM]]s).
==Concised Comparison==
 
{| class="wikitable sortable"
{| class="wikitable sortable"
! Model
! Model
Line 199: Line 199:
| '''[[Jamba Instruct]]''' || [[AI21 Labs]] || 256k ||  || $0.55 || 77.4 || 0.52
| '''[[Jamba Instruct]]''' || [[AI21 Labs]] || 256k ||  || $0.55 || 77.4 || 0.52
|}
|}
 
<ref name="”1”">LLM Leaderboard - Comparison of GPT-4o, Llama 3, Mistral, Gemini and over 30 models https://artificialanalysis.ai/leaderboards/models</ref>
==Detailed Comparison==
{| class="wikitable sortable"
! Model
! Creator
! License
! Context Window
! Quality Index<br>(Normalized avg)
! Chatbot Arena
! MMLU
! GPQA
! MATH-500
! HumanEval
! Blended<br>(USD/1M Tokens)
! Input Price<br>(USD/1M Tokens)
! Output Price<br>(USD/1M Tokens)
! Median<br>(Tokens/s)
! P5<br>(Tokens/s)
! P25<br>(Tokens/s)
! P75<br>(Tokens/s)
! P95<br>(Tokens/s)
! Median<br>(First Chunk (s))
! P5<br>(First Chunk (s))
! P25<br>(First Chunk (s))
! P75<br>(First Chunk (s))
! P95<br>(First Chunk (s))
! Further Analysis
|-
| '''[[o1-preview]]''' || OpenAI || Proprietary || 128k || 86 || 1334 || 0.91 || 0.67 || 0.92 || 0.96 || $27.56 || $15.75 || $63.00 || 143.8 || 68.9 || 121.6 || 164.6 || 179.6 || 21.28 || 13.40 || 17.04 || 27.80 || 46.49 || –
|-
| '''[[o1-mini]]''' || OpenAI || Proprietary || 128k || 84 || 1308 || 0.85 || 0.58 || 0.95 || 0.97 || $5.25 || $3.00 || $12.00 || 213.6 || 84.0 || 154.8 || 238.0 || 299.4 || 11.75 || 2.44 || 9.40 || 14.43 || 24.03 || –
|-
| '''[[GPT-4o (Aug '24)]]''' || OpenAI || Proprietary || 128k || 78 || 1337 || 0.89 || 0.51 || 0.80 || 0.93 || – || – || $4.38 || $2.50 || $10.00 || 85.6 || 40.3 || 61.5 || 109.3 || 143.6 || 0.66 || 0.33 || 0.43 || 0.91 || 1.92
|-
| '''[[GPT-4o (May '24)]]''' || OpenAI || Proprietary || 128k || 78 || 1285 || 0.87 || 0.51 || 0.79 || 0.93 || – || – || $7.50 || $5.00 || $15.00 || 106.8 || 53.2 || 82.2 || 126.8 || 142.5 || 0.65 || 0.32 || 0.43 || 0.73 || 1.22
|-
| '''[[GPT-4o mini]]''' || OpenAI || Proprietary || 128k || 73 || 1273 || 0.82 || 0.44 || 0.79 || 0.88 || – || – || $0.26 || $0.15 || $0.60 || 121.8 || 50.7 || 74.1 || 179.4 || 206.5 || 0.65 || 0.30 || 0.39 || 0.77 || 0.92
|-
| '''[[Claude 3.5 Sonnet (Oct)]]''' || Anthropic || Proprietary || 200k || 80 || 1282 || 0.89 || 0.58 || 0.76 || 0.96 || – || – || $6.00 || $3.00 || $15.00 || 71.8 || 37.6 || 44.8 || 78.0 || 89.6 || 0.98 || 0.68 || 0.78 || 1.36 || 2.23
|-
| '''[[Claude 3.5 Sonnet (June)]]''' || Anthropic || Proprietary || 200k || 76 || 1268 || 0.88 || 0.56 || 0.71 || 0.90 || – || – || $6.00 || $3.00 || $15.00 || 61.4 || 41.6 || 49.9 || 78.9 || 91.0 || 0.87 || 0.68 || 0.75 || 1.06 || 1.45
|-
| '''[[Claude 3.5 Haiku]]''' || Anthropic || Proprietary || 200k || 68 || – || 0.81 || 0.37 || 0.67 || 0.87 || – || – || $1.60 || $0.80 || $4.00 || 65.1 || 51.1 || 58.6 || 75.4 || 105.1 || 0.71 || 0.54 || 0.64 || 0.93 || 1.20
|-
| '''[[Llama 3.3 70B]]''' || Meta || Open || 128k || 74 || – || 0.86 || 0.49 || 0.76 || 0.86 || – || – || $0.67 || $0.59 || $0.73 || 67.2 || 23.6 || 31.2 || 275.7 || 2046.5 || 0.51 || 0.23 || 0.36 || 0.72 || 1.48
|-
| '''[[Llama 3.2 3B]]''' || Meta || Open || 128k || 49 || 1103 || 0.64 || 0.21 || 0.50 || 0.60 || – || – || $0.06 || $0.06 || $0.06 || 202.2 || 42.4 || 144.0 || 543.6 || 1623.1 || 0.38 || 0.15 || 0.26 || 0.49 || 0.93
|-
| '''[[Gemini 1.5 Flash (May)]]''' || Google || Proprietary || 1m || – || 1227 || 0.79 || 0.39 || 0.55 || – || – || – || $0.13 || $0.07 || $0.30 || 310.0 || 276.8 || 297.5 || 325.0 || 350.4 || 0.30 || 0.23 || 0.27 || 0.33 || 0.39
|-
| '''[[Nova Micro]]''' || Amazon || Proprietary || 130k || 66 || – || 0.76 || 0.38 || 0.69 || 0.80 || – || – || $0.06 || $0.04 || $0.14 || 195.8 || 170.9 || 186.0 || 208.3 || 219.5 || 0.33 || 0.30 || 0.32 || 0.35 || 0.39
|-
| '''[[DeepSeek-Coder-V2]]''' || DeepSeek || Open || 128k || 71 || 1178 || 0.80 || 0.42 || 0.74 || 0.87 || – || – || $0.17 || $0.14 || $0.28 || 64.4 || 51.8 || 57.3 || 71.4 || 81.1 || 1.12 || 0.84 || 0.99 || 1.27 || 1.71
|-
| '''[[Phi-4]]''' || Microsoft Azure || Open || 16k || 77 || – || 0.85 || 0.57 || 0.81 || 0.87 || – || – || $0.09 || $0.07 || $0.14 || 85.1 || 76.2 || 82.0 || 85.4 || 85.6 || 0.21 || 0.16 || 0.18 || 0.23 || 0.25
|-
| '''[[Reka Flash]]''' || Reka AI || Proprietary || 128k || 59 || – || 0.73 || 0.34 || 0.53 || 0.74 || – || – || $0.35 || $0.20 || $0.80 || – || – || – || – || – || – || – || – || – || –
|-
| '''[[OpenChat 3.5]]''' || OpenChat || Open || 8k || 44 || 1076 || 0.56 || 0.22 || 0.31 || 0.68 || – || – || $0.06 || $0.06 || $0.06 || 73.3 || 66.3 || 69.3 || 76.3 || 80.3 || 0.30 || 0.24 || 0.27 || 0.32 || 0.37
|-
| '''[[Jamba Instruct]]''' || AI21 Labs || Proprietary || 256k || – || – || 0.58 || 0.25 || – || – || – || – || $0.55 || $0.50 || $0.70 || 77.1 || 70.4 || 74.3 || 169.6 || 193.7 || 0.52 || 0.29 || 0.45 || 0.54 || 0.58
|}


==Terms==
==Terms==
Line 271: Line 210:
*'''Input Price''': Price per token included in the request/message sent to the API, represented as USD per million Tokens.
*'''Input Price''': Price per token included in the request/message sent to the API, represented as USD per million Tokens.
*'''Time period''': Metrics are 'live' and are based on the past 14 days of measurements, measurements are taken 8 times a day for single requests and 2 times per day for parallel requests.
*'''Time period''': Metrics are 'live' and are based on the past 14 days of measurements, measurements are taken 8 times a day for single requests and 2 times per day for parallel requests.
<ref name="”1”">LLM Leaderboard - Comparison of GPT-4o, Llama 3, Mistral, Gemini and over 30 models https://artificialanalysis.ai/leaderboards/models</ref>


==References==
==References==
<references />
[[Category:Important]] [[Category:Rankings]] [[Category:Aggregate pages]]

Latest revision as of 21:03, 13 January 2025

See also: LLM Benchmarks Timeline and LLM Rankings

Compare different large language models (LLMs).

Model Creator Context Window Quality Index
(Normalized avg)
Blended
(USD/1M Tokens)
Median
(Tokens/s)
Median
(First Chunk (s))
o1-preview OpenAI 128k 86 $27.56 143.7 21.33
o1-mini OpenAI 128k 84 $5.25 213.2 11.27
GPT-4o (Aug '24) OpenAI 128k 78 $4.38 83.5 0.67
GPT-4o (May '24) OpenAI 128k 78 $7.50 106.3 0.65
GPT-4o mini OpenAI 128k 73 $0.26 113.8 0.64
GPT-4o (Nov '24) OpenAI 128k 73 $4.38 116.4 0.39
GPT-4o mini Realtime (Dec '24) OpenAI 128k $0.00
GPT-4o Realtime (Dec '24) OpenAI 128k $0.00
Llama 3.3 70B Meta 128k 74 $0.69 71.8 0.49
Llama 3.1 405B Meta 128k 74 $3.50 30.2 0.71
Llama 3.1 70B Meta 128k 68 $0.72 72.8 0.44
Llama 3.2 90B (Vision) Meta 128k 68 $0.81 48.9 0.33
Llama 3.2 11B (Vision) Meta 128k 54 $0.18 131.2 0.28
Llama 3.1 8B Meta 128k 54 $0.10 184.9 0.33
Llama 3.2 3B Meta 128k 49 $0.06 201.4 0.38
Llama 3.2 1B Meta 128k 26 $0.04 468.6 0.37
Gemini 2.0 Flash (exp) Google 1m 82 $0.00 169.0 0.48
Gemini 1.5 Pro (Sep) Google 2m 80 $2.19 60.8 0.74
Gemini 1.5 Flash (Sep) Google 1m 72 $0.13 188.4 0.25
Gemma 2 27B Google 8k 61 $0.26 59.4 0.48
Gemma 2 9B Google 8k 55 $0.12 168.9 0.36
Gemini 1.5 Flash (May) Google 1m $0.13 310.6 0.29
Gemini Experimental (Nov) Google 2m $0.00 53.9 1.12
Gemini 1.5 Pro (May) Google 2m $2.19 66.9 0.49
Gemini 1.5 Flash-8B Google 1m $0.07 279.7 0.38
Claude 3.5 Sonnet (Oct) Anthropic 200k 80 $6.00 72.0 0.99
Claude 3.5 Sonnet (June) Anthropic 200k 76 $6.00 61.5 0.87
Claude 3 Opus Anthropic 200k 70 $30.00 25.9 2.00
Claude 3.5 Haiku Anthropic 200k 68 $1.60 65.1 0.71
Claude 3 Haiku Anthropic 200k 55 $0.50 121.6 0.72
Pixtral Large Mistral 128k 74 $3.00 36.5 0.40
Mistral Large 2 (Jul '24) Mistral 128k 74 $3.00 31.1 0.50
Mistral Large 2 (Nov '24) Mistral 128k 74 $3.00 37.4 0.52
Mistral Small (Sep '24) Mistral 33k 61 $0.30 61.5 0.32
Mixtral 8x22B Mistral 65k 61 $1.20 85.1 0.57
Pixtral 12B Mistral 128k 56 $0.13 70.3 0.37
Ministral 8B Mistral 128k 56 $0.10 136.1 0.30
Mistral NeMo Mistral 128k 54 $0.09 122.5 0.48
Ministral 3B Mistral 128k 53 $0.04 168.5 0.29
Mixtral 8x7B Mistral 33k 41 $0.50 110.6 0.36
Codestral-Mamba Mistral 256k 33 $0.25 95.8 0.44
Command-R+ Cohere 128k 55 $5.19 50.7 0.47
Command-R+ (Apr '24) Cohere 128k 45 $6.00 49.3 0.51
Command-R (Mar '24) Cohere 128k 36 $0.75 108.1 0.36
Aya Expanse 8B Cohere 8k $0.75 165.5 0.16
Command-R Cohere 128k $0.51 111.8 0.32
Aya Expanse 32B Cohere 128k $0.75 120.4 0.18
Sonar 3.1 Small Perplexity 127k $0.20 203.8 0.34
Sonar 3.1 Large Perplexity 127k $1.00 57.7 0.31
Grok Beta xAI 128k 72 $7.50 66.7 0.42
Nova Pro Amazon 300k 75 $1.40 91.0 0.38
Nova Lite Amazon 300k 70 $0.10 148.0 0.33
Nova Micro Amazon 130k 66 $0.06 195.5 0.33
Phi-4 Microsoft Azure 16k 77 $0.09 85.0 0.22
Phi-3 Mini Microsoft Azure 4k $0.00
Phi-3 Medium 14B Microsoft Azure 128k $0.30 50.4 0.43
Solar Mini Upstage 4k 47 $0.15 89.3 1.13
DBRX Databricks 33k 46 $1.16 78.3 0.42
Llama 3.1 Nemotron 70B NVIDIA 128k 72 $0.27 48.3 0.57
Reka Flash Reka AI 128k 59 $0.35
Reka Core Reka AI 128k 58 $2.00
Reka Flash (Feb '24) Reka AI 128k 46 $0.35
Reka Edge Reka AI 128k 31 $0.10
Jamba 1.5 Large AI21 Labs 256k 64 $3.50 51.0 0.71
Jamba 1.5 Mini AI21 Labs 256k $0.25 83.7 0.48
DeepSeek V3 DeepSeek 128k 80 $0.90 20.9 0.94
DeepSeek-V2.5 (Dec '24) DeepSeek 128k 72 $0.17 61.8 1.15
DeepSeek-Coder-V2 DeepSeek 128k 71 $0.17 62.0 1.11
DeepSeek-V2.5 DeepSeek 128k $1.09 7.6 0.77
DeepSeek-V2 DeepSeek 128k $0.17
Arctic Snowflake 4k 51 $0.00
Qwen2.5 72B Alibaba 131k 77 $0.40 67.6 0.53
Qwen2.5 Coder 32B Alibaba 131k 72 $0.80 84.0 0.38
Qwen2 72B Alibaba 131k 72 $0.63 46.5 0.30
QwQ 32B-Preview Alibaba 33k 46 $0.26 67.3 0.40
Yi-Large 01.AI 32k 61 $3.00 68.1 0.47
GPT-4 Turbo OpenAI 128k 75 $15.00 43.3 1.20
GPT-4 OpenAI 8k $37.50 28.4 0.75
Llama 3 70B Meta 8k 48 $0.89 48.9 0.38
Llama 3 8B Meta 8k 45 $0.15 117.3 0.34
Llama 2 Chat 70B Meta 4k $1.85
Llama 2 Chat 13B Meta 4k $0.00
Llama 2 Chat 7B Meta 4k $0.33 123.7 0.37
Gemini 1.0 Pro Google 33k $0.75 102.9 1.27
Claude 3 Sonnet Anthropic 200k 57 $6.00 68.2 0.76
Claude 2.1 Anthropic 200k $12.00 14.1 1.24
Claude 2.0 Anthropic 100k $12.00 29.9 0.81
Mistral Small (Feb '24) Mistral 33k 59 $1.50 53.5 0.37
Mistral Large (Feb '24) Mistral 33k 56 $6.00 38.9 0.44
Mistral 7B Mistral 8k 28 $0.16 112.5 0.26
Mistral Medium Mistral 33k $4.09 44.7 0.37
Codestral Mistral 33k $0.30 84.9 0.28
OpenChat 3.5 OpenChat 8k 44 $0.06 73.5 0.30
Jamba Instruct AI21 Labs 256k $0.55 77.4 0.52

[1]

Terms

  • Artificial Analysis Quality Index: Average result across our evaluations covering different dimensions of model intelligence. Currently includes MMLU, GPQA, Math & HumanEval. OpenAI o1 model figures are preliminary and are based on figures stated by OpenAI. See methodology for more details.
  • Context window: Maximum number of combined input & output tokens. Output tokens commonly have a significantly lower limit (varied by model).
  • Output Speed: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API for models which support streaming).
  • Latency: Time to first token of tokens received, in seconds, after API request sent. For models which do not support streaming, this represents time to receive the completion.
  • Price: Price per token, represented as USD per million Tokens. Price is a blend of Input & Output token prices (3:1 ratio).
  • Output Price: Price per token generated by the model (received from the API), represented as USD per million Tokens.
  • Input Price: Price per token included in the request/message sent to the API, represented as USD per million Tokens.
  • Time period: Metrics are 'live' and are based on the past 14 days of measurements, measurements are taken 8 times a day for single requests and 2 times per day for parallel requests.

[1]

References

  1. 1.0 1.1 LLM Leaderboard - Comparison of GPT-4o, Llama 3, Mistral, Gemini and over 30 models https://artificialanalysis.ai/leaderboards/models