LLM Comparisons: Difference between revisions

Latest revision as of 21:03, 13 January 2025

See also: LLM Benchmarks Timeline and LLM Rankings

Compare different large language models (LLMs).

Model	Creator	Context Window	Quality Index (Normalized avg)	Blended (USD/1M Tokens)	Median (Tokens/s)	Median (First Chunk (s))
o1-preview	OpenAI	128k	86	$27.56	143.7	21.33
o1-mini	OpenAI	128k	84	$5.25	213.2	11.27
GPT-4o (Aug '24)	OpenAI	128k	78	$4.38	83.5	0.67
GPT-4o (May '24)	OpenAI	128k	78	$7.50	106.3	0.65
GPT-4o mini	OpenAI	128k	73	$0.26	113.8	0.64
GPT-4o (Nov '24)	OpenAI	128k	73	$4.38	116.4	0.39
GPT-4o mini Realtime (Dec '24)	OpenAI	128k		$0.00
GPT-4o Realtime (Dec '24)	OpenAI	128k		$0.00
Llama 3.3 70B	Meta	128k	74	$0.69	71.8	0.49
Llama 3.1 405B	Meta	128k	74	$3.50	30.2	0.71
Llama 3.1 70B	Meta	128k	68	$0.72	72.8	0.44
Llama 3.2 90B (Vision)	Meta	128k	68	$0.81	48.9	0.33
Llama 3.2 11B (Vision)	Meta	128k	54	$0.18	131.2	0.28
Llama 3.1 8B	Meta	128k	54	$0.10	184.9	0.33
Llama 3.2 3B	Meta	128k	49	$0.06	201.4	0.38
Llama 3.2 1B	Meta	128k	26	$0.04	468.6	0.37
Gemini 2.0 Flash (exp)	Google	1m	82	$0.00	169.0	0.48
Gemini 1.5 Pro (Sep)	Google	2m	80	$2.19	60.8	0.74
Gemini 1.5 Flash (Sep)	Google	1m	72	$0.13	188.4	0.25
Gemma 2 27B	Google	8k	61	$0.26	59.4	0.48
Gemma 2 9B	Google	8k	55	$0.12	168.9	0.36
Gemini 1.5 Flash (May)	Google	1m		$0.13	310.6	0.29
Gemini Experimental (Nov)	Google	2m		$0.00	53.9	1.12
Gemini 1.5 Pro (May)	Google	2m		$2.19	66.9	0.49
Gemini 1.5 Flash-8B	Google	1m		$0.07	279.7	0.38
Claude 3.5 Sonnet (Oct)	Anthropic	200k	80	$6.00	72.0	0.99
Claude 3.5 Sonnet (June)	Anthropic	200k	76	$6.00	61.5	0.87
Claude 3 Opus	Anthropic	200k	70	$30.00	25.9	2.00
Claude 3.5 Haiku	Anthropic	200k	68	$1.60	65.1	0.71
Claude 3 Haiku	Anthropic	200k	55	$0.50	121.6	0.72
Pixtral Large	Mistral	128k	74	$3.00	36.5	0.40
Mistral Large 2 (Jul '24)	Mistral	128k	74	$3.00	31.1	0.50
Mistral Large 2 (Nov '24)	Mistral	128k	74	$3.00	37.4	0.52
Mistral Small (Sep '24)	Mistral	33k	61	$0.30	61.5	0.32
Mixtral 8x22B	Mistral	65k	61	$1.20	85.1	0.57
Pixtral 12B	Mistral	128k	56	$0.13	70.3	0.37
Ministral 8B	Mistral	128k	56	$0.10	136.1	0.30
Mistral NeMo	Mistral	128k	54	$0.09	122.5	0.48
Ministral 3B	Mistral	128k	53	$0.04	168.5	0.29
Mixtral 8x7B	Mistral	33k	41	$0.50	110.6	0.36
Codestral-Mamba	Mistral	256k	33	$0.25	95.8	0.44
Command-R+	Cohere	128k	55	$5.19	50.7	0.47
Command-R+ (Apr '24)	Cohere	128k	45	$6.00	49.3	0.51
Command-R (Mar '24)	Cohere	128k	36	$0.75	108.1	0.36
Aya Expanse 8B	Cohere	8k		$0.75	165.5	0.16
Command-R	Cohere	128k		$0.51	111.8	0.32
Aya Expanse 32B	Cohere	128k		$0.75	120.4	0.18
Sonar 3.1 Small	Perplexity	127k		$0.20	203.8	0.34
Sonar 3.1 Large	Perplexity	127k		$1.00	57.7	0.31
Grok Beta	xAI	128k	72	$7.50	66.7	0.42
Nova Pro	Amazon	300k	75	$1.40	91.0	0.38
Nova Lite	Amazon	300k	70	$0.10	148.0	0.33
Nova Micro	Amazon	130k	66	$0.06	195.5	0.33
Phi-4	Microsoft Azure	16k	77	$0.09	85.0	0.22
Phi-3 Mini	Microsoft Azure	4k		$0.00
Phi-3 Medium 14B	Microsoft Azure	128k		$0.30	50.4	0.43
Solar Mini	Upstage	4k	47	$0.15	89.3	1.13
DBRX	Databricks	33k	46	$1.16	78.3	0.42
Llama 3.1 Nemotron 70B	NVIDIA	128k	72	$0.27	48.3	0.57
Reka Flash	Reka AI	128k	59	$0.35
Reka Core	Reka AI	128k	58	$2.00
Reka Flash (Feb '24)	Reka AI	128k	46	$0.35
Reka Edge	Reka AI	128k	31	$0.10
Jamba 1.5 Large	AI21 Labs	256k	64	$3.50	51.0	0.71
Jamba 1.5 Mini	AI21 Labs	256k		$0.25	83.7	0.48
DeepSeek V3	DeepSeek	128k	80	$0.90	20.9	0.94
DeepSeek-V2.5 (Dec '24)	DeepSeek	128k	72	$0.17	61.8	1.15
DeepSeek-Coder-V2	DeepSeek	128k	71	$0.17	62.0	1.11
DeepSeek-V2.5	DeepSeek	128k		$1.09	7.6	0.77
DeepSeek-V2	DeepSeek	128k		$0.17
Arctic	Snowflake	4k	51	$0.00
Qwen2.5 72B	Alibaba	131k	77	$0.40	67.6	0.53
Qwen2.5 Coder 32B	Alibaba	131k	72	$0.80	84.0	0.38
Qwen2 72B	Alibaba	131k	72	$0.63	46.5	0.30
QwQ 32B-Preview	Alibaba	33k	46	$0.26	67.3	0.40
Yi-Large	01.AI	32k	61	$3.00	68.1	0.47
GPT-4 Turbo	OpenAI	128k	75	$15.00	43.3	1.20
GPT-4	OpenAI	8k		$37.50	28.4	0.75
Llama 3 70B	Meta	8k	48	$0.89	48.9	0.38
Llama 3 8B	Meta	8k	45	$0.15	117.3	0.34
Llama 2 Chat 70B	Meta	4k		$1.85
Llama 2 Chat 13B	Meta	4k		$0.00
Llama 2 Chat 7B	Meta	4k		$0.33	123.7	0.37
Gemini 1.0 Pro	Google	33k		$0.75	102.9	1.27
Claude 3 Sonnet	Anthropic	200k	57	$6.00	68.2	0.76
Claude 2.1	Anthropic	200k		$12.00	14.1	1.24
Claude 2.0	Anthropic	100k		$12.00	29.9	0.81
Mistral Small (Feb '24)	Mistral	33k	59	$1.50	53.5	0.37
Mistral Large (Feb '24)	Mistral	33k	56	$6.00	38.9	0.44
Mistral 7B	Mistral	8k	28	$0.16	112.5	0.26
Mistral Medium	Mistral	33k		$4.09	44.7	0.37
Codestral	Mistral	33k		$0.30	84.9	0.28
OpenChat 3.5	OpenChat	8k	44	$0.06	73.5	0.30
Jamba Instruct	AI21 Labs	256k		$0.55	77.4	0.52

^[1]

Terms

Artificial Analysis Quality Index: Average result across our evaluations covering different dimensions of model intelligence. Currently includes MMLU, GPQA, Math & HumanEval. OpenAI o1 model figures are preliminary and are based on figures stated by OpenAI. See methodology for more details.
Context window: Maximum number of combined input & output tokens. Output tokens commonly have a significantly lower limit (varied by model).
Output Speed: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API for models which support streaming).
Latency: Time to first token of tokens received, in seconds, after API request sent. For models which do not support streaming, this represents time to receive the completion.
Price: Price per token, represented as USD per million Tokens. Price is a blend of Input & Output token prices (3:1 ratio).
Output Price: Price per token generated by the model (received from the API), represented as USD per million Tokens.
Input Price: Price per token included in the request/message sent to the API, represented as USD per million Tokens.
Time period: Metrics are 'live' and are based on the past 14 days of measurements, measurements are taken 8 times a day for single requests and 2 times per day for parallel requests.

^[1]

References

↑ ^1.0 ^1.1 LLM Leaderboard - Comparison of GPT-4o, Llama 3, Mistral, Gemini and over 30 models https://artificialanalysis.ai/leaderboards/models

[”1”-1] 1.0 ^1.1 LLM Leaderboard - Comparison of GPT-4o, Llama 3, Mistral, Gemini and over 30 models https://artificialanalysis.ai/leaderboards/models

[1]

@@ Line 1: / Line 1: @@
-Compare different [[large language models]] ([[LLM]]s): [[#Concise Comparison|Concise comparison]], [[#Detailed Comparison|detailed comparison]], [[#Terms|terminology definitions]].
+{{see also|LLM Benchmarks Timeline|LLM Rankings}}
-__TOC__
+Compare different [[large language models]] ([[LLM]]s).
-==Concised Comparison==
 {| class="wikitable sortable"
 ! Model
@@ Line 199: / Line 199: @@
 | '''[[Jamba Instruct]]''' || [[AI21 Labs]] || 256k ||  || $0.55 || 77.4 || 0.52
 |}
+<ref name="”1”">LLM Leaderboard - Comparison of GPT-4o, Llama 3, Mistral, Gemini and over 30 models https://artificialanalysis.ai/leaderboards/models</ref>
-==Detailed Comparison==
-{| class="wikitable sortable"
-! Model
-! Creator
-! License
-! Context Window
-! Quality Index<br>(Normalized avg)
-! Chatbot Arena
-! MMLU
-! GPQA
-! MATH-500
-! HumanEval
-! Blended<br>(USD/1M Tokens)
-! Input Price<br>(USD/1M Tokens)
-! Output Price<br>(USD/1M Tokens)
-! Median<br>(Tokens/s)
-! P5<br>(Tokens/s)
-! P25<br>(Tokens/s)
-! P75<br>(Tokens/s)
-! P95<br>(Tokens/s)
-! Median<br>(First Chunk (s))
-! P5<br>(First Chunk (s))
-! P25<br>(First Chunk (s))
-! P75<br>(First Chunk (s))
-! P95<br>(First Chunk (s))
-! Further Analysis
-|-
-| '''[[o1-preview]]''' || OpenAI || Proprietary || 128k || 86 || 1334 || 0.91 || 0.67 || 0.92 || 0.96 || $27.56 || $15.75 || $63.00 || 143.8 || 68.9 || 121.6 || 164.6 || 179.6 || 21.28 || 13.40 || 17.04 || 27.80 || 46.49 || –
-|-
-| '''[[o1-mini]]''' || OpenAI || Proprietary || 128k || 84 || 1308 || 0.85 || 0.58 || 0.95 || 0.97 || $5.25 || $3.00 || $12.00 || 213.6 || 84.0 || 154.8 || 238.0 || 299.4 || 11.75 || 2.44 || 9.40 || 14.43 || 24.03 || –
-|-
-| '''[[GPT-4o (Aug '24)]]''' || OpenAI || Proprietary || 128k || 78 || 1337 || 0.89 || 0.51 || 0.80 || 0.93 || – || – || $4.38 || $2.50 || $10.00 || 85.6 || 40.3 || 61.5 || 109.3 || 143.6 || 0.66 || 0.33 || 0.43 || 0.91 || 1.92
-|-
-| '''[[GPT-4o (May '24)]]''' || OpenAI || Proprietary || 128k || 78 || 1285 || 0.87 || 0.51 || 0.79 || 0.93 || – || – || $7.50 || $5.00 || $15.00 || 106.8 || 53.2 || 82.2 || 126.8 || 142.5 || 0.65 || 0.32 || 0.43 || 0.73 || 1.22
-|-
-| '''[[GPT-4o mini]]''' || OpenAI || Proprietary || 128k || 73 || 1273 || 0.82 || 0.44 || 0.79 || 0.88 || – || – || $0.26 || $0.15 || $0.60 || 121.8 || 50.7 || 74.1 || 179.4 || 206.5 || 0.65 || 0.30 || 0.39 || 0.77 || 0.92
-|-
-| '''[[Claude 3.5 Sonnet (Oct)]]''' || Anthropic || Proprietary || 200k || 80 || 1282 || 0.89 || 0.58 || 0.76 || 0.96 || – || – || $6.00 || $3.00 || $15.00 || 71.8 || 37.6 || 44.8 || 78.0 || 89.6 || 0.98 || 0.68 || 0.78 || 1.36 || 2.23
-|-
-| '''[[Claude 3.5 Sonnet (June)]]''' || Anthropic || Proprietary || 200k || 76 || 1268 || 0.88 || 0.56 || 0.71 || 0.90 || – || – || $6.00 || $3.00 || $15.00 || 61.4 || 41.6 || 49.9 || 78.9 || 91.0 || 0.87 || 0.68 || 0.75 || 1.06 || 1.45
-|-
-| '''[[Claude 3.5 Haiku]]''' || Anthropic || Proprietary || 200k || 68 || – || 0.81 || 0.37 || 0.67 || 0.87 || – || – || $1.60 || $0.80 || $4.00 || 65.1 || 51.1 || 58.6 || 75.4 || 105.1 || 0.71 || 0.54 || 0.64 || 0.93 || 1.20
-|-
-| '''[[Llama 3.3 70B]]''' || Meta || Open || 128k || 74 || – || 0.86 || 0.49 || 0.76 || 0.86 || – || – || $0.67 || $0.59 || $0.73 || 67.2 || 23.6 || 31.2 || 275.7 || 2046.5 || 0.51 || 0.23 || 0.36 || 0.72 || 1.48
-|-
-| '''[[Llama 3.2 3B]]''' || Meta || Open || 128k || 49 || 1103 || 0.64 || 0.21 || 0.50 || 0.60 || – || – || $0.06 || $0.06 || $0.06 || 202.2 || 42.4 || 144.0 || 543.6 || 1623.1 || 0.38 || 0.15 || 0.26 || 0.49 || 0.93
-|-
-| '''[[Gemini 1.5 Flash (May)]]''' || Google || Proprietary || 1m || – || 1227 || 0.79 || 0.39 || 0.55 || – || – || – || $0.13 || $0.07 || $0.30 || 310.0 || 276.8 || 297.5 || 325.0 || 350.4 || 0.30 || 0.23 || 0.27 || 0.33 || 0.39
-|-
-| '''[[Nova Micro]]''' || Amazon || Proprietary || 130k || 66 || – || 0.76 || 0.38 || 0.69 || 0.80 || – || – || $0.06 || $0.04 || $0.14 || 195.8 || 170.9 || 186.0 || 208.3 || 219.5 || 0.33 || 0.30 || 0.32 || 0.35 || 0.39
-|-
-| '''[[DeepSeek-Coder-V2]]''' || DeepSeek || Open || 128k || 71 || 1178 || 0.80 || 0.42 || 0.74 || 0.87 || – || – || $0.17 || $0.14 || $0.28 || 64.4 || 51.8 || 57.3 || 71.4 || 81.1 || 1.12 || 0.84 || 0.99 || 1.27 || 1.71
-|-
-| '''[[Phi-4]]''' || Microsoft Azure || Open || 16k || 77 || – || 0.85 || 0.57 || 0.81 || 0.87 || – || – || $0.09 || $0.07 || $0.14 || 85.1 || 76.2 || 82.0 || 85.4 || 85.6 || 0.21 || 0.16 || 0.18 || 0.23 || 0.25
-|-
-| '''[[Reka Flash]]''' || Reka AI || Proprietary || 128k || 59 || – || 0.73 || 0.34 || 0.53 || 0.74 || – || – || $0.35 || $0.20 || $0.80 || – || – || – || – || – || – || – || – || – || –
-|-
-| '''[[OpenChat 3.5]]''' || OpenChat || Open || 8k || 44 || 1076 || 0.56 || 0.22 || 0.31 || 0.68 || – || – || $0.06 || $0.06 || $0.06 || 73.3 || 66.3 || 69.3 || 76.3 || 80.3 || 0.30 || 0.24 || 0.27 || 0.32 || 0.37
-|-
-| '''[[Jamba Instruct]]''' || AI21 Labs || Proprietary || 256k || – || – || 0.58 || 0.25 || – || – || – || – || $0.55 || $0.50 || $0.70 || 77.1 || 70.4 || 74.3 || 169.6 || 193.7 || 0.52 || 0.29 || 0.45 || 0.54 || 0.58
-|}
 ==Terms==
@@ Line 271: / Line 210: @@
 *'''Input Price''': Price per token included in the request/message sent to the API, represented as USD per million Tokens.
 *'''Time period''': Metrics are 'live' and are based on the past 14 days of measurements, measurements are taken 8 times a day for single requests and 2 times per day for parallel requests.
+<ref name="”1”">LLM Leaderboard - Comparison of GPT-4o, Llama 3, Mistral, Gemini and over 30 models https://artificialanalysis.ai/leaderboards/models</ref>
 ==References==
+<references />
+[[Category:Important]] [[Category:Rankings]] [[Category:Aggregate pages]]