GPT-4: Difference between revisions

Revision as of 11:44, 20 March 2023

Exams

Exam	GPT-4 Points	GPT-4 Percentile	GPT-4 (no vision) Points	GPT-4 (no vision) Percentile	GPT-3.5 Points	GPT-3.5 Percentile
Uniform Bar Exam (MBE+MEE+MPT)1	298 / 400	~90th	298 / 400	~90th	213 / 400	~10th
LSAT	163	~88th	161	~83rd	149	~40th
SAT Evidence-Based Reading & Writing	710 / 800	~93rd	710 / 800	~93rd	670 / 800	~87th
SAT Math	700 / 800	~89th	690 / 800	~89th	590 / 800	~70th
Graduate Record Examination (GRE) Quantitative	163 / 170	~80th	157 / 170	~62nd	147 / 170	~25th
Graduate Record Examination (GRE) Verbal	169 / 170	~99th	165 / 170	~96th	154 / 170	~63rd
Graduate Record Examination (GRE) Writing	4 / 6	~54th	4 / 6	~54th	4 / 6	~54th
USABO Semifinal Exam 2020	87 / 150	99th–100th	87 / 150	99th–100th	43 / 150	31st–33rd
USNCO Local Section Exam 2022	36 / 60		38 / 60		24 / 60
Medical Knowledge Self-Assessment Program	75%		75%		53%
Codeforces Rating	392	below 5th	392	below 5th	260	below 5th
AP Art History	5	86th–100th	5	86th–100th	5	86th–100th
AP Biology	5	85th–100th	5	85th–100th	4	62nd–85th
AP Calculus BC	4	43rd–59th	4	43rd–59th	1	0th–7th

Benchmark	GPT-4	Evaluated few-shot	GPT-3.5	Evaluated few-shot	LM SOTA	Best external LM evaluated few-shot	SOTA	Best external model (includes benchmark-specific training)
MMLU	86.4%	5-shot	70.0%	5-shot	70.7%	5-shot U-PaLM	75.2%	5-shot Flan-PaLM
HellaSwag	95.3%	10-shot	85.5%	10-shot	84.2%	LLAMA (validation set)	85.6%	ALUM
AI2 Reasoning Challenge (ARC)	96.3%	25-shot	85.2%	25-shot	84.2%	8-shot PaLM	85.6%	ST-MOE
WinoGrande	87.5%	5-shot	81.6%	5-shot	84.2%	5-shot PALM	85.6%	5-shot PALM
HumanEval	67.0%	0-shot	48.1%	0-shot	26.2%	0-shot PaLM	65.8%	CodeT + GPT-3.5
DROP (f1 score)	80.9	3-shot	64.1	3-shot	70.8	1-shot PaLM	88.4

@@ Line 121: / Line 121: @@
 | 0th–7th
 |-
+|}
+==Benchmarks==
+{| class="wikitable"
+! Benchmark
+! GPT-4
+! Evaluated few-shot
+! GPT-3.5
+! Evaluated few-shot
+! LM SOTA
+! Best external LM evaluated few-shot
+! SOTA
+! Best external model (includes benchmark-specific training)
+|-
+| MMLU
+| 86.4%
+| 5-shot
+| 70.0%
+| 5-shot
+| 70.7%
+| 5-shot U-PaLM
+| 75.2%
+| 5-shot Flan-PaLM
+|-
+| HellaSwag
+| 95.3%
+| 10-shot
+| 85.5%
+| 10-shot
+| 84.2%
+| LLAMA (validation set)
+| 85.6%
+| ALUM
+|-
+| AI2 Reasoning Challenge (ARC)
+| 96.3%
+| 25-shot
+| 85.2%
+| 25-shot
+| 84.2%
+| 8-shot PaLM
+| 85.6%
+| ST-MOE
+|-
+| WinoGrande
+| 87.5%
+| 5-shot
+| 81.6%
+| 5-shot
+| 84.2%
+| 5-shot PALM
+| 85.6%
+| 5-shot PALM
+|-
+| HumanEval
+| 67.0%
+| 0-shot
+| 48.1%
+| 0-shot
+| 26.2%
+| 0-shot PaLM
+| 65.8%
+| CodeT + GPT-3.5
+|-
+| DROP (f1 score)
+| 80.9
+| 3-shot
+| 64.1
+| 3-shot
+| 70.8
+| 1-shot PaLM
+| 88.4
 |}