GPT-4: Difference between revisions

Latest revision as of 18:56, 17 June 2023

See also: GPT-4 Plugins

Exams

Exam	GPT-4 Points	GPT-4 Percentile	GPT-4 (no vision) Points	GPT-4 (no vision) Percentile	GPT-3.5 Points	GPT-3.5 Percentile
Uniform Bar Exam (MBE+MEE+MPT)1	298 / 400	~90th	298 / 400	~90th	213 / 400	~10th
LSAT	163	~88th	161	~83rd	149	~40th
SAT Evidence-Based Reading & Writing	710 / 800	~93rd	710 / 800	~93rd	670 / 800	~87th
SAT Math	700 / 800	~89th	690 / 800	~89th	590 / 800	~70th
Graduate Record Examination (GRE) Quantitative	163 / 170	~80th	157 / 170	~62nd	147 / 170	~25th
Graduate Record Examination (GRE) Verbal	169 / 170	~99th	165 / 170	~96th	154 / 170	~63rd
Graduate Record Examination (GRE) Writing	4 / 6	~54th	4 / 6	~54th	4 / 6	~54th
USABO Semifinal Exam 2020	87 / 150	99th–100th	87 / 150	99th–100th	43 / 150	31st–33rd
USNCO Local Section Exam 2022	36 / 60		38 / 60		24 / 60
Medical Knowledge Self-Assessment Program	75%		75%		53%
Codeforces Rating	392	below 5th	392	below 5th	260	below 5th
AP Art History	5	86th–100th	5	86th–100th	5	86th–100th
AP Biology	5	85th–100th	5	85th–100th	4	62nd–85th
AP Calculus BC	4	43rd–59th	4	43rd–59th	1	0th–7th

Benchmark	GPT-4	Evaluated few-shot	GPT-3.5	Evaluated few-shot	LM SOTA	Best external LM evaluated few-shot	SOTA	Best external model (includes benchmark-specific training)
MMLU	86.4%	5-shot	70.0%	5-shot	70.7%	5-shot U-PaLM	75.2%	5-shot Flan-PaLM
HellaSwag	95.3%	10-shot	85.5%	10-shot	84.2%	LLAMA (validation set)	85.6%	ALUM
AI2 Reasoning Challenge (ARC)	96.3%	25-shot	85.2%	25-shot	84.2%	8-shot PaLM	85.6%	ST-MOE
WinoGrande	87.5%	5-shot	81.6%	5-shot	84.2%	5-shot PALM	85.6%	5-shot PALM
HumanEval	67.0%	0-shot	48.1%	0-shot	26.2%	0-shot PaLM	65.8%	CodeT + GPT-3.5
DROP (f1 score)	80.9	3-shot	64.1	3-shot	70.8	1-shot PaLM	88.4

Benmarks (Visual)

Benchmark	GPT-4	Evaluated few-shot	Few-shot SOTA	SOTA	Best external model (includes benchmark-specific training)
VQAv2	77.2%	0-shot	67.6%	Flamingo 32-shot	84.3%	PaLI-17B
TextVQA	78.0%	0-shot	37.9%	Flamingo 32-shot	71.8%	PaLI-17B
ChartQA	78.5%A	-	58.6%	Pix2Struct Large	-
AI2 Diagram (AI2D)	78.2%	0-shot	-	42.1%	Pix2Struct Large	-
DocVQA	88.4%	0-shot (pixel-only)	-	88.4%	ERNIE-Layout 2.0	-
Infographic VQA	75.1%	0-shot (pixel-only)	-	61.2%	Applica.ai TILT	-
TVQA	87.3%	0-shot	-	86.5%	MERLOT Reserve Large	-
LSMDC	45.7%	0-shot	31.0%	MERLOT Reserve 0-shot	52.9%	MERLOT

@@ Line 1: / Line 1: @@
+{{see also|GPT-4 Plugins}}
 ==Exams==
 {| class="wikitable"
@@ Line 193: / Line 194: @@
 | 1-shot PaLM
 | 88.4
+|}
+==Benmarks (Visual)==
+{| class="wikitable"
+! Benchmark
+! GPT-4
+! Evaluated few-shot
+! Few-shot SOTA
+! SOTA
+! Best external model (includes benchmark-specific training)
+|-
+| VQAv2
+| 77.2%
+| 0-shot
+| 67.6%
+| Flamingo 32-shot
+| 84.3%
+| PaLI-17B
+|-
+| TextVQA
+| 78.0%
+| 0-shot
+| 37.9%
+| Flamingo 32-shot
+| 71.8%
+| PaLI-17B
+|-
+| ChartQA
+| 78.5%A
+| -
+| 58.6%
+| Pix2Struct Large
+| -
+|-
+| AI2 Diagram (AI2D)
+| 78.2%
+| 0-shot
+| -
+| 42.1%
+| Pix2Struct Large
+| -
+|-
+| DocVQA
+| 88.4%
+| 0-shot (pixel-only)
+| -
+| 88.4%
+| ERNIE-Layout 2.0
+| -
+|-
+| Infographic VQA
+| 75.1%
+| 0-shot (pixel-only)
+| -
+| 61.2%
+| Applica.ai TILT
+| -
+|-
+| TVQA
+| 87.3%
+| 0-shot
+| -
+| 86.5%
+| MERLOT Reserve Large
+| -
+|-
+| LSMDC
+| 45.7%
+| 0-shot
+| 31.0%
+| MERLOT Reserve 0-shot
+| 52.9%
+| MERLOT
 |}