GPT-4: Difference between revisions

From AI Wiki
No edit summary
No edit summary
 
(One intermediate revision by one other user not shown)
Line 1: Line 1:
{{see also|GPT-4 Plugins}}
==Exams==
==Exams==
{| class="wikitable"
{| class="wikitable"
Line 193: Line 194:
| 1-shot PaLM
| 1-shot PaLM
| 88.4
| 88.4
|}
==Benmarks (Visual)==
{| class="wikitable"
! Benchmark
! GPT-4
! Evaluated few-shot
! Few-shot SOTA
! SOTA
! Best external model (includes benchmark-specific training)
|-
| VQAv2
| 77.2%
| 0-shot
| 67.6%
| Flamingo 32-shot
| 84.3%
| PaLI-17B
|-
| TextVQA
| 78.0%
| 0-shot
| 37.9%
| Flamingo 32-shot
| 71.8%
| PaLI-17B
|-
| ChartQA
| 78.5%A
| -
| 58.6%
| Pix2Struct Large
| -
|-
| AI2 Diagram (AI2D)
| 78.2%
| 0-shot
| -
| 42.1%
| Pix2Struct Large
| -
|-
| DocVQA
| 88.4%
| 0-shot (pixel-only)
| -
| 88.4%
| ERNIE-Layout 2.0
| -
|-
| Infographic VQA
| 75.1%
| 0-shot (pixel-only)
| -
| 61.2%
| Applica.ai TILT
| -
|-
| TVQA
| 87.3%
| 0-shot
| -
| 86.5%
| MERLOT Reserve Large
| -
|-
| LSMDC
| 45.7%
| 0-shot
| 31.0%
| MERLOT Reserve 0-shot
| 52.9%
| MERLOT
|}
|}

Latest revision as of 18:56, 17 June 2023

See also: GPT-4 Plugins

Exams

Exam GPT-4 Points GPT-4 Percentile GPT-4 (no vision) Points GPT-4 (no vision) Percentile GPT-3.5 Points GPT-3.5 Percentile
Uniform Bar Exam (MBE+MEE+MPT)1 298 / 400 ~90th 298 / 400 ~90th 213 / 400 ~10th
LSAT 163 ~88th 161 ~83rd 149 ~40th
SAT Evidence-Based Reading & Writing 710 / 800 ~93rd 710 / 800 ~93rd 670 / 800 ~87th
SAT Math 700 / 800 ~89th 690 / 800 ~89th 590 / 800 ~70th
Graduate Record Examination (GRE) Quantitative 163 / 170 ~80th 157 / 170 ~62nd 147 / 170 ~25th
Graduate Record Examination (GRE) Verbal 169 / 170 ~99th 165 / 170 ~96th 154 / 170 ~63rd
Graduate Record Examination (GRE) Writing 4 / 6 ~54th 4 / 6 ~54th 4 / 6 ~54th
USABO Semifinal Exam 2020 87 / 150 99th–100th 87 / 150 99th–100th 43 / 150 31st–33rd
USNCO Local Section Exam 2022 36 / 60 38 / 60 24 / 60
Medical Knowledge Self-Assessment Program 75% 75% 53%
Codeforces Rating 392 below 5th 392 below 5th 260 below 5th
AP Art History 5 86th–100th 5 86th–100th 5 86th–100th
AP Biology 5 85th–100th 5 85th–100th 4 62nd–85th
AP Calculus BC 4 43rd–59th 4 43rd–59th 1 0th–7th

Benchmarks

Benchmark GPT-4 Evaluated few-shot GPT-3.5 Evaluated few-shot LM SOTA Best external LM evaluated few-shot SOTA Best external model (includes benchmark-specific training)
MMLU 86.4% 5-shot 70.0% 5-shot 70.7% 5-shot U-PaLM 75.2% 5-shot Flan-PaLM
HellaSwag 95.3% 10-shot 85.5% 10-shot 84.2% LLAMA (validation set) 85.6% ALUM
AI2 Reasoning Challenge (ARC) 96.3% 25-shot 85.2% 25-shot 84.2% 8-shot PaLM 85.6% ST-MOE
WinoGrande 87.5% 5-shot 81.6% 5-shot 84.2% 5-shot PALM 85.6% 5-shot PALM
HumanEval 67.0% 0-shot 48.1% 0-shot 26.2% 0-shot PaLM 65.8% CodeT + GPT-3.5
DROP (f1 score) 80.9 3-shot 64.1 3-shot 70.8 1-shot PaLM 88.4

Benmarks (Visual)

Benchmark GPT-4 Evaluated few-shot Few-shot SOTA SOTA Best external model (includes benchmark-specific training)
VQAv2 77.2% 0-shot 67.6% Flamingo 32-shot 84.3% PaLI-17B
TextVQA 78.0% 0-shot 37.9% Flamingo 32-shot 71.8% PaLI-17B
ChartQA 78.5%A - 58.6% Pix2Struct Large -
AI2 Diagram (AI2D) 78.2% 0-shot - 42.1% Pix2Struct Large -
DocVQA 88.4% 0-shot (pixel-only) - 88.4% ERNIE-Layout 2.0 -
Infographic VQA 75.1% 0-shot (pixel-only) - 61.2% Applica.ai TILT -
TVQA 87.3% 0-shot - 86.5% MERLOT Reserve Large -
LSMDC 45.7% 0-shot 31.0% MERLOT Reserve 0-shot 52.9% MERLOT