@@ Line 1: / Line 1: @@
+{{see also|GPT-4 Plugins}}
 ==Exams==
 {| class="wikitable"
 ! Exam
-! GPT-4 Estimated Percentile
+! GPT-4 Points
-! GPT-4 (no vision) Estimated Percentile
+! GPT-4 Percentile
-! GPT-3.5 Estimated Percentile
+! GPT-4 (no vision) Points
+! GPT-4 (no vision) Percentile
+! GPT-3.5 Points
+! GPT-3.5 Percentile
 |-
 | Uniform Bar Exam (MBE+MEE+MPT)1
-| 298 / 400 (~90th)
+| 298 / 400
-| 298 / 400 (~90th)
+| ~90th
-| 213 / 400 (~10th)
+| 298 / 400
+| ~90th
+| 213 / 400
+| ~10th
 |-
 | LSAT
-| 163 (~88th)
+| 163
-| 161 (~83rd)
+| ~88th
-| 149 (~40th)
+| 161
+| ~83rd
+| 149
+| ~40th
 |-
 | SAT Evidence-Based Reading & Writing
-| 710 / 800 (~93rd)
+| 710 / 800
-| 710 / 800 (~93rd)
+| ~93rd
-| 670 / 800 (~87th)
+| 710 / 800
+| ~93rd
+| 670 / 800
+| ~87th
 |-
 | SAT Math
-| 700 / 800 (~89th)
+| 700 / 800
-| 690 / 800 (~89th)
+| ~89th
-| 590 / 800 (~70th)
+| 690 / 800
+| ~89th
+| 590 / 800
+| ~70th
 |-
 | Graduate Record Examination (GRE) Quantitative
-| 163 / 170 (~80th)
+| 163 / 170
-| 157 / 170 (~62nd)
+| ~80th
-| 147 / 170 (~25th)
+| 157 / 170
+| ~62nd
+| 147 / 170
+| ~25th
 |-
 | Graduate Record Examination (GRE) Verbal
-| 169 / 170 (~99th)
+| 169 / 170
-| 165 / 170 (~96th)
+| ~99th
-| 154 / 170 (~63rd)
+| 165 / 170
+| ~96th
+| 154 / 170
+| ~63rd
 |-
 | Graduate Record Examination (GRE) Writing
-| 4 / 6 (~54th)
+| 4 / 6
-| 4 / 6 (~54th)
+| ~54th
-| 4 / 6 (~54th)
+| 4 / 6
+| ~54th
+| 4 / 6
+| ~54th
 |-
 | USABO Semifinal Exam 2020
-| 87 / 150 (99th–100th)
+| 87 / 150
-| 87 / 150 (99th–100th)
+| 99th–100th
-| 43 / 150 (31st–33rd)
+| 87 / 150
+| 99th–100th
+| 43 / 150
+| 31st–33rd
 |-
 | USNCO Local Section Exam 2022
 | 36 / 60
+|
 | 38 / 60
+|
 | 24 / 60
+|
 |-
 | Medical Knowledge Self-Assessment Program
 | 75%
+|
 | 75%
+|
 | 53%
+|
 |-
 | Codeforces Rating
-| 392 (below 5th)
+| 392
-| 392 (below 5th)
+| below 5th
-| 260 (below 5th)
+| 392
+| below 5th
+| 260
+| below 5th
 |-
 | AP Art History
-| 5 (86th–100th)
+| 5
-| 5 (86th–100th)
+| 86th–100th
-| 5 (86th–100th)
+| 5
+| 86th–100th
+| 5
+| 86th–100th
 |-
 | AP Biology
-| 5 (85th–100th)
+| 5
-| 5 (85th–100th)
+| 85th–100th
-| 4 (62nd–85th)
+| 5
+| 85th–100th
+| 4
+| 62nd–85th
 |-
 | AP Calculus BC
-| 4 (43rd–59th)
+| 4
-| 4 (43rd–59th)
+| 43rd–59th
-| 1 (0th–7th)
+| 4
+| 43rd–59th
+| 1
+| 0th–7th
 |-
-| AP Chemistry
+|}
-| 4 (71st–88th)
-| 4 (71st–88th)
+==Benchmarks==
-| 2 (22nd–46th)
+{| class="wikitable"
+! Benchmark
+! GPT-4
+! Evaluated few-shot
+! GPT-3.5
+! Evaluated few-shot
+! LM SOTA
+! Best external LM evaluated few-shot
+! SOTA
+! Best external model (includes benchmark-specific training)
+|-
+| MMLU
+| 86.4%
+| 5-shot
+| 70.0%
+| 5-shot
+| 70.7%
+| 5-shot U-PaLM
+| 75.2%
+| 5-shot Flan-PaLM
+|-
+| HellaSwag
+| 95.3%
+| 10-shot
+| 85.5%
+| 10-shot
+| 84.2%
+| LLAMA (validation set)
+| 85.6%
+| ALUM
+|-
+| AI2 Reasoning Challenge (ARC)
+| 96.3%
+| 25-shot
+| 85.2%
+| 25-shot
+| 84.2%
+| 8-shot PaLM
+| 85.6%
+| ST-MOE
+|-
+| WinoGrande
+| 87.5%
+| 5-shot
+| 81.6%
+| 5-shot
+| 84.2%
+| 5-shot PALM
+| 85.6%
+| 5-shot PALM
+|-
+| HumanEval
+| 67.0%
+| 0-shot
+| 48.1%
+| 0-shot
+| 26.2%
+| 0-shot PaLM
+| 65.8%
+| CodeT + GPT-3.5
+|-
+| DROP (f1 score)
+| 80.9
+| 3-shot
+| 64.1
+| 3-shot
+| 70.8
+| 1-shot PaLM
+| 88.4
+|}
+==Benmarks (Visual)==
+{| class="wikitable"
+! Benchmark
+! GPT-4
+! Evaluated few-shot
+! Few-shot SOTA
+! SOTA
+! Best external model (includes benchmark-specific training)
+|-
+| VQAv2
+| 77.2%
+| 0-shot
+| 67.6%
+| Flamingo 32-shot
+| 84.3%
+| PaLI-17B
+|-
+| TextVQA
+| 78.0%
+| 0-shot
+| 37.9%
+| Flamingo 32-shot
+| 71.8%
+| PaLI-17B
+|-
+| ChartQA
+| 78.5%A
+| -
+| 58.6%
+| Pix2Struct Large
+| -
+|-
+| AI2 Diagram (AI2D)
+| 78.2%
+| 0-shot
+| -
+| 42.1%
+| Pix2Struct Large
+| -
+|-
+| DocVQA
+| 88.4%
+| 0-shot (pixel-only)
+| -
+| 88.4%
+| ERNIE-Layout 2.0
+| -
 |-
-| AP English Language and Composition
+| Infographic VQA
-| 2 (14th–44th)
+| 75.1%
-| 2 (14th–44th)
+| 0-shot (pixel-only)
-| 2 (14th–44th)
+| -
+| 61.2%
+| Applica.ai TILT
+| -
 |-
-| AP English Literature and Composition
+| TVQA
-| 2 (8th–22nd)
+| 87.3%
-| 2 (8th–22nd)
+| 0-shot
-| 2 (8th–22nd)
+| -
+| 86.5%
+| MERLOT Reserve Large
+| -
 |-
+| LSMDC
+| 45.7%
+| 0-shot
+| 31.0%
+| MERLOT Reserve 0-shot
+| 52.9%
+| MERLOT
 |}

GPT-4: Difference between revisions

GPT-4 (view source)

Revision as of 18:56, 17 June 2023