GPT-4: Difference between revisions

1,377 bytes added ,  17 June 2023
no edit summary
(Created page with "==Exams== {| class="wikitable" ! Exam ! GPT-4 Estimated Percentile ! GPT-4 (no vision) Estimated Percentile ! GPT-3.5 Estimated Percentile |- | Uniform Bar Exam (MBE+MEE+MPT)1 | 298 / 400 (~90th) | 298 / 400 (~90th) | 213 / 400 (~10th) |- | LSAT | 163 (~88th) | 161 (~83rd) | 149 (~40th) |- | SAT Evidence-Based Reading & Writing | 710 / 800 (~93rd) | 710 / 800 (~93rd) | 670 / 800 (~87th) |- | SAT Math | 700 / 800 (~89th) | 690 / 800 (~89th)...")
 
No edit summary
 
(3 intermediate revisions by one other user not shown)
Line 1: Line 1:
{{see also|GPT-4 Plugins}}
==Exams==
==Exams==
{| class="wikitable"
{| class="wikitable"
! Exam
! Exam
! GPT-4 Estimated Percentile
! GPT-4 Points
! GPT-4 (no vision) Estimated Percentile
! GPT-4 Percentile
! GPT-3.5 Estimated Percentile
! GPT-4 (no vision) Points
! GPT-4 (no vision) Percentile
! GPT-3.5 Points
! GPT-3.5 Percentile
|-
|-
| Uniform Bar Exam (MBE+MEE+MPT)1
| Uniform Bar Exam (MBE+MEE+MPT)1
| 298 / 400 (~90th)
| 298 / 400
| 298 / 400 (~90th)
| ~90th
| 213 / 400 (~10th)
| 298 / 400
| ~90th
| 213 / 400
| ~10th
|-
|-
| LSAT
| LSAT
| 163 (~88th)
| 163
| 161 (~83rd)
| ~88th
| 149 (~40th)
| 161
| ~83rd
| 149
| ~40th
|-
|-
| SAT Evidence-Based Reading & Writing
| SAT Evidence-Based Reading & Writing
| 710 / 800 (~93rd)
| 710 / 800
| 710 / 800 (~93rd)
| ~93rd
| 670 / 800 (~87th)
| 710 / 800
| ~93rd
| 670 / 800
| ~87th
|-
|-
| SAT Math
| SAT Math
| 700 / 800 (~89th)
| 700 / 800
| 690 / 800 (~89th)
| ~89th
| 590 / 800 (~70th)
| 690 / 800
| ~89th
| 590 / 800
| ~70th
|-
|-
| Graduate Record Examination (GRE) Quantitative
| Graduate Record Examination (GRE) Quantitative
| 163 / 170 (~80th)
| 163 / 170
| 157 / 170 (~62nd)
| ~80th
| 147 / 170 (~25th)
| 157 / 170
| ~62nd
| 147 / 170
| ~25th
|-
|-
| Graduate Record Examination (GRE) Verbal
| Graduate Record Examination (GRE) Verbal
| 169 / 170 (~99th)
| 169 / 170
| 165 / 170 (~96th)
| ~99th
| 154 / 170 (~63rd)
| 165 / 170
| ~96th
| 154 / 170
| ~63rd
|-
|-
| Graduate Record Examination (GRE) Writing
| Graduate Record Examination (GRE) Writing
| 4 / 6 (~54th)
| 4 / 6
| 4 / 6 (~54th)
| ~54th
| 4 / 6 (~54th)
| 4 / 6
| ~54th
| 4 / 6
| ~54th
|-
|-
| USABO Semifinal Exam 2020
| USABO Semifinal Exam 2020
| 87 / 150 (99th–100th)
| 87 / 150
| 87 / 150 (99th–100th)
| 99th–100th
| 43 / 150 (31st–33rd)
| 87 / 150
| 99th–100th
| 43 / 150
| 31st–33rd
|-
|-
| USNCO Local Section Exam 2022
| USNCO Local Section Exam 2022
| 36 / 60
| 36 / 60
|
| 38 / 60
| 38 / 60
|
| 24 / 60
| 24 / 60
|
|-
|-
| Medical Knowledge Self-Assessment Program
| Medical Knowledge Self-Assessment Program
| 75%
| 75%
|
| 75%
| 75%
|
| 53%
| 53%
|
|-
|-
| Codeforces Rating
| Codeforces Rating
| 392 (below 5th)
| 392
| 392 (below 5th)
| below 5th
| 260 (below 5th)
| 392
| below 5th
| 260
| below 5th
|-
|-
| AP Art History
| AP Art History
| 5 (86th–100th)
| 5
| 5 (86th–100th)
| 86th–100th
| 5 (86th–100th)
| 5
| 86th–100th
| 5
| 86th–100th
|-
|-
| AP Biology
| AP Biology
| 5 (85th–100th)
| 5
| 5 (85th–100th)
| 85th–100th
| 4 (62nd–85th)
| 5
| 85th–100th
| 4
| 62nd–85th
|-
|-
| AP Calculus BC
| AP Calculus BC
| 4 (43rd–59th)
| 4
| 4 (43rd–59th)
| 43rd–59th
| 1 (0th–7th)
| 4
| 43rd–59th
| 1
| 0th–7th
|-
|-
| AP Chemistry
|}
| 4 (71st–88th)
 
| 4 (71st–88th)
==Benchmarks==
| 2 (22nd–46th)
{| class="wikitable"
! Benchmark
! GPT-4
! Evaluated few-shot
! GPT-3.5
! Evaluated few-shot
! LM SOTA
! Best external LM evaluated few-shot
! SOTA
! Best external model (includes benchmark-specific training)
|-
| MMLU
| 86.4%
| 5-shot
| 70.0%
| 5-shot
| 70.7%
| 5-shot U-PaLM
| 75.2%
| 5-shot Flan-PaLM
|-
| HellaSwag
| 95.3%
| 10-shot
| 85.5%
| 10-shot
| 84.2%
| LLAMA (validation set)
| 85.6%
| ALUM
|-
| AI2 Reasoning Challenge (ARC)
| 96.3%
| 25-shot
| 85.2%
| 25-shot
| 84.2%
| 8-shot PaLM
| 85.6%
| ST-MOE
|-
| WinoGrande
| 87.5%
| 5-shot
| 81.6%
| 5-shot
| 84.2%
| 5-shot PALM
| 85.6%
| 5-shot PALM
|-
| HumanEval
| 67.0%
| 0-shot
| 48.1%
| 0-shot
| 26.2%
| 0-shot PaLM
| 65.8%
| CodeT + GPT-3.5
|-
| DROP (f1 score)
| 80.9
| 3-shot
| 64.1
| 3-shot
| 70.8
| 1-shot PaLM
| 88.4
|}
 
==Benmarks (Visual)==
{| class="wikitable"
! Benchmark
! GPT-4
! Evaluated few-shot
! Few-shot SOTA
! SOTA
! Best external model (includes benchmark-specific training)
|-
| VQAv2
| 77.2%
| 0-shot
| 67.6%
| Flamingo 32-shot
| 84.3%
| PaLI-17B
|-
| TextVQA
| 78.0%
| 0-shot
| 37.9%
| Flamingo 32-shot
| 71.8%
| PaLI-17B
|-
| ChartQA
| 78.5%A
| -
| 58.6%
| Pix2Struct Large
| -
|-
| AI2 Diagram (AI2D)
| 78.2%
| 0-shot
| -
| 42.1%
| Pix2Struct Large
| -
|-
| DocVQA
| 88.4%
| 0-shot (pixel-only)
| -
| 88.4%
| ERNIE-Layout 2.0
| -
|-
|-
| AP English Language and Composition
| Infographic VQA
| 2 (14th–44th)
| 75.1%
| 2 (14th–44th)
| 0-shot (pixel-only)
| 2 (14th–44th)
| -
| 61.2%
| Applica.ai TILT
| -
|-
|-
| AP English Literature and Composition
| TVQA
| 2 (8th–22nd)
| 87.3%
| 2 (8th–22nd)
| 0-shot
| 2 (8th–22nd)
| -
| 86.5%
| MERLOT Reserve Large
| -
|-
|-
| LSMDC
| 45.7%
| 0-shot
| 31.0%
| MERLOT Reserve 0-shot
| 52.9%
| MERLOT
|}
|}
1,151

edits