GPT-4: Difference between revisions
No edit summary |
No edit summary |
||
Line 121: | Line 121: | ||
| 0th–7th | | 0th–7th | ||
|- | |- | ||
|} | |||
==Benchmarks== | |||
{| class="wikitable" | |||
! Benchmark | |||
! GPT-4 | |||
! Evaluated few-shot | |||
! GPT-3.5 | |||
! Evaluated few-shot | |||
! LM SOTA | |||
! Best external LM evaluated few-shot | |||
! SOTA | |||
! Best external model (includes benchmark-specific training) | |||
|- | |||
| MMLU | |||
| 86.4% | |||
| 5-shot | |||
| 70.0% | |||
| 5-shot | |||
| 70.7% | |||
| 5-shot U-PaLM | |||
| 75.2% | |||
| 5-shot Flan-PaLM | |||
|- | |||
| HellaSwag | |||
| 95.3% | |||
| 10-shot | |||
| 85.5% | |||
| 10-shot | |||
| 84.2% | |||
| LLAMA (validation set) | |||
| 85.6% | |||
| ALUM | |||
|- | |||
| AI2 Reasoning Challenge (ARC) | |||
| 96.3% | |||
| 25-shot | |||
| 85.2% | |||
| 25-shot | |||
| 84.2% | |||
| 8-shot PaLM | |||
| 85.6% | |||
| ST-MOE | |||
|- | |||
| WinoGrande | |||
| 87.5% | |||
| 5-shot | |||
| 81.6% | |||
| 5-shot | |||
| 84.2% | |||
| 5-shot PALM | |||
| 85.6% | |||
| 5-shot PALM | |||
|- | |||
| HumanEval | |||
| 67.0% | |||
| 0-shot | |||
| 48.1% | |||
| 0-shot | |||
| 26.2% | |||
| 0-shot PaLM | |||
| 65.8% | |||
| CodeT + GPT-3.5 | |||
|- | |||
| DROP (f1 score) | |||
| 80.9 | |||
| 3-shot | |||
| 64.1 | |||
| 3-shot | |||
| 70.8 | |||
| 1-shot PaLM | |||
| 88.4 | |||
|} | |} |
Revision as of 11:44, 20 March 2023
Exams
Exam | GPT-4 Points | GPT-4 Percentile | GPT-4 (no vision) Points | GPT-4 (no vision) Percentile | GPT-3.5 Points | GPT-3.5 Percentile |
---|---|---|---|---|---|---|
Uniform Bar Exam (MBE+MEE+MPT)1 | 298 / 400 | ~90th | 298 / 400 | ~90th | 213 / 400 | ~10th |
LSAT | 163 | ~88th | 161 | ~83rd | 149 | ~40th |
SAT Evidence-Based Reading & Writing | 710 / 800 | ~93rd | 710 / 800 | ~93rd | 670 / 800 | ~87th |
SAT Math | 700 / 800 | ~89th | 690 / 800 | ~89th | 590 / 800 | ~70th |
Graduate Record Examination (GRE) Quantitative | 163 / 170 | ~80th | 157 / 170 | ~62nd | 147 / 170 | ~25th |
Graduate Record Examination (GRE) Verbal | 169 / 170 | ~99th | 165 / 170 | ~96th | 154 / 170 | ~63rd |
Graduate Record Examination (GRE) Writing | 4 / 6 | ~54th | 4 / 6 | ~54th | 4 / 6 | ~54th |
USABO Semifinal Exam 2020 | 87 / 150 | 99th–100th | 87 / 150 | 99th–100th | 43 / 150 | 31st–33rd |
USNCO Local Section Exam 2022 | 36 / 60 | 38 / 60 | 24 / 60 | |||
Medical Knowledge Self-Assessment Program | 75% | 75% | 53% | |||
Codeforces Rating | 392 | below 5th | 392 | below 5th | 260 | below 5th |
AP Art History | 5 | 86th–100th | 5 | 86th–100th | 5 | 86th–100th |
AP Biology | 5 | 85th–100th | 5 | 85th–100th | 4 | 62nd–85th |
AP Calculus BC | 4 | 43rd–59th | 4 | 43rd–59th | 1 | 0th–7th |
Benchmarks
Benchmark | GPT-4 | Evaluated few-shot | GPT-3.5 | Evaluated few-shot | LM SOTA | Best external LM evaluated few-shot | SOTA | Best external model (includes benchmark-specific training) |
---|---|---|---|---|---|---|---|---|
MMLU | 86.4% | 5-shot | 70.0% | 5-shot | 70.7% | 5-shot U-PaLM | 75.2% | 5-shot Flan-PaLM |
HellaSwag | 95.3% | 10-shot | 85.5% | 10-shot | 84.2% | LLAMA (validation set) | 85.6% | ALUM |
AI2 Reasoning Challenge (ARC) | 96.3% | 25-shot | 85.2% | 25-shot | 84.2% | 8-shot PaLM | 85.6% | ST-MOE |
WinoGrande | 87.5% | 5-shot | 81.6% | 5-shot | 84.2% | 5-shot PALM | 85.6% | 5-shot PALM |
HumanEval | 67.0% | 0-shot | 48.1% | 0-shot | 26.2% | 0-shot PaLM | 65.8% | CodeT + GPT-3.5 |
DROP (f1 score) | 80.9 | 3-shot | 64.1 | 3-shot | 70.8 | 1-shot PaLM | 88.4 |