GPT-4
- See also: GPT-4 Plugins
Exams
Exam | GPT-4 Points | GPT-4 Percentile | GPT-4 (no vision) Points | GPT-4 (no vision) Percentile | GPT-3.5 Points | GPT-3.5 Percentile |
---|---|---|---|---|---|---|
Uniform Bar Exam (MBE+MEE+MPT)1 | 298 / 400 | ~90th | 298 / 400 | ~90th | 213 / 400 | ~10th |
LSAT | 163 | ~88th | 161 | ~83rd | 149 | ~40th |
SAT Evidence-Based Reading & Writing | 710 / 800 | ~93rd | 710 / 800 | ~93rd | 670 / 800 | ~87th |
SAT Math | 700 / 800 | ~89th | 690 / 800 | ~89th | 590 / 800 | ~70th |
Graduate Record Examination (GRE) Quantitative | 163 / 170 | ~80th | 157 / 170 | ~62nd | 147 / 170 | ~25th |
Graduate Record Examination (GRE) Verbal | 169 / 170 | ~99th | 165 / 170 | ~96th | 154 / 170 | ~63rd |
Graduate Record Examination (GRE) Writing | 4 / 6 | ~54th | 4 / 6 | ~54th | 4 / 6 | ~54th |
USABO Semifinal Exam 2020 | 87 / 150 | 99th–100th | 87 / 150 | 99th–100th | 43 / 150 | 31st–33rd |
USNCO Local Section Exam 2022 | 36 / 60 | 38 / 60 | 24 / 60 | |||
Medical Knowledge Self-Assessment Program | 75% | 75% | 53% | |||
Codeforces Rating | 392 | below 5th | 392 | below 5th | 260 | below 5th |
AP Art History | 5 | 86th–100th | 5 | 86th–100th | 5 | 86th–100th |
AP Biology | 5 | 85th–100th | 5 | 85th–100th | 4 | 62nd–85th |
AP Calculus BC | 4 | 43rd–59th | 4 | 43rd–59th | 1 | 0th–7th |
Benchmarks
Benchmark | GPT-4 | Evaluated few-shot | GPT-3.5 | Evaluated few-shot | LM SOTA | Best external LM evaluated few-shot | SOTA | Best external model (includes benchmark-specific training) |
---|---|---|---|---|---|---|---|---|
MMLU | 86.4% | 5-shot | 70.0% | 5-shot | 70.7% | 5-shot U-PaLM | 75.2% | 5-shot Flan-PaLM |
HellaSwag | 95.3% | 10-shot | 85.5% | 10-shot | 84.2% | LLAMA (validation set) | 85.6% | ALUM |
AI2 Reasoning Challenge (ARC) | 96.3% | 25-shot | 85.2% | 25-shot | 84.2% | 8-shot PaLM | 85.6% | ST-MOE |
WinoGrande | 87.5% | 5-shot | 81.6% | 5-shot | 84.2% | 5-shot PALM | 85.6% | 5-shot PALM |
HumanEval | 67.0% | 0-shot | 48.1% | 0-shot | 26.2% | 0-shot PaLM | 65.8% | CodeT + GPT-3.5 |
DROP (f1 score) | 80.9 | 3-shot | 64.1 | 3-shot | 70.8 | 1-shot PaLM | 88.4 |
Benmarks (Visual)
Benchmark | GPT-4 | Evaluated few-shot | Few-shot SOTA | SOTA | Best external model (includes benchmark-specific training) | |
---|---|---|---|---|---|---|
VQAv2 | 77.2% | 0-shot | 67.6% | Flamingo 32-shot | 84.3% | PaLI-17B |
TextVQA | 78.0% | 0-shot | 37.9% | Flamingo 32-shot | 71.8% | PaLI-17B |
ChartQA | 78.5%A | - | 58.6% | Pix2Struct Large | - | |
AI2 Diagram (AI2D) | 78.2% | 0-shot | - | 42.1% | Pix2Struct Large | - |
DocVQA | 88.4% | 0-shot (pixel-only) | - | 88.4% | ERNIE-Layout 2.0 | - |
Infographic VQA | 75.1% | 0-shot (pixel-only) | - | 61.2% | Applica.ai TILT | - |
TVQA | 87.3% | 0-shot | - | 86.5% | MERLOT Reserve Large | - |
LSMDC | 45.7% | 0-shot | 31.0% | MERLOT Reserve 0-shot | 52.9% | MERLOT |