# GPT-3

> Source: https://aiwiki.ai/wiki/gpt-3
> Updated: 2026-07-28
> Categories: Large Language Models, Natural Language Processing, OpenAI
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

GPT-3 (Generative Pre-trained Transformer 3) is a family of decoder-only, autoregressive [large language models](/wiki/large_language_model) developed by [OpenAI](/wiki/openai). OpenAI first described the family in a May 2020 preprint; a peer-reviewed version appeared at NeurIPS 2020. The largest model contained 175 billion trainable parameters. The central experiment was not simply a larger text generator: it tested whether one pretrained model could perform many tasks from an instruction, one example, or several examples placed in its input, without updating its weights for each task.[1][2]

GPT-3 became an important early demonstration of [in-context learning](/wiki/in_context_learning) at scale. Its results also require careful qualification. Performance varied widely across tasks and prompt formats, several benchmarks were affected by possible training-data contamination, the full training corpus and model weights were not released, and the base model was not specifically trained to follow user intent or to provide truthful answers.[1][2] GPT-3 is therefore best understood as the original 2020 research family and its associated base-model API era, not as a collective name for later systems such as Codex, InstructGPT, GPT-3.5, ChatGPT, or GPT-4.

## Scope and terminology

The paper used "GPT-3" in two related senses. It trained eight [language model](/wiki/language_model) sizes from 125 million to 175 billion parameters and called the 175-billion-parameter member "GPT-3"; the smaller members were used to measure scaling behavior. OpenAI later used GPT-3 as a product-family name for base models offered through its API. Public API aliases such as `ada`, `babbage`, `curie`, and `davinci` should not be assigned exact research-model parameter counts unless an authoritative source makes that mapping. The paper did not provide such an alias-to-size table.[2][8]

The model performed autoregressive next-token prediction. "Zero-shot," "one-shot," and "few-shot" in the paper describe how many demonstrations appeared in the input at evaluation time. They do not mean that GPT-3 was trained from no data, and the paper did not update model weights during these evaluations. Its few-shot condition generally used 10 to 100 demonstrations, subject to the input-length limit.[2]

## Research lineage and release

GPT-3 continued the [GPT](/wiki/gpt) line of generative pretraining. The original 2018 GPT report combined Transformer language-model pretraining with supervised task-specific fine-tuning.[4] [GPT-2](/wiki/gpt-2), described in 2019, increased the largest model to 1.5 billion parameters and studied task behavior elicited through text without task-specific parameter updates.[5] GPT-3 increased the largest dense model by more than two orders of magnitude relative to GPT-2 and systematically compared zero-, one-, and few-shot evaluation across a broad set of tasks.[1][2]

The work was also an experiment in [scaling laws](/wiki/scaling_laws). A preceding OpenAI study reported approximate power-law relationships between language-model loss and model size, dataset size, and training compute.[6] GPT-3 trained a sequence of model sizes to test whether decreasing validation loss with scale would translate into better task performance and more effective use of demonstrations. The paper found smooth improvements in aggregate accuracy and cross-entropy loss, but some individual exact-match tasks changed much more sharply between the 13-billion- and 175-billion-parameter models.[2]

The arXiv version was submitted on 28 May 2020. The conference record lists 31 authors, led by Tom B. Brown, and identifies the paper as a NeurIPS 2020 publication.[1][2]

## Architecture and objective

GPT-3 is a unidirectional Transformer [decoder](/wiki/decoder). Like GPT-2, it used causal [self-attention](/wiki/self_attention), pre-normalization, a modified initialization scheme, and reversible [tokenization](/wiki/tokenization). Unlike a standard dense-attention stack, its layers alternated dense attention with locally banded [sparse attention](/wiki/sparse_attention), following ideas from the Sparse Transformer. The original Transformer established the attention-based architecture, while the Sparse Transformer described attention factorizations intended to reduce the cost of long-sequence modeling.[2][3][5][7]

For a sequence of tokens, the training objective can be written as the negative log-likelihood of each token conditioned on earlier tokens:[2]

$$
\mathcal{L}(\theta)=-\sum_{t=1}^{T}\log P_{\theta}\!\left(x_t\mid x_1,\ldots,x_{t-1}\right).
$$

This objective trains the model to predict text continuations. It does not by itself specify that an answer should be factual, helpful, harmless, or faithful to a user's intention. GPT-3 used a [byte-pair encoding](/wiki/byte_pair_encoding) vocabulary and a 2,048-token [context window](/wiki/context_window). The reported feed-forward width was four times the model width.[2][5]

The paper reported the following architecture and optimization settings. Width and batch size are reproduced as printed; batch size is measured in tokens. Every model was trained for 300 billion tokens.[2]

| Model | Parameters | Layers | Model width | Attention heads | Batch tokens | Peak learning rate |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| GPT-3 Small | 125 million | 12 | 768 | 12 | 0.5 million | 6.0 × 10^-4 |
| GPT-3 Medium | 350 million | 24 | 1,024 | 16 | 0.5 million | 3.0 × 10^-4 |
| GPT-3 Large | 760 million | 24 | 1,536 | 16 | 0.5 million | 2.5 × 10^-4 |
| GPT-3 XL | 1.3 billion | 24 | 2,048 | 24 | 1 million | 2.0 × 10^-4 |
| GPT-3 2.7B | 2.7 billion | 32 | 2,560 | 32 | 1 million | 1.6 × 10^-4 |
| GPT-3 6.7B | 6.7 billion | 32 | 4,096 | 32 | 2 million | 1.2 × 10^-4 |
| GPT-3 13B | 13.0 billion | 40 | 5,140 | 40 | 2 million | 1.0 × 10^-4 |
| GPT-3 175B | 175.0 billion | 96 | 12,288 | 96 | 3.2 million | 0.6 × 10^-4 |

The full paper's compute appendix estimated 3.14 × 10^23 floating-point operations, or about 3,640 petaflop/s-days, for the 175-billion-parameter training run. This was an analytical estimate based on active parameters and training tokens, not a published hardware-meter reading. The authors reported training the models on [graphics processing units](/wiki/gpu), specifically NVIDIA V100 GPUs, on part of a high-bandwidth cluster provided by [Microsoft](/wiki/microsoft).[2]

## Training data and processing

The training mixture contained filtered [Common Crawl](/wiki/common_crawl), an expanded WebText dataset called WebText2, two internet-based book corpora, and English-language Wikipedia. OpenAI filtered Common Crawl using similarity to higher-quality reference corpora, applied fuzzy document-level deduplication within and across datasets, and sampled curated corpora more heavily than their raw sizes would imply.[2]

The paper reported this mixture for [pre-training](/wiki/pre-training):[2]

| Dataset | Available tokens | Sampling weight | Effective epochs over a 300B-token run |
| --- | ---: | ---: | ---: |
| Filtered Common Crawl | 410 billion | 60% | 0.44 |
| WebText2 | 19 billion | 22% | 2.9 |
| Books1 | 12 billion | 8% | 1.9 |
| Books2 | 55 billion | 8% | 0.43 |
| English Wikipedia | 3 billion | 3% | 3.4 |

The displayed weights sum to 101% because the paper rounded them. It reported that 41 monthly Common Crawl shards from 2016 through 2019 occupied 45 terabytes of compressed plain text before filtering and 570 gigabytes afterward, corresponding to roughly 400 billion encoded tokens. The labels "Books1" and "Books2" do not identify the underlying collections in enough detail to reconstruct them independently.[2]

All models used model parallelism both within matrix multiplications and across layers. Microsoft stated that GPT-3 was trained on an Azure-hosted AI supercomputer, while the paper limits its hardware description to V100 GPUs on part of a Microsoft-provided cluster. Those statements do not establish that every processor in the announced [Microsoft Azure](/wiki/azure) system was allocated to the GPT-3 run.[2][10]

## Evaluation protocol

GPT-3's headline evaluation was prompt-based rather than task-specific [fine-tuning](/wiki/fine_tuning). In the few-shot condition, the model received a natural-language task description when used, followed by several input-output demonstrations and a new input. The one-shot condition supplied one demonstration. The zero-shot condition supplied an instruction but no demonstration. These examples occupied the same 2,048-token input as the item being evaluated; no gradient update occurred.[2]

The evaluation was not one universal prompt applied to every task. The researchers chose task formats, answer labels, and sometimes a value of the demonstration count using development data. Multiple-choice tasks were scored by completion likelihood, with length normalization in most cases and additional unconditional normalization on ARC, OpenBookQA, and RACE. Free-form tasks used the metric conventional for the dataset. Private test servers could not always host the model, so most such results were reported on development sets; the paper submitted only selected results to test servers.[2]

The benchmark collection spanned language modeling, question answering, translation, commonsense reasoning, reading comprehension, natural-language inference, and synthetic tasks. [LAMBADA](/wiki/lambada) tests prediction of a passage's final word using broad discourse context.[14] [TriviaQA](/wiki/triviaqa) contains question-answer pairs associated with distantly gathered evidence.[15] [HellaSwag](/wiki/hellaswag) is an adversarially filtered sentence-completion benchmark.[16] [PIQA](/wiki/piqa) tests physical commonsense through paired candidate solutions.[17] [SuperGLUE](/wiki/superglue) aggregates several language-understanding tasks.[18]

Selected results for the 175-billion-parameter model are shown below. They are historical measurements under the paper's specific prompts, splits, and scoring rules, not current leaderboard claims.[2]

| Evaluation | Zero-shot | One-shot | Few-shot | Important condition |
| --- | ---: | ---: | ---: | --- |
| LAMBADA accuracy | 76.2% | 72.5% | 86.4% | Few-shot used a fill-in-the-blank format; contamination was later examined |
| TriviaQA accuracy | 64.3% | 68.0% | 71.2% | Closed-book setting; few-shot result used the wiki-split test server |
| HellaSwag accuracy | 78.9% | 78.1% | 79.3% | Development-set result |
| PIQA accuracy | 80.5% | 80.5% | 82.8% | Few-shot test-server result; marked with a contamination asterisk in the paper |
| SuperGLUE average | Not reported | Not reported | 71.8 | Test set with 32 demonstrations per task and no weight updates |

Results varied substantially. GPT-3 did well on LAMBADA and closed-book TriviaQA under the reported conditions, but remained below contemporaneous fine-tuned systems on many tasks. On SuperGLUE, its 71.8 few-shot average exceeded the paper's 69.0 fine-tuned BERT-Large baseline but remained far below the listed 89.0 fine-tuned state of the art. Its few-shot WiC accuracy was 49.4%, approximately chance, and the paper reported persistent weakness on several tasks requiring comparison between two text passages.[2]

## In-context learning and prompt sensitivity

The paper's [few-shot learning](/wiki/few-shot_learning) curves showed that larger models often made more effective use of demonstrations. The term "learning" here is behavioral: a [prompt](/wiki/prompt) changes the conditional distribution of outputs, while the model's parameters remain fixed. The experiments did not establish a single mechanism explaining whether GPT-3 inferred a new rule, recognized a familiar pattern, or exploited correlations in its pretraining data.[2]

Subsequent controlled studies narrowed the interpretation. Zhao and colleagues found that GPT-3 few-shot classification could vary from near chance to near state of the art with changes in format, selected examples, or label order; their contextual calibration procedure improved accuracy by as much as 30 absolute percentage points across studied prompt choices.[19] Lu and colleagues separately found strong sensitivity to the order of demonstrations and reported a 13% relative average improvement from an entropy-based ordering method across eleven text-classification tasks.[20] These findings made [prompt engineering](/wiki/prompt_engineering) an important part of reproducible evaluation rather than a neutral presentation detail.

Min and colleagues tested twelve models, including GPT-3, and found that replacing correct labels in demonstrations with random labels often had little effect on the classification and multiple-choice tasks they studied. Their experiments instead implicated the demonstrated label space, input distribution, and formatting as important contributors.[21] The result does not show that correct demonstrations never matter; it shows that accuracy gains in those settings could not be attributed solely to learning the demonstrated input-label mapping.

Later work also revisited scale. A 2022 compute-optimal study argued that many large models had been trained on too few tokens for their parameter count under a fixed compute budget. Its 70-billion-parameter Chinchilla model, trained on substantially more data, outperformed GPT-3 on the evaluated task suite while using the same reported training-compute budget as the larger Gopher model.[23] This does not retroactively change GPT-3's architecture or results, but it weakens the inference that increasing parameter count was the uniquely efficient route to better performance.

Some later authors classified sharp task-level changes in the GPT-3 family as [emergent abilities](/wiki/emergent_abilities).[24] That interpretation remains measurement-dependent. Schaeffer and colleagues reanalyzed InstructGPT and GPT-3 family outputs and showed that discontinuous metrics such as exact-match accuracy can create apparent thresholds, while continuous metrics can produce smoother scaling curves.[25] GPT-3 therefore supplies evidence of strong scale-dependent behavior, but a sudden curve in one metric is not by itself proof of a new internal capability appearing discontinuously.

## Contamination and reproducibility

The authors attempted to remove overlaps between the pretraining corpus and benchmark development or test sets, but a filtering bug left some detected overlaps in the training data. Retraining was judged too expensive. Their post-hoc analysis flagged possible overlap using 13-gram matching, compared full and "clean" subsets, and emphasized that the method could yield false positives or distribution shifts.[2]

The analysis found concrete concerns. It flagged 29% of PIQA examples and observed a three-percentage-point absolute drop on the clean subset. It found 132 Winograd schemas in the training corpus in a different format and a 2.6-point drop on the clean subset. LAMBADA showed substantial genuine overlap but less than a half-point difference between the reported full and clean scores. Four Wikipedia-derived language-modeling benchmarks and the Children's Book Test were omitted because the authors could not extract reliable clean subsets. These findings are why the PIQA and Winograd results carried contamination marks in the paper.[2]

[Data contamination](/wiki/data_contamination) was not the only reproducibility constraint. OpenAI published the paper, an official model card, and generated samples, but did not release the GPT-3 weights or a document-level training manifest. Access was instead provided through a hosted API. The model card also warned that deployed-system behavior depends on configuration, context, users, and interpretation, so paper benchmark scores do not directly predict application performance.[8][32]

## Limitations, bias, and risks

The paper documented repeated phrases, loss of long-range coherence, contradictions, and non sequiturs in generated text. It also described the model as ungrounded in physical or multimodal experience, predominantly English-focused, difficult to interpret, and weak on some bidirectional or passage-comparison tasks. Its bias probes found stereotyped associations involving gender, race, and religion, but the authors characterized that analysis as preliminary rather than exhaustive.[1][2][32]

The official model card states that GPT-3 can generate false statements confidently. A later [TruthfulQA](/wiki/truthfulqa) study tested GPT-3 and other models on 817 questions constructed around common misconceptions. Under human evaluation, the best tested GPT-3 configuration was truthful on 58% of questions, compared with 94% for the human baseline; larger base models were generally less truthful in that study.[22][32] Fluency or benchmark accuracy should therefore not be treated as a factuality guarantee, and confident false text is one form of language-model [hallucination](/wiki/hallucination).

The original paper discussed potential misuse for spam, phishing, impersonation, fraudulent writing, and misinformation, along with the energy required for large-scale training.[1][2] Broader scholarship has identified additional risks from large language models, including discrimination and exclusion, information hazards, malicious use, human-computer interaction harms, automation and access effects, and environmental costs.[29] Bender and colleagues also argued that scale, undocumented training data, and fluent output can obscure environmental, social, and epistemic costs; their paper is a critical analysis of the large-language-model research direction, not an experimental audit limited to GPT-3.[30]

OpenAI's 2022 deployment retrospective reported that it had detected and stopped hundreds of actors attempting to misuse GPT-3, including uses beyond the influence operations emphasized in its initial risk planning. The same retrospective acknowledged that early GPT-3 work did not filter toxic pretraining material as aggressively as later production-oriented efforts and that academic benchmarks often failed to represent deployment risks.[31] These are first-party observations and do not provide a complete independent measurement of prevalence or social impact.

## API access and lifecycle

OpenAI launched its API in private beta on 11 June 2020 with a general-purpose "text in, text out" interface. The announcement said that the service ran models from the GPT-3 family and explained that OpenAI chose controlled API access instead of open-sourcing the underlying models, in part to permit use review and revocation.[8]

On 22 September 2020, OpenAI licensed GPT-3 technology to Microsoft for Microsoft's products and services while stating that the agreement did not change continued access through OpenAI's API.[9] Microsoft described the license as exclusive and separately confirmed that GPT-3 had been trained on its Azure AI supercomputer.[10] On 18 November 2021, OpenAI removed the GPT-3 API waitlist for developers in supported countries; this broadened hosted access but did not release model weights.[11]

The original base GPT-3 API aliases `ada`, `babbage`, `curie`, and `davinci` were retired on 4 January 2024. OpenAI mapped the first two to `babbage-002` and the latter two to `davinci-002`; instruction-tuned completion models such as `text-davinci-003` moved to `gpt-3.5-turbo-instruct`.[12][13] Those replacement names should not be described as the original 2020 research checkpoints.

As of the research cutoff of 28 July 2026, OpenAI's deprecation record schedules `babbage-002`, `davinci-002`, and `gpt-3.5-turbo-instruct` to shut down on 28 September 2026. It schedules fine-tuned `babbage-002` and `davinci-002` models to shut down on 23 October 2026. These are service-lifecycle facts about hosted model identifiers, not a claim that the GPT-3 paper or historical model family ceases to exist.[12]

## Later models and historical significance

GPT-3 supplied pretrained starting points and deployment experience for several later projects, but those systems used additional data or training objectives. [OpenAI Codex](/wiki/openai_codex) was a GPT model fine-tuned on publicly available code and evaluated for program synthesis; it was not merely a renamed GPT-3 base checkpoint.[27] [InstructGPT](/wiki/instructgpt) applied supervised demonstrations and [reinforcement learning from human feedback](/wiki/rlhf) to GPT-3 models. In the reported human-preference evaluation, a 1.3-billion-parameter InstructGPT model was preferred to the 175-billion-parameter base GPT-3 despite its smaller size.[26]

[ChatGPT](/wiki/chatgpt), introduced in November 2022, was described by OpenAI as a sibling of InstructGPT and trained for dialogue. [GPT-3.5](/wiki/gpt-3.5) is a later model family associated with that period, not one of the eight models in the 2020 GPT-3 paper.[28] [GPT-4](/wiki/gpt-4) is also a separate successor family. Keeping these scopes distinct prevents later instruction following, dialogue behavior, code specialization, or multimodal capabilities from being attributed to the original GPT-3 base model.

GPT-3's lasting research importance lies in the combination of scale, a systematic zero-/one-/few-shot protocol, and a gated general-purpose deployment. It helped establish large pretrained models as reusable [foundation models](/wiki/foundation_models), while subsequent work showed that prompt construction, training-token allocation, contamination, factuality, alignment, and evaluation metrics are all essential to interpreting what scale achieved.[19][21][22][23][25]

## References

[1] Brown, Tom B., et al. "Language Models are Few-Shot Learners." Advances in Neural Information Processing Systems 33, 2020. https://proceedings.neurips.cc/paper_files/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html

[2] Brown, Tom B., et al. "Language Models are Few-Shot Learners." arXiv:2005.14165, version 4, 2020. https://arxiv.org/abs/2005.14165

[3] Vaswani, Ashish, et al. "Attention is All You Need." Advances in Neural Information Processing Systems 30, 2017. https://proceedings.neurips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html

[4] Radford, Alec, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. "Improving Language Understanding by Generative Pre-Training." OpenAI technical report, 2018. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf

[5] Radford, Alec, et al. "Language Models are Unsupervised Multitask Learners." OpenAI technical report, 2019. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

[6] Kaplan, Jared, et al. "Scaling Laws for Neural Language Models." arXiv:2001.08361, 2020. https://arxiv.org/abs/2001.08361

[7] Child, Rewon, Scott Gray, Alec Radford, and Ilya Sutskever. "Generating Long Sequences with Sparse Transformers." arXiv:1904.10509, 2019. https://arxiv.org/abs/1904.10509

[8] OpenAI. "OpenAI API." 11 June 2020, updated 18 September 2020. https://openai.com/index/openai-api/

[9] OpenAI. "OpenAI licenses GPT-3 technology to Microsoft." 22 September 2020. https://openai.com/index/openai-licenses-gpt-3-technology-to-microsoft/

[10] Scott, Kevin. "Microsoft teams up with OpenAI to exclusively license GPT-3 language model." Official Microsoft Blog, 22 September 2020. https://blogs.microsoft.com/blog/2020/09/22/microsoft-teams-up-with-openai-to-exclusively-license-gpt-3-language-model/

[11] OpenAI. "OpenAI's API now available with no waitlist." 18 November 2021. https://openai.com/index/api-no-waitlist/

[12] OpenAI. "Deprecations." OpenAI API documentation, accessed 28 July 2026. https://developers.openai.com/api/docs/deprecations

[13] OpenAI. "GPT-4 API general availability and deprecation of older models in the Completions API." 6 July 2023, updated 24 April 2024. https://openai.com/index/gpt-4-api-general-availability/

[14] Paperno, Denis, et al. "The LAMBADA dataset: Word prediction requiring a broad discourse context." Proceedings of ACL 2016, pages 1525-1534. https://aclanthology.org/P16-1144/

[15] Joshi, Mandar, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. "TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension." Proceedings of ACL 2017, pages 1601-1611. https://aclanthology.org/P17-1147/

[16] Zellers, Rowan, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. "HellaSwag: Can a Machine Really Finish Your Sentence?" Proceedings of ACL 2019, pages 4791-4800. https://aclanthology.org/P19-1472/

[17] Bisk, Yonatan, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. "PIQA: Reasoning about Physical Commonsense in Natural Language." Proceedings of AAAI 34(05), 2020, pages 7432-7439. https://ojs.aaai.org/index.php/AAAI/article/view/6239

[18] Wang, Alex, et al. "SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems." Advances in Neural Information Processing Systems 32, 2019. https://proceedings.neurips.cc/paper/2019/hash/4496bf24afe7fab6f046bf4923da8de6-Abstract.html

[19] Zhao, Zihao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. "Calibrate Before Use: Improving Few-Shot Performance of Language Models." Proceedings of ICML 2021, pages 12697-12706. https://proceedings.mlr.press/v139/zhao21c.html

[20] Lu, Yao, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. "Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity." Proceedings of ACL 2022, pages 8086-8098. https://aclanthology.org/2022.acl-long.556/

[21] Min, Sewon, et al. "Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?" Proceedings of EMNLP 2022, pages 11048-11064. https://aclanthology.org/2022.emnlp-main.759/

[22] Lin, Stephanie, Jacob Hilton, and Owain Evans. "TruthfulQA: Measuring How Models Mimic Human Falsehoods." Proceedings of ACL 2022, pages 3214-3252. https://aclanthology.org/2022.acl-long.229/

[23] Hoffmann, Jordan, et al. "An empirical analysis of compute-optimal large language model training." Advances in Neural Information Processing Systems 35, 2022. https://proceedings.neurips.cc/paper_files/paper/2022/hash/c1e2faff6f588870935f114ebe04a3e5-Abstract.html

[24] Wei, Jason, et al. "Emergent Abilities of Large Language Models." Transactions on Machine Learning Research, 2022. https://research.google/pubs/emergent-abilities-of-large-language-models/

[25] Schaeffer, Rylan, Brando Miranda, and Sanmi Koyejo. "Are Emergent Abilities of Large Language Models a Mirage?" Advances in Neural Information Processing Systems 36, 2023. https://proceedings.neurips.cc/paper_files/paper/2023/hash/adc98a266f45005c403b8311ca7e8bd7-Abstract-Conference.html

[26] Ouyang, Long, et al. "Training language models to follow instructions with human feedback." Advances in Neural Information Processing Systems 35, 2022. https://proceedings.neurips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract.html

[27] Chen, Mark, et al. "Evaluating Large Language Models Trained on Code." arXiv:2107.03374, 2021. https://arxiv.org/abs/2107.03374

[28] OpenAI. "Introducing ChatGPT." 30 November 2022. https://openai.com/index/chatgpt/

[29] Weidinger, Laura, et al. "Taxonomy of Risks posed by Language Models." Proceedings of FAccT 2022, pages 214-229. https://doi.org/10.1145/3531146.3533088

[30] Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?" Proceedings of FAccT 2021, pages 610-623. https://doi.org/10.1145/3442188.3445922

[31] OpenAI. "Lessons learned on language model safety and misuse." 3 March 2022. https://openai.com/index/language-model-safety-and-misuse/

[32] OpenAI. "GPT-3 Model Card." Last updated September 2020. https://github.com/openai/gpt-3/blob/master/model-card.md