GPT-3

GPT-3 (Generative Pre-Trained Transformer 3) is a large language model developed by OpenAI. It is the third iteration of the GPT series, a family of autoregressive models that use deep learning to generate text based on natural language inputs (prompts). Not only does it create human-like text but also is able to produce code or other data. The system has been used to generate poetry, write role-playing adventures and to create basic apps. ^[1] ^[2] GPT-3 was introduced in the paper "Language Models are Few-Shot Learners" by Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, and 26 additional co-authors at OpenAI. The paper was submitted to arXiv in May 2020 and later presented at the 34th Conference on Neural Information Processing Systems (NeurIPS 2020). ^[3] OpenAI released access to GPT-3 through a limited beta API on June 11, 2020, and subsequently made it available as a commercial product to fund its research. ^[2] ^[4] ^[5] On November 18, 2021, OpenAI stated that they had removed the waitlist, making the API generally available to all developers. ^[5]

With 175 billion parameters, GPT-3 was the largest dense neural network at the time of its release, representing a more than 100-fold increase over its predecessor, GPT-2, which had 1.5 billion parameters. ^[3] In the area of artificial intelligence, GPT-3 is a neural network, a mathematical system inspired by the working of brain neurons. This network learns its skills by finding patterns in enormous amounts of digital data. It is the same type of technology that allows for facial recognition or the identification of voice commands on smartphones. ^[4]

According to Floridi and Chiriatti (2020), "the language model is trained on an unlabeled dataset that is made up of texts, such as Wikipedia and many other sites, primarily in English, but also in other languages. These statistical models need to be trained with large amounts of data to produce relevant results." ^[1] When the GPT-3 system receives a prompt, it produces a text completion in natural language, and developers can tweak the program by showing it a few examples that demonstrate their intended goal. ^[6]

GPT-3 is a powerful model with various applications like automatic chat, search matching, code generation, article creation, and summarizing, but it still has security and uncontrollability problems. These can include false content and biased information. For now, its value is mainly on intelligent auxiliary tasks like summarizing reports or creating content that can then be manually reviewed and edited. ^[9]

Background and Motivation

The development of GPT-3 was motivated by a growing body of evidence that scaling up language models could yield substantial improvements in performance across a wide range of natural language processing tasks. In January 2020, a team of researchers at OpenAI, including Jared Kaplan and Sam McCandlish, published "Scaling Laws for Neural Language Models," which demonstrated that the cross-entropy loss of a language model follows a power-law relationship with model size, dataset size, and the amount of compute used for training. ^[18] These scaling laws spanned more than seven orders of magnitude and suggested that larger models would continue to improve in a predictable fashion, provided they were given sufficient data and compute.

The scaling laws paper also found that other architectural details, such as network width or depth, had minimal effects within a wide range, and that larger models were significantly more sample-efficient. This meant that the most compute-efficient approach to training involved building very large models and training them on a relatively modest amount of data, stopping well before convergence. ^[18] These findings provided the theoretical foundation and practical justification for the massive scale of GPT-3 and directly shaped the decisions around its architecture and training regimen.

Prior to GPT-3, the progression from GPT (110 million parameters in 2018) to GPT-2 (1.5 billion parameters in 2019) had already shown that increasing scale led to qualitative improvements in text generation and task performance. ^[1] ^[2] GPT-3 pushed this approach to its logical extreme, scaling up by over two orders of magnitude from GPT-2.

Architecture

GPT-3 uses the same general architecture as GPT-2, based on the Transformer decoder. The model employs modified initialization, pre-normalization (placing layer normalization before the attention and feed-forward blocks rather than after), and reversible tokenization using byte pair encoding (BPE) with a vocabulary of approximately 50,257 tokens. ^[3]

One key architectural change from GPT-2 is the use of alternating dense and locally banded sparse attention patterns in the Transformer layers, an approach borrowed from the Sparse Transformer. ^[3] This modification helps the model handle its increased depth more efficiently.

The full-size GPT-3 model (GPT-3 175B) has the following specifications:

Layers: 96 Transformer decoder layers
Hidden dimension (d_model): 12,288
Attention heads: 96 heads, each operating in a 128-dimensional subspace (96 x 128 = 12,288)
Context window: 2,048 tokens
Total parameters: 175 billion (approximately 175.0B)

Each of the 96 layers contains a multi-head self-attention sublayer and a position-wise feed-forward network (FFN) sublayer. The feed-forward hidden dimension is four times the model dimension (4 x 12,288 = 49,152). In terms of parameter distribution, the word embedding layer accounts for less than 1% of total parameters in the full-size model, while the remaining parameters are split roughly 2:1 between the feed-forward layers and the attention heads. ^[3]

Model Family

The GPT-3 paper evaluated eight different model sizes to study the effect of scale on performance. The table below summarizes the architectural hyperparameters for each variant, as reported in Table 2.1 of the paper. ^[3]

Model Name	Parameters	Layers	d_model	Attention Heads	d_head	Batch Size (tokens)	Learning Rate
GPT-3 Small	125M	12	768	12	64	0.5M	6.0 x 10^-4
GPT-3 Medium	350M	24	1,024	16	64	0.5M	3.0 x 10^-4
GPT-3 Large	760M	24	1,536	16	96	1M	2.5 x 10^-4
GPT-3 XL	1.3B	24	2,048	24	128	1M	2.0 x 10^-4
GPT-3 2.7B	2.7B	32	2,560	32	80	1M	1.6 x 10^-4
GPT-3 6.7B	6.7B	32	4,096	32	128	2M	1.2 x 10^-4
GPT-3 13B	13.0B	40	5,140	40	128	2M	1.0 x 10^-4
GPT-3 175B	175.0B	96	12,288	96	128	3.2M	0.6 x 10^-4

All eight models were trained for a total of 300 billion tokens. The learning rate was warmed up over the first 375 million tokens and then decayed using a cosine schedule to 10% of its original value. ^[3]

API Model Names

When OpenAI launched the GPT-3 API, the model variants were given names inspired by notable scientists and mathematicians. Research by EleutherAI matched these API models to the paper's model sizes based on benchmark performance comparisons. ^[19]

API Model Name	Approximate Paper Equivalent	Estimated Parameters
Ada	GPT-3 Small or Medium	~350M
Babbage	GPT-3 XL	~1.3B
Curie	GPT-3 6.7B	~6.7B
Davinci	GPT-3 175B	~175B

Davinci was the largest and most capable model, able to perform any task the smaller models could handle, though at higher cost and lower speed. Ada was the fastest and least expensive, suited for simpler tasks such as parsing, classification, and address correction. Curie balanced power and speed, handling more complex tasks like sentiment analysis and summarization. Babbage was suitable for straightforward tasks like search ranking and simple classification. ^[19]

Training Data

GPT-3 was trained on a mixture of five datasets. The training corpus contained approximately 499 billion byte-pair-encoded tokens in total, though the model saw roughly 300 billion tokens during training because the dataset was weighted to favor higher-quality sources. Some datasets were seen more than once during training while others were not fully traversed. ^[3]

Dataset	Tokens (Billions)	Weight in Training Mix	Epochs (approx.)
Common Crawl (filtered)	410	60%	0.44
WebText2	19	22%	2.9
Books1	12	8%	1.9
Books2	55	8%	0.43
Wikipedia (English)	3	3%	3.4

The Common Crawl portion underwent significant filtering. OpenAI trained a binary classifier to distinguish high-quality documents (using WebText as a positive proxy) from low-quality ones, and then used fuzzy deduplication to remove near-duplicate documents. The filtering process reduced the original 45 TB of Common Crawl text down to approximately 570 GB of high-quality content. ^[3]

WebText2 is an expanded version of the WebText dataset used to train GPT-2. It consists of the text content of web pages linked from Reddit posts that received at least three upvotes, serving as a rough heuristic for content quality. ^[3]

Training Details

GPT-3 was trained on a high-bandwidth cluster of NVIDIA V100 GPUs provided by Microsoft as part of a dedicated cloud computing partnership. OpenAI used the cuDNN-accelerated PyTorch deep learning framework for training. ^[3] The Microsoft Azure-based supercomputer used for training reportedly contained 285,000 CPU cores and 10,000 NVIDIA V100 GPUs. ^[20]

The estimated computational cost of training GPT-3 175B was approximately 3.14 x 10^23 floating-point operations (FLOPS). Based on public cloud pricing for V100 GPU instances, independent researchers estimated the single training run cost at roughly $4.6 million, though the actual cost to OpenAI may have differed given its partnership with Microsoft. ^[20] Some earlier estimates placed the cost at approximately $12 million. ^[1]

The training used a cosine learning rate schedule with warmup over the first 375 million tokens, decaying the learning rate to 10% of its peak value by the end of training. The largest model (175B) used a batch size of 3.2 million tokens and a peak learning rate of 0.6 x 10^-4. ^[3]

In-Context Learning

One of the most significant contributions of the GPT-3 paper was the formalization and systematic study of in-context learning. Rather than updating model weights through fine-tuning on task-specific data, in-context learning allows the model to perform new tasks simply by conditioning on a natural language description or a few demonstrative examples provided in the input prompt. ^[3]

The paper defined and evaluated three settings for in-context learning:

Zero-shot: The model receives only a natural language description of the task with no examples. For instance, "Translate English to French: cheese =>" prompts the model to produce a French translation.
One-shot: The model receives a single example of the task alongside the description before being asked to perform the task on new input.
Few-shot learning: The model receives a small number of examples (typically 10 to 100, constrained by the 2,048-token context window) demonstrating the task.

The authors showed that GPT-3's few-shot performance improved sharply with model scale. While smaller models showed little difference between zero-shot and few-shot settings, the 175B parameter model often exhibited large performance gains when given just a handful of examples. This gap between zero-shot and few-shot performance widened consistently as model size increased, suggesting that in-context learning is an emergent capability that arises at sufficient scale. ^[3]

Some AI researchers consider GPT-3 a step toward machines that can comprehend the complexities of human language. The system has been used in several experiments to generate tweets, poetry, summarizing emails, answering trivia questions, translating languages and coding computer programs, with little input from the user. ^[4] Indeed, OpenAI has suggested that the model is capable of performing zero-shot or few-shot learning, meaning that it can perform new tasks without extra training cycles. Instead, being shown a few examples by prompts is sufficient for the model to learn. ^[2] ^[5]

Few-shot learning is one of the main problems in AI research. The ability to learn complex tasks from a few examples is a human characteristic that still eludes machines. Learning with no examples is called zero-shot. While GPT-3 is not the final solution to few-shot learning, it is an important development by showing the potential of scaling up to improve few-shot performance, as it did from the previous version, GPT-2. ^[2]

Benchmark Performance

GPT-3 was evaluated across a wide range of NLP benchmarks, demonstrating strong few-shot performance that in many cases approached or matched fine-tuned state-of-the-art models. The table below summarizes key results for the full GPT-3 175B model. ^[3]

Benchmark	Task Type	Zero-Shot	One-Shot	Few-Shot	Prior SOTA (Fine-tuned)
LAMBADA	Language Modeling (Accuracy)	76.2%	72.5%	86.4%	68.0% (GPT-2)
TriviaQA	Closed-Book QA	64.3%	68.0%	71.2%	50.1% (T5-11B)
PIQA	Physical Commonsense	81.0%	80.5%	82.8%	79.4%
WinoGrande	Coreference Resolution	70.2%	73.2%	77.7%	84.6% (fine-tuned)
ARC (Challenge)	Science QA	51.4%	53.2%	51.5%	73.0% (UnifiedQA)
ARC (Easy)	Science QA	68.8%	71.2%	70.1%	92.0%
HellaSwag	Sentence Completion	78.9%	78.1%	79.3%	85.6%
StoryCloze	Story Completion	83.2%	84.7%	87.7%	91.8%

On the SuperGLUE benchmark, GPT-3's performance was mixed across individual tasks. It achieved near state-of-the-art results on COPA and ReCoRD in the few-shot setting. Performance on BoolQ, MultiRC, and RTE was reasonable, roughly matching fine-tuned BERT-Large. However, GPT-3 struggled with tasks that required comparing two sentences or snippets, such as WiC (Word-in-Context) and tasks involving paraphrasing or textual entailment. Overall, GPT-3 required fewer than eight total examples per task to outperform fine-tuned BERT-Large on the aggregate SuperGLUE score. ^[3]

Notably, on the LAMBADA benchmark for next-word prediction, GPT-3 few-shot achieved 86.4% accuracy, an 18-point improvement over the previous state of the art. On TriviaQA in the closed-book few-shot setting, GPT-3 achieved 71.2%, surpassing the fine-tuned T5-11B model. ^[3]

GPT-3 also demonstrated competence in machine translation, achieving competitive BLEU scores on WMT translation benchmarks between English, French, German, and Romanian, despite translation never being an explicit training objective. ^[3]

Emergent Abilities

GPT-3 demonstrated an interesting quality: its general-purpose approach. Generally, in machine learning, a model is trained for a specific task and it will be limited to that. However, OpenAI's language model can do different tasks without fine-tuning (no additional training). ^[2] This came about due to scaling the model and data size, which had the effect of producing emergent abilities, executed through prompts "without expensive supervised data or repeated re-training." ^[13]

From its original intent of generating text based on an initial input, GPT-3 has been able to translate between languages, perform reading comprehension tasks and answer questions analogous to SAT exams with some level of accuracy. ^[2] ^[4] Developers also showed that it was possible for the language model to create JavaScript code from a prompt, something that GPT-3 developers did not expect. This seems to be a consequence of the training process for the model, in which text that was used had samples of code. ^[2] ^[4] Dario Amodei, OpenAI's vice-president for research, said that GPT-3 "has this emergent quality. It has some ability to recognize the pattern that you gave it and complete the story, give another example." ^[4]

API Launch and Democratization of LLM Access

On June 11, 2020, OpenAI announced the GPT-3 API, offering developers a general-purpose "text in, text out" interface that could complete almost any English language task. ^[5] ^[21] Rather than releasing model weights publicly (as had been done eventually with GPT-2), OpenAI chose a commercial API model. This decision was driven by safety considerations and the desire to control potential misuse, but it also established a new business model for deploying large language models.

Initially, access was tightly restricted through a waitlist. OpenAI reviewed applications and gradually expanded access over the following months. By March 2021, the API was being used in over 300 different applications by tens of thousands of developers, producing an average of 4.5 billion words per day. ^[6]

On November 18, 2021, OpenAI announced that the waitlist had been removed, making the API generally available to all developers. ^[5] OpenAI also launched the GPT-3 Playground, a web-based interface allowing users to interact with the model directly without writing code. Users could try the Playground for free, though sign-up with an email address and phone number was required. ^[10] ^[11] ^[12]

The GPT-3 API marked a turning point in how AI capabilities were distributed. Previously, only organizations with significant compute resources and machine learning expertise could build and deploy large language models. The API abstracted away the infrastructure complexity, allowing individual developers, startups, and businesses to integrate advanced language capabilities into their products through simple HTTP requests. This model of "AI as a service" influenced the strategies of competitors and set the stage for the broader commercialization of large language models. ^[21]

GPT-3 Apps

Nine months after launch, more than 300 applications were using GPT-3, with tens of thousands of developers building on the platform. On average, 4.5 billion words per day were being generated. ^[6]

The OpenAI language model increased interest in large language model (LLM) research, with companies like Google (PaLM), Meta (OPT-175B), NVIDIA (Megatron MT-NLG), and DeepMind (Chinchilla) investing in their own LLMs. Not only that but also companies and startups began to create or integrate GPT-3 in their products. Companies like Hugging Face, Cohere, and Humanloop are also providing LLM APIs, competing with OpenAI. GPT-3 also fueled the development of various open-source projects with the goal of making LLMs available to a wider audience (BigScience's BLOOM and EleutherAI's GPT-J). ^[5]

Development

In 2017, research labs at Google and OpenAI started working on neural networks that learned from massive amounts of text; these included Wikipedia articles and unpublished books. GPT-3 is the result of several years of work by OpenAI, which is backed by funding from Microsoft. ^[4] Its previous version, GPT-2, was announced in February 2019 and had 1.5 billion parameters trained on 40 GB of text (around 10 billion tokens). ^[7]

When released in June 2020, GPT-3 was the largest neural network, and it captured the attention of media, researchers, and AI businesses. OpenAI decided to launch it as a private API, filtering who could use the system through a waitlist. ^[5]

The development of GPT-3 would not be possible without innovations like using "neural networks to generate word vector representations. Instead of using the words themselves in a machine learning algorithm, the idea is to first represent the words as mathematical vectors." ^[2] Another relevant innovation was the use of recurrent neural networks to "read" sentences. The introduction of the Transformer architecture is another milestone that allowed the creation of deeper neural networks (DNN) now available to researchers in the area of natural language processing. This new architecture permitted the development of much larger language models. ^[2]

GPT-3 continued to improve. According to OpenAI, new features were added such as:

Answers endpoint: "Searches provided information (documents, knowledge bases etc.) for relevant context to be added to the prompt before completing with GPT-3."
Classification endpoint: leverages labeled training data without fine-tuning.
Enhanced search endpoint: provides the framework for the "Answers and Classifications endpoints that scales to a large number of documents while also being cheap and fast."
Safety: improvements to reduce the instances of bias and misuse.
Prompt library: examples of starter prompt designs for several use cases. ^[6]

Scaling Laws Validation

GPT-3 served as one of the first large-scale empirical validations of the scaling laws described by Kaplan et al. (2020). The paper showed that across the eight model sizes tested (from 125M to 175B parameters), validation loss decreased smoothly as a function of model size, following the predicted power-law relationship. ^[3] ^[18]

The results confirmed several key predictions from the scaling laws research:

Performance improved predictably with each increase in model size, with no sign of diminishing returns at the 175B parameter scale.
The gap between zero-shot and few-shot performance widened at larger scales, suggesting that in-context learning ability is itself a scaling phenomenon.
Larger models achieved lower validation loss per unit of compute compared to smaller models trained for longer, consistent with the finding that scaling model size is more compute-efficient than scaling training duration.

These findings had a profound influence on the AI research community, providing empirical evidence that the path to more capable language models lay primarily in scaling up parameters, data, and compute. The success of GPT-3 motivated a wave of large-scale model development across the industry, including Google's PaLM (540B parameters), DeepMind's Chinchilla (70B parameters, trained on more data), and Meta's LLaMA. ^[18]

Codex and Code Generation

In July 2021, OpenAI published "Evaluating Large Language Models Trained on Code" (Chen et al., 2021), introducing Codex, a GPT-3 descendant fine-tuned on publicly available code from GitHub. ^[22] Codex powered GitHub Copilot, the AI pair programming tool developed by GitHub in collaboration with OpenAI.

On the HumanEval benchmark, which measures the ability to synthesize correct Python functions from docstrings, Codex solved 28.8% of problems, while the base GPT-3 model solved 0%. With repeated sampling (100 samples per problem), Codex's success rate rose to 70.2%. ^[22]

Codex represented an important milestone in demonstrating that large language models pre-trained on text could be effectively adapted for specialized domains through additional fine-tuning. The recent version of GPT-3's Codex program, trained specifically for code generation, could also identify bugs, fix mistakes in its code, and explain the function of that code. ^[15]

InstructGPT and RLHF

In March 2022, OpenAI published "Training Language Models to Follow Instructions with Human Feedback" (Ouyang et al., 2022), which introduced InstructGPT, a family of models derived from GPT-3 through a process of alignment with human preferences. ^[23] The InstructGPT paper addressed a fundamental limitation of GPT-3: while the base model was trained to predict the next token, it was not explicitly trained to follow user instructions or produce helpful, truthful, and harmless outputs.

The InstructGPT training process involved three stages:

Supervised fine-tuning (SFT): Human labelers wrote demonstrations of desired behavior for a set of prompts, and a GPT-3 model was fine-tuned on this data using standard supervised learning.
Reward model training: Human labelers ranked multiple model outputs for a given prompt from best to worst. These rankings were used to train a reward model that could predict human preferences.
Reinforcement learning from human feedback (RLHF): The SFT model was further optimized using Proximal Policy Optimization (PPO) against the reward model, encouraging the model to produce outputs that humans would rate highly.

A striking finding was that the 1.3B parameter InstructGPT model was preferred by human evaluators over the 175B parameter base GPT-3 model, despite having over 100 times fewer parameters. InstructGPT also showed improvements in truthfulness and reductions in toxic output generation, with minimal performance regressions on public NLP benchmarks. ^[23]

InstructGPT laid the groundwork for the techniques used in ChatGPT and became one of the most influential papers in AI alignment research.

GPT-3.5 and ChatGPT

Building on the InstructGPT approach, OpenAI developed the GPT-3.5 series, a family of models that bridged the gap between GPT-3 and GPT-4. The GPT-3.5 series includes several models that evolved through different training techniques:

code-davinci-002: A GPT-3 model fine-tuned on code, which became the base model for the GPT-3.5 series. It was released in early 2022.
text-davinci-002: Created through supervised fine-tuning of code-davinci-002 on instructional data.
text-davinci-003: Further refined using RLHF, which improved its ability to follow instructions and produce higher-quality text.
gpt-3.5-turbo: An optimized variant designed for chat-based interactions, released on March 1, 2023. It offered performance comparable to text-davinci-003 at one-tenth the cost.

On November 30, 2022, OpenAI launched ChatGPT, a conversational interface built on a model from the GPT-3.5 series that had been fine-tuned using RLHF for dialogue. ChatGPT became one of the fastest-growing consumer technology products in history, reaching an estimated 100 million monthly active users within two months of launch. ^[24] The success of ChatGPT transformed public awareness of large language models and triggered a global wave of investment and competition in generative AI.

Pricing

GPT-3 service is charged per token, which is a part of a word. There was an $18 free credit available during the first three months of use. In September 2022, OpenAI reduced the prices for the GPT-3 API service. The pricing plans can be checked at OpenAI's website. ^[5]

Use

Universal language models can be useful to a broad range of applications such as automatic summarizing of news articles or chatbots for online conversation. GPT-3 promises to take those capabilities further by being able to speed the development of smartphone apps or improving the chatbots so that the conversation seems closer to that of a human being. ^[4] This last use case is one that has been proven to be effective. GPT-3 is given a small program function or description and it can produce code in different languages. CoPilot is a specific version of GPT-3 for code generation. The recent version of GPT-3's Codex program, which is trained for code generation, can even identify bugs, fix mistakes in its code and explain the function of that code. ^[15]

Other use cases exist that are new opportunities for businesses and professionals that rely on content creation. Automated content generation, where content like articles, blog posts, and social media posts can be created by the AI language model, saving time for businesses and professionals who need to create content frequently. There is also the possibility to improve content quality since the AI model can learn from a large amount of data, identifying patterns that otherwise could not be perceived. The result could be an increase in accuracy and more informative content. ^[15]

GPT-3 used together with another AI model that can generate images or video will increase the content variety, diversifying the output of businesses and making it interesting to a wider range of people. The system can also create personalized content tailored to the preferences of individual users. ^[15]

This generative model has been used more commonly for marketing applications. Jasper is a "marketing-focused version of GPT-3 that can create blogs, social media posts, web copy, sales emails, ads, and other types of customer-facing content." ^[15]

There are over 300 apps that use GPT-3 in different areas such as productivity, education, creativity, and games. Some examples are given in OpenAI's blog such as Viable, Fable Studio, and Algolia. ^[6] Other applications include OtherSideAI's Quick Response, an email generation program that writes in the style of writing given on an outline of the key points for the text; Kriya.ai offers personalized introduction requests; Dover.ai produces longer versions of a short job description; and Copy.ai writes ad copy based on a product description. ^[7]

Limitations

Dale (2021) and Brown et al. (2020) have noted several limitations of OpenAI's language model system in text synthesis and NLP tasks. ^[7] ^[14] Dale (2021) points out that "its outputs may lack semantic coherence, resulting in text that is gibberish and increasingly nonsensical as the output grows longer." Also, the system might take in biases that may be found in its training data, and the system's outputs "may correspond to assertions that are not consonant with the truth." ^[5] Brown et al. (2020) also notes the loss of coherence over sufficiently long passages with contradictions and non-sequitur sentences. ^[14]

Si et al. (2022) highlights different studies that investigate GPT-3 capabilities on specific tasks like mathematical reasoning, multi-hop reasoning, and code generation. The authors note that while language models have improved, they are still not reliable, GPT-3 included. ^[13]

Zhang and Li (2021) wrote that "GPT-3 is still at the stage of perceptual intelligence, and a significant amount of development is required before it reaches general intelligence and cognitive intelligence." They note that the model has not met expectations in "natural language reasoning, filling in the blanks, long text generation, and reading comprehension tasks." ^[9]

Impact on the AI Field

GPT-3's release in 2020 marked a watershed moment in artificial intelligence research and the broader technology industry. Its impact can be observed across several dimensions:

Paradigm shift in NLP. Before GPT-3, the dominant approach in NLP involved pre-training a model and then fine-tuning it on task-specific labeled datasets. GPT-3 demonstrated that a sufficiently large model could perform tasks competitively through in-context learning alone, without any gradient updates. This challenged the conventional wisdom about how NLP systems should be built and deployed. ^[3]

Commercial LLM ecosystem. The API-based deployment model pioneered by GPT-3 created an entirely new market for language model services. Competitors such as Cohere, AI21 Labs, and Anthropic (founded in 2021 by former OpenAI researchers, including Dario Amodei) launched their own commercial LLM APIs. Cloud providers including Google, Amazon, and Microsoft began integrating LLM capabilities into their platforms. ^[5]

Open-source response. GPT-3's closed-source nature motivated a vigorous open-source movement. EleutherAI developed GPT-Neo (2.7B), GPT-J (6B), and GPT-NeoX (20B) as open alternatives. BigScience, a collaboration of over 1,000 researchers, created BLOOM (176B parameters) as an open-access multilingual model. These efforts aimed to provide the research community with models comparable in scale to GPT-3 without API access restrictions. ^[5]

Scaling race. GPT-3 validated the hypothesis that scaling up produces better models, triggering a race among major AI labs to build ever-larger systems. Within two years, Google released PaLM (540B parameters), DeepMind published research on Chinchilla (which showed that GPT-3 was likely undertrained relative to its size), and numerous other organizations invested heavily in large-scale model development. ^[18]

Public awareness. While GPT-3's API was primarily used by developers, viral demonstrations of its capabilities on social media brought widespread public attention to the potential of large language models. This heightened awareness set the stage for the explosive public reception of ChatGPT in late 2022. ^[4]

Impact on Creative Work and Future

There have been discussions around AI tools, whether for text, image, or video generation, about their impact both on creative work and on the professionals of those areas in which the AI software acts upon. For now, the user still has a lot of input since there is still a lot of prompting, editing, and iteration, but future technological developments can lead to a decrease of those activities. ^[15]

According to Floridi and Chiriatti (2020), text generation models like GPT-3 signal "the arrival of a new age in which we can now mass produce good and cheap semantic artefacts. Translations, summaries, minutes, comments, webpages, catalogues, newspaper articles, guides, manuals, forms to fill, reports, recipes... soon an AI service may write, or at least draft, the necessary texts, which today still require human effort." The authors envision a future in which writers will have less work and "people whose jobs still consist in writing will be supported, increasingly, by tools such as GPT-3." These professionals will need to develop skills in prompting for the software to deliver the best results, and to collect and combine the results obtained. ^[1]

Finally, Floridi and Chiriatti (2020) also envision a world where "given the business models of many online companies, clickbait of all kinds will be boosted by tools like GPT-3, which can produce excellent prose cheaply, quickly, purposefully, and in ways that can be automatically targeted to the reader. GPT-3 will be another weapon in the competition for users' attention. Furthermore, the wide availability of tools like GPT-3 will support the development of 'no-code platforms,' which will enable marketers to create applications to automate repetitive tasks, starting from data commands in natural language (written or spoken)." ^[1]

Others see the development of technologies like GPT-3 not as threats to their careers but as tools to make their jobs easier, like in the case of app development. GPT-3 is a tool that researchers and entrepreneurs are exploring and building new technologies, products, and companies. ^[4]

Concerns and Safety Guidelines

Like OpenAI's image-generation counterpart, DALL-E, GPT-3 also raises concerns, and safety guidelines are put in place in order to prevent the creation of content that includes hate, harassment, violence, self-harm, adult, political, spam, deception, and malware. ^[5]

Language models could also increase the production of texts with misinformation or fraudulent academic essay writing. According to Brown et al. (2020), "the misuse potential of language models increases as the quality of text synthesis improves. The ability of GPT-3 to generate several paragraphs of synthetic content that people find difficult to distinguish from human-written text [...] represents a concerning milestone in this regard." Dehouche (2021) raised his concerns regarding content generated by GPT-3, in which he notes "the possible use of this technology to facilitate scientific misconduct, and call for an urgent revision of publishing standards. I believe that the advent of this powerful NLP technology calls for an urgent update of our concepts of plagiarism." ^[16]

The GPT-3 paper itself included a detailed analysis of biases present in the model's outputs, examining gender, race, and religious biases. The authors found that GPT-3 exhibited stereotypical associations in its text generation, reflecting biases present in its training data. They noted that addressing these biases would require both technical interventions and careful deployment practices. ^[3]

References

Floridi, L., & Chiriatti, M. (2020). "GPT-3: Its Nature, Scope, Limits, and Consequences." *Minds and Machines*, 30, 681-694.
Wikipedia contributors. "GPT-3." *Wikipedia, The Free Encyclopedia*.
Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., et al. (2020). "Language Models are Few-Shot Learners." *Advances in Neural Information Processing Systems*, 33 (NeurIPS 2020). arXiv:2005.14165.
Metz, C. (2020). "Meet GPT-3. It Has Learned to Code (and Blog and Argue)." *The New York Times*.
OpenAI. "OpenAI API." https://openai.com/api/
OpenAI. (2021). "GPT-3 Powers the Next Generation of Apps." *OpenAI Blog*.
Dale, R. (2021). "GPT-3: What's It Good For?" *Natural Language Engineering*, 27(1), 113-118.
Zhang, Y., & Li, Z. (2021). "A Review on the Development of GPT-3." *Journal of Physics: Conference Series*.
OpenAI. "GPT-3 Playground." https://platform.openai.com/playground
OpenAI. (2021). "OpenAI Removes GPT-3 Waitlist." *OpenAI Blog*.
OpenAI. "Getting Started with the OpenAI API." https://platform.openai.com/docs
Si, C., Gan, Z., Yang, Z., Wang, S., Wang, J., Boyd-Graber, J., & Wang, L. (2022). "Prompting GPT-3 To Be Reliable." arXiv:2210.09150.
Brown, T.B., et al. (2020). "Language Models are Few-Shot Learners." Section 6: Broader Impacts. arXiv:2005.14165.
Various industry analyses on GPT-3 applications and use cases.
Dehouche, N. (2021). "Plagiarism in the Age of Massive Generative Pre-trained Transformers (GPT-3)." *Ethics in Science and Environmental Politics*, 21, 17-23.
Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). "Scaling Laws for Neural Language Models." arXiv:2001.08361.
EleutherAI. (2021). "On the Sizes of OpenAI API Models." *EleutherAI Blog*. https://blog.eleuther.ai/gpt3-model-sizes/
Lambda Labs. (2020). "OpenAI's GPT-3 Language Model: A Technical Overview." https://lambda.ai/blog/demystifying-gpt-3
OpenAI. (2020). "OpenAI API." https://openai.com/index/openai-api/
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H.P., Kaplan, J., et al. (2021). "Evaluating Large Language Models Trained on Code." arXiv:2107.03374.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.L., Mishkin, P., et al. (2022). "Training Language Models to Follow Instructions with Human Feedback." *Advances in Neural Information Processing Systems*, 35 (NeurIPS 2022). arXiv:2203.02155.
Various news sources on ChatGPT user adoption, 2023.

Background and Motivation

Architecture

Model Family

API Model Names

Training Data

Training Details

In-Context Learning

Benchmark Performance

Emergent Abilities

API Launch and Democratization of LLM Access

GPT-3 Apps

Development

Scaling Laws Validation

Codex and Code Generation

InstructGPT and RLHF

GPT-3.5 and ChatGPT

Pricing

Use

Limitations

Impact on the AI Field

Impact on Creative Work and Future

Concerns and Safety Guidelines

References

Improve this article

Related Articles

DeepSeek 3.0

Agentic Context Engineering

Claude Sonnet 4.5

Context window

Post-training

GPT-5 Codex

Background and Motivation

Architecture

Model Family

API Model Names

Training Data

Training Details

In-Context Learning

Benchmark Performance

Emergent Abilities

API Launch and Democratization of LLM Access

GPT-3 Apps

Development

Scaling Laws Validation

Codex and Code Generation

InstructGPT and RLHF

GPT-3.5 and ChatGPT

Pricing

Use

Limitations

Impact on the AI Field

Impact on Creative Work and Future

Concerns and Safety Guidelines

References

Related Articles

DeepSeek 3.0

Agentic Context Engineering

Claude Sonnet 4.5

Context window

Post-training

GPT-5 Codex