Improving Language Understanding by Generative Pre-Training (GPT)

Introduction

In June 2018, OpenAI introduced GPT-1, a language model that combined unsupervised pre-training with the transformer architecture to achieve significant progress in natural language understanding. The team fine-tuned the model for specific tasks and found that pre-training helped it perform well on various NLP tasks with minimal fine-tuning. GPT-1 used the BooksCorpus dataset and self-attention in the transformer's decoder with 117 million parameters, paving the way for future models with more parameters and larger datasets to enhance its potential further. One noteworthy feature of GPT-1 was its ability to perform zero-shot tasks in natural language processing, such as question-answering and sentiment analysis, thanks to pre-training. Zero-shot learning enables the model to understand a task based on instructions and a few examples without having seen previous examples of that task. GPT-1 was a crucial building block in the development of a language model with general language-based capabilities.

Related Work

In the realm of natural language processing, our research is part of the semi-supervised learning paradigm, which has gained significant interest and has been applied to a range of tasks such as sequence labeling, text classification, and more. Early approaches computed word-level or phrase-level statistics from unlabeled data to use as features in a supervised model. Later, researchers found the benefits of using word embeddings trained on unlabeled corpora to improve performance on various tasks, though mainly at the word-level. Our work aims to capture higher-level semantics and incorporates recent approaches that use phrase-level or sentence-level embeddings to encode text into suitable vector representations for various target tasks.

Unsupervised pre-training is a special case of semi-supervised learning that aims to find a good initialization point instead of modifying the supervised learning objective. Pre-training acts as a regularization scheme, enabling better generalization in deep neural networks. Recent research has used the method to help train deep neural networks on various tasks such as image classification, speech recognition, entity disambiguation, and machine translation. In our work, we pre-trained a neural network using a language modeling objective and then fine-tuned it on a target task with supervision. Our choice of transformer networks allowed us to capture longer-range linguistic structure, which was demonstrated in our experiments. Our model was effective on a range of tasks, including natural language inference, paraphrase detection, and story completion, requiring minimal changes to the model architecture during transfer.

Auxiliary training objectives are an alternative form of semi-supervised learning that involves adding auxiliary unsupervised training objectives to improve performance on target tasks. Our experiments also use an auxiliary objective, and we show that unsupervised pre-training learns several linguistic aspects relevant to target tasks. However, our approach differs from other approaches that use hidden representations from a pre-trained language or machine translation model as auxiliary features while training a supervised model on the target task, which involves a substantial amount of new parameters for each separate target task.