ChatGPT: Difference between revisions

4 bytes added ,  4 February 2023
Line 20: Line 20:
[[File:GPT Training process.png|thumb|Figure 2: General overview of the training process using reinforcement learning from human feedback. Source: OpenAI.]]
[[File:GPT Training process.png|thumb|Figure 2: General overview of the training process using reinforcement learning from human feedback. Source: OpenAI.]]


The ChatGPT model was trained using [[Reinforcement Learning]] from Human Feedback (RLHF) following the same methods as InstructGPT (figure 2); only with small differences in the data collection setup. Training for an initial model was made using supervised fine-tuning where human AI trainers had "conversations in which they played both sides—the user and an AI assistant." <ref name="”1”" />
The ChatGPT model was trained using [[Reinforcement Learning from Human Feedback]] ([[RLHF]]) following the same methods as InstructGPT (figure 2); only with small differences in the data collection setup. Training for an initial model was made using supervised fine-tuning where human AI trainers had "conversations in which they played both sides—the user and an AI assistant." <ref name="”1”" />


Reinforcement learning uses a reward model for AI training. This was done by collecting comparison data consisting of two or more model responses ranked by quality. According to the official blog of OpenAI, "to collect this data, we took conversations that AI trainers had with the chatbot. We randomly selected a model-written message, sampled several alternative completions, and had AI trainers rank them. Using these reward models, we can fine-tune the model using Proximal Policy Optimization. We performed several iterations of this process." <ref name="”1”" />
Reinforcement learning uses a reward model for AI training. This was done by collecting comparison data consisting of two or more model responses ranked by quality. According to the official blog of OpenAI, "to collect this data, we took conversations that AI trainers had with the chatbot. We randomly selected a model-written message, sampled several alternative completions, and had AI trainers rank them. Using these reward models, we can fine-tune the model using Proximal Policy Optimization. We performed several iterations of this process." <ref name="”1”" />