Interface administrators, Administrators (Semantic MediaWiki), Curators (Semantic MediaWiki), Editors (Semantic MediaWiki), Suppressors, Administrators
7,785
edits
Line 22: | Line 22: | ||
The ChatGPT model was trained using [[Reinforcement Learning from Human Feedback]] ([[RLHF]]) following the same methods as InstructGPT (figure 2); only with small differences in the data collection setup. Training for an initial model was made using supervised fine-tuning where human AI trainers had "conversations in which they played both sides—the user and an AI assistant." <ref name="”1”" /> | The ChatGPT model was trained using [[Reinforcement Learning from Human Feedback]] ([[RLHF]]) following the same methods as InstructGPT (figure 2); only with small differences in the data collection setup. Training for an initial model was made using supervised fine-tuning where human AI trainers had "conversations in which they played both sides—the user and an AI assistant." <ref name="”1”" /> | ||
Reinforcement learning uses a reward model for AI training. This was done by collecting comparison data consisting of two or more model responses ranked by quality. According to the official blog of OpenAI, "to collect this data, we took conversations that AI trainers had with the chatbot. We randomly selected a model-written message, sampled several alternative completions, and had AI trainers rank them. Using these reward models, we can fine-tune the model using Proximal Policy Optimization. We performed several iterations of this process." <ref name="”1”" /> | Reinforcement learning uses a reward model for [[training|AI training]]. This was done by collecting comparison data consisting of two or more model responses ranked by quality. According to the official blog of OpenAI, "to collect this data, we took conversations that AI trainers had with the chatbot. We randomly selected a model-written message, sampled several alternative completions, and had AI trainers rank them. Using these reward models, we can fine-tune the model using Proximal Policy Optimization. We performed several iterations of this process." <ref name="”1”" /> | ||
GPT-3.5 was trained on a mix of text and code published before Q4, 2021. <ref name="”6”" /> Both ChatGPT (a fine-tuned version of a model in the GPT-3.5 series) and GPT-3.5 were trained on an Azure AI supercomputing infrastructure. <ref name="”1”" /> The model on which the chatbot is based, text-davinci-003, can handle more complex instructions with increased output quality and overall better in long-form writing (around 65% longer outputs than text-davinci-002). It also has fewer limitations (e.g. a reduction in "hallucinations") than previous versions and scores higher on human preference rating. <ref name="”6”" /> | GPT-3.5 was trained on a mix of text and code published before Q4, 2021. <ref name="”6”" /> Both ChatGPT (a fine-tuned version of a model in the GPT-3.5 series) and GPT-3.5 were trained on an Azure AI supercomputing infrastructure. <ref name="”1”" /> The model on which the chatbot is based, text-davinci-003, can handle more complex instructions with increased output quality and overall better in long-form writing (around 65% longer outputs than text-davinci-002). It also has fewer limitations (e.g. a reduction in "hallucinations") than previous versions and scores higher on human preference rating. <ref name="”6”" /> |