GPT-4V (Vision)
Last reviewed
Jun 3, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,479 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,479 words
Add missing citations, update stale details, or suggest a clearer explanation.
GPT-4V, also written GPT-4V(ision) and read as "GPT-4 with vision," is the image-understanding capability that OpenAI added to its GPT-4 large language model. GPT-4V allows a user to supply one or more images alongside text and have the model analyze, describe, and reason about their visual content. OpenAI began rolling the feature into ChatGPT for paying subscribers in late September 2023 and exposed it to software developers through the API at its first DevDay conference in November 2023. GPT-4V was not a separate model so much as the visual-input mode of GPT-4, and OpenAI documented its behavior and safety work in a dedicated "GPT-4V(ision) System Card" published on September 25, 2023.[1][2]
GPT-4 was announced on March 14, 2023, and OpenAI described it from the outset as a multimodal model that could accept both image and text inputs while producing text outputs.[3][4] However, the image-input capability was not made available to the general public at launch; the initial release accepted only text prompts, and the visual feature was tested privately. OpenAI stated that training of the vision system was completed in 2022 and that early access began in March 2023 with a small number of users so the company could study real-world use and safety implications.[1] One prominent early partner was Be My Eyes, an accessibility service connecting blind and low-vision people with assistance; its "Virtual Volunteer" tool used GPT-4's image recognition to describe photographs and surroundings.[3][4]
The "V" in GPT-4V denotes "vision." Rather than being a distinct neural network, GPT-4V represented the same GPT-4 model operating on image inputs in addition to text. OpenAI framed its public documentation around preparing this capability for broad deployment, building on the safety analysis it had already performed for the text-only version of GPT-4.[1]
GPT-4V applies the language-reasoning abilities of GPT-4 to a wide range of images, including photographs, screenshots, and documents that combine text and pictures.[2] Reported and demonstrated capabilities include:
Independent evaluations found the capability impressive but uneven. Reviewers reported that GPT-4V handled clear digital documents well but made errors on low-contrast or angled text, sometimes misread serial numbers, and occasionally transcribed words incorrectly (for example rendering "Eggless" as "Eggs").[2] It also struggled with precise spatial tasks: requests for bounding-box coordinates did not reliably match object positions, and the model misread the structure of grids in puzzles such as crosswords and sudoku.[2] A research note published when the API launched observed that the system "makes basic mistakes that a human wouldn't," including confusing colors of adjacent objects and misreading which value in a graph was larger.[5]
The GPT-4V(ision) System Card describes the evaluations, red-teaming, and mitigations OpenAI put in place before broad release.[1] Because adding vision created new risks beyond those of a text-only model, OpenAI applied a set of refusals and safeguards intended to prevent harmful or privacy-invasive uses. Documented mitigations and limitations include:
| Area | Behavior or mitigation |
|---|---|
| Identifying real individuals | The model is designed to refuse requests to identify specific real people from photographs, and to limit inferences about a person from an image. |
| CAPTCHAs | The model is designed to decline solving CAPTCHAs, which are intended to distinguish humans from automated systems. |
| Medical and high-stakes advice | OpenAI cautions against relying on the model for medical diagnosis or other high-stakes interpretation; performance on specialized medical imagery was not validated for clinical use. |
| Sensitive personal traits | Mitigations target requests to infer attributes such as identity or other sensitive characteristics from images. |
| Hate symbols and unsafe content | The system attempts to avoid endorsing hateful imagery, though the card acknowledges residual failure cases, noting the model could still generate text praising certain lesser-known hate groups in response to their symbols. |
| Spatial and structured reasoning | Bounding-box localization and grid-based puzzles were unreliable, so the model was not suited to precise spatial tasks. |
| Text transcription errors | OCR could fail on low-contrast, rotated, or stylized text, and the model sometimes paraphrased or omitted information. |
OpenAI emphasized that these mitigations reduced but did not eliminate risks, and that the system card reflected the state of the model at release rather than a guarantee of behavior.[1]
OpenAI introduced GPT-4V to the public on September 25, 2023, in a blog post titled "ChatGPT can now see, hear, and speak," which paired image input with new voice-conversation features.[6] Image understanding in ChatGPT was powered by multimodal GPT-3.5 and GPT-4, and users could upload pictures and highlight specific areas for the model to focus on.[6] OpenAI said the voice and image features would roll out to ChatGPT Plus and Enterprise users over the following two weeks; voice was offered on the iOS and Android apps as an opt-in setting, while image input was made available across platforms.[6] On the same day, OpenAI published the GPT-4V(ision) System Card describing the model's preparation for deployment.[1]
API access followed at OpenAI's first DevDay developer conference on November 6, 2023, where the company announced a model identifier named gpt-4-vision-preview that let developers send images to GPT-4.[2][5] Vision was launched as a preview alongside the newly announced GPT-4 Turbo line, with OpenAI signaling that image support would later be folded into the main GPT-4 Turbo model.[5]
| Date | Milestone |
|---|---|
| March 14, 2023 | GPT-4 announced as a multimodal (image and text input) model; image input not yet public. |
| March 2023 | Early access to vision begins for a small group of users; Be My Eyes integrates GPT-4 image recognition. |
| September 25, 2023 | GPT-4V(ision) System Card published; "ChatGPT can now see, hear, and speak" announces image and voice input. |
| Late September to October 2023 | Image input rolls out to ChatGPT Plus and Enterprise users over roughly two weeks. |
| November 6, 2023 | gpt-4-vision-preview made available to developers at OpenAI DevDay. |
| April 9, 2024 | gpt-4-turbo-2024-04-09 brings vision into the general-availability GPT-4 Turbo model. |
| May 13, 2024 | GPT-4o announced as a natively multimodal successor. |
| June 6, 2024 | OpenAI notifies developers of the deprecation of gpt-4-vision-preview. |
| December 6, 2024 | gpt-4-vision-preview (and gpt-4-1106-vision-preview) shut down; gpt-4o recommended as the replacement. |
GPT-4V's standalone preview status was relatively short-lived. On April 9, 2024, OpenAI released gpt-4-turbo-2024-04-09, a GPT-4 Turbo checkpoint that incorporated vision as a core feature so that image and text inputs could be handled by a single general-availability model with support for function calling and JSON mode.[7] This made the dedicated gpt-4-vision-preview endpoint redundant for many developers, and OpenAI subsequently deprecated it: the company notified affected developers on June 6, 2024, and shut the model down on December 6, 2024, recommending GPT-4o as the equivalent replacement.[8]
GPT-4o ("o" for "omni"), announced on May 13, 2024, represented the next step in OpenAI's approach to multimodality.[9] Whereas GPT-4V and the earlier ChatGPT voice mode combined separate components (for example using Whisper to transcribe audio before a text model processed it), GPT-4o processes text, audio, image, and video inputs through a single neural network and can generate text, audio, and image outputs.[9] OpenAI reported that GPT-4o matched GPT-4 Turbo on English text and code, was faster and cheaper in the API, and was made available to all ChatGPT users, including those on the free plan.[9] GPT-4V should therefore be understood as the transitional, bolt-on vision capability of the GPT-4 generation, distinct from the natively multimodal design that GPT-4o introduced.