# GPT-4V (Vision)

> Source: https://aiwiki.ai/wiki/gpt_4_vision
> Updated: 2026-06-24
> Categories: Large Language Models, Multimodal AI, OpenAI
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**GPT-4V**, also written **GPT-4V(ision)** and read as "GPT-4 with vision," is the image-understanding capability that [OpenAI](/wiki/openai) added to its [GPT-4](/wiki/gpt-4) large language model, letting a user supply one or more images alongside text and have the model analyze, describe, and reason about their visual content. In OpenAI's own words, "GPT-4 with vision (GPT-4V) enables users to instruct GPT-4 to analyze image inputs provided by the user."[1] OpenAI began rolling the feature into [ChatGPT](/wiki/chatgpt) for paying subscribers on September 25, 2023, and exposed it to software developers through the API at its first DevDay conference on November 6, 2023.[2][6] GPT-4V was not a separate model so much as the visual-input mode of GPT-4, documented in a dedicated "GPT-4V(ision) System Card" published on September 25, 2023, and it served as the transitional, bolt-on vision layer of the GPT-4 generation that preceded the natively [multimodal](/wiki/multimodal_ai) [GPT-4o](/wiki/gpt_4o).[1][2]

## Overview

GPT-4 was announced on March 14, 2023, and OpenAI described it from the outset as a multimodal model that could accept both image and text inputs while producing text outputs.[3][4] However, the image-input capability was not made available to the general public at launch; the initial release accepted only text prompts, and the visual feature was tested privately. OpenAI stated that training of the vision system was completed in 2022 and that early access began in 2023 with a small number of users so the company could study real-world use and safety implications.[1] One prominent early partner was Be My Eyes, an accessibility service connecting blind and low-vision people with assistance; its "Virtual Volunteer" tool (later renamed Be My AI) was announced alongside the GPT-4 launch on March 14, 2023, and used GPT-4's image recognition to describe photographs and surroundings for a community of roughly 253 million blind or low-vision people worldwide.[3][4][10]

The "V" in GPT-4V denotes "vision." Rather than being a distinct neural network, GPT-4V represented the same GPT-4 model operating on image inputs in addition to text. OpenAI framed its public documentation around preparing this capability for broad deployment, building on the safety analysis it had already performed for the text-only version of GPT-4.[1]

## What can GPT-4V do?

GPT-4V applies the language-reasoning abilities of GPT-4 to a wide range of images, including photographs, screenshots, and documents that combine text and pictures.[2] Reported and demonstrated capabilities include:

- **Visual question answering:** answering open-ended questions about the contents of an image, such as identifying objects, scenes, or activities.
- **Image description and analysis:** producing detailed captions and interpreting relationships between elements in a picture.
- **Reading text in images (OCR):** transcribing printed or handwritten text, including from screenshots and scanned documents.
- **Interpreting charts, diagrams, and tables:** extracting information from data visualizations and structured graphics.
- **Multimodal reasoning:** combining the visual input with a text prompt to perform tasks such as explaining a meme, suggesting recipes from a photo of ingredients, or walking through a problem shown in an image.

Independent evaluations found the capability impressive but uneven. Reviewers reported that GPT-4V handled clear digital documents well but made errors on low-contrast or angled text, sometimes misread serial numbers, and occasionally transcribed words incorrectly (for example rendering "Eggless" as "Eggs").[5] It also struggled with precise spatial tasks: requests for bounding-box coordinates did not reliably match object positions, and the model misread the structure of grids in puzzles such as crosswords and sudoku.[5] A research note published when the API launched observed that the system "makes basic mistakes that a human wouldn't," including confusing colors of adjacent objects and misreading which value in a graph was larger.[5]

## What are GPT-4V's safety limits?

The GPT-4V(ision) System Card describes the evaluations, red-teaming, and mitigations OpenAI put in place before broad release, drawing in part on fractions of alpha production traffic from July and August 2023 to study how people used the system for person identification, medical advice, and CAPTCHA solving.[1] Because adding vision created new risks beyond those of a text-only model, OpenAI applied a set of refusals and safeguards intended to prevent harmful or privacy-invasive uses. Documented mitigations and limitations include:

| Area | Behavior or mitigation |
| --- | --- |
| Identifying real individuals | The model is designed to refuse requests to identify specific real people from photographs, and to limit inferences about a person from an image. |
| CAPTCHAs | The model is designed to decline solving CAPTCHAs, which are intended to distinguish humans from automated systems. |
| Medical and high-stakes advice | OpenAI cautions against relying on the model for medical diagnosis or other high-stakes interpretation; performance on specialized medical imagery was not validated for clinical use. |
| Sensitive personal traits | Mitigations target requests to infer attributes such as identity or other sensitive characteristics from images. |
| Hate symbols and unsafe content | The system attempts to avoid endorsing hateful imagery, though the card acknowledges residual failure cases, noting the model could still generate text praising certain lesser-known hate groups in response to their symbols. |
| Spatial and structured reasoning | Bounding-box localization and grid-based puzzles were unreliable, so the model was not suited to precise spatial tasks. |
| Text transcription errors | OCR could fail on low-contrast, rotated, or stylized text, and the model sometimes paraphrased or omitted information. |

OpenAI emphasized that these mitigations reduced but did not eliminate risks, and that the system card reflected the state of the model at release rather than a guarantee of behavior.[1]

## When was GPT-4V released?

OpenAI introduced GPT-4V to the public on September 25, 2023, in a blog post titled "ChatGPT can now see, hear, and speak," which paired image input with new voice-conversation features.[6] Image understanding in ChatGPT was powered by multimodal GPT-3.5 and GPT-4, and users could upload pictures and highlight specific areas for the model to focus on with a drawing tool in the mobile app.[6] OpenAI said the voice and image features would roll out to ChatGPT Plus and Enterprise users over the following two weeks; voice was offered on the iOS and Android apps as an opt-in setting and used the company's [Whisper](/wiki/whisper) speech-recognition system to transcribe spoken words, while image input was made available across platforms.[6] On the same day, OpenAI published the GPT-4V(ision) System Card describing the model's preparation for deployment.[1]

API access followed at OpenAI's first DevDay developer conference on November 6, 2023, where the company announced a model identifier named `gpt-4-vision-preview` that let developers send images to GPT-4.[2][5] Vision was launched as a preview alongside the newly announced [GPT-4 Turbo](/wiki/gpt_4_turbo) line, with OpenAI signaling that image support would later be folded into the main GPT-4 Turbo model.[5]

| Date | Milestone |
| --- | --- |
| March 14, 2023 | GPT-4 announced as a multimodal (image and text input) model; image input not yet public. |
| March 14, 2023 | Be My Eyes announces its GPT-4-powered Virtual Volunteer (later Be My AI) as a launch partner. |
| September 25, 2023 | GPT-4V(ision) System Card published; "ChatGPT can now see, hear, and speak" announces image and voice input. |
| Late September to October 2023 | Image input rolls out to ChatGPT Plus and Enterprise users over roughly two weeks. |
| November 6, 2023 | `gpt-4-vision-preview` made available to developers at OpenAI DevDay. |
| April 9, 2024 | `gpt-4-turbo-2024-04-09` brings vision into the general-availability GPT-4 Turbo model. |
| May 13, 2024 | GPT-4o announced as a natively multimodal successor. |
| June 6, 2024 | OpenAI notifies developers of the deprecation of `gpt-4-vision-preview`. |
| December 6, 2024 | `gpt-4-vision-preview` (and `gpt-4-1106-vision-preview`) shut down; `gpt-4o` recommended as the replacement. |

## How does GPT-4V differ from GPT-4 Turbo and GPT-4o?

GPT-4V's standalone preview status was relatively short-lived. On April 9, 2024, OpenAI released `gpt-4-turbo-2024-04-09`, a GPT-4 Turbo checkpoint that incorporated vision as a core feature so that image and text inputs could be handled by a single general-availability model with support for function calling and JSON mode.[7] This made the dedicated `gpt-4-vision-preview` endpoint redundant for many developers, and OpenAI subsequently deprecated it: the company notified affected developers on June 6, 2024, and shut the model down on December 6, 2024, stating that "`gpt-4o` is the new equivalent to using `gpt-4-vision-preview`."[8]

GPT-4o ("o" for "omni"), announced on May 13, 2024, represented the next step in OpenAI's approach to multimodality.[9] Whereas GPT-4V and the earlier ChatGPT voice mode combined separate components (for example using Whisper to transcribe audio before a text model processed it), GPT-4o was trained "end-to-end across text, vision, and audio," so that all inputs and outputs are processed by the same neural network, and it can generate text, audio, and image outputs.[9] OpenAI reported that GPT-4o responded to audio inputs in as little as 232 milliseconds (averaging 320 milliseconds, similar to human conversational latency), matched GPT-4 Turbo on English text and code, was 50% cheaper in the API, and was made available to all ChatGPT users, including those on the free plan.[9] GPT-4V should therefore be understood as the transitional, bolt-on vision capability of the GPT-4 generation, distinct from the natively multimodal design that GPT-4o introduced.

## References

1. OpenAI. "GPT-4V(ision) system card." September 25, 2023. https://openai.com/index/gpt-4v-system-card/
2. Roboflow Blog. "GPT-4 with Vision: Complete Guide and Evaluation." https://blog.roboflow.com/gpt-4-vision/
3. TechCrunch. "OpenAI releases GPT-4, AI that it claims is state-of-the-art." March 14, 2023. https://techcrunch.com/2023/03/14/openai-releases-gpt-4-ai-that-it-claims-is-state-of-the-art/
4. Wikipedia. "GPT-4." https://en.wikipedia.org/wiki/GPT-4
5. TechCrunch. "As OpenAI's multimodal API launches broadly, research shows it's still flawed." November 6, 2023. https://techcrunch.com/2023/11/06/openai-gpt-4-with-vision-release-research-flaws/
6. OpenAI. "ChatGPT can now see, hear, and speak." September 25, 2023. https://openai.com/index/chatgpt-can-now-see-hear-and-speak/
7. Microsoft Tech Community. "Announcing the General Availability of GPT-4 Turbo with Vision on Azure OpenAI Service." May 2, 2024. https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/announcing-the-general-availability-of-gpt-4-turbo-with-vision-on-azure-openai-s/4127916
8. OpenAI. "Deprecations." https://developers.openai.com/api/docs/deprecations
9. OpenAI. "Hello GPT-4o." May 13, 2024. https://openai.com/index/hello-gpt-4o/
10. Be My Eyes. "Introducing Be My AI (formerly Virtual Volunteer) for People who are Blind or Have Low Vision, Powered by OpenAI's GPT-4." https://www.bemyeyes.com/news/introducing-be-my-ai-formerly-virtual-volunteer-for-people-who-are-blind-or-have-low-vision-powered-by-openais-gpt-4/

