GPT-4V (Vision)

Large Language Models Multimodal AI OpenAI

8 min read

Updated Jun 24, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 24, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v2 · 1,650 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

GPT-4V, also written GPT-4V(ision) and read as "GPT-4 with vision," is the image-understanding capability that OpenAI added to its GPT-4 large language model, letting a user supply one or more images alongside text and have the model analyze, describe, and reason about their visual content. In OpenAI's own words, "GPT-4 with vision (GPT-4V) enables users to instruct GPT-4 to analyze image inputs provided by the user."^[1] OpenAI began rolling the feature into ChatGPT for paying subscribers on September 25, 2023, and exposed it to software developers through the API at its first DevDay conference on November 6, 2023.^[2]^[6] GPT-4V was not a separate model so much as the visual-input mode of GPT-4, documented in a dedicated "GPT-4V(ision) System Card" published on September 25, 2023, and it served as the transitional, bolt-on vision layer of the GPT-4 generation that preceded the natively multimodal GPT-4o.^[1]^[2]

Overview

GPT-4 was announced on March 14, 2023, and OpenAI described it from the outset as a multimodal model that could accept both image and text inputs while producing text outputs.^[3]^[4] However, the image-input capability was not made available to the general public at launch; the initial release accepted only text prompts, and the visual feature was tested privately. OpenAI stated that training of the vision system was completed in 2022 and that early access began in 2023 with a small number of users so the company could study real-world use and safety implications.^[1] One prominent early partner was Be My Eyes, an accessibility service connecting blind and low-vision people with assistance; its "Virtual Volunteer" tool (later renamed Be My AI) was announced alongside the GPT-4 launch on March 14, 2023, and used GPT-4's image recognition to describe photographs and surroundings for a community of roughly 253 million blind or low-vision people worldwide.^[3]^[4]^[10]

The "V" in GPT-4V denotes "vision." Rather than being a distinct neural network, GPT-4V represented the same GPT-4 model operating on image inputs in addition to text. OpenAI framed its public documentation around preparing this capability for broad deployment, building on the safety analysis it had already performed for the text-only version of GPT-4.^[1]

What can GPT-4V do?

GPT-4V applies the language-reasoning abilities of GPT-4 to a wide range of images, including photographs, screenshots, and documents that combine text and pictures.^[2] Reported and demonstrated capabilities include:

Visual question answering: answering open-ended questions about the contents of an image, such as identifying objects, scenes, or activities.
Image description and analysis: producing detailed captions and interpreting relationships between elements in a picture.
Reading text in images (OCR): transcribing printed or handwritten text, including from screenshots and scanned documents.
Interpreting charts, diagrams, and tables: extracting information from data visualizations and structured graphics.
Multimodal reasoning: combining the visual input with a text prompt to perform tasks such as explaining a meme, suggesting recipes from a photo of ingredients, or walking through a problem shown in an image.

Independent evaluations found the capability impressive but uneven. Reviewers reported that GPT-4V handled clear digital documents well but made errors on low-contrast or angled text, sometimes misread serial numbers, and occasionally transcribed words incorrectly (for example rendering "Eggless" as "Eggs").^[5] It also struggled with precise spatial tasks: requests for bounding-box coordinates did not reliably match object positions, and the model misread the structure of grids in puzzles such as crosswords and sudoku.^[5] A research note published when the API launched observed that the system "makes basic mistakes that a human wouldn't," including confusing colors of adjacent objects and misreading which value in a graph was larger.^[5]

What are GPT-4V's safety limits?

The GPT-4V(ision) System Card describes the evaluations, red-teaming, and mitigations OpenAI put in place before broad release, drawing in part on fractions of alpha production traffic from July and August 2023 to study how people used the system for person identification, medical advice, and CAPTCHA solving.^[1] Because adding vision created new risks beyond those of a text-only model, OpenAI applied a set of refusals and safeguards intended to prevent harmful or privacy-invasive uses. Documented mitigations and limitations include:

Area	Behavior or mitigation
Identifying real individuals	The model is designed to refuse requests to identify specific real people from photographs, and to limit inferences about a person from an image.
CAPTCHAs	The model is designed to decline solving CAPTCHAs, which are intended to distinguish humans from automated systems.
Medical and high-stakes advice	OpenAI cautions against relying on the model for medical diagnosis or other high-stakes interpretation; performance on specialized medical imagery was not validated for clinical use.
Sensitive personal traits	Mitigations target requests to infer attributes such as identity or other sensitive characteristics from images.
Hate symbols and unsafe content	The system attempts to avoid endorsing hateful imagery, though the card acknowledges residual failure cases, noting the model could still generate text praising certain lesser-known hate groups in response to their symbols.
Spatial and structured reasoning	Bounding-box localization and grid-based puzzles were unreliable, so the model was not suited to precise spatial tasks.
Text transcription errors	OCR could fail on low-contrast, rotated, or stylized text, and the model sometimes paraphrased or omitted information.

OpenAI emphasized that these mitigations reduced but did not eliminate risks, and that the system card reflected the state of the model at release rather than a guarantee of behavior.^[1]

When was GPT-4V released?

OpenAI introduced GPT-4V to the public on September 25, 2023, in a blog post titled "ChatGPT can now see, hear, and speak," which paired image input with new voice-conversation features.^[6] Image understanding in ChatGPT was powered by multimodal GPT-3.5 and GPT-4, and users could upload pictures and highlight specific areas for the model to focus on with a drawing tool in the mobile app.^[6] OpenAI said the voice and image features would roll out to ChatGPT Plus and Enterprise users over the following two weeks; voice was offered on the iOS and Android apps as an opt-in setting and used the company's Whisper speech-recognition system to transcribe spoken words, while image input was made available across platforms.^[6] On the same day, OpenAI published the GPT-4V(ision) System Card describing the model's preparation for deployment.^[1]

API access followed at OpenAI's first DevDay developer conference on November 6, 2023, where the company announced a model identifier named gpt-4-vision-preview that let developers send images to GPT-4.^[2]^[5] Vision was launched as a preview alongside the newly announced GPT-4 Turbo line, with OpenAI signaling that image support would later be folded into the main GPT-4 Turbo model.^[5]

Date	Milestone
March 14, 2023	GPT-4 announced as a multimodal (image and text input) model; image input not yet public.
March 14, 2023	Be My Eyes announces its GPT-4-powered Virtual Volunteer (later Be My AI) as a launch partner.
September 25, 2023	GPT-4V(ision) System Card published; "ChatGPT can now see, hear, and speak" announces image and voice input.
Late September to October 2023	Image input rolls out to ChatGPT Plus and Enterprise users over roughly two weeks.
November 6, 2023	`gpt-4-vision-preview` made available to developers at OpenAI DevDay.
April 9, 2024	`gpt-4-turbo-2024-04-09` brings vision into the general-availability GPT-4 Turbo model.
May 13, 2024	GPT-4o announced as a natively multimodal successor.
June 6, 2024	OpenAI notifies developers of the deprecation of `gpt-4-vision-preview`.
December 6, 2024	`gpt-4-vision-preview` (and `gpt-4-1106-vision-preview`) shut down; `gpt-4o` recommended as the replacement.

How does GPT-4V differ from GPT-4 Turbo and GPT-4o?

GPT-4V's standalone preview status was relatively short-lived. On April 9, 2024, OpenAI released gpt-4-turbo-2024-04-09, a GPT-4 Turbo checkpoint that incorporated vision as a core feature so that image and text inputs could be handled by a single general-availability model with support for function calling and JSON mode.^[7] This made the dedicated gpt-4-vision-preview endpoint redundant for many developers, and OpenAI subsequently deprecated it: the company notified affected developers on June 6, 2024, and shut the model down on December 6, 2024, stating that "gpt-4o is the new equivalent to using gpt-4-vision-preview."^[8]

GPT-4o ("o" for "omni"), announced on May 13, 2024, represented the next step in OpenAI's approach to multimodality.^[9] Whereas GPT-4V and the earlier ChatGPT voice mode combined separate components (for example using Whisper to transcribe audio before a text model processed it), GPT-4o was trained "end-to-end across text, vision, and audio," so that all inputs and outputs are processed by the same neural network, and it can generate text, audio, and image outputs.^[9] OpenAI reported that GPT-4o responded to audio inputs in as little as 232 milliseconds (averaging 320 milliseconds, similar to human conversational latency), matched GPT-4 Turbo on English text and code, was 50% cheaper in the API, and was made available to all ChatGPT users, including those on the free plan.^[9] GPT-4V should therefore be understood as the transitional, bolt-on vision capability of the GPT-4 generation, distinct from the natively multimodal design that GPT-4o introduced.

References

OpenAI. "GPT-4V(ision) system card." September 25, 2023. https://openai.com/index/gpt-4v-system-card/ ↩
Roboflow Blog. "GPT-4 with Vision: Complete Guide and Evaluation." https://blog.roboflow.com/gpt-4-vision/ ↩
TechCrunch. "OpenAI releases GPT-4, AI that it claims is state-of-the-art." March 14, 2023. https://techcrunch.com/2023/03/14/openai-releases-gpt-4-ai-that-it-claims-is-state-of-the-art/ ↩
Wikipedia. "GPT-4." https://en.wikipedia.org/wiki/GPT-4 ↩
TechCrunch. "As OpenAI's multimodal API launches broadly, research shows it's still flawed." November 6, 2023. https://techcrunch.com/2023/11/06/openai-gpt-4-with-vision-release-research-flaws/ ↩
OpenAI. "ChatGPT can now see, hear, and speak." September 25, 2023. https://openai.com/index/chatgpt-can-now-see-hear-and-speak/ ↩
Microsoft Tech Community. "Announcing the General Availability of GPT-4 Turbo with Vision on Azure OpenAI Service." May 2, 2024. https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/announcing-the-general-availability-of-gpt-4-turbo-with-vision-on-azure-openai-s/4127916 ↩
OpenAI. "Deprecations." https://developers.openai.com/api/docs/deprecations ↩
OpenAI. "Hello GPT-4o." May 13, 2024. https://openai.com/index/hello-gpt-4o/ ↩
Be My Eyes. "Introducing Be My AI (formerly Virtual Volunteer) for People who are Blind or Have Low Vision, Powered by OpenAI's GPT-4." https://www.bemyeyes.com/news/introducing-be-my-ai-formerly-virtual-volunteer-for-people-who-are-blind-or-have-low-vision-powered-by-openais-gpt-4/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

BLINK GPT-4 LLM Benchmarks Timeline MM-Vet Mind2Web OlympiadBench Video-MME

Overview

What can GPT-4V do?

What are GPT-4V's safety limits?

When was GPT-4V released?

How does GPT-4V differ from GPT-4 Turbo and GPT-4o?

References

Improve this article

Related Articles

Sora 2

GPT Image 1

GPT-4o mini

Claude Sonnet 4.5

Vision language model

Reka AI

What links here

Related Articles

Sora 2

GPT Image 1

GPT-4o mini

Claude Sonnet 4.5

Vision language model

Reka AI

What links here