Multimodal AI

93 articlesRSS

Showing 1-60 of 93 articles

Baidu ERNIE

Baidu ERNIE (Enhanced Representation through Knowledge Integration) is the family of large language and multimodal foundation models built by the Chinese...

Chinese AILarge Language Models

CLIP (Contrastive Language-Image Pre-training)

CLIP (Contrastive Language-Image Pre-training) is a multimodal neural network developed by OpenAI that learns visual concepts from natural language by training...

Computer VisionDeep Learning

CLIP Score

CLIP Score (also written CLIPScore or CLIP-S) is a reference-free automatic evaluation metric that measures how well a text caption matches an image, computed...

AI BenchmarksComputer Vision

CM3leon

CM3leon (pronounced "chameleon") is a multimodal generative model from Meta AI, introduced in July 2023, that handles both text-to-image and image-to-text...

Image GenerationMeta AI

Chameleon (Meta AI)

Chameleon is a family of early-fusion, token-based mixed-modal foundation models from Meta AI's Fundamental AI Research (FAIR) group that represents both...

AI ModelsMeta AI

Claude Sonnet 4.5

[](/wiki/fileclaudesonnet45logo1png) Claude Sonnet 4.5 is a multimodal large language model (LLM) developed by Anthropic and released on September 29, 2025,...

AI Code GenerationAI Tools & Products

CogAgent

CogAgent is an open visual language model built to act as a graphical user interface (GUI) agent: given a screenshot and a natural-language goal, it predicts...

AI AgentsChinese AI

CogVLM

CogVLM is an open vision language model developed by Zhipu AI and the Knowledge Engineering Group (KEG) at Tsinghua University. It was introduced in the paper...

Chinese AIOpen Source AI

Computer-use agent

A computer-use agent (CUA) is a category of AI agent in artificial intelligence that performs tasks by directly operating a general-purpose computer's...

AI AgentsArtificial Intelligence

DeepSeek Janus

DeepSeek Janus is a family of open-weight unified multimodal models from Chinese AI lab DeepSeek that perform both image understanding and text-to-image...

AI ModelsChinese AI

DeepSeek-OCR

DeepSeek-OCR is an open-source optical character recognition (OCR) and document-understanding system released by DeepSeek on 20 October 2025 that pioneers a...

Chinese AIComputer Vision

DeepSeek-VL

DeepSeek-VL is the first open-source vision-language model series from DeepSeek, the Chinese AI company. It was released on 11 March 2024 in 1.3B and 7B sizes,...

AI ModelsChinese AI

DeepSeek-VL2

DeepSeek-VL2 is an open-weights family of Mixture-of-Experts (MoE) vision-language models released by the Chinese AI laboratory DeepSeek on December 13,...

Chinese AIMixture of Experts

Document Question Answering Models

Document question answering models (DocQA, sometimes called DocVQA for document visual question answering) are machine learning systems that take a document...

AI Models

Doubao Seed 1.6

Doubao Seed 1.6 is a family of general-purpose foundation models developed by the ByteDance Seed research team and released through Volcano Engine on 11 June...

AI ModelsChinese AI

ERNIE 4.5

ERNIE 4.5 is a family of large language models released by the Chinese technology company Baidu, open-sourced on June 30, 2025 under the Apache 2.0 license...

Chinese AILarge Language Models

ERNIE 5.0

ERNIE 5.0 is a natively omni-modal foundation model from Baidu, unveiled at the company's annual Baidu World 2025 conference in Beijing on 13 November 2025 as...

Chinese AILarge Language Models

ERQA

Embodied Reasoning Question Answering Release date 1.0 Authors Embodied Reasoning, Visual Question Answering, Robotics Modality Multiple-choice VQA (4...

AI BenchmarksEmbodied AI

EgoSchema

EgoSchema is a diagnostic benchmark for evaluating very long-form video language understanding, introduced by Karttikeya Mangalam, Raiymbek Akshulakov, and...

AI BenchmarksComputer Vision

Feature Extraction Models

Feature extraction models are machine learning systems that transform raw inputs such as text, images, or audio into dense numerical vectors known as...

AI Models

Flamingo (visual language model)

Flamingo is a family of visual language models (VLMs) built by DeepMind and introduced in April 2022 that brought few-shot, in-context learning to multimodal...

AI ModelsGoogle DeepMind

Fox (benchmark)

Fox is an evaluation suite for fine-grained, multi-page document understanding by large vision-language models. It was released in May 2024 alongside a model...

AI BenchmarksComputer Vision

GPT Image 1

GPT Image 1 (API identifier gpt-image-1) is a natively multimodal image generation model developed by OpenAI, integrated into ChatGPT on March 25, 2025, and...

AI ModelsGenerative AI

GPT-4V (Vision)

GPT-4V, also written GPT-4V(ision) and read as "GPT-4 with vision," is the image-understanding capability that OpenAI added to its GPT-4 large language model,...

Large Language ModelsOpenAI

GPT-4o mini

GPT-4o mini is a small, low-cost multimodal large language model developed by OpenAI and released on July 18, 2024, as the company's most cost-efficient model...

OpenAISmall Language Models

Gemini (app)

The Gemini app is Google's consumer AI assistant, a chat application available on the web at gemini.google.com and as native Android and iOS apps, powered by...

AI Tools & ProductsConversational AI

Gemini 1.0

Gemini 1.0 is the first generation of Gemini, the family of natively multimodal AI models that Google DeepMind announced on 6 December 2023 [1][2]. It shipped...

Google DeepMindLarge Language Models

Gemini 1.5 Flash

Gemini 1.5 Flash is a lightweight, low-latency multimodal large language model from Google DeepMind, released at Google I/O on May 14, 2024, as the fast and...

Google DeepMindLarge Language Models

Gemini 1.5 Pro

Gemini 1.5 Pro is a multimodal large language model developed by Google DeepMind and announced on February 15, 2024, as the flagship model of the Gemini 1.5...

Google DeepMindLarge Language Models

Gemini 2.0 Flash

Gemini 2.0 Flash is a fast, low-cost multimodal large language model built by Google DeepMind as the flagship workhorse of the Gemini 2.0 generation, designed...

Google DeepMindLarge Language Models

Gemini 2.0 Flash-Lite

Gemini 2.0 Flash-Lite is a large language model developed by Google DeepMind and released as the most cost-efficient member of the Gemini 2.0 model family. It...

Google DeepMindLarge Language Models

Gemini 2.5 Flash

Gemini 2.5 Flash is a fast, cost-optimized multimodal large language model developed by Google DeepMind and the mid-tier member of the Gemini 2.5 family. It...

AI ModelsGoogle DeepMind

Gemini 3

Gemini 3 is the third major generation of the Gemini family of multimodal models from Google DeepMind, launched on November 18, 2025 with Gemini 3 Pro as the...

AI ModelsGoogle

Gemini 3 Flash

Gemini 3 Flash is a multimodal large language model released by Google on December 17, 2025 as the fast, lower-cost sibling to Gemini 3 Pro in the Gemini 3...

AI ModelsGoogle

Gemini 3 Pro

Gemini 3 Pro is the flagship preview model in Google DeepMind's Gemini 3 family of multimodal models, launched on November 18, 2025 as what Google called "our...

AI ModelsGoogle

Gemini 3.1 Pro

Gemini 3.1 Pro is a large language model developed by Google DeepMind and released on 19 February 2026 as a point-release upgrade to Gemini 3 Pro [1][2]. Built...

Google DeepMindLarge Language Models

Gemini 3.5 Flash

Gemini 3.5 Flash is a fast frontier large language model developed by Google DeepMind, announced at Google I/O 2026 on May 19, 2026 and made generally...

Google DeepMindLarge Language Models

Gemini Ultra

Gemini Ultra (branded Ultra 1.0) was the largest and most capable model in the Gemini 1.0 family, the first generation of natively multimodal large language...

Google DeepMindLarge Language Models

Gemma 3

Gemma 3 is a family of open-weight large language models developed by Google DeepMind and released on March 12, 2025.[2] It is the third generation of the...

AI ModelsGoogle

HY-World 2.0

HY-World 2.0 (also written HunyuanWorld 2.0 or Hunyuan World Model 2.0) is an open multimodal 3D world model from Tencent's Hunyuan team, with a technical...

Chinese AIWorld Models

Image-to-Text Models

See also: Multimodal Models and Tasks Image-to-text models are machine learning systems that take an image as input and produce natural language text as...

Machine Learning

ImageBind

ImageBind is a multimodal model from Meta AI (its Fundamental AI Research lab) that learns a single joint embedding space across six different modalities:...

AI ModelsMeta AI

InternVL

InternVL is a family of open-source multimodal large language models developed by the OpenGVLab research group at the Shanghai Artificial Intelligence...

Chinese AIOpen Source AI

LINGO-2 (Wayve)

LINGO-2 is a closed-loop vision-language-action model for autonomous driving developed by the British self-driving company Wayve. Announced on 17 April 2024,...

AI ModelsAutonomous Vehicles

LLaVA (Large Language and Vision Assistant)

LLaVA (Large Language and Vision Assistant) is an open-source family of multimodal large language models that connects a frozen vision encoder to a pre-trained...

Open Source AI

LayoutLM

LayoutLM is a family of pre-trained multimodal models developed by Microsoft Research for document AI, the task of automatically reading and understanding...

Large Language Models

Llama 3.2

Meta Release date Meta Connect 2024 Model sizes Auto-regressive transformer; vision models add a cross-attention image adapter on frozen Llama 3.1...

AI ModelsLarge Language Models

Llama 3.2 Vision

Llama 3.2 Vision is the set of multimodal (image-plus-text) models in Meta's Llama 3.2 family, released on September 25, 2024 at the Meta Connect 2024...

Large Language ModelsMeta AI

Llama 4 Scout and Maverick

Llama 4 Scout and Llama 4 Maverick are open-weight, natively multimodal AI large language models developed by Meta and released on April 5, 2025.[1] They are...

AI ModelsLarge Language Models

MM-BrowseComp

MM-BrowseComp is a benchmark for evaluating multimodal web-browsing AI agents, introduced in August 2025 by researchers from ByteDance, Nanjing University,...

AI AgentsAI Benchmarks

MMMU

MMMU (Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark) is a multimodal AI benchmark of 11,550 college-level questions that pairs text...

AI BenchmarksMachine Learning

MMMU-Pro

MMMU-Pro is a rigorous benchmark for evaluating multimodal AI systems on college-level, expert questions that genuinely require seeing an image, built as a...

AI BenchmarksComputer Vision

MMStar

MMStar (Multi-modal Star) is a vision-language model evaluation benchmark consisting of 1,500 multimodal samples that were filtered from six pre-existing...

AI Benchmarks

MathVista

MathVista is a benchmark for evaluating the mathematical reasoning capabilities of foundation models in visual contexts.[1] It was introduced by Pan Lu, Hritik...

AI Benchmarks

MetaCLIP

MetaCLIP (Metadata-Curated Language-Image Pre-training) is a data curation recipe and a family of vision-language models from Meta AI, introduced in the 2023...

Data & DatasetsMeta AI

MiniCPM-V

MiniCPM-V is a family of open-weights multimodal large language models developed by the OpenBMB lab at Tsinghua University's Natural Language Processing group...

Chinese AIOpen Source AI

Molmo

Molmo is a family of open-weight, open-data vision-language models (VLMs) released by the Allen Institute for AI (Ai2) on 25 September 2024.[^1][^2] The family...

AI ModelsOpen Source AI

Muse Spark

Muse Spark is a proprietary multimodal reasoning model developed by Meta Superintelligence Labs (MSL), the artificial intelligence division Meta reorganized in...

Meta AIReasoning Models

NVIDIA Cosmos Reason

NVIDIA Cosmos Reason is an open, customizable, 7-billion-parameter reasoning vision-language model (VLM) for physical AI and robotics developed by Nvidia. It...

Embodied AINVIDIA

NVLM

NVLM (short for NVIDIA Vision Language Model), released as NVLM 1.0, is a family of open multimodal large language models developed by Nvidia. Introduced in...

Large Language ModelsNVIDIA