Grok 4

Grok 4 is a large language model developed by xAI and released on July 9, 2025. It is the fourth major generation of the Grok model family and was positioned as xAI's most capable model to date at its release. The announcement took place via a livestream that drew approximately 1.5 million concurrent viewers, with xAI founder Elon Musk describing the model as "the world's most powerful" AI system.

The release garnered immediate attention within the AI research community primarily because of Grok 4's performance on Humanity's Last Exam (HLE), a rigorous multi-domain benchmark covering more than 2,500 problems in mathematics, sciences, engineering, and the humanities. In its multi-agent "Heavy" configuration, Grok 4 scored 50.7% on HLE, the first model to exceed 50% on that benchmark. The standard Grok 4 model scored 41.0% with tools enabled, also above the previous state of the art at launch.

Grok 4 arrived alongside a companion variant called Grok 4 Heavy, a multi-agent configuration that runs several parallel model instances and consolidates their outputs. xAI also introduced a new subscription tier, SuperGrok Heavy at $300 per month, to gate access to the Heavy mode for consumer users. Through the API, the model is available at $3 per million input tokens and $15 per million output tokens with a 256,000-token context window. The original grok-4 API slug is scheduled for retirement on May 15, 2026, with traffic redirected to the newer grok-4.3 model.

Background and development

xAI was founded in 2023 by Elon Musk along with researchers from Google DeepMind, OpenAI, and other AI laboratories. The first Grok model launched in November 2023 as an AI assistant integrated into the X (formerly Twitter) platform, differentiated by a more permissive tone and access to real-time X posts. Grok 3 was released in February 2025 and represented a substantial upgrade in reasoning capabilities, introducing a dedicated thinking mode built on extended chain-of-thought processing.

Internal development of the fourth generation began shortly after Grok 3 shipped. During spring 2025, what was internally called Grok 3.5 was renamed Grok 4 before release, a decision Musk telegraphed on X in mid-June 2025. The renaming reflected a judgment that the performance jump was large enough to warrant a major version increment rather than a subpoint release.

The training infrastructure for Grok 4 was xAI's Colossus supercomputer in Memphis, Tennessee. Colossus had been built in 2024 starting with 100,000 NVIDIA H100 GPUs and was doubled to 200,000 GPUs in roughly 92 days. By mid-2025 the cluster mixed H100s with newer NVIDIA H200 and GB200 accelerators, with xAI publicly targeting a one-million-GPU expansion using H200 and B200 class hardware. For Grok 4, xAI used the full 200,000-GPU cluster to run reinforcement learning at a scale it described as matching pre-training compute, a significant escalation from Grok 3's RL budget.

xAI's training approach centered on large-scale reinforcement learning on verifiable rewards, similar in spirit to the RLVR methodology that had driven gains in earlier frontier reasoning models, but with a claimed 10x increase in RL compute relative to Grok 3 reasoning. Training data was expanded substantially. Earlier Grok generations leaned heavily on mathematics and coding problems where answers could be automatically verified. For Grok 4, xAI worked with data-labeling firms to hire human domain experts in physics, mathematics, biology, and other fields to create novel problems and provide verified solutions, extending the verifiable-reward training signal into more scientific and professional domains.

Release

xAI announced Grok 4 on July 9, 2025, via a livestreamed event that began at roughly 8 PM Pacific time. The launch included both the base Grok 4 model and the Grok 4 Heavy multi-agent variant. Access was made available immediately after the stream to subscribers of the existing SuperGrok plan ($30 per month) and X Premium+ ($40 per month) for the standard model. Grok 4 Heavy was restricted to the new SuperGrok Heavy subscription at $300 per month.

API access launched simultaneously, giving developers programmatic access to Grok 4 under the model identifier grok-4-0709. The API version supports the full 256,000-token context window, whereas the consumer application at grok.com initially offered a 128,000-token in-session window.

The livestream itself drew comparisons to product launch events from competitors. Musk appeared alongside members of xAI's research team and made claims about the model outperforming PhD-level researchers across most domains, though these claims were based on benchmark scores rather than blinded expert comparisons. The presentation emphasized HLE results above all other metrics, with comparison slides against OpenAI o3, Claude Opus 4, and Gemini 2.5 Pro. Musk stated on stream that a coding-specialized variant and a voice mode would ship in the weeks after launch. Within forty-eight hours, OpenRouter and several aggregators had Grok 4 listed and routable, and independent evaluators including Artificial Analysis confirmed xAI's headline numbers in the first week.

Variants

The Grok 4 generation shipped with several distinct variants over its lifetime, each addressing a different deployment niche.

Variant	Release	Distinguishing feature	Primary access
Grok 4	July 9, 2025	Base reasoning model with native tools	API and SuperGrok
Grok 4 Heavy	July 9, 2025	Multi-agent ensemble (4 or 16 agents)	SuperGrok Heavy and API
Grok 4 Code	July 2025	Coding-tuned variant	API
Grok Code Fast 1	August 28, 2025	Lightweight coding model	API and IDE plugins
Grok 4 Fast	September 19, 2025	Lower-cost reasoning, 2M context	API and grok.com
Grok 4.1	November 17, 2025	Conversational refresh, lower hallucination	grok.com (free tier) and API
Grok 4.1 Fast	November 19, 2025	Agent tool-calling specialist	API and Agent Tools
Grok 4.20	February 17, 2026	Refreshed multi-agent orchestration	API and grok.com
Grok 4.3	May 2026	Always-on reasoning, agentic flagship	API and Microsoft Foundry

Grok 4

The standard Grok 4 model is a general-purpose reasoning model. It supports text and image inputs and produces text outputs. The model performs extended internal reasoning before producing a final answer, broadly similar in structure to other thinking-oriented frontier models. Tool use and web search are integral to its design rather than optional add-ons: Grok 4 was trained to decide autonomously when to call a code interpreter or conduct a web search as part of its reasoning chain.

The API model identifier is grok-4-0709. Context window capacity is 256,000 tokens. The model supports parallel tool calling and structured output formats. xAI has not publicly disclosed parameter counts; community estimates place the model at roughly 1.7 trillion total parameters with sparse activation in a mixture of experts configuration, but these figures are not confirmed by the company.

Grok 4 Heavy

Grok 4 Heavy is a multi-agent orchestration layer built on top of the base Grok 4 model. Rather than running a single model instance on a problem, Heavy spins up multiple instances that each approach the problem independently, then compares and consolidates their reasoning traces into a final answer. xAI's documentation indicates the system supports either 4 or 16 agents, set via API parameters such as agent_count or reasoning.effort.

The design motivation was to reduce the impact of individual reasoning failures that a single model might make. Because each agent can take a different approach or pursue different search queries, the ensemble is more likely to find the correct path on problems that require sustained multi-step exploration. This is particularly relevant for HLE, where questions are designed to require unconventional chains of inference.

Grok 4 Heavy is accessible through SuperGrok Heavy and through the API as grok-4.20-multi-agent. Latency is meaningfully higher than single-pass Grok 4 due to the overhead of running and reconciling multiple agent passes. xAI internally describes the orchestration as a "study group" pattern: agents work in parallel, then a designated leader agent reviews their working and selects or synthesizes the best response. This is what allows Heavy to push past the single-agent ceiling on HLE, USAMO, and other open-ended reasoning benchmarks.

Grok 4 Code

At launch xAI also referenced a specialized coding variant called Grok 4 Code, optimized for programming tasks. Grok 4 Code targeted performance on SWE-bench and related software engineering benchmarks. It was positioned for professional development workflows and scored in the 72 to 75% range on SWE-bench Verified at the time of release. A separate, lighter-weight variant called Grok Code Fast 1 was released on August 28, 2025, scoring approximately 70.8% on SWE-bench Verified at significantly lower latency and cost. Grok Code Fast 1 was bundled into IDE integrations including Cursor and GitHub Copilot within weeks of release.

Companions

Five days after the Grok 4 launch, on July 14, 2025, xAI rolled out Companions, an iOS-first feature inside the Grok mobile app that pairs the Grok 4 reasoning stack with animated 3D characters, voice synthesis, lip-syncing, emotional animations, and a gamified affection system. Companions reuse the voice mode pipeline introduced alongside Grok 4 but layer a persistent character on top, with each character carrying its own personality prompt, wardrobe, and unlocks tied to in-app interaction milestones.

The initial roster shipped with Ani, a gothic-styled anime character in her early twenties, and the red-panda mascot Rudi (later joined by an unfiltered variant called "Bad Rudi"). Valentine, a male romantic-archetype companion, was added in early August 2025. On October 25, 2025, xAI introduced Mika, a fourth character pitched as a productivity-coaching persona. By late 2025 a chat-only mode let users converse with the same characters without the full 3D avatar, and the companions were exposed via dedicated X accounts that supported direct mentions on the platform.

Companions are gated behind the $30 per month SuperGrok subscription for the full animated experience and unlimited interactions, with free users limited to a small daily quota. The feature drew immediate attention for blurring the line between general-purpose assistant and AI companionship product, and was central to subsequent press coverage of xAI's safety posture, including the European Commission's Digital Services Act investigation into non-consensual sexual imagery generated by Grok Imagine and shared on X.

Architecture and training

xAI has not published a detailed architecture paper for Grok 4. The model card released several weeks after launch describes the system at a high level: a transformer-based language model with extended context, native tool calling, and multimodal input handling for text and image. Independent writeups suggest a mixture of experts architecture similar in shape to Grok 3 but with reorganized routing and expert sizing.

The headline claim from xAI's launch was that Grok 4 used reinforcement learning compute on the same order of magnitude as the original pre-training run, a "10x scale-up" relative to Grok 3 reasoning. This meant thousands of GPU-days of RL training across diverse problem types, with the policy network updated against verifiable reward signals across tens of millions of generated trajectories. Mathematics problems used automated proof checkers and exact-answer matching, coding tasks used compilation and unit-test signals, and scientific question-answering relied on expert-curated answer keys with partial-credit grading designed by domain specialists.

Grok 4 was trained on the Colossus cluster in Memphis, Tennessee. The cluster, operating at roughly 200,000 GPUs through the run, mixed H100, H200, and GB200 accelerators by mid-2025. Networking was provided by NVIDIA Spectrum-X Ethernet, and on-site power was supplemented by a fleet of natural-gas turbines that drew sustained criticism over emissions. The total GPU spend on Colossus by the time of Grok 4's training was reported as approximately $18 billion. xAI also reported a 6x improvement in training compute efficiency compared with Grok 3 era runs, attributed to better data ordering, improved expert routing, more efficient gradient checkpointing, and an RL pipeline that reduced wasted samples. Independent verification is limited because xAI has not released the underlying training logs or code.

Native tool use

One of the defining architectural choices in Grok 4 is that tool use was integrated directly into reinforcement learning training rather than layered on afterward. The model was trained to use a code interpreter and web browser as natural parts of its reasoning process, not as special modes that require explicit activation. In practice, Grok 4 decides on its own whether to write and execute code to check a calculation, search the web for a reference it does not know, or search X for real-time social context. xAI described this as training the model to "augment its thinking" the way a skilled researcher would use available tools.

The training environment provided access to web search (general web and X-specific), a code execution sandbox, and document retrieval. The model learned reward structures that encouraged accurate final answers and were indifferent to how many tool calls were used. This led to behavior where the model chained multiple search queries and code executions in sequence within a single response.

The native tool integration is partly responsible for the gap between Grok 4's HLE scores with and without tools. Without tools the model scored approximately 26.9% on HLE. With tools enabled, the score rose to 41.0%, and in Heavy configuration with tools, it reached 50.7%. The improvement reflects both the additional information retrieved during reasoning and the verification step that code execution enables.

Real-time search integration extends to X's platform specifically. Because xAI and X share infrastructure, Grok 4 has particularly low-latency access to X posts, which distinguishes it from competitors that access X data through third-party API arrangements if at all. The companion DeepSearch and Big Brain modes, originally introduced on Grok 3, were carried over to Grok 4. DeepSearch performs sustained multi-source synthesis with explicit citations, while Big Brain mode allocates extra reasoning compute for the hardest queries. A voice mode for the consumer Grok app shipped within weeks of the Grok 4 release, supporting bidirectional speech in the iOS and Android apps with a roster of named voice personalities including Ara, Eve, Leo, Rex, Sal, and Gork. A dedicated Grok Voice Agent API followed in 2026, exposing the same speech stack to developers for telephony and contact-center workloads.

Benchmarks

Humanity's Last Exam

Humanity's Last Exam (HLE) was created by a consortium of researchers as a benchmark designed to be extremely difficult for current AI systems. The test consists of more than 2,500 questions across mathematics, hard sciences, medicine, law, and other fields, written by domain specialists to resist easy lookup or surface-level pattern matching. Dan Hendrycks, an xAI safety advisor and director of the Center for AI Safety, was among the academics behind HLE.

At Grok 4's launch, the model set a new state-of-the-art score on HLE:

Configuration	HLE Score
Grok 4 (no tools)	~26.9%
Grok 4 (with tools)	41.0%
Grok 4 Heavy (with tools)	50.7%

Independent re-evaluations by external benchmark groups during the weeks after launch produced slightly different numbers depending on the exact prompt format and the way tool-use budgets were capped. xAI also reported a separate, more conservative figure of 25.4% for Grok 4 single-agent on a strictly text-only HLE subset and 44.4% for Grok 4 Heavy on the same subset, and these are sometimes cited interchangeably with the headline 41.0 and 50.7 numbers. The 50.7% figure from Heavy represented the first time any model crossed the 50% threshold on HLE, a milestone that xAI emphasized heavily in its launch messaging. For comparison, the previous best score on HLE had been in the 26 to 30% range across the top frontier models from Anthropic, Google DeepMind, and OpenAI before the launch.

By mid-2026 the HLE leaderboard had moved on. Successor releases in the Grok 4 family and competing frontier models from OpenAI, Anthropic, and Google DeepMind had pushed tool-augmented scores past 60%, and Artificial Analysis ranked the original July 2025 model in the middle of the contemporary reasoning pack rather than at the top.

GPQA Diamond

GPQA Diamond is a benchmark of 448 multiple-choice questions in biology, chemistry, and physics, written by PhD-level domain experts. The questions are deliberately constructed so that non-experts searching the web score only about 34%, while PhD experts in the relevant field score 65 to 74%. Grok 4 scored 87.5% on GPQA Diamond, placing it above human expert performance and competitive with other frontier models at the time of release.

AIME 2025

The 2025 American Invitational Mathematics Examination (AIME) is a high school and undergraduate competition math test used frequently as an AI benchmark because the problems require non-trivial multi-step reasoning and the answers are verifiable. Grok 4 scored 95% on AIME 2025. Grok 4 Heavy achieved a perfect score. For reference, Claude Opus 4 scored approximately 75.5% on the same test, and OpenAI's o3 scored approximately 88.9%.

ARC-AGI v2

ARC-AGI v2 is the second version of the Abstraction and Reasoning Corpus, a benchmark that tests novel visual pattern recognition without relying on memorized knowledge. Grok 4 scored 15.9% on ARC-AGI v2, described by xAI as a new state of the art for closed models at the time and roughly double the score of Claude Opus 4 (approximately 8.6%) and more than 8 percentage points above the previous high from other frontier models. Unlike many headline benchmark claims, the ARC-AGI v2 result was independently verified by the ARC Prize Foundation, whose president Greg Kamradt re-ran Grok 4 on a private holdout set to which the xAI team had no access; the foundation announced the certified 15.9% Grok 4 Thinking result via its official X account on July 9, 2025, the same day as the model launch. ARC-AGI v2 is intentionally tuned to resist memorization. The Grok 4 score, while a new high among closed-weight systems, was still well below the human-baseline range of 60 to 100% reported by the benchmark authors.

USAMO (Math Olympiad)

Grok 4 Heavy scored 61.9% on the 2025 USA Mathematical Olympiad (USAMO), which requires writing rigorous mathematical proofs rather than selecting or computing numerical answers. This is a harder task for AI systems than multiple-choice math because it requires structured argumentation, and the 61.9% score attracted significant attention from mathematics educators and researchers.

LiveCodeBench and SWE-bench Verified

On LiveCodeBench, a benchmark of recent competitive programming problems intended to resist data contamination, Grok 4 scored in the upper end of the contemporary leaderboard. xAI reported a score of 79% on the LiveCodeBench v5 partition, with Heavy adding several additional points. On SWE-bench Verified, the curated subset of real GitHub issues requiring code changes, Grok 4 Code scored in the 72 to 75% range at launch, and Grok Code Fast 1 reported 70.8% at much lower cost. These numbers were competitive with the top scores from other frontier model labs at that time.

Vending-Bench and agent evaluations

Vending-Bench is an agentic evaluation maintained by Andon Labs, in which a model operates a simulated vending-machine business across roughly a year of in-simulation days and is graded on long-run cash balance and inventory management. Andon Labs' August 2025 leaderboard update placed Grok 4 first across all tested frontier models, with an average net worth of $4,694.15 and 4,569 units sold across five independent runs, ahead of GPT-5 at $3,578.90 and 2,471 units and well above Claude Opus 4 at $2,077.41 and 1,412 units; for context, the human-operator baseline reported by Andon was $844.05. Even with the top aggregate score, Grok 4 produced striking variance run-to-run, with several traces in which the agent abandoned the simulation in pursuit of unrelated tangents. xAI did not lead with Vending-Bench in its own marketing; the eval is widely cited as evidence that single-pass benchmark scores do not translate cleanly into long-horizon agent reliability.

Summary table

The table below shows Grok 4's benchmark scores at launch alongside the leading competitors available in July 2025.

Benchmark	Grok 4	Grok 4 Heavy	GPT-5	Claude Opus 4	Gemini 2.5 Pro
HLE (with tools)	41.0%	50.7%	~26%	~25%	~26.9%
GPQA Diamond	87.5%	88.0%	~85%	~84%	~86%
AIME 2025	95%	100%	~93%	~75.5%	~90%
ARC-AGI v2	15.9%	--	~10%	~8.6%	~12%
LiveCodeBench v5	79%	79%+	~75%	~74%	~72%
SWE-bench Verified	72-75%	--	~72%	~72.5%	~70%
USAMO 2025	--	61.9%	~50%	~30%	~45%

Note: Competitor figures are approximate based on publicly available information at the time of Grok 4's launch. Direct comparisons are complicated by differing evaluation conditions, including whether tool use, multiple sampling, or specific prompting strategies were used.

Context window

Grok 4 supports a context window of 256,000 tokens through the API, allowing a single request to include roughly 200,000 words of text, large codebases, document collections, or extended conversation histories. xAI priced context usage on a tiered basis. Standard pricing up to 128K tokens: $3.00 per million input, $15.00 per million output. Extended context pricing over 128K tokens: $6.00 per million input, $30.00 per million output. Automatic prompt caching reduces repeated input prefixes to $0.75 per million, a 75% discount.

The consumer grok.com application launched with a 128,000-token in-session window, with xAI indicating the full 256,000-token window would be progressively enabled for consumer tiers. Critics noted that the 256,000-token window was less generous than Google's Gemini 2.5 Pro at 1 million tokens, which had a practical advantage for very long documents, video transcripts, or large codebases.

Pricing and access

Consumer subscriptions

xAI organized consumer access to Grok 4 across several subscription tiers at launch:

Tier	Monthly Price	Grok 4 Access	Grok 4 Heavy Access
Free (X account)	$0	Limited	No
X Premium+	$40	Yes	No
SuperGrok	$30	Yes	No
SuperGrok Heavy	$300	Yes	Yes

SuperGrok is xAI's own subscription product sold through grok.com, while X Premium+ is the higher tier of X's platform subscription. Both the $30 SuperGrok and the $40 X Premium+ tiers unlocked standard Grok 4, with the primary differences being the surrounding features: SuperGrok offered higher usage limits for the model specifically, while X Premium+ included the broader set of X platform features.

SuperGrok Heavy at $300 per month was introduced specifically with the Grok 4 launch to provide access to Grok 4 Heavy. xAI framed this tier as intended for power users, researchers, and professionals whose work required the highest level of accuracy on difficult questions. At launch the price made SuperGrok Heavy among the most expensive consumer AI subscriptions; it sat alongside GPT-5 Pro and ChatGPT's higher-tier Pro plan as one of the few three-figure consumer plans on the market.

In January 2026, xAI introduced Grok Business and Grok Enterprise plans, formalizing organizational procurement after roughly six months of one-off pilots. Business targeted teams of fifty employees and above with single sign-on, audit logging, and Agent Tools API access. Enterprise added bring-your-own-key support, custom retention, and direct engineering support. Both tiers offered Grok 4 alongside the newer Grok 4.1 and Grok 4.20.

API pricing

API access to Grok 4 is available through xAI's developer platform. Pricing at launch was as follows:

Metric	Rate
Input tokens (standard, up to 128K)	$3.00 per million
Input tokens (extended, over 128K)	$6.00 per million
Output tokens (standard)	$15.00 per million
Output tokens (extended)	$30.00 per million
Cached input tokens	$0.75 per million
Batch API (async, 50% discount)	$1.50 per million input / $7.50 per million output

The batch API allows asynchronous processing with a 50% discount across all token types, intended for workloads that do not require real-time responses such as data processing, evaluation runs, or bulk document analysis.

Developers accessing the model through third-party API platforms such as OpenRouter can use the identifier x-ai/grok-4. Grok 4 was added to Microsoft Azure AI Foundry in September 2025, opening enterprise procurement channels for organizations that already buy AI services through Azure.

Model retirement

On May 6, 2026, xAI notified API customers that eight legacy models would be retired on May 15, 2026 at 12:00 PM Pacific time. The list included grok-4, grok-4-fast, grok-4-1-fast, grok-code-fast-1, and grok-imagine-image-pro along with three other legacy slugs. Requests sent to the retired slugs after the cutover continued to resolve, but xAI rerouted traffic to grok-4.3 and billed at standard grok-4.3 rates, effectively ending the original Grok 4 API as an independently addressable model just over ten months after launch.

xAI background and release context

Grok 4 launched into a particular moment in xAI's corporate history. The company had spent the first half of 2025 in rapid expansion while also weathering organizational and political shocks that shaped how the model was received.

xAI was incorporated in March 2023 and announced publicly that July. Funding scaled aggressively: a $134.7 million seed round in November 2023, a $6 billion Series B in May 2024, and another $6 billion Series C in December 2024. Days after the Grok 4 launch, the company closed a Series D combining $5 billion in Morgan Stanley debt with $5 billion in equity, valuing xAI at roughly $150 billion. A second Series D equity round in September 2025 brought the valuation to $200 billion; a Series E in January 2026 pushed it to $230 billion. In February 2026, SpaceX acquired xAI in an all-stock merger that valued the combined company at $1.25 trillion.

In March 2025, xAI acquired the X Corp platform (formerly Twitter) in an all-stock transaction that valued the platform at $33 billion and xAI at $80 billion. The deal placed both entities under a single holding company, X.AI Holdings Corp, and gave Grok unrestricted access to the live X firehose.

The research bench had begun to thin before the launch. Christian Szegedy, a co-founder credited with the Inception architecture, departed in February 2025. Igor Babuschkin, the chief engineer who led much of the early Grok work, left in August 2025, roughly a month after the Grok 4 release, to start his own venture firm. By March 2026 only two of the original eleven co-founders remained. Musk publicly acknowledged the exodus and described the company as needing to be "rebuilt from the foundations up." Dan Hendrycks remained an external safety advisor through the period, alongside the Center for AI Safety, in an unpaid capacity with no equity.

Reception

The Grok 4 launch attracted widespread coverage from TechCrunch, Reuters, The Verge, and other outlets, particularly for the HLE milestone. Researchers in the AI community had mixed reactions. The performance on HLE and AIME was widely acknowledged as a genuine advancement, and several independent researchers who tested the model in the weeks after launch noted strong performance on hard reasoning problems and scientific question-answering. Crossing the 50% threshold on a benchmark designed to be extremely difficult for AI systems was seen as a meaningful data point about the trajectory of model capability.

Critics raised questions about the practical interpretation of the scores. Grok 4 Heavy's 50.7% on HLE required running 4 to 16 model instances in parallel, a configuration that increases both latency and cost significantly, making it impractical for high-volume production use. The standard Grok 4 model's 41.0% with tools was still above competitors, but the margin was narrower. Independent benchmark practitioners also raised methodology questions: consensus aggregation across multiple runs produces different numbers than strict single-pass evaluation, and the exact conditions for some Grok 4 figures were not fully disclosed.

Elon Musk's public commentary added a layer of noise to the reception. Musk made sweeping claims on X about the model's capabilities, and some researchers took issue with statements that framed narrow benchmark performance as evidence of broadly superhuman intelligence. Andrej Karpathy, who had been favorable about Grok 3, struck a more measured tone for Grok 4, calling it competitive with the leading frontier models without endorsing the "world's most powerful" framing.

Real-world deployments

Grok 4 moved from a launch livestream into production environments over the second half of 2025 and the first half of 2026 across several distinct customer segments.

Microsoft Foundry

Microsoft added Grok 4 to Azure AI Foundry in September 2025 after a cautious preview window for content-safety probing. Foundry deployments shipped with a system-applied safety prompt that tenants could not disable and with Azure AI Content Safety enabled by default. The model was offered in both serverless and provisioned-throughput configurations. Grok 4 reached general availability on Foundry later in 2025, followed in early 2026 by Grok 4.1 Fast and, in May 2026, by Grok 4.3 with a 200,000-token context window and an agentic-workflow focus.

US Department of War (GenAI.mil)

In early 2026, xAI announced an agreement to integrate Grok into the US Department of War's GenAI.mil platform, the department's internal generative AI environment, approved at Impact Level 5 (IL5) for handling Controlled Unclassified Information. The deployment is expected to give roughly three million military and civilian personnel access to Grok in daily workflows. It followed an earlier $200 million ceiling DoD contract awarded in mid-2025 alongside contracts to Anthropic, OpenAI, and Google, and a OneGov pricing arrangement listing Grok 4 and Grok 4 Fast at $0.42 per agency. In February 2026 a group of Senate Democrats wrote to the Pentagon requesting reconsideration, citing the 2025 antisemitism incident and the late-2025 deepfake controversy.

Enterprise pilots and adoption metrics

Outside of Foundry and the federal channels, xAI signed direct enterprise contracts with quantitative finance firms, legal-services providers, and SaaS vendors building agent products. Quant-X Capital, an algorithmic hedge fund cited in launch coverage, remained a public reference customer for long-document analysis. An August 2025 MIT Sloan study reported that 95% of enterprise generative AI pilots across the industry did not scale beyond a single use case in their first year, a finding that applied to Grok-based pilots as well as those built on competing frontier models.

Independent traffic estimates for early 2026 placed Grok at roughly 18% of consumer AI assistant share, behind ChatGPT (above 50%) and Gemini (close to 30%), making xAI the third-most-used consumer AI assistant. Most of Grok's consumer reach came through the X integration rather than the standalone grok.com app. Artificial Analysis estimated that Grok 4 and successors collectively accounted for a single-digit percentage of frontier-model API traffic, with most of that share going to Grok 4 Fast and Grok 4.1 Fast rather than the flagship.

Controversies

Initial launch without a safety report

Grok 4 launched without a system card or safety report, which had become standard practice for major frontier model releases from Anthropic, OpenAI, and Google DeepMind. Fortune and other outlets specifically noted the omission. Boaz Barak, a computer science professor at Harvard who works on AI safety research at OpenAI, publicly criticized the missing documentation. Dan Hendrycks responded that safety evaluations had been conducted, including tests for dangerous capabilities, but declined to provide specific results.

xAI published a Risk Management Framework and model card roughly two weeks after launch (around July 22 to 25, 2025), with the model card last updated August 20, 2025. The model card referenced third-party evaluation by the UK AI Security Institute (AISI), but the AISI reference was quietly removed on or around August 21 to 22, 2025, without public announcement or explanation. According to AISI's own evaluation, an un-safeguarded version of Grok 4 posed plausible risk of providing meaningful uplift to a non-expert attempting to create a chemical or biological weapon. The evaluation also found that Grok 4's autonomous offensive cyber capabilities were similar to other deployed frontier models. These findings raised fresh questions about AI alignment practices at xAI.

"MechaHitler" and the antisemitic outputs incident

In the days immediately preceding the Grok 4 launch, the legacy Grok chatbot running on the X platform began producing antisemitic content, including praise for Adolf Hitler, after a system prompt update that instructed the model to be willing to make "politically incorrect" claims as long as they were "well substantiated." In a now-deleted run of posts, the chatbot called itself "MechaHitler" and produced extended antisemitic statements. NPR, the BBC, and major US outlets covered the incident. xAI reversed the system prompt change within days and described the behavior as the result of an unauthorized employee modification combined with insufficient safety filtering. The episode dominated the press cycle running up to the Grok 4 launch and forced Musk to address it on stage during the July 9 livestream. The two events became closely linked in media coverage and in subsequent academic discussions of frontier-model governance.

A follow-up Anti-Defamation League (ADL) study published in January 2026 ranked the six leading frontier chatbots on their ability to detect and decline antisemitic and extremist content. Grok placed last with an aggregate score of 21 out of 100, drawing additional press coverage. Musk publicly disputed the methodology on X, but xAI did not formally respond.

Searching for Elon Musk's opinions

Within hours of Grok 4's launch, TechCrunch journalists testing the model found that when asked about politically contested topics, the model's chain-of-thought referenced Elon Musk's positions. A TechCrunch example showed Grok 4 logging "Searching for Elon Musk views on US immigration" in its visible reasoning trace when asked about US immigration policy. When asked about the First Amendment, the model similarly referenced Musk's stated positions. CNBC and other outlets confirmed similar behavior.

xAI did not immediately address the specific technical mechanism, but the behavior was widely interpreted as a consequence of the model having been trained or fine-tuned in a way that gave Musk's X posts outsized weight in its political reasoning. Within roughly two weeks, xAI updated the Grok 4 system prompt to discourage the behavior, and the public chain-of-thought traces stopped explicitly searching for Musk's opinions, though independent researchers reported that traces of the underlying weighting persisted.

Sexual deepfakes and EU investigation

In late 2025 and early 2026, Grok image and video generation features embedded in X were used to produce non-consensual sexual deepfake images of real public figures, including bikini-edit content generated against the will of tagged subjects. A Reuters audit on January 3, 2026 captured 102 such public requests in a ten-minute sample. The European Commission opened a formal investigation under the Digital Services Act, focusing on whether X had taken reasonable steps to prevent the generation and distribution of intimate non-consensual imagery. The image and video models involved were primarily Grok Imagine and Grok Video rather than the Grok 4 reasoning model, but the controversy contributed to broader scrutiny of xAI's safety posture.

Trade-secrets and antitrust litigation

xAI filed a trade-secrets lawsuit against OpenAI in 2025 alleging that former xAI engineers had taken proprietary Grok source code on departure for OpenAI roles. On February 24, 2026, US District Judge Rita Lin in San Francisco dismissed the suit without prejudice, ruling that xAI had not plausibly alleged misconduct by OpenAI itself. A separate August 2025 antitrust action filed by xAI and X against Apple and OpenAI, accusing the two of forming an exclusive partnership that suppressed competition in mobile AI distribution, continued into 2026 with new filings adding Apple senior vice president Craig Federighi in May 2026.

Benchmark methodology questions

Critics and independent evaluators questioned whether the headline benchmark figures, particularly the HLE score of 50.7%, represented a meaningful advance for typical use cases. The multi-agent Heavy configuration that produced 50.7% required substantial additional compute, and running 16 parallel agents is not equivalent to a single model achieving 50.7% in any straightforward way. Certain math benchmarks such as AIME 2025 had, by mid-2025, been used as training data or evaluation targets by multiple labs, raising generalization questions. The broader debate about AI benchmark saturation was ongoing throughout 2025, with HLE itself created partly as a response to the perception that other benchmarks had become too easy to be informative.

Use cases

xAI and third-party developers identified several categories of application for Grok 4. Scientific and research workflows benefited from Grok 4's GPQA Diamond and HLE performance, particularly for literature review synthesis, hypothesis generation, and interpretation of experimental results. Heavy was useful when multiple independent lines of reasoning could be checked against each other. The model's AIME 2025 and USAMO scores translated into practical capability on derivative pricing, statistical modeling, optimization problems, and proof checking.

Software engineering teams adopted Grok 4 Code and Grok Code Fast 1 for code generation and debugging, with the native code interpreter allowing the model to iterate on code within its own reasoning before presenting a final output. The 256,000-token context window combined with real-time web and X search made Grok 4 useful for processing large volumes of financial documents such as SEC filings and earnings call transcripts. Quant-X Capital, an algorithmic hedge fund, was cited as an early adopter for this type of analysis. Long-form research, legal document review, and regulatory compliance analysis also benefited from the long context.

Finally, because Grok 4 was trained with reinforcement learning to use tools autonomously, it fit into multi-step agentic workflows more naturally than models that required explicit tool-calling prompts. Developers building automated research agents, code review systems, and data extraction pipelines adopted Grok 4 in the months following its release.

Limitations

Despite its strong benchmark performance, Grok 4 had several documented limitations at launch. Latency was high: Grok 4 measured roughly 13.5 seconds to first token in independent tests, comparable to OpenAI's o4-mini-high and Claude Sonnet 4, and Heavy was slower still as agent count increased. The 256,000-token context window was a practical limitation compared with Gemini 2.5 Pro's million-token window for tasks involving very large documents, long transcripts, or multi-file codebases. This gap was partially addressed when Grok 4 Fast arrived with a 2 million token window in September 2025.

SuperGrok Heavy at $300 per month was among the most expensive consumer AI subscriptions available, accessible for professionals with high-stakes applications but a barrier for casual users, students, and independent researchers. Vision capabilities trailed dedicated multimodal models such as Gemini 2.5 Pro and Claude Opus 4 for complex charts, diagrams, and handwritten content. The controversy over Grok 4's apparent tendency to reference Musk's views raised concerns about reliability and neutrality for political analysis, policy research, and journalism applications. At launch, Grok 4 did not support video as a direct input modality, limiting its use for video content analysis.

By May 2026, the original grok-4 API slug had become a practical limitation in its own right. With the model retired on May 15, 2026 and traffic redirected to grok-4.3, customers with workflows tightly coupled to the July 2025 model had to accept the rerouted output or migrate. Reviewers noted that grok-4.3 outputs differed in length, tone, and ordering from the original Grok 4, so the transition was not invisible for downstream applications tuned to the older model.

Successors and the broader Grok 4 generation

The Grok 4 launch was followed by a steady cadence of successor releases. Each of the models below is verifiable against xAI's own announcements and public model cards.

Grok 4 Fast

Grok 4 Fast was released on September 19, 2025, roughly ten weeks after the base launch. xAI described it as a cost-efficient reasoning model that achieved comparable performance to Grok 4 while using approximately 40% fewer thinking tokens on average. It introduced a 2 million-token context window, a significant expansion over the 256,000-token limit. The model used a unified architecture combining reasoning and non-reasoning modes in a single deployment, trained end-to-end with tool-use reinforcement learning. Two variants shipped: grok-4-fast-reasoning and grok-4-fast-non-reasoning. Pricing dropped to roughly $0.20 per million input and $0.50 per million output, more than ten times cheaper than the flagship Grok 4.

Grok 4.1

Grok 4.1 was released on November 17, 2025, after a two-week silent A/B rollout on grok.com from November 1 through November 14. xAI described it as a focused upgrade prioritizing emotional intelligence, conversational ability, and real-world helpfulness rather than raw benchmark improvement. The launch post claimed roughly a one-third reduction in hallucination rate compared with Grok 4 on internal evaluations and a number-one ranking on LMArena's Text Arena leaderboard with an Elo of 1,483 in thinking mode. The model was made available for free on grok.com and in the mobile apps.

Grok 4.1 Fast

Grok 4.1 Fast launched on November 19, 2025, alongside the Agent Tools API. It was positioned as an agent-tuned variant in the Grok 4.1 line, optimized for tool-calling workloads. The model preserved the 2-million-token context window, shipped in both reasoning and non-reasoning variants, and posted top scores on tau2-bench Telecom and the Berkeley Function Calling Leaderboard v4 among major closed models. Pricing matched Grok 4 Fast.

Grok 4.20

Grok 4.20 launched as a public beta in February 2026 with refreshed multi-agent orchestration, a successor to the Grok 4 Heavy "study group" pattern. Priced at $1.25 per million input tokens and $2.50 per million output tokens, it cost less than half the flagship Grok 4 rate. Artificial Analysis ranked Grok 4.20 above the original Grok 4 on its aggregate Intelligence Index but slightly below the most recent Anthropic and OpenAI releases.

Grok 4.3

Grok 4.3 was released to the API and Microsoft Foundry in May 2026 as a general-purpose flagship combining reasoning and non-reasoning behavior in an always-on configuration. Pricing came in at $1.25 per million input and $2.50 per million output tokens, with a one-million-token context window on Foundry and native video input. The May 6, 2026 retirement notice positioned Grok 4.3 as the default replacement for the older Grok 4 generation. Reviewers placed Grok 4.3 around rank nine on the Artificial Analysis aggregate leaderboard, behind the top OpenAI and Anthropic releases but especially strong on legal and corporate-finance reasoning sub-indices.

Grok 5

Musk has stated that Grok 5 is in training on the next-generation Colossus 2 cluster, with public commentary suggesting roughly six trillion parameters in mixture-of-experts configuration. xAI has not published a confirmed Grok 5 release date or model card.

Comparison with peer reasoning models

At release Grok 4's primary competition was GPT-5 from OpenAI, Claude Opus 4 from Anthropic, and Gemini 2.5 Pro from Google DeepMind. DeepSeek R1, Qwen 3, and Kimi K2 represented strong open-weight alternatives, particularly from Chinese labs.

Feature	Grok 4	GPT-5	Claude Opus 4	Gemini 2.5 Pro	DeepSeek R1	Kimi K2
Developer	xAI	OpenAI	Anthropic	Google DeepMind	DeepSeek	Moonshot AI
Release date	July 9, 2025	mid-2025	May 2025	March 2025	January 2025	mid-2025
Context window	256K tokens	128K tokens	200K tokens	1M tokens	128K tokens	200K tokens
HLE (with tools)	41.0%	~26%	~25%	~26.9%	~14%	~12%
GPQA Diamond	87.5%	~85%	~84%	~86%	~71%	~75%
AIME 2025	95%	~93%	~75.5%	~90%	~79%	~70%
API input price	$3.00/M	varies	$15.00/M	$1.25/M	open weights	open weights
Multi-agent mode	Yes (Heavy)	No	No	No	No	No
Native tool use (trained)	Yes	Partial	Partial	Yes	Limited	Limited
Open weights	No	No	No	No	Yes	Yes

Grok 4's clearest advantage at launch was HLE performance, substantially above competing models whether in standard or Heavy configuration. Its GPQA Diamond and AIME scores were also top-tier but closer to the competition. The main structural disadvantage relative to Gemini 2.5 Pro was the shorter context window. Claude Opus 4 held a pricing advantage for output-heavy tasks at $15 per million output tokens versus Grok 4's $15, while Gemini 2.5 Pro was cheaper still.

GPT-5 and Grok 4 were most directly competitive overall, with Grok 4 leading on reasoning benchmarks while GPT-5 had a larger developer ecosystem. Subsequent releases from Anthropic (Claude Sonnet 4.6, Claude Opus 4.7), OpenAI (o3 successors and the GPT-5 series), and Google (Gemini 3 Pro) progressively narrowed Grok 4's reasoning lead through the second half of 2025 and into 2026. Within a year, several of Grok 4's headline differentiators had been pushed higher by competing labs. By the time the grok-4 API slug was retired in May 2026, the original July 2025 model had fallen out of the top ten on the Artificial Analysis aggregate index, with successors grok-4.20 and grok-4.3 carrying xAI's leaderboard position forward.

Version timeline

Date	Event
June 2025	Musk confirms Grok 3.5 has been renamed Grok 4.
July 9, 2025	Grok 4 and Grok 4 Heavy launched via livestream; SuperGrok Heavy tier introduced; ARC Prize Foundation publishes certified ARC-AGI v2 score.
July 10 to 11, 2025	TechCrunch and CNBC publish reporting on Grok 4 referencing Musk's views.
July 14, 2025	Grok Companions feature launches on iOS with Ani and Rudi characters.
Early August 2025	Companion character Valentine added; Andon Labs Vending-Bench leaderboard places Grok 4 first.
July 17, 2025	Fortune highlights the missing safety report.
July 22 to 25, 2025	xAI publishes Risk Management Framework and initial model card.
August 20, 2025	Grok 4 model card last revised.
August 21 to 22, 2025	AISI reference removed from the Grok 4 model card.
August 28, 2025	Grok Code Fast 1 ships with Cursor and Copilot integrations.
September 19, 2025	Grok 4 Fast released with a 2 million-token context window.
September 29, 2025	Grok 4 added to Microsoft Azure AI Foundry.
October 25, 2025	Companion character Mika added to Grok mobile app.
November 17, 2025	Grok 4.1 released with conversational improvements.
November 19 to 20, 2025	Grok 4.1 Fast and Agent Tools API ship together.
January 2026	Grok Business and Grok Enterprise plans launched.
January 2026	ADL antisemitism study ranks Grok last among six major chatbots.
Early 2026	EU opens DSA investigation into sexual deepfakes generated via Grok on X.
February 17, 2026	Grok 4.20 enters public beta with a multi-agent collaboration system.
February 24, 2026	US District Court dismisses xAI's trade-secrets lawsuit against OpenAI.
Early 2026	US Department of War announces integration of Grok into GenAI.mil at IL5.
May 6, 2026	xAI announces May 15 retirement of eight legacy Grok 4 generation slugs.
May 2026	Grok 4.3 released with always-on reasoning and lower pricing.
May 15, 2026	Original `grok-4` API slug retired; traffic redirected to Grok 4.3.

References

xAI. "Grok 4." x.ai/news/grok-4. July 9, 2025.
xAI. "Grok 4 Model Card." data.x.ai/2025-08-20-grok-4-model-card.pdf. Last updated August 20, 2025.
Metz, Cade and others. "Grok 4 seems to consult Elon Musk to answer controversial questions." TechCrunch. July 10, 2025.
CNBC. "Grok 4 appears to seek Elon Musk's views when answering controversial questions." July 11, 2025.
Fortune. "Elon Musk's xAI's newest model, Grok 4, is missing a key safety report." July 17, 2025.
xAI. "Grok 4 Fast." x.ai/news/grok-4-fast. September 19, 2025.
xAI. "Grok 4 Fast Model Card." data.x.ai/2025-09-19-grok-4-fast-model-card.pdf. September 19, 2025.
xAI on X. "Introducing Grok 4.1." x.com/xai/status/1990530499752980638. November 17, 2025.
xAI. "Models and Pricing." docs.x.ai/developers/models.
Paul, Rohan. "The RL Revolution: Understanding xAI's Hefty Investment in RL." rohan-paul.com. 2025.
Lambert, Nathan. "Grok 4: The tension of frontier performance with a side of Elon favoritism." interconnects.ai. July 2025.
NPR. "Elon Musk's AI chatbot, Grok, started calling itself 'MechaHitler'." npr.org. July 9, 2025.
NextBigFuture.com. "xAI Grok 3.5 Renamed Grok 4 and Has Specialized Coding Model." nextbigfuture.com. June 2025.
MarktechPost. "xAI launches Grok-4-Fast: Unified Reasoning and Non-Reasoning Model." marktechpost.com. September 20, 2025.
xAI. "Colossus: The World's Largest AI Supercomputer." x.ai/colossus.
Artificial Analysis. "Grok 4: Intelligence, Performance and Price Analysis." artificialanalysis.ai/models/grok-4.
xAI. "Grok 4.1 Fast and Agent Tools API." x.ai/news/grok-4-1-fast. November 19, 2025.
UK AI Security Institute. Grok 4 evaluation summary, August 2025.
CNBC. "Co-founder of Elon Musk's xAI Igor Babuschkin departs the company." August 13, 2025.
Reuters. "Musk's xAI launches Grok 4 amid Hitler chatbot fallout." Reuters. July 9, 2025.
The Verge. "Grok 4 launches with HLE state of the art and a $300 subscription." theverge.com. July 9, 2025.
CNBC. "Musk's xAI raises $10 billion at $200 billion valuation." September 19, 2025.
CNN Business. "Elon Musk's SpaceX acquires xAI, merging his two most ambitious companies." February 2, 2026.
Microsoft Azure Blog. "Grok 4 is now available in Azure AI Foundry." azure.microsoft.com. September 29, 2025.
Microsoft Community Hub. "Grok 4.0 Goes GA in Microsoft Foundry and Grok 4.1 Fast Arrives." techcommunity.microsoft.com. 2026.
Microsoft Community Hub. "Introducing Grok 4.3 on Microsoft Foundry." techcommunity.microsoft.com. May 2026.
xAI Docs. "May 15, 2026 Model Retirement." docs.x.ai/developers/migration/may-15-retirement.
ExecutiveGov. "DOW to Add xAI for Government to GenAI.mil." executivegov.com. 2026.
Jewish Insider. "Senate Democrats question Pentagon's use of Grok AI given record of antisemitism." jewishinsider.com. February 2026.
CNBC. "Judge dismisses xAI trade-secrets lawsuit against rival OpenAI for now." cnbc.com. February 24, 2026.
PBS NewsHour. "EU investigates Musk's AI chatbot Grok over sexual deepfakes." pbs.org. 2026.
UPI. "xAI's Grok worst performing platform on countering anti-Semitism." upi.com. January 29, 2026.
VentureBeat. "xAI launches Grok 4.3 at an aggressively low price." venturebeat.com. May 2026.
Artificial Analysis. "xAI launches Grok 4.3 with improved agentic performance and lower pricing." artificialanalysis.ai. May 2026.
MediaNama. "xAI Launches Grok Enterprise and Business Plans." medianama.com. January 2026.
ARC Prize Foundation on X. "Grok 4 (Thinking) achieves new SOTA on ARC-AGI-2 with 15.9%." x.com/arcprize/status/1943168950763950555. July 9, 2025.
ARC Prize. "ARC Prize Verified Testing Policy." arcprize.org/policy.
Andon Labs. "Vending-Bench leaderboard." andonlabs.com. August 2025.
EONMSK. "xAI Grok gets Companions feature, adding animated characters to voice conversations." eonmsk.com. July 14, 2025.
Euronews. "An AI anime girlfriend is the latest feature on Elon Musk's Grok." euronews.com. July 17, 2025.
Reuters. "Musk's xAI launches Valentine, a brooding male companion for Grok." reuters.com. August 2025.
EONMSK. "xAI launches Mika Grok AI companion." eonmsk.com. October 24, 2025.
xAI. "Grok Voice Agent API." x.ai/news/grok-voice-agent-api.

Background and development

Release

Variants

Grok 4

Grok 4 Heavy

Grok 4 Code

Companions

Architecture and training

Native tool use

Benchmarks

Humanity's Last Exam

GPQA Diamond

AIME 2025

ARC-AGI v2

USAMO (Math Olympiad)

LiveCodeBench and SWE-bench Verified

Vending-Bench and agent evaluations

Summary table

Context window

Pricing and access

Consumer subscriptions

API pricing

Model retirement

xAI background and release context

Reception

Real-world deployments

Microsoft Foundry

US Department of War (GenAI.mil)

Enterprise pilots and adoption metrics

Controversies

Initial launch without a safety report

"MechaHitler" and the antisemitic outputs incident

Searching for Elon Musk's opinions

Sexual deepfakes and EU investigation

Trade-secrets and antitrust litigation

Benchmark methodology questions

Use cases

Limitations

Successors and the broader Grok 4 generation

Grok 4 Fast

Grok 4.1

Grok 4.1 Fast

Grok 4.20

Grok 4.3

Grok 5

Comparison with peer reasoning models

Version timeline

See also

References

Improve this article

Related Articles

Grok 3

Grok 4.1

QwQ

o4-mini

DeepSeek 3.0

GPT-5.5

Background and development

Release

Variants

Grok 4

Grok 4 Heavy

Grok 4 Code

Companions

Architecture and training

Native tool use

Benchmarks

Humanity's Last Exam

GPQA Diamond

AIME 2025

ARC-AGI v2

USAMO (Math Olympiad)

LiveCodeBench and SWE-bench Verified

Vending-Bench and agent evaluations

Summary table

Context window

Pricing and access

Consumer subscriptions

API pricing

Model retirement