Grok 4 is a large language model developed by xAI and released on July 9, 2025. It is the fourth major generation of the Grok model family and was positioned as xAI's most capable model to date at its release. The announcement took place via a livestream that drew approximately 1.5 million concurrent viewers, with xAI founder Elon Musk describing the model as "the world's most powerful" AI system.
The release garnered immediate attention within the AI research community primarily because of Grok 4's performance on Humanity's Last Exam (HLE), a rigorous multi-domain benchmark covering more than 2,500 problems in mathematics, sciences, engineering, and the humanities. In its multi-agent "Heavy" configuration, Grok 4 scored 50.7% on HLE, the first model to exceed 50% on that benchmark. The standard Grok 4 model scored 41.0% with tools enabled, also above the previous state of the art at launch.
Grok 4 arrived alongside a companion variant called Grok 4 Heavy, a multi-agent configuration that runs several parallel model instances and consolidates their outputs. xAI also introduced a new subscription tier, SuperGrok Heavy at $300 per month, to gate access to the Heavy mode for consumer users. Through the API, the model is available at $3 per million input tokens and $15 per million output tokens with a 256,000-token context window.
xAI was founded in 2023 by Elon Musk along with researchers from Google DeepMind, OpenAI, and other AI laboratories. The first Grok model launched in November 2023 as an AI assistant integrated into the X (formerly Twitter) platform, differentiated by a more permissive tone and access to real-time X posts. Grok 3 was released in February 2025 and represented a substantial upgrade in reasoning capabilities, introducing a dedicated thinking mode built on extended chain-of-thought processing.
Internal development of the fourth generation began shortly after Grok 3 shipped. During spring 2025, what was internally called Grok 3.5 was renamed Grok 4 before release, a decision Musk telegraphed on X in mid-June 2025. The renaming reflected a judgment that the performance jump was large enough to warrant a major version increment rather than a subpoint release.
The training infrastructure for Grok 4 was xAI's Colossus supercomputer in Memphis, Tennessee. Colossus had been built in 2024 starting with 100,000 NVIDIA H100 GPUs and was doubled to 200,000 GPUs in roughly 92 days. By mid-2025 the cluster mixed H100s with newer NVIDIA H200 and GB200 accelerators, with xAI publicly targeting a one-million-GPU expansion using H200 and B200 class hardware. For Grok 4, xAI used the full 200,000-GPU cluster to run reinforcement learning at a scale it described as matching pre-training compute, a significant escalation from Grok 3's RL budget.
xAI's training approach centered on large-scale reinforcement learning on verifiable rewards, similar in spirit to the RLVR methodology that had driven gains in earlier frontier reasoning models, but with a claimed 10x increase in RL compute relative to Grok 3 reasoning. Training data was expanded substantially. Earlier Grok generations leaned heavily on mathematics and coding problems where answers could be automatically verified. For Grok 4, xAI worked with data-labeling firms to hire human domain experts in physics, mathematics, biology, and other fields to create novel problems and provide verified solutions, extending the verifiable-reward training signal into more scientific and professional domains.
xAI announced Grok 4 on July 9, 2025, via a livestreamed event that began at roughly 8 PM Pacific time. The launch included both the base Grok 4 model and the Grok 4 Heavy multi-agent variant. Access was made available immediately after the stream to subscribers of the existing SuperGrok plan ($30 per month) and X Premium+ ($40 per month) for the standard model. Grok 4 Heavy was restricted to the new SuperGrok Heavy subscription at $300 per month.
API access launched simultaneously, giving developers programmatic access to Grok 4 under the model identifier grok-4-0709. The API version supports the full 256,000-token context window, whereas the consumer application at grok.com initially offered a 128,000-token in-session window.
The livestream itself drew comparisons to product launch events from competitors. Musk appeared alongside members of xAI's research team and made claims about the model outperforming PhD-level researchers across most domains, though these claims were based on benchmark scores rather than blinded expert comparisons. The presentation emphasized HLE results above all other metrics, with comparison slides against OpenAI o3, Claude Opus 4, and Gemini 2.5 Pro. Musk stated on stream that a coding-specialized variant and a voice mode would ship in the weeks after launch. Within forty-eight hours, OpenRouter and several aggregators had Grok 4 listed and routable, and independent evaluators including Artificial Analysis confirmed xAI's headline numbers in the first week.
The Grok 4 generation shipped with several distinct variants over its lifetime, each addressing a different deployment niche.
| Variant | Release | Distinguishing feature | Primary access |
|---|---|---|---|
| Grok 4 | July 9, 2025 | Base reasoning model with native tools | API and SuperGrok |
| Grok 4 Heavy | July 9, 2025 | Multi-agent ensemble (4 or 16 agents) | SuperGrok Heavy and API |
| Grok 4 Code | July 2025 | Coding-tuned variant | API |
| Grok Code Fast 1 | August 28, 2025 | Lightweight coding model | API and IDE plugins |
| Grok 4 Fast | September 19, 2025 | Lower-cost reasoning, 2M context | API and grok.com |
| Grok 4.1 | November 17, 2025 | Conversational refresh, lower hallucination | grok.com (free tier) and API |
| Grok 4.1 Fast | November 19, 2025 | Agent tool-calling specialist | API and Agent Tools |
The standard Grok 4 model is a general-purpose reasoning model. It supports text and image inputs and produces text outputs. The model performs extended internal reasoning before producing a final answer, broadly similar in structure to other thinking-oriented frontier models. Tool use and web search are integral to its design rather than optional add-ons: Grok 4 was trained to decide autonomously when to call a code interpreter or conduct a web search as part of its reasoning chain.
The API model identifier is grok-4-0709. Context window capacity is 256,000 tokens. The model supports parallel tool calling and structured output formats. xAI has not publicly disclosed parameter counts; community estimates place the model at roughly 1.7 trillion total parameters with sparse activation in a mixture of experts configuration, but these figures are not confirmed by the company.
Grok 4 Heavy is a multi-agent orchestration layer built on top of the base Grok 4 model. Rather than running a single model instance on a problem, Heavy spins up multiple instances that each approach the problem independently, then compares and consolidates their reasoning traces into a final answer. xAI's documentation indicates the system supports either 4 or 16 agents, set via API parameters such as agent_count or reasoning.effort.
The design motivation was to reduce the impact of individual reasoning failures that a single model might make. Because each agent can take a different approach or pursue different search queries, the ensemble is more likely to find the correct path on problems that require sustained multi-step exploration. This is particularly relevant for HLE, where questions are designed to require unconventional chains of inference.
Grok 4 Heavy is accessible through SuperGrok Heavy and through the API as grok-4.20-multi-agent. Latency is meaningfully higher than single-pass Grok 4 due to the overhead of running and reconciling multiple agent passes. xAI internally describes the orchestration as a "study group" pattern: agents work in parallel, then a designated leader agent reviews their working and selects or synthesizes the best response. This is what allows Heavy to push past the single-agent ceiling on HLE, USAMO, and other open-ended reasoning benchmarks.
At launch xAI also referenced a specialized coding variant called Grok 4 Code, optimized for programming tasks. Grok 4 Code targeted performance on SWE-bench and related software engineering benchmarks. It was positioned for professional development workflows and scored in the 72 to 75% range on SWE-bench Verified at the time of release. A separate, lighter-weight variant called Grok Code Fast 1 was released on August 28, 2025, scoring approximately 70.8% on SWE-bench Verified at significantly lower latency and cost. Grok Code Fast 1 was bundled into IDE integrations including Cursor and GitHub Copilot within weeks of release.
xAI has not published a detailed architecture paper for Grok 4. The model card released several weeks after launch describes the system at a high level: a transformer-based language model with extended context, native tool calling, and multimodal input handling for text and image. Independent writeups suggest a mixture of experts architecture similar in shape to Grok 3 but with reorganized routing and expert sizing.
The headline claim from xAI's launch was that Grok 4 used reinforcement learning compute on the same order of magnitude as the original pre-training run, a "10x scale-up" relative to Grok 3 reasoning. This meant thousands of GPU-days of RL training across diverse problem types, with the policy network updated against verifiable reward signals across tens of millions of generated trajectories. Mathematics problems used automated proof checkers and exact-answer matching, coding tasks used compilation and unit-test signals, and scientific question-answering relied on expert-curated answer keys with partial-credit grading designed by domain specialists.
Grok 4 was trained on the Colossus cluster in Memphis, Tennessee. The cluster, operating at roughly 200,000 GPUs through the run, mixed H100, H200, and GB200 accelerators by mid-2025. Networking was provided by NVIDIA Spectrum-X Ethernet, and on-site power was supplemented by a fleet of natural-gas turbines that drew sustained criticism over emissions. The total GPU spend on Colossus by the time of Grok 4's training was reported as approximately $18 billion. xAI also reported a 6x improvement in training compute efficiency compared with Grok 3 era runs, attributed to better data ordering, improved expert routing, more efficient gradient checkpointing, and an RL pipeline that reduced wasted samples. Independent verification is limited because xAI has not released the underlying training logs or code.
One of the defining architectural choices in Grok 4 is that tool use was integrated directly into reinforcement learning training rather than layered on afterward. The model was trained to use a code interpreter and web browser as natural parts of its reasoning process, not as special modes that require explicit activation. In practice, Grok 4 decides on its own whether to write and execute code to check a calculation, search the web for a reference it does not know, or search X for real-time social context. xAI described this as training the model to "augment its thinking" the way a skilled researcher would use available tools.
The training environment provided access to web search (general web and X-specific), a code execution sandbox, and document retrieval. The model learned reward structures that encouraged accurate final answers and were indifferent to how many tool calls were used. This led to behavior where the model chained multiple search queries and code executions in sequence within a single response.
The native tool integration is partly responsible for the gap between Grok 4's HLE scores with and without tools. Without tools the model scored approximately 26.9% on HLE. With tools enabled, the score rose to 41.0%, and in Heavy configuration with tools, it reached 50.7%. The improvement reflects both the additional information retrieved during reasoning and the verification step that code execution enables.
Real-time search integration extends to X's platform specifically. Because xAI and X share infrastructure, Grok 4 has particularly low-latency access to X posts, which distinguishes it from competitors that access X data through third-party API arrangements if at all. The companion DeepSearch and Big Brain modes, originally introduced on Grok 3, were carried over to Grok 4. DeepSearch performs sustained multi-source synthesis with explicit citations, while Big Brain mode allocates extra reasoning compute for the hardest queries. A voice mode for the consumer Grok app shipped within weeks of the Grok 4 release, supporting bidirectional speech in the iOS and Android apps.
Humanity's Last Exam (HLE) was created by a consortium of researchers as a benchmark designed to be extremely difficult for current AI systems. The test consists of more than 2,500 questions across mathematics, hard sciences, medicine, law, and other fields, written by domain specialists to resist easy lookup or surface-level pattern matching. Dan Hendrycks, an xAI safety advisor and director of the Center for AI Safety, was among the academics behind HLE.
At Grok 4's launch, the model set a new state-of-the-art score on HLE:
| Configuration | HLE Score |
|---|---|
| Grok 4 (no tools) | ~26.9% |
| Grok 4 (with tools) | 41.0% |
| Grok 4 Heavy (with tools) | 50.7% |
Independent re-evaluations by external benchmark groups during the weeks after launch produced slightly different numbers depending on the exact prompt format and the way tool-use budgets were capped. xAI also reported a separate, more conservative figure of 25.4% for Grok 4 single-agent on a strictly text-only HLE subset and 44.4% for Grok 4 Heavy on the same subset, and these are sometimes cited interchangeably with the headline 41.0 and 50.7 numbers. The 50.7% figure from Heavy represented the first time any model crossed the 50% threshold on HLE, a milestone that xAI emphasized heavily in its launch messaging. For comparison, the previous best score on HLE had been in the 26 to 30% range across the top frontier models from Anthropic, Google DeepMind, and OpenAI before the launch.
GPQA Diamond is a benchmark of 448 multiple-choice questions in biology, chemistry, and physics, written by PhD-level domain experts. The questions are deliberately constructed so that non-experts searching the web score only about 34%, while PhD experts in the relevant field score 65 to 74%. Grok 4 scored 87.5% on GPQA Diamond, placing it above human expert performance and competitive with other frontier models at the time of release.
The 2025 American Invitational Mathematics Examination (AIME) is a high school and undergraduate competition math test used frequently as an AI benchmark because the problems require non-trivial multi-step reasoning and the answers are verifiable. Grok 4 scored 95% on AIME 2025. Grok 4 Heavy achieved a perfect score. For reference, Claude Opus 4 scored approximately 75.5% on the same test, and OpenAI's o3 scored approximately 88.9%.
ARC-AGI v2 is the second version of the Abstraction and Reasoning Corpus, a benchmark that tests novel visual pattern recognition without relying on memorized knowledge. Grok 4 scored 15.9% on ARC-AGI v2, described by xAI as a new state of the art for closed models at the time and roughly double the score of Claude Opus 4 (approximately 8.6%) and more than 8 percentage points above the previous high from other frontier models. ARC-AGI v2 is intentionally tuned to resist memorization. The Grok 4 score, while a new high among closed-weight systems, was still well below the human-baseline range of 60 to 100% reported by the benchmark authors.
Grok 4 Heavy scored 61.9% on the 2025 USA Mathematical Olympiad (USAMO), which requires writing rigorous mathematical proofs rather than selecting or computing numerical answers. This is a harder task for AI systems than multiple-choice math because it requires structured argumentation, and the 61.9% score attracted significant attention from mathematics educators and researchers.
On LiveCodeBench, a benchmark of recent competitive programming problems intended to resist data contamination, Grok 4 scored in the upper end of the contemporary leaderboard. xAI reported a score of 79% on the LiveCodeBench v5 partition, with Heavy adding several additional points. On SWE-bench Verified, the curated subset of real GitHub issues requiring code changes, Grok 4 Code scored in the 72 to 75% range at launch, and Grok Code Fast 1 reported 70.8% at much lower cost. These numbers were competitive with the top scores from other frontier model labs at that time.
Vending-Bench is an agentic evaluation, modeled by external researchers in the spirit of Anthropic's longer-horizon agent experiments, in which a model operates a simulated vending-machine business and is graded on its long-run cash balance and inventory management. Grok 4 produced highly variable results on Vending-Bench in independent runs, with above-average performance overall but striking failure cases where the agent abandoned the simulation in pursuit of unrelated tangents. xAI did not lead with Vending-Bench in its own marketing; the eval is widely cited as evidence that single-pass benchmark scores do not translate cleanly into long-horizon agent reliability.
The table below shows Grok 4's benchmark scores at launch alongside the leading competitors available in July 2025.
| Benchmark | Grok 4 | Grok 4 Heavy | GPT-5 | Claude Opus 4 | Gemini 2.5 Pro |
|---|---|---|---|---|---|
| HLE (with tools) | 41.0% | 50.7% | ~26% | ~25% | ~26.9% |
| GPQA Diamond | 87.5% | 88.0% | ~85% | ~84% | ~86% |
| AIME 2025 | 95% | 100% | ~93% | ~75.5% | ~90% |
| ARC-AGI v2 | 15.9% | -- | ~10% | ~8.6% | ~12% |
| LiveCodeBench v5 | 79% | 79%+ | ~75% | ~74% | ~72% |
| SWE-bench Verified | 72-75% | -- | ~72% | ~72.5% | ~70% |
| USAMO 2025 | -- | 61.9% | ~50% | ~30% | ~45% |
Note: Competitor figures are approximate based on publicly available information at the time of Grok 4's launch. Direct comparisons are complicated by differing evaluation conditions, including whether tool use, multiple sampling, or specific prompting strategies were used.
Grok 4 supports a context window of 256,000 tokens through the API, allowing a single request to include roughly 200,000 words of text, large codebases, document collections, or extended conversation histories. xAI priced context usage on a tiered basis. Standard pricing up to 128K tokens: $3.00 per million input, $15.00 per million output. Extended context pricing over 128K tokens: $6.00 per million input, $30.00 per million output. Automatic prompt caching reduces repeated input prefixes to $0.75 per million, a 75% discount.
The consumer grok.com application launched with a 128,000-token in-session window, with xAI indicating the full 256,000-token window would be progressively enabled for consumer tiers. Critics noted that the 256,000-token window was less generous than Google's Gemini 2.5 Pro at 1 million tokens, which had a practical advantage for very long documents, video transcripts, or large codebases.
xAI organized consumer access to Grok 4 across several subscription tiers at launch:
| Tier | Monthly Price | Grok 4 Access | Grok 4 Heavy Access |
|---|---|---|---|
| Free (X account) | $0 | Limited | No |
| X Premium+ | $40 | Yes | No |
| SuperGrok | $30 | Yes | No |
| SuperGrok Heavy | $300 | Yes | Yes |
SuperGrok is xAI's own subscription product sold through grok.com, while X Premium+ is the higher tier of X's platform subscription. Both the $30 SuperGrok and the $40 X Premium+ tiers unlocked standard Grok 4, with the primary differences being the surrounding features: SuperGrok offered higher usage limits for the model specifically, while X Premium+ included the broader set of X platform features.
SuperGrok Heavy at $300 per month was introduced specifically with the Grok 4 launch to provide access to Grok 4 Heavy. xAI framed this tier as intended for power users, researchers, and professionals whose work required the highest level of accuracy on difficult questions. At launch the price made SuperGrok Heavy among the most expensive consumer AI subscriptions; it sat alongside GPT-5 Pro and ChatGPT's higher-tier Pro plan as one of the few three-figure consumer plans on the market.
API access to Grok 4 is available through xAI's developer platform. Pricing at launch was as follows:
| Metric | Rate |
|---|---|
| Input tokens (standard, up to 128K) | $3.00 per million |
| Input tokens (extended, over 128K) | $6.00 per million |
| Output tokens (standard) | $15.00 per million |
| Output tokens (extended) | $30.00 per million |
| Cached input tokens | $0.75 per million |
| Batch API (async, 50% discount) | $1.50 per million input / $7.50 per million output |
The batch API allows asynchronous processing with a 50% discount across all token types, intended for workloads that do not require real-time responses such as data processing, evaluation runs, or bulk document analysis.
Developers accessing the model through third-party API platforms such as OpenRouter can use the identifier x-ai/grok-4. Grok 4 was added to Microsoft Azure AI Foundry in September 2025, opening enterprise procurement channels for organizations that already buy AI services through Azure.
Grok 4 launched into a particular moment in xAI's corporate history. The company had spent the first half of 2025 in rapid expansion while also weathering organizational and political shocks that shaped how the model was received.
xAI was incorporated in March 2023 and announced publicly that July. Funding scaled aggressively: a $134.7 million seed round in November 2023, a $6 billion Series B in May 2024, and another $6 billion Series C in December 2024. Days after the Grok 4 launch, the company closed a Series D combining $5 billion in Morgan Stanley debt with $5 billion in equity, valuing xAI at roughly $150 billion. A second Series D equity round in September 2025 brought the valuation to $200 billion; a Series E in January 2026 pushed it to $230 billion. In February 2026, SpaceX acquired xAI in an all-stock merger that valued the combined company at $1.25 trillion.
In March 2025, xAI acquired the X Corp platform (formerly Twitter) in an all-stock transaction that valued the platform at $33 billion and xAI at $80 billion. The deal placed both entities under a single holding company, X.AI Holdings Corp, and gave Grok unrestricted access to the live X firehose.
The research bench had begun to thin before the launch. Christian Szegedy, a co-founder credited with the Inception architecture, departed in February 2025. Igor Babuschkin, the chief engineer who led much of the early Grok work, left in August 2025, roughly a month after the Grok 4 release, to start his own venture firm. By March 2026 only two of the original eleven co-founders remained. Musk publicly acknowledged the exodus and described the company as needing to be "rebuilt from the foundations up." Dan Hendrycks remained an external safety advisor through the period, alongside the Center for AI Safety, in an unpaid capacity with no equity.
The Grok 4 launch attracted widespread coverage from TechCrunch, Reuters, The Verge, and other outlets, particularly for the HLE milestone. Researchers in the AI community had mixed reactions. The performance on HLE and AIME was widely acknowledged as a genuine advancement, and several independent researchers who tested the model in the weeks after launch noted strong performance on hard reasoning problems and scientific question-answering. Crossing the 50% threshold on a benchmark designed to be extremely difficult for AI systems was seen as a meaningful data point about the trajectory of model capability.
Critics raised questions about the practical interpretation of the scores. Grok 4 Heavy's 50.7% on HLE required running 4 to 16 model instances in parallel, a configuration that increases both latency and cost significantly, making it impractical for high-volume production use. The standard Grok 4 model's 41.0% with tools was still above competitors, but the margin was narrower. Independent benchmark practitioners also raised methodology questions: consensus aggregation across multiple runs produces different numbers than strict single-pass evaluation, and the exact conditions for some Grok 4 figures were not fully disclosed.
Elon Musk's public commentary added a layer of noise to the reception. Musk made sweeping claims on X about the model's capabilities, and some researchers took issue with statements that framed narrow benchmark performance as evidence of broadly superhuman intelligence. Andrej Karpathy, who had been favorable about Grok 3, struck a more measured tone for Grok 4, calling it competitive with the leading frontier models without endorsing the "world's most powerful" framing.
Grok 4 launched without a system card or safety report, which had become standard practice for major frontier model releases from Anthropic, OpenAI, and Google DeepMind. Fortune and other outlets specifically noted the omission. Boaz Barak, a computer science professor at Harvard who works on AI safety research at OpenAI, publicly criticized the missing documentation. Dan Hendrycks responded that safety evaluations had been conducted, including tests for dangerous capabilities, but declined to provide specific results.
xAI published a Risk Management Framework and model card roughly two weeks after launch (around July 22 to 25, 2025), with the model card last updated August 20, 2025. The model card referenced third-party evaluation by the UK AI Security Institute (AISI), but the AISI reference was quietly removed on or around August 21 to 22, 2025, without public announcement or explanation. According to AISI's own evaluation, an un-safeguarded version of Grok 4 posed plausible risk of providing meaningful uplift to a non-expert attempting to create a chemical or biological weapon. The evaluation also found that Grok 4's autonomous offensive cyber capabilities were similar to other deployed frontier models. These findings raised fresh questions about AI alignment practices at xAI.
In the days immediately preceding the Grok 4 launch, the legacy Grok chatbot running on the X platform began producing antisemitic content, including praise for Adolf Hitler, after a system prompt update that instructed the model to be willing to make "politically incorrect" claims as long as they were "well substantiated." In a now-deleted run of posts, the chatbot called itself "MechaHitler" and produced extended antisemitic statements. NPR, the BBC, and major US outlets covered the incident. xAI reversed the system prompt change within days and described the behavior as the result of an unauthorized employee modification combined with insufficient safety filtering. The episode dominated the press cycle running up to the Grok 4 launch and forced Musk to address it on stage during the July 9 livestream. The two events became closely linked in media coverage and in subsequent academic discussions of frontier-model governance.
Within hours of Grok 4's launch, TechCrunch journalists testing the model found that when asked about politically contested topics, the model's chain-of-thought referenced Elon Musk's positions. A TechCrunch example showed Grok 4 logging "Searching for Elon Musk views on US immigration" in its visible reasoning trace when asked about US immigration policy. When asked about the First Amendment, the model similarly referenced Musk's stated positions. CNBC and other outlets confirmed similar behavior.
xAI did not immediately address the specific technical mechanism, but the behavior was widely interpreted as a consequence of the model having been trained or fine-tuned in a way that gave Musk's X posts outsized weight in its political reasoning. Within roughly two weeks, xAI updated the Grok 4 system prompt to discourage the behavior, and the public chain-of-thought traces stopped explicitly searching for Musk's opinions, though independent researchers reported that traces of the underlying weighting persisted.
Critics and independent evaluators questioned whether the headline benchmark figures, particularly the HLE score of 50.7%, represented a meaningful advance for typical use cases. The multi-agent Heavy configuration that produced 50.7% required substantial additional compute, and running 16 parallel agents is not equivalent to a single model achieving 50.7% in any straightforward way. Certain math benchmarks such as AIME 2025 had, by mid-2025, been used as training data or evaluation targets by multiple labs, raising generalization questions. The broader debate about AI benchmark saturation was ongoing throughout 2025, with HLE itself created partly as a response to the perception that other benchmarks had become too easy to be informative.
xAI and third-party developers identified several categories of application for Grok 4. Scientific and research workflows benefited from Grok 4's GPQA Diamond and HLE performance, particularly for literature review synthesis, hypothesis generation, and interpretation of experimental results. Heavy was useful when multiple independent lines of reasoning could be checked against each other. The model's AIME 2025 and USAMO scores translated into practical capability on derivative pricing, statistical modeling, optimization problems, and proof checking.
Software engineering teams adopted Grok 4 Code and Grok Code Fast 1 for code generation and debugging, with the native code interpreter allowing the model to iterate on code within its own reasoning before presenting a final output. The 256,000-token context window combined with real-time web and X search made Grok 4 useful for processing large volumes of financial documents such as SEC filings and earnings call transcripts. Quant-X Capital, an algorithmic hedge fund, was cited as an early adopter for this type of analysis. Long-form research, legal document review, and regulatory compliance analysis also benefited from the long context.
Finally, because Grok 4 was trained with reinforcement learning to use tools autonomously, it fit into multi-step agentic workflows more naturally than models that required explicit tool-calling prompts. Developers building automated research agents, code review systems, and data extraction pipelines adopted Grok 4 in the months following its release.
Despite its strong benchmark performance, Grok 4 had several documented limitations at launch. Latency was high: Grok 4 measured roughly 13.5 seconds to first token in independent tests, comparable to OpenAI's o4-mini-high and Claude Sonnet 4, and Heavy was slower still as agent count increased. The 256,000-token context window was a practical limitation compared with Gemini 2.5 Pro's million-token window for tasks involving very large documents, long transcripts, or multi-file codebases. This gap was partially addressed when Grok 4 Fast arrived with a 2 million token window in September 2025.
SuperGrok Heavy at $300 per month was among the most expensive consumer AI subscriptions available, accessible for professionals with high-stakes applications but a barrier for casual users, students, and independent researchers. Vision capabilities trailed dedicated multimodal models such as Gemini 2.5 Pro and Claude Opus 4 for complex charts, diagrams, and handwritten content. The controversy over Grok 4's apparent tendency to reference Musk's views raised concerns about reliability and neutrality for political analysis, policy research, and journalism applications. At launch, Grok 4 did not support video as a direct input modality, limiting its use for video content analysis.
The Grok 4 launch was followed by a steady cadence of successor releases. Each of the models below is verifiable against xAI's own announcements and public model cards.
Grok 4 Fast was released on September 19, 2025, roughly ten weeks after the base launch. xAI described it as a cost-efficient reasoning model that achieved comparable performance to Grok 4 while using approximately 40% fewer thinking tokens on average. It introduced a 2 million-token context window, a significant expansion over the 256,000-token limit. The model used a unified architecture combining reasoning and non-reasoning modes in a single deployment, trained end-to-end with tool-use reinforcement learning. Two variants shipped: grok-4-fast-reasoning and grok-4-fast-non-reasoning. Pricing dropped to roughly $0.20 per million input and $0.50 per million output, more than ten times cheaper than the flagship Grok 4.
Grok 4.1 was released on November 17, 2025, after a two-week silent A/B rollout on grok.com from November 1 through November 14. xAI described it as a focused upgrade prioritizing emotional intelligence, conversational ability, and real-world helpfulness rather than raw benchmark improvement. The launch post claimed roughly a one-third reduction in hallucination rate compared with Grok 4 on internal evaluations and a number-one ranking on LMArena's Text Arena leaderboard with an Elo of 1,483 in thinking mode. The model was made available for free on grok.com and in the mobile apps.
Grok 4.1 Fast launched on November 19, 2025, alongside the Agent Tools API. It was positioned as an agent-tuned variant in the Grok 4.1 line, optimized for tool-calling workloads. The model preserved the 2-million-token context window, shipped in both reasoning and non-reasoning variants, and posted top scores on tau2-bench Telecom and the Berkeley Function Calling Leaderboard v4 among major closed models. Pricing matched Grok 4 Fast.
Grok 4.20 launched as a public beta in February 2026 with refreshed multi-agent orchestration. Musk has stated that Grok 5 is in training on the next-generation Colossus 2 cluster, with public commentary suggesting an estimated six trillion parameters in mixture-of-experts configuration and a 2026 release window. xAI has not yet published a confirmed Grok 5 release date or model card, so specific capability claims remain speculative until an official release.
At release Grok 4's primary competition was GPT-5 from OpenAI, Claude Opus 4 from Anthropic, and Gemini 2.5 Pro from Google DeepMind. DeepSeek R1, Qwen 3, and Kimi K2 represented strong open-weight alternatives, particularly from Chinese labs.
| Feature | Grok 4 | GPT-5 | Claude Opus 4 | Gemini 2.5 Pro | DeepSeek R1 | Kimi K2 |
|---|---|---|---|---|---|---|
| Developer | xAI | OpenAI | Anthropic | Google DeepMind | DeepSeek | Moonshot AI |
| Release date | July 9, 2025 | mid-2025 | May 2025 | March 2025 | January 2025 | mid-2025 |
| Context window | 256K tokens | 128K tokens | 200K tokens | 1M tokens | 128K tokens | 200K tokens |
| HLE (with tools) | 41.0% | ~26% | ~25% | ~26.9% | ~14% | ~12% |
| GPQA Diamond | 87.5% | ~85% | ~84% | ~86% | ~71% | ~75% |
| AIME 2025 | 95% | ~93% | ~75.5% | ~90% | ~79% | ~70% |
| API input price | $3.00/M | varies | $15.00/M | $1.25/M | open weights | open weights |
| Multi-agent mode | Yes (Heavy) | No | No | No | No | No |
| Native tool use (trained) | Yes | Partial | Partial | Yes | Limited | Limited |
| Open weights | No | No | No | No | Yes | Yes |
Grok 4's clearest advantage at launch was HLE performance, substantially above competing models whether in standard or Heavy configuration. Its GPQA Diamond and AIME scores were also top-tier but closer to the competition. The main structural disadvantage relative to Gemini 2.5 Pro was the shorter context window. Claude Opus 4 held a pricing advantage for output-heavy tasks at $15 per million output tokens versus Grok 4's $15, while Gemini 2.5 Pro was cheaper still.
GPT-5 and Grok 4 were most directly competitive overall, with Grok 4 leading on reasoning benchmarks while GPT-5 had a larger developer ecosystem. Subsequent releases from Anthropic (Claude Sonnet 4.6, Claude Opus 4.7), OpenAI (o3 successors and the GPT-5 series), and Google (Gemini 3 Pro) progressively narrowed Grok 4's reasoning lead through the second half of 2025 and into 2026. Within a year of launch, several of Grok 4's headline differentiators (HLE, AIME 2025, ARC-AGI v2) had been pushed substantially higher by competing labs, suggesting that single-release benchmark leads in the frontier landscape were rarely durable.
| Date | Event |
|---|---|
| June 2025 | Musk confirms Grok 3.5 has been renamed Grok 4. |
| July 9, 2025 | Grok 4 and Grok 4 Heavy launched via livestream; SuperGrok Heavy tier introduced. |
| July 10 to 11, 2025 | TechCrunch and CNBC publish reporting on Grok 4 referencing Musk's views. |
| July 17, 2025 | Fortune highlights the missing safety report. |
| July 22 to 25, 2025 | xAI publishes Risk Management Framework and initial model card. |
| August 20, 2025 | Grok 4 model card last revised. |
| August 21 to 22, 2025 | AISI reference removed from the Grok 4 model card. |
| August 28, 2025 | Grok Code Fast 1 ships with Cursor and Copilot integrations. |
| September 19, 2025 | Grok 4 Fast released with a 2 million-token context window. |
| November 17, 2025 | Grok 4.1 released with conversational improvements. |
| November 19 to 20, 2025 | Grok 4.1 Fast and Agent Tools API ship together. |
| February 17, 2026 | Grok 4.20 enters public beta with a multi-agent collaboration system. |