Artificial Analysis

AI Benchmarks Developer Tools Large Language Models

18 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

22 citations

Revision

v10 · 3,629 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Artificial Analysis is an independent benchmarking and analytics platform that evaluates artificial intelligence models and API providers across intelligence, speed, price, and latency, and it is best known for the Artificial Analysis Intelligence Index, a composite 0-100 score of model quality.^[1]^[3] Founded in 2023 by Micah Hill-Smith and George Cameron, the platform runs its own evaluations rather than relying on numbers self-reported by AI labs, and it has become one of the most widely referenced independent sources for comparing large language models (LLMs), image, video, text-to-speech, speech-to-text, and embedding models.^[1]^[17] As of mid-2026 the platform tracked roughly 540 language models and more than 500 API endpoints, served both individual developers and enterprises through free public leaderboards alongside paid subscriptions, and was backed by investors including Nat Friedman, Daniel Gross, and Andrew Ng through AI Grant.^[8]^[17]^[18] "In 2023, there was nothing that we could find where anyone was doing good independent benchmarking of LLMs," co-founder Micah Hill-Smith has said of the gap the platform set out to fill.^[17]

History and Founding

Who founded Artificial Analysis?

Micah Hill-Smith and George Cameron first met while interning at Google. Both went on to pursue separate careers: Hill-Smith worked as a Business Analyst at McKinsey & Company before leaving to build a legal AI startup, while Cameron worked as a Senior Strategy Consultant at Altman Solon, focusing on technology and data centers.^[17]

In 2023, while building applications with AI models, both founders noticed a significant gap in the market. There was no reliable, independent resource for benchmarking LLMs across quality, speed, and price.^[17] AI labs frequently reported their own benchmark results using inconsistent methodologies. A notable example that motivated the founders was Google's reporting of Gemini 1.0 Ultra performance on MMLU, where Google used 32-shot chain-of-thought prompts to make the model appear superior to GPT-4, a methodology that differed from how other labs reported their scores.^[17] Hill-Smith framed the underlying realization plainly: "the more you go into building something using LLMs, the more each bit of what you're doing ends up being a benchmarking problem."^[17]

Hill-Smith and Cameron built Artificial Analysis as a side project from a basement in Sydney, Australia in 2023, initially expecting it to require only occasional updates.^[17] However, the rapid pace of new model releases and strong developer interest quickly outpaced their expectations. The platform launched publicly in January 2024 and gained significant traction after being featured on the Latent Space podcast.^[17] Developers embraced the tool so enthusiastically that both founders decided to pause their other projects and commit to Artificial Analysis full-time.^[17]

How is Artificial Analysis funded?

The pair joined AI Grant's fourth batch, an accelerator program run by Nat Friedman (former CEO of GitHub) and Daniel Gross, and relocated to San Francisco. AI Grant provided a $250,000 investment via an uncapped SAFE plus cloud compute credits and access to a network of prominent AI investors and entrepreneurs, and Artificial Analysis was one of the few AI Grant companies to raise a full seed round from Friedman and Gross.^[17] Andrew Ng also backed the company. As of late 2025, the team had grown to approximately 20 employees.^[18]

Core Mission

Artificial Analysis positions itself as an independent third party in the AI evaluation space. The platform runs its own evaluations rather than relying on self-reported numbers from AI labs.^[1]^[2] This independence is central to its value proposition: developers and enterprises need trustworthy, standardized data to make informed decisions about which models and providers to use.^[1] Co-founder George Cameron has summarized the policy directly: "you can't pay us for better results."^[17] Hill-Smith has compared independently reproducing benchmarks to consumer reporting rather than an accusation against the labs: "None of this is dodgy or bad. It's like Apple saying how long your MacBook's battery is going to last."^[18]

The platform addresses several practical questions that developers face when building AI applications: Which model offers the best quality for a given task? Which API provider delivers the fastest response times? How much does it cost to run inference at scale? How do different providers compare when handling concurrent requests?^[1]

LLM Leaderboard

The LLM Leaderboard is the flagship product of Artificial Analysis, comparing over 100 AI models across intelligence, price, performance, and speed.^[4] The leaderboard covers both proprietary models from companies like OpenAI, Anthropic, and Google, as well as open-source and open-weight models.^[4]

Which models are tracked?

The leaderboard includes models from all major AI labs and many smaller ones. Proprietary models tracked include the GPT series from OpenAI, Claude models from Anthropic, Gemini models from Google, and others.^[4] Open-weight models include the Llama series from Meta, Mistral models, Qwen models from Alibaba, DeepSeek models, and many more.^[4] As the AI landscape has expanded, the number of frontier labs tracked on the platform has grown from four or five at launch to over thirteen, and by mid-2026 the catalogue spanned roughly 540 language models.^[8]^[18]

Key Performance Metrics

Artificial Analysis measures several performance indicators for each model and provider combination:^[2]

Metric	Description	Unit
Time to First Token (TTFT)	The time between sending a request and receiving the first token of the response	Seconds
Output Speed	The average number of tokens received per second after the first token arrives	Tokens per second (t/s)
Total Response Time (100 tokens)	Synthetically calculated based on TTFT and output speed for a 100-token response	Seconds
End-to-End Response Time	Complete time from request to final token, including input processing and reasoning	Seconds
Context Window	Maximum number of tokens the model can process in a single request (input plus output)	Tokens

For reasoning models that perform internal deliberation before generating a response, the platform also measures Time to First Answer Token, which excludes the thinking phase to provide a more meaningful latency figure.^[2]

Pricing Methodology

Pricing data tracks both input and output token costs separately. To simplify comparisons, the platform calculates a "blended price" that assumes a 3:1 ratio of input to output tokens.^[2] This ratio reflects typical usage patterns in many applications. Prices are reported per million tokens.^[2]

Token Standardization

Different models use different tokenizers, which means the same text can produce different token counts across models. To ensure fair comparisons, Artificial Analysis uses OpenAI tokens as a standard unit of measurement through the tiktoken package (o200k_base tokenizer).^[2] The platform also tracks native token counts separately, since providers typically charge based on their own tokenization.^[2]

Measurement Methodology

Performance metrics are collected through real API calls to production endpoints. The platform tests every API endpoint eight times per day for single-request benchmarks and twice per day for parallel-request benchmarks.^[2] Reported figures represent the median measurement over the prior 14 days, with P5, P25, P75, and P95 percentile values also available.^[2]

The leaderboard supports six different workload configurations, varying across two dimensions:^[2]

Dimension	Options
Prompt Length	~100 tokens, ~1,000 tokens, ~10,000 tokens
Concurrency	1 query, 10 parallel queries

This allows developers to evaluate provider performance under conditions that match their specific use case, whether that involves short conversational exchanges or long document processing with high concurrency.

Mystery Shopper Policy

To prevent providers from giving preferential treatment to known benchmarking accounts, Artificial Analysis registers evaluation accounts without using their own domain.^[2] This "mystery shopper" approach ensures that the measured performance reflects what an ordinary customer would experience, rather than a specially optimized environment.^[2]

LLM API Providers Leaderboard

In addition to comparing models, Artificial Analysis maintains a separate Providers Leaderboard that compares over 500 API endpoints across different hosting services.^[5] This leaderboard helps developers understand how the same model performs differently depending on where it is hosted.^[5]

Which API providers are benchmarked?

The platform benchmarks a wide range of API providers:^[5]

Provider Category	Examples
First-party APIs	OpenAI, Anthropic, Google
Cloud platforms	Microsoft Azure, Amazon Bedrock, Google Cloud
Inference specialists	Groq, Fireworks, Together.ai, Cerebras, SambaNova
Other providers	DeepInfra, Nebius, Baseten, Databricks, Snowflake, Parasail, Cloudflare, Hyperbolic, FriendliAI, SiliconFlow, Eigen AI, Novita

For each provider, the platform reports throughput, latency, pricing, and availability data, enabling developers to choose between first-party APIs (which may offer the latest model versions first) and third-party providers (which may offer better price-performance ratios).^[5]

Intelligence Benchmarking

The Artificial Analysis Intelligence Index (AAII) provides a composite measure of model quality, synthesizing multiple evaluation benchmarks into a single score on a 0-100 scale.^[3] The index is versioned and updated as benchmarks saturate: it ran as v4.0.2 in January 2026 and advanced to version 4.1 in June 2026, which incorporates nine evaluations and introduced updates to GDPval-AA v2, tau-cubed-Banking, and Terminal-Bench v2.1.^[3]

What is in the Intelligence Index v4.1?

The Intelligence Index v4.1 uses a weighted, four-category framework spanning nine evaluations. Unlike earlier versions that weighted each category equally at 25%, v4.1 places the heaviest weight on agentic tasks:^[3]

Category	Weight	Evaluations Included (sub-weight)
Agents	34%	GDPval-AA v2 (20%), tau-cubed-Banking (14%)
Coding	24%	Terminal-Bench v2.1 (16%), SciCode (8%)
Scientific Reasoning	24%	HLE / Humanity's Last Exam (12%), GPQA Diamond (6%), CritPt (6%)
General	18%	AA-LCR (6%), AA-Omniscience Accuracy (8%) + Non-Hallucination (4%)

GPQA Diamond is the 198-question "diamond subset" on which PhD-level domain experts score about 65% while skilled non-experts with web access reach roughly 34%, making it a deliberately expert-gated science benchmark.^[22]

For historical context, the v4 framework introduced in late 2025 had instead weighted Agents, Coding, General, and Scientific Reasoning at 25% each, with evaluations including tau-squared-Bench Telecom (114 agent-user simulation tasks), Terminal-Bench Hard (44 terminal tasks), SciCode (338 scientific computing subproblems), AA-LCR (100 long-context reasoning questions at about 100k tokens each), IFBench (294 instruction-following questions), HLE (2,158 frontier academic questions), GPQA Diamond (198 questions), and CritPt (70 physics reasoning challenges).^[3]

Evaluation Standards

All models undergo identical testing conditions. The methodology emphasizes standardization, avoidance of bias, zero-shot instruction prompting, and full transparency.^[3] Scoring predominantly uses pass@1 metrics, where models must succeed on their first attempt. Multiple runs are aggregated into single pass@1 scores. Artificial Analysis states that it estimates "a 95% confidence interval for Artificial Analysis Intelligence Index of less than plus or minus 1%," based on experiments with more than ten repeats on certain models across all v4.1 evaluations.^[3]

Reasoning Model Handling

For reasoning models that generate internal "thinking" tokens before producing a visible answer, the platform assumes 2,000 reasoning tokens when the actual count is not available from the provider.^[3] This estimate is derived from testing across 60 prompts covering diverse topics including math, coding, and science.^[3]

AA-Omniscience: Knowledge and Hallucination Benchmark

AA-Omniscience is a proprietary benchmark developed by Artificial Analysis to measure factual recall and knowledge calibration.^[12] It consists of 6,000 questions derived from authoritative academic and industry sources, spanning 42 economically relevant topics across six domains:^[12]

Domain	Example Topics
Business	Finance, management, economics
Health	Medical knowledge, clinical reasoning
Law	Legal reasoning, regulatory knowledge
Software Engineering	Programming languages, system design
Humanities and Social Sciences	History, philosophy, social studies
Science, Engineering, and Mathematics	Physics, chemistry, biology, mathematics

The AA-Omniscience Index uses a scoring system bounded between -100 and +100. A correct answer earns +1, an incorrect answer costs -1, and abstaining from answering earns 0.^[12] This design rewards models that recognize their own limitations and refuse to answer rather than hallucinate. A score of 0 means the model answers correctly as often as it answers incorrectly.^[12]

What did AA-Omniscience find about hallucination?

A key finding from AA-Omniscience is that all but three of the evaluated models are more likely to hallucinate than provide a correct answer when faced with difficult questions: only three models scored above zero, with Claude 4.1 Opus achieving the highest index score of 4.8.^[12] The paper concludes that "these results reveal persistent factuality and calibration weaknesses across frontier models."^[12] The benchmark was published as a research paper on arXiv (2511.13029) in November 2025, and the dataset is publicly available on Hugging Face.^[12]

Openness Index

The Artificial Analysis Openness Index is a standardized, independently assessed measure of how "open" AI models are. The index evaluates models across two dimensions: availability and transparency.^[11]

Dimension	Maximum Score	What It Measures
Availability	6 points	API access, open weights for self-hosting, permissive licensing
Transparency	12 points	Disclosure of methodology, training data, and approach documentation

Each component is scored on a 0-3 qualitative scale based on best-fitting openness archetypes, with data elements averaged between pre-training and post-training phases.^[11] All component scores are summed (up to a maximum of 18 raw points) and normalized to a 0-100 scale.^[11] The index recognizes that "openness" in AI encompasses more than just the ability to download model weights; it also includes licensing terms, data transparency, and methodological disclosure.^[11]

Image Generation Leaderboard

Artificial Analysis operates a Text-to-Image Arena where users compare pairs of images generated from the same prompt without knowing which model produced each image.^[6] Models are ranked using an Elo rating system derived from these blind comparisons.^[6] Higher Elo scores indicate that a model is preferred more often by human evaluators.

The image generation leaderboard also tracks generation speed and pricing across providers.^[6] In addition to text-to-image generation, the platform maintains an image editing leaderboard that evaluates models on their ability to modify existing images based on text instructions.^[6]

The Elo scores are informed by tens of thousands of human image preferences. The methodology applies a linear regression model similar to how LMSYS calculates Elo scores for Chatbot Arena.^[6] The Image Arena Leaderboard is also published as a Hugging Face Space for broader accessibility.^[6]

Video Generation Leaderboard

The video generation section includes both text-to-video and image-to-video leaderboards.^[7] Models are ranked through the same blind comparison methodology used for image generation: users compare two videos generated from the same prompt without knowing which model created each video, and an Elo rating is computed from the accumulated votes.^[7]

The platform tracks video generation models both with and without audio output, maintaining separate rankings for each category.^[7] The leaderboard also highlights open-weight models separately, allowing developers to evaluate self-hostable alternatives to proprietary services.^[7]

Text-to-Speech Leaderboard

The Text-to-Speech (TTS) Arena uses a similar blind comparison methodology. Users listen to pairs of speech samples generated from the same text and select which sounds more natural.^[8] The resulting Elo ratings rank models across naturalness and quality.^[8]

The TTS leaderboard supports filtering by use case (knowledge sharing, assistants, entertainment, customer service) and accent preference (US, UK).^[8] It tracks both proprietary and open-weight models, with pricing data included for cost comparisons.^[8]

Speech-to-Text Leaderboard

For speech recognition, Artificial Analysis developed its own accuracy metric called AA-WER (Artificial Analysis Word Error Rate).^[9] The platform evaluates dozens of speech-to-text models and ranks them by transcription accuracy.^[9] The leaderboard distinguishes between proprietary and open-weight models, providing developers with options for both cloud-hosted and self-hosted deployments.^[9]

Hardware Benchmarking

Beyond software models and API providers, Artificial Analysis provides benchmarking of AI accelerator hardware for inference workloads.^[10] The platform measures how performance scales with concurrent load across different GPU systems, including NVIDIA H100, H200, and B200 configurations, as well as AMD MI300X and Google TPU v6e (Trillium) chips.^[10]

Configuration	Description
Max Throughput	Optimized for the highest sustained request volume
Minimum Latency	Tuned to deliver the fastest possible response times
Optimal	Balances throughput and latency for general-purpose use

The cost per million tokens is calculated by combining system output throughput with the average cloud price per GPU per hour.^[10] Hardware benchmarks are conducted periodically, at least once per quarter, with full specifications published alongside results.^[10]

Stirrup: Open-Source Agent Framework

In December 2025, Artificial Analysis released Stirrup, an open-source lightweight framework for building AI agents.^[13] The framework was developed as part of the team's work on evaluating agentic capabilities through benchmarks like GDPval-AA.^[13]

Stirrup differs from many existing agent frameworks by letting models drive their own workflow rather than imposing rigid step-by-step processes.^[13] The framework provides models with essential tools including code execution environments, web search, web browsing, and bash command execution in a sandboxed environment.^[13] The design philosophy draws from analysis of leading agents such as Claude Code and Codex.^[13] Stirrup is distributed as a Python package installable with "pip install stirrup," ships a TypeScript implementation called StirrupJS, and is available on GitHub under the Artificial Analysis organization with features like context management and MCP (Model Context Protocol) support.^[13]^[21]

Hugging Face Integration

Artificial Analysis publishes several of its leaderboards as Hugging Face Spaces, making them accessible to the broader AI research community.^[15] The LLM Performance Leaderboard was brought to Hugging Face in May 2024 as a collaborative effort.^[15] The Text-to-Image Arena Leaderboard and Text-to-Video Arena Leaderboard are also available as Hugging Face Spaces.^[6]^[7]

This integration gives researchers and developers access to the same data through the Hugging Face ecosystem, where it can be referenced alongside model cards, datasets, and other evaluation tools.

API and Data Access

Artificial Analysis provides a free public API that gives developers programmatic access to its benchmark data.^[14] The API covers model intelligence evaluations, speed benchmarks, pricing data, and Elo ratings across different model categories.^[14] The free tier is rate-limited to 1,000 requests per day.^[14]

The website and all public leaderboards are freely accessible without an account.^[1] Enterprise customers who require more detailed analysis, custom evaluations, or standardized reports subscribe to the Artificial Analysis Insights Platform.^[18]

Revenue Model

How does Artificial Analysis make money?

Artificial Analysis generates revenue through two primary streams. The first is enterprise subscriptions, which provide standardized reports on model deployment decisions covering topics like serverless versus managed infrastructure versus leasing chips.^[18] The second is private benchmarking, where AI companies commission custom evaluations of their models.^[18] The founders have emphasized that no company pays to appear on the public website, maintaining the platform's independence; as Cameron put it, "you can't pay us for better results."^[17]^[18]

Industry Recognition and Adoption

The Stanford AI Index Report 2025 cited Artificial Analysis benchmarks as key reference data for understanding the AI model landscape. Academic papers on LLM pricing and performance have also referenced the platform's data.^[20] As of November 2025, the site recorded 27.56% month-over-month traffic growth, reflecting increasing adoption among developers and decision-makers.^[18]

Groq, the inference chip company, has publicly highlighted its performance on Artificial Analysis leaderboards, demonstrating the platform's influence on how AI companies market their products.^[19] Multiple cloud providers and inference startups reference their Artificial Analysis rankings in marketing materials and press releases.^[19]

Comparison with Other Benchmarking Platforms

How does Artificial Analysis differ from Chatbot Arena?

Artificial Analysis occupies a distinct niche in the AI evaluation ecosystem. While platforms like LMSYS Chatbot Arena focus on human preference rankings through pairwise comparisons, and the Open LLM Leaderboard on Hugging Face focuses on academic benchmarks for open models, Artificial Analysis combines quality evaluation with real-world performance and pricing data.^[1]

Platform	Primary Focus	Methodology
Artificial Analysis	Quality, speed, price across models and providers	Independent API testing, composite intelligence index, human preference arenas
LMSYS Chatbot Arena	Human preference rankings for chat models	Crowdsourced blind pairwise comparisons
Open LLM Leaderboard	Academic benchmark scores for open models	Standardized academic evaluations
MTEB	Text embedding model quality	Standardized embedding task evaluations

The platform's breadth across multiple modalities (text, image, video, speech) and its focus on practical metrics like pricing and provider-level performance differentiate it from purely academic benchmarking efforts.^[1]

References

Artificial Analysis. "AI Model & API Providers Analysis." artificialanalysis.ai. Accessed June 2026. ↩
Artificial Analysis. "Language Model Benchmarking Methodology." artificialanalysis.ai/methodology. Accessed June 2026. ↩
Artificial Analysis. "Intelligence Benchmarking Methodology (Intelligence Index v4.1)." artificialanalysis.ai/methodology/intelligence-benchmarking. Accessed June 2026. ↩
Artificial Analysis. "LLM Leaderboard." artificialanalysis.ai/leaderboards/models. Accessed June 2026. ↩
Artificial Analysis. "LLM API Providers Leaderboard." artificialanalysis.ai/leaderboards/providers. Accessed June 2026. ↩
Artificial Analysis. "Text to Image Leaderboard." artificialanalysis.ai/image/leaderboard/text-to-image. Accessed June 2026. ↩
Artificial Analysis. "Text to Video Leaderboard." artificialanalysis.ai/video/leaderboard/text-to-video. Accessed June 2026. ↩
Artificial Analysis. "Text to Speech Leaderboard." artificialanalysis.ai/text-to-speech/leaderboard. Accessed June 2026. ↩
Artificial Analysis. "Speech to Text (ASR) Providers Leaderboard." artificialanalysis.ai/speech-to-text. Accessed June 2026. ↩
Artificial Analysis. "AI Hardware Benchmarking & Performance Analysis." artificialanalysis.ai/benchmarks/hardware. Accessed June 2026. ↩
Artificial Analysis. "Introducing the Artificial Analysis Openness Index." artificialanalysis.ai/articles/announcing-artificial-analysis-openness-index. Accessed June 2026. ↩
Artificial Analysis. "AA-Omniscience: Knowledge and Hallucination Benchmark." artificialanalysis.ai/evaluations/omniscience; arXiv:2511.13029. November 2025. ↩
Artificial Analysis. "Stirrup: Our new open source framework for building agents." artificialanalysis.ai/articles/stirrup-open-source-framework-agents. December 2025. ↩
Artificial Analysis. "API Documentation." artificialanalysis.ai/documentation. Accessed June 2026. ↩
Hugging Face. "Bringing the Artificial Analysis LLM Performance Leaderboard to Hugging Face." huggingface.co/blog/leaderboard-artificial-analysis. May 2024. ↩
Grokipedia. "Artificial Analysis." grokipedia.com/page/artificial-analysis. Accessed June 2026.
Latent Space. "Artificial Analysis: Independent LLM Evals as a Service with George Cameron and Micah Hill-Smith." latent.space/p/artificialanalysis. Accessed June 2026. ↩
The Neuron. "The State of AI in November 2025: A Deep Dive with the Co-Founders of Artificial Analysis." theneuron.ai. November 2025. ↩
Groq. "Groq LPU Tops Latency & Throughput in Benchmark." groq.com/blog/artificialanalysis-ai-llm-benchmark. Accessed June 2026. ↩
TechCrunch. "The rise of AI reasoning models is making benchmarking more expensive." techcrunch.com. April 2025. ↩
GitHub. "ArtificialAnalysis/Stirrup." github.com/ArtificialAnalysis/Stirrup. Accessed June 2026. ↩
Artificial Analysis. "Intelligence Index Evaluation: GPQA Diamond." artificialanalysis.ai/evaluations/artificial-analysis-intelligence-index. Accessed June 2026. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

9 revisions by 1 contributors · full history

Suggest edit

Artificial Analysis

History and Founding

Who founded Artificial Analysis?

How is Artificial Analysis funded?

Core Mission

LLM Leaderboard

Which models are tracked?

Key Performance Metrics

Pricing Methodology

Token Standardization

Measurement Methodology

Mystery Shopper Policy

LLM API Providers Leaderboard

Which API providers are benchmarked?

Intelligence Benchmarking

What is in the Intelligence Index v4.1?

Evaluation Standards

Reasoning Model Handling

AA-Omniscience: Knowledge and Hallucination Benchmark

What did AA-Omniscience find about hallucination?

Openness Index

Image Generation Leaderboard

Video Generation Leaderboard

Text-to-Speech Leaderboard

Speech-to-Text Leaderboard

Hardware Benchmarking

Stirrup: Open-Source Agent Framework

Hugging Face Integration

API and Data Access

Revenue Model

How does Artificial Analysis make money?

Industry Recognition and Adoption

Comparison with Other Benchmarking Platforms

How does Artificial Analysis differ from Chatbot Arena?

See Also

References

Improve this article

What links here (24 of 58)

What links here (24 of 58)

History and Founding

Who founded Artificial Analysis?

How is Artificial Analysis funded?

Core Mission

LLM Leaderboard

Which models are tracked?

Key Performance Metrics

Pricing Methodology

Token Standardization

Measurement Methodology

Mystery Shopper Policy

LLM API Providers Leaderboard

Which API providers are benchmarked?

Intelligence Benchmarking

What is in the Intelligence Index v4.1?

Evaluation Standards

Reasoning Model Handling

AA-Omniscience: Knowledge and Hallucination Benchmark

What did AA-Omniscience find about hallucination?

Openness Index

Image Generation Leaderboard

Video Generation Leaderboard

Text-to-Speech Leaderboard

Speech-to-Text Leaderboard

Hardware Benchmarking

Stirrup: Open-Source Agent Framework

Hugging Face Integration

API and Data Access

Revenue Model

How does Artificial Analysis make money?

Industry Recognition and Adoption

Comparison with Other Benchmarking Platforms

How does Artificial Analysis differ from Chatbot Arena?

See Also

References

Improve this article

Related Articles

WeirdML

MMLU-Pro

Chatbot Arena

BIG-Bench

MT-Bench

GSM8K

What links here (24 of 58)

Related Articles

WeirdML

MMLU-Pro

Chatbot Arena

BIG-Bench

MT-Bench

GSM8K

What links here (24 of 58)