Model card

A model card is a standardized documentation framework for machine learning models that describes a model's intended use, performance characteristics, training data, evaluation metrics, ethical considerations, and limitations. Model cards function as a form of transparency documentation, analogous to nutrition labels for food products, datasheets for electronic components, or Material Safety Data Sheets in chemistry. The concept was introduced by Margaret Mitchell and colleagues at Google in their 2019 paper "Model Cards for Model Reporting," and has since become a widely adopted standard in both industry and open-source AI development, particularly through Hugging Face's implementation on its Model Hub [1].

Model cards are part of a broader ecosystem of AI documentation practices that includes datasheets for datasets (Gebru et al., 2021), system cards (used by OpenAI for models like GPT-4), and regulatory documentation requirements under frameworks like the EU AI Act. Together, these documentation standards aim to improve accountability, reproducibility, and informed decision-making across the AI development lifecycle. They sit at the intersection of responsible AI, AI governance, and AI safety practice.

origins and the mitchell et al. 2019 paper

The foundational paper, "Model Cards for Model Reporting," was published in January 2019 at the ACM Conference on Fairness, Accountability, and Transparency (FAT* 2019). The paper was authored by Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru, all of whom were affiliated with Google at the time. The first preprint was posted to arXiv (1810.03993) in October 2018 [1].

Mitchell co-founded and co-led Google's Ethical AI team alongside Gebru. The model cards proposal grew out of that team's broader work on fairness, accountability, and bias in deployed machine learning. After Gebru's controversial departure from Google in December 2020, Mitchell was also fired in February 2021 following an internal investigation. She joined Hugging Face later that year as Chief Ethics Scientist, where she continued the model cards work and helped shape the platform's documentation practices [2][7].

The paper identified a gap in how machine learning models were documented and shared. While software engineering had established practices for documentation (API references, changelogs, README files), machine learning models were frequently released with minimal information about their intended use, performance across different populations, training data composition, or known limitations. This lack of transparency made it difficult for downstream users to evaluate whether a model was appropriate for their specific application, and it obscured potential harms to affected communities [1].

Mitchell et al. proposed model cards as a short, structured document accompanying any trained machine learning model. The paper drew inspiration from existing practices in other fields: datasheets in the electronics industry, nutritional labels in the food industry, and Material Safety Data Sheets in chemistry. The authors argued that a standardized documentation format would facilitate better communication between model developers and model users, encourage the evaluation of models across different demographic groups, and create accountability for known limitations [1].

The paper's most cited definition reads: "Model cards are short documents accompanying trained machine learning models that provide benchmarked evaluation in a variety of conditions, such as across different cultural, demographic, or phenotypic groups (e.g., race, geographic location, sex, Fitzpatrick skin type) and intersectional groups (e.g., age and race, or sex and Fitzpatrick skin type) that are relevant to the intended application domains." The emphasis on intersectional evaluation, drawing on critical race theory and earlier fairness work, was a deliberate departure from aggregate accuracy reporting [1].

The paper demonstrated the concept with example model cards for two Google models: a smile detector trained on the CelebA dataset and a toxicity classifier built on the Perspective API. The smile-detection card revealed higher false discovery rates for older men. The toxicity card surfaced bias against terms related to sexual orientation. In both cases the disaggregated reporting exposed disparities that aggregate accuracy numbers had hidden, demonstrating exactly the failure mode the authors were trying to address [1].

what a model card contains

The original Mitchell et al. framework specifies nine sections that a model card should include. These sections have been refined and expanded by subsequent adopters, but the core structure remains consistent across most modern templates.

Section	Description	Example content
Model details	Basic information about the model	Name, version, type (e.g., classification, generation), developer, release date, license, citation, contact
Intended use	What the model was designed for	Primary use cases, intended users, out-of-scope uses
Factors	Variables that affect model performance	Demographic factors (age, gender, ethnicity, Fitzpatrick skin type), environmental factors (lighting, noise), instrumentation factors (camera type, microphone quality)
Metrics	How performance is measured	Accuracy, F1 score, BLEU, perplexity, fairness metrics, decision thresholds, variation approaches, disaggregated by relevant factors
Evaluation data	Datasets used for testing	Dataset name, size, composition, motivation for choice, any known biases
Training data	Datasets used for training	Sources, size, collection methodology, preprocessing steps, any known limitations
Quantitative analyses	Detailed performance results	Unitary results per factor, intersectional results, confidence intervals, error analysis
Ethical considerations	Known risks and societal impact	Potential for harm, sensitive use cases, populations at risk, mitigation strategies considered during development
Caveats and recommendations	Known limitations and usage guidance	Failure modes, conditions under which the model should not be used, recommended downstream evaluation

The emphasis on disaggregated evaluation, where performance metrics are broken down across demographic or contextual subgroups rather than reported only as aggregate numbers, is one of the most important contributions of the model card framework. A face recognition model might achieve 99% accuracy overall but only 85% accuracy on darker-skinned faces; an aggregate metric would hide this disparity, while a model card would make it explicit. This approach was directly informed by Joy Buolamwini and Timnit Gebru's earlier Gender Shades research, which found commercial face analysis systems performed dramatically worse on darker-skinned women than on lighter-skinned men [1].

The paper also proposed two specific reporting modes: unitary results, which show performance per single factor, and intersectional results, which show performance at the intersection of factors (for example age x skin type). The intersectional view is more demanding but typically reveals worse outcomes than either factor alone would suggest, which is precisely why Mitchell et al. argued it was needed.

hugging face model cards

Hugging Face, the open-source AI platform and model hosting service, has become the primary venue where model cards are created and consumed in practice. The Hugging Face Hub crossed one million hosted models in late 2024 and surpassed two million by early 2026, with each model repository containing a README.md file that serves as the model card [3][4].

Hugging Face's adoption of model cards has been the single most significant factor in their widespread use. When a researcher or organization uploads a model to the Hub, the platform prompts them to fill out a model card using a standardized template. This template draws on the original Mitchell et al. framework but has been expanded to include additional metadata fields relevant to the Hugging Face ecosystem, such as pipeline tags (text-classification, image-generation, etc.), language codes, datasets used, and library compatibility [3].

In 2022, Hugging Face launched the Model Card Guidebook, authored primarily by Ezi Ozoani, Marissa Gerchick, and Margaret Mitchell. The guidebook bundled four resources: an Annotated Model Card Template explaining how to fill out each section, a user study on model card usage at the Hub, a landscape analysis of the broader documentation ecosystem, and a Model Card Creator Tool that allows users to generate model cards through a graphical interface without writing markdown directly. The release also added prompt text inside the template to encourage thorough completion of the Bias, Risks and Limitations section [3].

The quality of model cards on Hugging Face varies significantly. Models released by major organizations (Google, Meta, Microsoft, Anthropic) typically have detailed, well-structured model cards. Community-contributed models, which make up the majority of Hub content, range from thorough documentation to completely empty README files. A 2024 analysis published in Nature Machine Intelligence by Liang et al. systematically examined 32,111 AI model cards on the Hub. The study found that the Training section was the most consistently filled-out, while the Environmental Impact, Limitations, and Evaluation sections had the lowest completion rates. The same paper showed that improving model card completeness correlated with higher download rates, suggesting documentation pays off in adoption as well as accountability [5].

hugging face yaml metadata

Hugging Face model cards include a YAML metadata block at the top of the README.md file, delimited by --- lines. This block enables structured search, filtering, and integration with the platform's API. Key metadata fields include:

Field	Purpose
`language`	ISO 639-1 codes for languages the model supports or was trained on
`license`	License identifier (apache-2.0, mit, llama3, etc.)
`tags`	Free-form tags for categorization and discovery
`datasets`	Hub datasets used for training or evaluation
`metrics`	Performance metrics and their reported values
`pipeline_tag`	The task type (text-generation, image-classification, automatic-speech-recognition, etc.)
`library_name`	The compatible ML library (transformers, diffusers, sentence-transformers); required for transformers repos created after August 2024
`base_model`	The model this one was fine-tuned, adapted, or quantized from
`thumbnail`	Image used in social media previews
`model-index`	Structured evaluation results with task, dataset, metric, and source fields

The model-index field deserves special mention because it lets the Hub surface evaluation numbers in a standardized way. A repo can declare, for example, that the model achieved 87.3 on MMLU using the lm-eval-harness, and that result will appear on aggregated leaderboards rather than buried in prose.

datasheets for datasets

Model cards were designed to document models, but a parallel documentation framework exists for the datasets used to train and evaluate those models. "Datasheets for Datasets," proposed by Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daume III, and Kate Crawford, was first published as a preprint in March 2018 and later appeared in Communications of the ACM in December 2021 [6].

The datasheets framework takes its name from the electronics industry, where every component ships with a datasheet specifying its operating characteristics. Gebru et al. argued that every dataset used in machine learning should be accompanied by a similar document covering motivation, composition, the data collection process, any preprocessing or labeling steps, recommended uses, distribution, maintenance plans, and legal or ethical considerations. The full questionnaire contains 57 questions across seven categories, designed to be answered before a dataset is shared widely [6].

The key questions a dataset datasheet answers include:

Category	Example questions
Motivation	Why was the dataset created? Who created it? Who funded it?
Composition	What do the instances represent? How many are there? Is there missing data? Does it contain confidential information?
Collection process	How was the data collected? Who was involved? Over what timeframe? Were individuals notified? Did they consent?
Preprocessing	Was any data cleaned, filtered, or labeled? What tools or procedures were used?
Uses	What tasks has the dataset been used for? Are there tasks it should not be used for?
Distribution	How is the dataset distributed? Under what license? Are there access restrictions?
Maintenance	Who maintains the dataset? How can errors be reported? Will it be updated?

Datasheets for datasets and model cards are complementary: a model card references the training and evaluation data, while the dataset's datasheet provides the detailed provenance information for that data. Together, they create a documentation chain from raw data to deployed model. Hugging Face later integrated dataset cards (a YAML-fronted README on every dataset repo) that mirror this structure for the Hub [6].

model cards versus datasheets at a glance

Dimension	Model card	Datasheet for datasets
What it documents	A trained machine learning model	A dataset used for training or evaluation
Originating paper	Mitchell et al., 2019 (FAT*)	Gebru et al., 2018 (preprint), 2021 (CACM)
Core sections	9 (Model details to Caveats)	7 categories, 57 questions
Disaggregation focus	Performance across demographic groups	Composition and collection across population subgroups
Typical author	Model developer	Dataset curator or maintainer
Common venue	Hugging Face README, internal tooling	Dataset README, paper appendix

system cards

System cards represent an evolution of the model card concept, designed for AI systems that are more complex than a single model. The term was popularized by OpenAI, which has published system cards for several of its major releases, including GPT-4 (March 2023), GPT-4V (the vision-enabled variant, September 2023), GPT-4o (August 2024), the o1 reasoning models (December 2024), and subsequent frontier releases [8].

A system card differs from a model card in several respects. While a model card documents a single trained model in isolation, a system card documents the entire system that is deployed to users, including the base model, fine-tuning procedures, safety mitigations (such as RLHF training, content filters, and rate limits), deployment configuration, and the results of red-teaming and safety evaluations. System cards also typically include a discussion of the system's capabilities and the risks those capabilities pose in deployment [8].

OpenAI's GPT-4 System Card, published March 23, 2023 alongside the GPT-4 Technical Report, set the template that later system cards have followed. OpenAI assembled an external red team of 41 researchers across cybersecurity, biosecurity, persuasion, fairness, alignment, and other domains. The card documented specific capabilities that raised safety concerns, including the model's ability to provide step-by-step instructions for synthesizing dangerous chemicals (mitigated through refusal training), its potential to identify private individuals when augmented with outside data, and its performance on standardized exams (passing the Uniform Bar Exam in the 90th percentile). The card also described the limitations of the safety measures in place, including known jailbreaking techniques [8].

The GPT-4o System Card (August 2024) extended this to multimodal systems, adding voice safety evaluations covering unauthorized voice generation, speaker identification, disallowed audio content, and ungrounded inferences. The o1 System Card (December 2024) introduced new evaluations specific to reasoning models, including chain-of-thought monitoring and tests of whether the model would deceive evaluators when its goals conflicted with its training [8].

Anthropic publishes detailed model cards and system cards for its Claude family. The Claude 3 model card (March 2024) documented constitutional AI training, covered fourteen Trust & Safety policy areas across six languages, and included evaluations on areas such as elections integrity, child safety, cyber attacks, hate and discrimination, and violent extremism. The model card has been extended through addenda for Claude 3.5 Sonnet (June 2024), the upgraded Claude 3.5 Sonnet and Claude 3.5 Haiku (October 2024), Claude 3.7 Sonnet, Claude Opus 4 and Sonnet 4 (May 2025), and Claude Opus 4.5 (November 2025). Each release ties to Anthropic's Responsible Scaling Policy, which mandates specific evaluations in CBRN (Chemical, Biological, Radiological, Nuclear), cybersecurity, and autonomous capability domains before deployment [9][10].

Google DeepMind has published technical reports that serve a comparable function for Gemini 1.0, 1.5, 2.0, and later releases, while Meta has published model cards for its Llama series. The DeepSeek-V3 Technical Report (December 2024) and DeepSeek R1 Technical Report (January 2025) similarly act as system cards for those models, covering architecture (Mixture-of-Experts with 671B total / 37B active parameters for V3), training data scale (14.8 trillion tokens), training compute (2.788M H800 GPU hours), and downstream evaluations [11]. The common thread is that as AI systems become more complex and capable, documentation needs to cover not just the model itself but the full stack of engineering, safety, and deployment decisions around it.

model card versus system card

Aspect	Model card	System card
Scope	Single trained model	Entire deployed system including model, tooling, mitigations
Typical length	1 to 10 pages	30 to 200+ pages for frontier LLMs
Safety content	Known limitations	Red-team results, dangerous-capability evals, mitigation efficacy
Audience	ML practitioners, downstream developers	Developers, regulators, policymakers, civil society
Update cadence	Often static after release	Often updated with new evaluations or mitigations
Examples	Llama 3 model card, Mistral 7B model card	GPT-4 System Card, Claude Opus 4.5 System Card, Operator System Card

tools and libraries

A growing ecosystem of open-source tools makes model card creation easier. Each addresses a slightly different audience, from individual researchers to enterprise governance teams.

Tool	Maintainer	Purpose
Model Card Toolkit (MCT)	TensorFlow / Google	Python library that auto-populates JSON schema from ML Metadata, integrates with TensorFlow Extended (TFX), renders to HTML
Hugging Face Model Card Creator	Hugging Face	Web UI for filling in the standardized template without writing markdown
huggingface_hub library	Hugging Face	Python `ModelCard` and `ModelCardData` classes for programmatic creation, validation, and pushing of cards
AI Factsheets	IBM (watsonx.governance)	Lifecycle tracking from training to production, integrated with model inventory; based on Arnold et al. "FactSheets" research
VerifyML	Cylynx	Open-source auditing tool that pairs with model cards to verify reported performance
Vertex AI Model Cards	Google Cloud	Managed model card generation tied to Vertex AI pipelines

The Model Card Toolkit (MCT), open-sourced in July 2020 at github.com/tensorflow/model-card-toolkit, was Google's first public attempt to make model card production routine for engineers. It provides a JSON schema, a Python API, and a default HTML template, and can pull model lineage out of ML Metadata so that fields like training dataset and evaluation metrics populate automatically. IBM's AI FactSheets, building on "FactSheets: Increasing Trust in AI Services through Supplier's Declarations of Conformity" (Arnold et al., IBM Journal of Research and Development), takes a more enterprise-governance approach, treating each model as an asset in a tracked inventory.

model cards and regulation

Documentation requirements for AI systems have moved from voluntary best practice to legal mandate in several jurisdictions, particularly between 2023 and 2026. The regulations rarely use the term "model card" explicitly, but the information they require maps closely to the Mitchell et al. framework.

eu ai act

The EU AI Act, which entered into force on August 1, 2024, includes specific documentation requirements for AI systems, particularly those classified as high-risk. Providers of high-risk AI systems must maintain comprehensive technical documentation covering the system's intended purpose, design, development methodology, training and testing data, performance metrics, risk management measures, and post-market monitoring plans [12].

For general-purpose AI (GPAI) models, including LLMs, the AI Act requires providers to publish a sufficiently detailed summary of the training data content, using a template provided by the EU's AI Office. They must also draw up technical documentation covering the training and testing process and evaluation results, share documentation with downstream providers integrating the model, maintain that documentation per model version for ten years, and comply with EU copyright law. GPAI obligations entered into application on August 2, 2025, with full AI Office enforcement powers from August 2, 2026; providers of models placed on the market before August 2, 2025 have until August 2, 2027 to comply [12][13].

Providers of GPAI models with systemic risk (those exceeding 10^25 floating-point operations in training, a threshold designed to capture frontier models like GPT-4 and Claude Opus) face additional obligations: model evaluations including adversarial testing, serious incident reporting, cybersecurity protections, and documentation of energy consumption. Free and open-source models receive partial exemptions from the technical documentation and downstream support obligations but still must publish a training data summary and respect copyright [12].

The AI Act's documentation requirements effectively mandate something resembling model cards for all AI systems sold or deployed in the EU. While the regulation does not specifically require the Mitchell et al. format, the overlap between regulatory requirements and model card best practices has led many organizations to use model cards as a starting point for compliance.

nist ai risk management framework

The US NIST AI RMF (AI 100-1), published in January 2023, organizes responsible AI practice around four core functions: GOVERN, MAP, MEASURE, and MANAGE. Documentation runs through all four. The MAP function calls for documenting context, risks, and intended uses; MEASURE calls for documenting evaluation methods and results; GOVERN calls for organizational policies that make documentation systematic; MANAGE calls for documenting risk responses and monitoring. The framework explicitly treats documentation as enabling transparency, accountability, and human review, even though it does not prescribe a specific template [14].

NIST has also published the Generative AI Profile (NIST AI 600-1, July 2024) that adapts the RMF specifically to generative models, with documentation expectations tailored to risks like data leakage, hallucination, and content provenance.

united states executive orders

In October 2023 the Biden administration issued Executive Order 14110, "Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence," which required developers of dual-use foundation models above specified compute thresholds to share safety test results with the federal government and contemplated reporting requirements that mirrored model card content. The order also created the US AI Safety Institute (USAISI) within NIST and tasked it with developing red-teaming guidelines [15].

On January 20, 2025, President Trump rescinded EO 14110 via Executive Order 14148. Three days later, Trump signed "Removing Barriers to American Leadership in Artificial Intelligence," which directs federal agencies to draft an AI Action Plan within 180 days. As of April 2026, federal documentation requirements for foundation models in the United States are in flux, though state-level requirements (notably in Colorado) continue to apply [15][16].

colorado ai act

Colorado's Senate Bill 24-205, signed May 17, 2024 and known as the Colorado Anti-Discrimination in AI Law (ADAI), is the first state-level high-risk AI statute in the United States. It distinguishes between developers (who build or substantially modify high-risk AI systems) and deployers (who put them into use). Developers must give deployers a statement describing reasonably foreseeable uses, known harmful or inappropriate uses, the type of data used to train the system, and known or foreseeable limitations. They must also disclose any known risk of algorithmic discrimination to the Colorado Attorney General and to all known deployers within 90 days of discovery. Deployers must run impact assessments, maintain organized records (impact assessments, testing results, data source descriptions, version histories, internal review approvals), and notify consumers when a high-risk system is used to make a consequential decision about them. The provisions take effect June 30, 2026 [17].

regulatory landscape at a glance

Framework	Jurisdiction	In force	Documentation core
EU AI Act	European Union	Aug 1, 2024 (GPAI obligations Aug 2, 2025)	Technical documentation, training data summary, downstream provider info
NIST AI RMF 1.0	United States (voluntary)	Jan 26, 2023	GOVERN/MAP/MEASURE/MANAGE functions, transparency woven through
NIST AI 600-1 (Gen AI Profile)	United States (voluntary)	Jul 26, 2024	Generative-AI-specific risks and documentation
EO 14110 (Biden)	United States	Oct 30, 2023 (rescinded Jan 20, 2025)	Safety test reporting for dual-use models above compute thresholds
EO 14179 (Trump)	United States	Jan 23, 2025	Action plan; rolls back prior reporting requirements
Colorado SB 24-205 (ADAI)	Colorado, US	Effective Jun 30, 2026	Developer/deployer documentation, impact assessments, consumer notice
Singapore Model AI Governance Framework	Singapore (voluntary)	2019 (rev. 2020, 2024)	Documentation of design, data, and decisions
Canada Algorithmic Impact Assessment	Canada (federal use)	2020	Mandatory questionnaire for federal AI systems

examples from major models

The following table lists notable model cards and system cards published by major AI organizations, illustrating the range of documentation practices in use. Tables in this article are exempt from the one-link-per-article rule, so the same wikilinks may appear across rows.

Organization	Model	Document type	Year	Notable features
Google	Smile detector, Toxicity (Perspective)	Model card	2019	Original examples from the Mitchell et al. paper; revealed disparate performance for older men and for sexual-orientation terms
Google	Gemma	Model card	2024	Open-weights companion to Gemini; standardized template across Gemma 1 and 2
Google DeepMind	Gemini 1.0, 1.5, 2.0	Technical report	2023, 2024	De facto system card; long-context evaluations and red-team results
OpenAI	GPT-4	System card	March 2023	Red-team results across CBRN, persuasion, cybersecurity; bar exam at 90th percentile
OpenAI	GPT-4V (vision)	System card	September 2023	Image-input safety, person identification, biometric inference
OpenAI	GPT-4o	System card	August 2024	Multimodal (text/image/audio); voice safety, speaker ID, ungrounded inference
OpenAI	o1	System card	December 2024	First detailed evaluation of chain-of-thought reasoning safety
OpenAI	Operator	System card	2025	Agentic browser use; documentation of containment and oversight
Anthropic	Claude 3 (Opus, Sonnet, Haiku)	Model card	March 2024	Constitutional AI training, 14 policy areas in 6 languages
Anthropic	Claude 3.5 Sonnet (and addendum)	Model card	June 2024, Oct 2024	Document analysis, visual understanding, coding evaluations
Anthropic	Claude 3.7 Sonnet	System card	2025	First Anthropic system card under updated RSP
Anthropic	Claude Opus 4 / Sonnet 4	System card	May 2025	Hybrid-reasoning models; CBRN and autonomy evaluations
Anthropic	Claude Opus 4.5	System card	November 2025	RSP v3-aligned; Frontier Safety Roadmap and Risk Reports
Meta	Llama 2	Model card	July 2023	Detailed benchmarks and Responsible Use Guide; published on Hugging Face
Meta	Llama 3 (8B, 70B)	Model card	April 2024	128K-token tokenizer, 8,192-token sequences; CyberSecEval and CBRNE testing
Meta	Llama 3.1 / 3.2 / 3.3	Model card	2024	Multilingual coverage, 405B variant, multimodal extensions
Mistral AI	Mistral 7B, Mixtral 8x7B	Model card	2023	Concise template focused on benchmarks and Apache 2.0 license
Stability AI	Stable Diffusion	Model card	2022	Documented training data sources, biases in image generation, misuse risks
DeepSeek	DeepSeek-V3	Technical report	December 2024	671B MoE with 37B active; 14.8T training tokens; 2.788M H800 GPU hours
DeepSeek	DeepSeek-R1	Technical report	January 2025	Reinforcement-learning-only reasoning model with cold-start data

The variation between these documents reflects both organizational priorities and the absence of a binding standard. OpenAI and Anthropic have converged on long, multi-section system cards (often 50 to 200+ pages for frontier models, with the combined Claude Opus 4.6 and GPT-5.3 system cards reportedly totaling 244 pages). Meta and Mistral favor compact model cards. Open-source community releases sit anywhere on the spectrum.

sections in modern llm model cards

For large language models, the standard model card template has expanded well beyond the original Mitchell et al. nine sections. A typical 2024 to 2026 LLM model card includes:

Section	Typical content
Architecture	Model family, decoder-only or MoE, parameter count, active parameters, attention type (e.g., grouped-query, multi-head latent), context length
Training data	Composition (web, code, books, math, multilingual), token count, cutoff date, deduplication and filtering
Training compute	FLOPs, hardware (e.g., H100, H800, TPU v5p), GPU-hours, training duration
Energy and carbon	kWh consumed, location-based and market-based CO2e, sometimes water usage
Capabilities benchmarks	MMLU, GPQA, HumanEval, MATH, BIG-Bench Hard, MMMU, AIME, SWE-bench, GDPval
Long-context evaluations	Needle-in-a-haystack, RULER, multi-document QA
Safety evaluations	Red-team results, refusal rates, jailbreak resistance, sycophancy and deception checks
Bias and fairness	StereoSet, BBQ, CrowS-Pairs, multilingual fairness checks
Alignment evaluations	Constitutional AI compliance, instruction-following, RLHF win rates
Dangerous-capability evals	CBRN uplift, cybersecurity (CTF, vulnerability discovery), autonomous replication
Suggested uses	Recommended applications and audiences
Out-of-scope uses	Applications the developer disclaims (e.g., medical diagnosis, legal advice)
License and access	Apache 2.0, MIT, Llama Community License, custom RAIL license, API-only access

Not every model card hits every section. Open-weights releases are more likely to publish carbon and compute numbers; closed-weights frontier releases tend to invest more in safety and dangerous-capability sections.

limitations and criticism

Despite their wide adoption, model cards have been subject to several substantive critiques. Many of these limitations are acknowledged in the original Mitchell et al. paper, but they have not been fully resolved by subsequent practice.

Voluntary and inconsistent. Outside of regulated environments, model card creation is voluntary. Many models, particularly those released by smaller organizations or individual researchers, ship with incomplete or nonexistent documentation. The Liang et al. (2024) systematic analysis of 32,111 Hub model cards found that even the average card had multiple sections missing or stub-quality, with environmental, limitations, and evaluation sections least likely to be filled in [5].

Static documents. Model cards are typically created at the time of model release and rarely updated thereafter. As models are fine-tuned, deployed in new contexts, or discovered to have previously unknown failure modes, the original model card becomes outdated. Some organizations have adopted versioned model cards (Anthropic publishes addenda for each Claude release; Meta updates Llama cards across 3.1, 3.2, 3.3), but this practice is not widespread. Reward Reports (Gilbert et al., 2023) were proposed specifically to address the dynamic-system blind spot, focusing on reinforcement learning systems whose objectives evolve with deployment [18].

Self-reported. Model cards are written by the organizations that develop and release the models. There is an inherent tension between transparency and self-interest: organizations may understate risks, overstate performance, or omit information about known problems. Independent auditing of model card claims remains rare. Inioluwa Deborah Raji and colleagues have argued that internal algorithmic auditing should be standard practice and that external audit access should be a precondition for high-risk deployments [19].

Documentation theater. Critics have warned that producing a model card can substitute for actually doing the underlying safety, fairness, or accountability work. A polished card with the right section headers can give the impression of due diligence while the model itself was never tested in the relevant ways. This concern parallels longer-standing critiques of "ethics washing" in tech.

Readability and audience. Model cards are often written in technical language that is accessible to ML practitioners but opaque to policymakers, journalists, affected communities, and other stakeholders who may need the information most. The original Mitchell et al. paper acknowledged this tension and called for plain-language summaries; in practice, most cards still read like research-paper appendices.

Scope limitations. A model card documents a model; it does not document the deployment context, the downstream applications, or the lived experiences of people affected by the model's outputs. A toxicity classifier's model card might document that the model performs well on a standard benchmark, without capturing how its deployment in a content moderation system affects free expression for specific communities. System cards partially address this gap, but deployment-level documentation remains underdeveloped.

Benchmark gaming. As model cards have become competitive marketing artifacts, the benchmarks they report have become subject to contamination, cherry-picking, and tuning-for-the-test. The cards themselves are not at fault, but they have helped concentrate attention on a narrow set of metrics that may not generalize.

the landscape of ml documentation tools

Model cards exist within a broader ecosystem of documentation and governance tools for machine learning. Hugging Face maintains a Landscape of ML Documentation Tools resource that catalogs these complementary frameworks; each addresses a different layer of the AI lifecycle.

Tool / Framework	Purpose	Originating work
Model cards	Document individual ML models	Mitchell et al., FAT* 2019
Datasheets for datasets	Document training and evaluation datasets	Gebru et al., CACM 2021
System cards	Document deployed AI systems including safety measures	OpenAI (GPT-4 System Card, 2023)
AI FactSheets	Comprehensive AI governance documentation	Arnold et al., IBM 2019
Data Statements	Document characteristics of NLP datasets	Bender and Friedman, TACL 2018 (v3 2023)
Reward Reports	Document objectives and reward functions of reinforcement learning systems	Gilbert et al., AIES 2023
Dataset Nutrition Labels	Standardized nutrition-label-style data summary	Holland et al., 2018
Responsible AI Licenses (RAIL)	License terms with use restrictions for AI models	RAIL Initiative
Algorithmic Impact Assessments	Pre-deployment risk assessment of AI systems	Government of Canada, 2020

These tools serve overlapping but distinct purposes. No single document can capture all relevant information about an AI system, from raw training data to real-world impact. The trend in AI documentation is toward complementary, layered documentation that covers different aspects of the AI lifecycle, with model cards anchoring the model layer while datasheets cover data, system cards cover deployment, and reward reports cover dynamic behavior.

current state and outlook

As of April 2026, model cards are firmly established as a norm in the AI industry, though their adoption and quality remain uneven. The EU AI Act's GPAI obligations have been in force since August 2025, and full enforcement powers begin August 2026, which is already pulling the median quality of frontier-model documentation upward. Providers selling into the European market increasingly publish documentation that satisfies EU technical-documentation expectations regardless of where the company itself sits.

Hugging Face continues to lead in model card tooling and infrastructure. The platform now hosts over two million public models, with automated card generation features that pre-populate certain fields based on metadata, training logs, and benchmark results. Research groups have explored using language models themselves to draft model card content, an idea that raises awkward questions about accuracy and accountability when the documenter is the same kind of system being documented.

Frontier-lab system cards have grown dramatically in scope. Where the GPT-4 System Card in 2023 ran roughly 60 pages, more recent releases such as Claude Opus 4.5 and the GPT-5.3 system cards run well over 100 pages each, with detailed sections on dangerous-capability evaluations (CBRN uplift studies, cyber CTF performance, autonomous replication tests), agentic-task evaluations like SWE-bench and OpenAI's GDPval gold dataset of 220 tasks across 44 occupations, and a growing focus on evaluation reproducibility. Anthropic's Responsible Scaling Policy, OpenAI's Preparedness Framework, and Google DeepMind's Frontier Safety Framework all tie release decisions to specific evaluations that get reported back in the system card.

The broader trajectory points toward AI documentation becoming not just a best practice but a legal and commercial requirement. As AI systems are deployed in healthcare, finance, criminal justice, and defense, the demand for thorough, accurate, and independently verifiable documentation will only grow. Model cards, for all their limitations, established the vocabulary and expectations that more rigorous successors are now being built on.

references

[1] Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I.D., and Gebru, T. "Model Cards for Model Reporting." Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT* 2019), 220-229. https://arxiv.org/abs/1810.03993 and https://dl.acm.org/doi/10.1145/3287560.3287596

[2] Mitchell, M. Personal site and biography. https://www.m-mitchell.com/ ; "Margaret Mitchell (scientist)." Wikipedia. https://en.wikipedia.org/wiki/Margaret_Mitchell_(scientist)

[3] Hugging Face. "Model Cards." https://huggingface.co/docs/hub/en/model-cards ; Hugging Face. "Model Card Guidebook." https://huggingface.co/docs/hub/en/model-card-guidebook ; Hugging Face. "Annotated Model Card Template." https://huggingface.co/docs/hub/en/model-card-annotated

[4] Hugging Face. "huggingface_hub v1.0: Five Years of Building the Foundation of Open Machine Learning." Blog post, 2025. https://huggingface.co/blog/huggingface-hub-v1

[5] Liang, W. et al. "Systematic analysis of 32,111 AI model cards characterizes documentation practice in AI." Nature Machine Intelligence, 2024. https://www.nature.com/articles/s42256-024-00857-z

[6] Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J.W., Wallach, H., Daume III, H., and Crawford, K. "Datasheets for Datasets." Communications of the ACM, Vol. 64, No. 12, December 2021, 86-92. https://arxiv.org/abs/1803.09010 and https://dl.acm.org/doi/10.1145/3458723

[7] Lunden, I. "Google fires top AI ethics researcher Margaret Mitchell." TechCrunch, February 19, 2021. https://techcrunch.com/2021/02/19/google-fires-top-ai-ethics-researcher-margaret-mitchell/

[8] OpenAI. "GPT-4 System Card." March 23, 2023. https://cdn.openai.com/papers/gpt-4-system-card.pdf ; OpenAI. "GPT-4o System Card." August 2024. https://openai.com/index/gpt-4o-system-card/ ; OpenAI. "Operator System Card." 2025. https://openai.com/index/operator-system-card/

[9] Anthropic. "The Claude 3 Model Family: Opus, Sonnet, Haiku." Model card, March 2024. https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf ; Anthropic. "Model system cards." https://www.anthropic.com/system-cards

[10] Anthropic. "Anthropic's Responsible Scaling Policy." Versions 1.0 (Sept 2023), 2.0 (Oct 2024), 3.0 (2025). https://www.anthropic.com/responsible-scaling-policy

[11] DeepSeek-AI. "DeepSeek-V3 Technical Report." arXiv:2412.19437, December 2024. https://arxiv.org/abs/2412.19437 ; DeepSeek-AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." January 2025. https://huggingface.co/deepseek-ai/DeepSeek-R1

[12] European Commission. "AI Act." https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai ; European Commission. "The General-Purpose AI Code of Practice." https://digital-strategy.ec.europa.eu/en/policies/contents-code-gpai

[13] Future of Life Institute. "High-level summary of the AI Act." https://artificialintelligenceact.eu/high-level-summary/

[14] National Institute of Standards and Technology. "Artificial Intelligence Risk Management Framework (AI RMF 1.0)." NIST AI 100-1, January 2023. https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf ; NIST. "AI Risk Management Framework: Generative AI Profile." NIST AI 600-1, July 2024. https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf

[15] White House. "Executive Order 14110: Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence." October 30, 2023. https://en.wikipedia.org/wiki/Executive_Order_14110

[16] White House. "Executive Order 14148: Initial Rescissions of Harmful Executive Orders and Actions." January 20, 2025; "Removing Barriers to American Leadership in Artificial Intelligence." January 23, 2025. https://www.aila.org/library/executive-order-on-removing-barriers-to-american-leadership-in-artificial-intelligence

[17] Colorado General Assembly. "SB24-205 Consumer Protections for Artificial Intelligence." Signed May 17, 2024, effective June 30, 2026. https://leg.colorado.gov/bills/sb24-205

[18] Gilbert, T.K., Lambert, N., Dean, S., Zick, T., Snoswell, A., and Mehta, S. "Reward Reports for Reinforcement Learning." Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society. https://arxiv.org/abs/2204.10817

[19] Raji, I.D., Smart, A., White, R.N., Mitchell, M., Gebru, T., Hutchinson, B., Smith-Loud, J., Theron, D., and Barnes, P. "Closing the AI Accountability Gap: Defining an End-to-End Framework for Internal Algorithmic Auditing." FAT* 2020. https://dl.acm.org/doi/10.1145/3351095.3372873

[20] Bender, E.M. and Friedman, B. "Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science." Transactions of the Association for Computational Linguistics, 2018. https://aclanthology.org/Q18-1041/

[21] Arnold, M., Bellamy, R.K.E., Hind, M., et al. "FactSheets: Increasing Trust in AI Services through Supplier's Declarations of Conformity." IBM Journal of Research and Development. https://research.ibm.com/blog/aifactsheets

[22] TensorFlow / Google. "Model Card Toolkit." https://github.com/tensorflow/model-card-toolkit ; "Introducing the Model Card Toolkit for Easier Model Transparency Reporting." Google Research blog, July 2020. https://research.google/blog/introducing-the-model-card-toolkit-for-easier-model-transparency-reporting/

[23] Meta. "Llama 3 Model Card." April 2024. https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md ; "Llama 3.1 Model Card." https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md

origins and the mitchell et al. 2019 paper

what a model card contains

hugging face model cards

hugging face yaml metadata

datasheets for datasets

model cards versus datasheets at a glance

system cards

model card versus system card

tools and libraries

model cards and regulation

eu ai act

nist ai risk management framework

united states executive orders

colorado ai act

regulatory landscape at a glance

examples from major models

sections in modern llm model cards

limitations and criticism

the landscape of ml documentation tools

current state and outlook

see also

references

Improve this article

Related Articles

Explainable AI

LLaMA/Model Card

GPT API

origins and the mitchell et al. 2019 paper

what a model card contains

hugging face model cards

hugging face yaml metadata

datasheets for datasets

model cards versus datasheets at a glance

system cards

model card versus system card

tools and libraries

model cards and regulation

eu ai act

nist ai risk management framework

united states executive orders

colorado ai act

regulatory landscape at a glance

examples from major models

sections in modern llm model cards

limitations and criticism

the landscape of ml documentation tools

current state and outlook

see also

references

Related Articles

Explainable AI

LLaMA/Model Card

GPT API