A model card is a standardized documentation framework for machine learning models that describes a model's intended use, performance characteristics, training data, evaluation metrics, ethical considerations, and limitations. Model cards function as a form of transparency documentation, analogous to nutrition labels for food products, datasheets for electronic components, or Material Safety Data Sheets in chemistry. The concept was introduced by Margaret Mitchell and colleagues at Google in their 2019 paper "Model Cards for Model Reporting," and has since become a widely adopted standard in both industry and open-source AI development, particularly through Hugging Face's implementation on its Model Hub [1].
Model cards are part of a broader ecosystem of AI documentation practices that includes datasheets for datasets (Gebru et al., 2021), system cards (used by OpenAI for models like GPT-4), and regulatory documentation requirements under frameworks like the EU AI Act. Together, these documentation standards aim to improve accountability, reproducibility, and informed decision-making across the AI development lifecycle. They sit at the intersection of responsible AI, AI governance, and AI safety practice.
The foundational paper, "Model Cards for Model Reporting," was published in January 2019 at the ACM Conference on Fairness, Accountability, and Transparency (FAT* 2019). The paper was authored by Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru, all of whom were affiliated with Google at the time. The first preprint was posted to arXiv (1810.03993) in October 2018 [1].
Mitchell co-founded and co-led Google's Ethical AI team alongside Gebru. The model cards proposal grew out of that team's broader work on fairness, accountability, and bias in deployed machine learning. After Gebru's controversial departure from Google in December 2020, Mitchell was also fired in February 2021 following an internal investigation. She joined Hugging Face later that year as Chief Ethics Scientist, where she continued the model cards work and helped shape the platform's documentation practices [2][7].
The paper identified a gap in how machine learning models were documented and shared. While software engineering had established practices for documentation (API references, changelogs, README files), machine learning models were frequently released with minimal information about their intended use, performance across different populations, training data composition, or known limitations. This lack of transparency made it difficult for downstream users to evaluate whether a model was appropriate for their specific application, and it obscured potential harms to affected communities [1].
Mitchell et al. proposed model cards as a short, structured document accompanying any trained machine learning model. The paper drew inspiration from existing practices in other fields: datasheets in the electronics industry, nutritional labels in the food industry, and Material Safety Data Sheets in chemistry. The authors argued that a standardized documentation format would facilitate better communication between model developers and model users, encourage the evaluation of models across different demographic groups, and create accountability for known limitations [1].
The paper's most cited definition reads: "Model cards are short documents accompanying trained machine learning models that provide benchmarked evaluation in a variety of conditions, such as across different cultural, demographic, or phenotypic groups (e.g., race, geographic location, sex, Fitzpatrick skin type) and intersectional groups (e.g., age and race, or sex and Fitzpatrick skin type) that are relevant to the intended application domains." The emphasis on intersectional evaluation, drawing on critical race theory and earlier fairness work, was a deliberate departure from aggregate accuracy reporting [1].
The paper demonstrated the concept with example model cards for two Google models: a smile detector trained on the CelebA dataset and a toxicity classifier built on the Perspective API. The smile-detection card revealed higher false discovery rates for older men. The toxicity card surfaced bias against terms related to sexual orientation. In both cases the disaggregated reporting exposed disparities that aggregate accuracy numbers had hidden, demonstrating exactly the failure mode the authors were trying to address [1].
The original Mitchell et al. framework specifies nine sections that a model card should include. These sections have been refined and expanded by subsequent adopters, but the core structure remains consistent across most modern templates.
| Section | Description | Example content |
|---|---|---|
| Model details | Basic information about the model | Name, version, type (e.g., classification, generation), developer, release date, license, citation, contact |
| Intended use | What the model was designed for | Primary use cases, intended users, out-of-scope uses |
| Factors | Variables that affect model performance | Demographic factors (age, gender, ethnicity, Fitzpatrick skin type), environmental factors (lighting, noise), instrumentation factors (camera type, microphone quality) |
| Metrics | How performance is measured | Accuracy, F1 score, BLEU, perplexity, fairness metrics, decision thresholds, variation approaches, disaggregated by relevant factors |
| Evaluation data | Datasets used for testing | Dataset name, size, composition, motivation for choice, any known biases |
| Training data | Datasets used for training | Sources, size, collection methodology, preprocessing steps, any known limitations |
| Quantitative analyses | Detailed performance results | Unitary results per factor, intersectional results, confidence intervals, error analysis |
| Ethical considerations | Known risks and societal impact | Potential for harm, sensitive use cases, populations at risk, mitigation strategies considered during development |
| Caveats and recommendations | Known limitations and usage guidance | Failure modes, conditions under which the model should not be used, recommended downstream evaluation |
The emphasis on disaggregated evaluation, where performance metrics are broken down across demographic or contextual subgroups rather than reported only as aggregate numbers, is one of the most important contributions of the model card framework. A face recognition model might achieve 99% accuracy overall but only 85% accuracy on darker-skinned faces; an aggregate metric would hide this disparity, while a model card would make it explicit. This approach was directly informed by Joy Buolamwini and Timnit Gebru's earlier Gender Shades research, which found commercial face analysis systems performed dramatically worse on darker-skinned women than on lighter-skinned men [1].
The paper also proposed two specific reporting modes: unitary results, which show performance per single factor, and intersectional results, which show performance at the intersection of factors (for example age x skin type). The intersectional view is more demanding but typically reveals worse outcomes than either factor alone would suggest, which is precisely why Mitchell et al. argued it was needed.
Hugging Face, the open-source AI platform and model hosting service, has become the primary venue where model cards are created and consumed in practice. The Hugging Face Hub crossed one million hosted models in late 2024 and surpassed two million by early 2026, with each model repository containing a README.md file that serves as the model card [3][4].
Hugging Face's adoption of model cards has been the single most significant factor in their widespread use. When a researcher or organization uploads a model to the Hub, the platform prompts them to fill out a model card using a standardized template. This template draws on the original Mitchell et al. framework but has been expanded to include additional metadata fields relevant to the Hugging Face ecosystem, such as pipeline tags (text-classification, image-generation, etc.), language codes, datasets used, and library compatibility [3].
In 2022, Hugging Face launched the Model Card Guidebook, authored primarily by Ezi Ozoani, Marissa Gerchick, and Margaret Mitchell. The guidebook bundled four resources: an Annotated Model Card Template explaining how to fill out each section, a user study on model card usage at the Hub, a landscape analysis of the broader documentation ecosystem, and a Model Card Creator Tool that allows users to generate model cards through a graphical interface without writing markdown directly. The release also added prompt text inside the template to encourage thorough completion of the Bias, Risks and Limitations section [3].
The quality of model cards on Hugging Face varies significantly. Models released by major organizations (Google, Meta, Microsoft, Anthropic) typically have detailed, well-structured model cards. Community-contributed models, which make up the majority of Hub content, range from thorough documentation to completely empty README files. A 2024 analysis published in Nature Machine Intelligence by Liang et al. systematically examined 32,111 AI model cards on the Hub. The study found that the Training section was the most consistently filled-out, while the Environmental Impact, Limitations, and Evaluation sections had the lowest completion rates. The same paper showed that improving model card completeness correlated with higher download rates, suggesting documentation pays off in adoption as well as accountability [5].
Hugging Face model cards include a YAML metadata block at the top of the README.md file, delimited by --- lines. This block enables structured search, filtering, and integration with the platform's API. Key metadata fields include:
| Field | Purpose |
|---|---|
language | ISO 639-1 codes for languages the model supports or was trained on |
license | License identifier (apache-2.0, mit, llama3, etc.) |
tags | Free-form tags for categorization and discovery |
datasets | Hub datasets used for training or evaluation |
metrics | Performance metrics and their reported values |
pipeline_tag | The task type (text-generation, image-classification, automatic-speech-recognition, etc.) |
library_name | The compatible ML library (transformers, diffusers, sentence-transformers); required for transformers repos created after August 2024 |
base_model | The model this one was fine-tuned, adapted, or quantized from |
thumbnail | Image used in social media previews |
model-index | Structured evaluation results with task, dataset, metric, and source fields |
The model-index field deserves special mention because it lets the Hub surface evaluation numbers in a standardized way. A repo can declare, for example, that the model achieved 87.3 on MMLU using the lm-eval-harness, and that result will appear on aggregated leaderboards rather than buried in prose.
Model cards were designed to document models, but a parallel documentation framework exists for the datasets used to train and evaluate those models. "Datasheets for Datasets," proposed by Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daume III, and Kate Crawford, was first published as a preprint in March 2018 and later appeared in Communications of the ACM in December 2021 [6].
The datasheets framework takes its name from the electronics industry, where every component ships with a datasheet specifying its operating characteristics. Gebru et al. argued that every dataset used in machine learning should be accompanied by a similar document covering motivation, composition, the data collection process, any preprocessing or labeling steps, recommended uses, distribution, maintenance plans, and legal or ethical considerations. The full questionnaire contains 57 questions across seven categories, designed to be answered before a dataset is shared widely [6].
The key questions a dataset datasheet answers include:
| Category | Example questions |
|---|---|
| Motivation | Why was the dataset created? Who created it? Who funded it? |
| Composition | What do the instances represent? How many are there? Is there missing data? Does it contain confidential information? |
| Collection process | How was the data collected? Who was involved? Over what timeframe? Were individuals notified? Did they consent? |
| Preprocessing | Was any data cleaned, filtered, or labeled? What tools or procedures were used? |
| Uses | What tasks has the dataset been used for? Are there tasks it should not be used for? |
| Distribution | How is the dataset distributed? Under what license? Are there access restrictions? |
| Maintenance | Who maintains the dataset? How can errors be reported? Will it be updated? |
Datasheets for datasets and model cards are complementary: a model card references the training and evaluation data, while the dataset's datasheet provides the detailed provenance information for that data. Together, they create a documentation chain from raw data to deployed model. Hugging Face later integrated dataset cards (a YAML-fronted README on every dataset repo) that mirror this structure for the Hub [6].
| Dimension | Model card | Datasheet for datasets |
|---|---|---|
| What it documents | A trained machine learning model | A dataset used for training or evaluation |
| Originating paper | Mitchell et al., 2019 (FAT*) | Gebru et al., 2018 (preprint), 2021 (CACM) |
| Core sections | 9 (Model details to Caveats) | 7 categories, 57 questions |
| Disaggregation focus | Performance across demographic groups | Composition and collection across population subgroups |
| Typical author | Model developer | Dataset curator or maintainer |
| Common venue | Hugging Face README, internal tooling | Dataset README, paper appendix |
System cards represent an evolution of the model card concept, designed for AI systems that are more complex than a single model. The term was popularized by OpenAI, which has published system cards for several of its major releases, including GPT-4 (March 2023), GPT-4V (the vision-enabled variant, September 2023), GPT-4o (August 2024), the o1 reasoning models (December 2024), and subsequent frontier releases [8].
A system card differs from a model card in several respects. While a model card documents a single trained model in isolation, a system card documents the entire system that is deployed to users, including the base model, fine-tuning procedures, safety mitigations (such as RLHF training, content filters, and rate limits), deployment configuration, and the results of red-teaming and safety evaluations. System cards also typically include a discussion of the system's capabilities and the risks those capabilities pose in deployment [8].
OpenAI's GPT-4 System Card, published March 23, 2023 alongside the GPT-4 Technical Report, set the template that later system cards have followed. OpenAI assembled an external red team of 41 researchers across cybersecurity, biosecurity, persuasion, fairness, alignment, and other domains. The card documented specific capabilities that raised safety concerns, including the model's ability to provide step-by-step instructions for synthesizing dangerous chemicals (mitigated through refusal training), its potential to identify private individuals when augmented with outside data, and its performance on standardized exams (passing the Uniform Bar Exam in the 90th percentile). The card also described the limitations of the safety measures in place, including known jailbreaking techniques [8].
The GPT-4o System Card (August 2024) extended this to multimodal systems, adding voice safety evaluations covering unauthorized voice generation, speaker identification, disallowed audio content, and ungrounded inferences. The o1 System Card (December 2024) introduced new evaluations specific to reasoning models, including chain-of-thought monitoring and tests of whether the model would deceive evaluators when its goals conflicted with its training [8].
Anthropic publishes detailed model cards and system cards for its Claude family. The Claude 3 model card (March 2024) documented constitutional AI training, covered fourteen Trust & Safety policy areas across six languages, and included evaluations on areas such as elections integrity, child safety, cyber attacks, hate and discrimination, and violent extremism. The model card has been extended through addenda for Claude 3.5 Sonnet (June 2024), the upgraded Claude 3.5 Sonnet and Claude 3.5 Haiku (October 2024), Claude 3.7 Sonnet, Claude Opus 4 and Sonnet 4 (May 2025), and Claude Opus 4.5 (November 2025). Each release ties to Anthropic's Responsible Scaling Policy, which mandates specific evaluations in CBRN (Chemical, Biological, Radiological, Nuclear), cybersecurity, and autonomous capability domains before deployment [9][10].
Google DeepMind has published technical reports that serve a comparable function for Gemini 1.0, 1.5, 2.0, and later releases, while Meta has published model cards for its Llama series. The DeepSeek-V3 Technical Report (December 2024) and DeepSeek R1 Technical Report (January 2025) similarly act as system cards for those models, covering architecture (Mixture-of-Experts with 671B total / 37B active parameters for V3), training data scale (14.8 trillion tokens), training compute (2.788M H800 GPU hours), and downstream evaluations [11]. The common thread is that as AI systems become more complex and capable, documentation needs to cover not just the model itself but the full stack of engineering, safety, and deployment decisions around it.
| Aspect | Model card | System card |
|---|---|---|
| Scope | Single trained model | Entire deployed system including model, tooling, mitigations |
| Typical length | 1 to 10 pages | 30 to 200+ pages for frontier LLMs |
| Safety content | Known limitations | Red-team results, dangerous-capability evals, mitigation efficacy |
| Audience | ML practitioners, downstream developers | Developers, regulators, policymakers, civil society |
| Update cadence | Often static after release | Often updated with new evaluations or mitigations |
| Examples | Llama 3 model card, Mistral 7B model card | GPT-4 System Card, Claude Opus 4.5 System Card, Operator System Card |
A growing ecosystem of open-source tools makes model card creation easier. Each addresses a slightly different audience, from individual researchers to enterprise governance teams.
| Tool | Maintainer | Purpose |
|---|---|---|
| Model Card Toolkit (MCT) | TensorFlow / Google | Python library that auto-populates JSON schema from ML Metadata, integrates with TensorFlow Extended (TFX), renders to HTML |
| Hugging Face Model Card Creator | Hugging Face | Web UI for filling in the standardized template without writing markdown |
| huggingface_hub library | Hugging Face | Python ModelCard and ModelCardData classes for programmatic creation, validation, and pushing of cards |
| AI Factsheets | IBM (watsonx.governance) | Lifecycle tracking from training to production, integrated with model inventory; based on Arnold et al. "FactSheets" research |
| VerifyML | Cylynx | Open-source auditing tool that pairs with model cards to verify reported performance |
| Vertex AI Model Cards | Google Cloud | Managed model card generation tied to Vertex AI pipelines |
The Model Card Toolkit (MCT), open-sourced in July 2020 at github.com/tensorflow/model-card-toolkit, was Google's first public attempt to make model card production routine for engineers. It provides a JSON schema, a Python API, and a default HTML template, and can pull model lineage out of ML Metadata so that fields like training dataset and evaluation metrics populate automatically. IBM's AI FactSheets, building on "FactSheets: Increasing Trust in AI Services through Supplier's Declarations of Conformity" (Arnold et al., IBM Journal of Research and Development), takes a more enterprise-governance approach, treating each model as an asset in a tracked inventory.
Documentation requirements for AI systems have moved from voluntary best practice to legal mandate in several jurisdictions, particularly between 2023 and 2026. The regulations rarely use the term "model card" explicitly, but the information they require maps closely to the Mitchell et al. framework.
The EU AI Act, which entered into force on August 1, 2024, includes specific documentation requirements for AI systems, particularly those classified as high-risk. Providers of high-risk AI systems must maintain comprehensive technical documentation covering the system's intended purpose, design, development methodology, training and testing data, performance metrics, risk management measures, and post-market monitoring plans [12].
For general-purpose AI (GPAI) models, including LLMs, the AI Act requires providers to publish a sufficiently detailed summary of the training data content, using a template provided by the EU's AI Office. They must also draw up technical documentation covering the training and testing process and evaluation results, share documentation with downstream providers integrating the model, maintain that documentation per model version for ten years, and comply with EU copyright law. GPAI obligations entered into application on August 2, 2025, with full AI Office enforcement powers from August 2, 2026; providers of models placed on the market before August 2, 2025 have until August 2, 2027 to comply [12][13].
Providers of GPAI models with systemic risk (those exceeding 10^25 floating-point operations in training, a threshold designed to capture frontier models like GPT-4 and Claude Opus) face additional obligations: model evaluations including adversarial testing, serious incident reporting, cybersecurity protections, and documentation of energy consumption. Free and open-source models receive partial exemptions from the technical documentation and downstream support obligations but still must publish a training data summary and respect copyright [12].
The AI Act's documentation requirements effectively mandate something resembling model cards for all AI systems sold or deployed in the EU. While the regulation does not specifically require the Mitchell et al. format, the overlap between regulatory requirements and model card best practices has led many organizations to use model cards as a starting point for compliance.
The US NIST AI RMF (AI 100-1), published in January 2023, organizes responsible AI practice around four core functions: GOVERN, MAP, MEASURE, and MANAGE. Documentation runs through all four. The MAP function calls for documenting context, risks, and intended uses; MEASURE calls for documenting evaluation methods and results; GOVERN calls for organizational policies that make documentation systematic; MANAGE calls for documenting risk responses and monitoring. The framework explicitly treats documentation as enabling transparency, accountability, and human review, even though it does not prescribe a specific template [14].
NIST has also published the Generative AI Profile (NIST AI 600-1, July 2024) that adapts the RMF specifically to generative models, with documentation expectations tailored to risks like data leakage, hallucination, and content provenance.
In October 2023 the Biden administration issued Executive Order 14110, "Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence," which required developers of dual-use foundation models above specified compute thresholds to share safety test results with the federal government and contemplated reporting requirements that mirrored model card content. The order also created the US AI Safety Institute (USAISI) within NIST and tasked it with developing red-teaming guidelines [15].
On January 20, 2025, President Trump rescinded EO 14110 via Executive Order 14148. Three days later, Trump signed "Removing Barriers to American Leadership in Artificial Intelligence," which directs federal agencies to draft an AI Action Plan within 180 days. As of April 2026, federal documentation requirements for foundation models in the United States are in flux, though state-level requirements (notably in Colorado) continue to apply [15][16].
Colorado's Senate Bill 24-205, signed May 17, 2024 and known as the Colorado Anti-Discrimination in AI Law (ADAI), is the first state-level high-risk AI statute in the United States. It distinguishes between developers (who build or substantially modify high-risk AI systems) and deployers (who put them into use). Developers must give deployers a statement describing reasonably foreseeable uses, known harmful or inappropriate uses, the type of data used to train the system, and known or foreseeable limitations. They must also disclose any known risk of algorithmic discrimination to the Colorado Attorney General and to all known deployers within 90 days of discovery. Deployers must run impact assessments, maintain organized records (impact assessments, testing results, data source descriptions, version histories, internal review approvals), and notify consumers when a high-risk system is used to make a consequential decision about them. The provisions take effect June 30, 2026 [17].
| Framework | Jurisdiction | In force | Documentation core |
|---|---|---|---|
| EU AI Act | European Union | Aug 1, 2024 (GPAI obligations Aug 2, 2025) | Technical documentation, training data summary, downstream provider info |
| NIST AI RMF 1.0 | United States (voluntary) | Jan 26, 2023 | GOVERN/MAP/MEASURE/MANAGE functions, transparency woven through |
| NIST AI 600-1 (Gen AI Profile) | United States (voluntary) | Jul 26, 2024 | Generative-AI-specific risks and documentation |
| EO 14110 (Biden) | United States | Oct 30, 2023 (rescinded Jan 20, 2025) | Safety test reporting for dual-use models above compute thresholds |
| EO 14179 (Trump) | United States | Jan 23, 2025 | Action plan; rolls back prior reporting requirements |
| Colorado SB 24-205 (ADAI) | Colorado, US | Effective Jun 30, 2026 | Developer/deployer documentation, impact assessments, consumer notice |
| Singapore Model AI Governance Framework | Singapore (voluntary) | 2019 (rev. 2020, 2024) | Documentation of design, data, and decisions |
| Canada Algorithmic Impact Assessment | Canada (federal use) | 2020 | Mandatory questionnaire for federal AI systems |
The following table lists notable model cards and system cards published by major AI organizations, illustrating the range of documentation practices in use. Tables in this article are exempt from the one-link-per-article rule, so the same wikilinks may appear across rows.
| Organization | Model | Document type | Year | Notable features |
|---|---|---|---|---|
| Smile detector, Toxicity (Perspective) | Model card | 2019 | Original examples from the Mitchell et al. paper; revealed disparate performance for older men and for sexual-orientation terms | |
| Gemma | Model card | 2024 | Open-weights companion to Gemini; standardized template across Gemma 1 and 2 | |
| Google DeepMind | Gemini 1.0, 1.5, 2.0 | Technical report | 2023, 2024 | De facto system card; long-context evaluations and red-team results |
| OpenAI | GPT-4 | System card | March 2023 | Red-team results across CBRN, persuasion, cybersecurity; bar exam at 90th percentile |
| OpenAI | GPT-4V (vision) | System card | September 2023 | Image-input safety, person identification, biometric inference |
| OpenAI | GPT-4o | System card | August 2024 | Multimodal (text/image/audio); voice safety, speaker ID, ungrounded inference |
| OpenAI | o1 | System card | December 2024 | First detailed evaluation of chain-of-thought reasoning safety |
| OpenAI | Operator | System card | 2025 | Agentic browser use; documentation of containment and oversight |
| Anthropic | Claude 3 (Opus, Sonnet, Haiku) | Model card | March 2024 | Constitutional AI training, 14 policy areas in 6 languages |
| Anthropic | Claude 3.5 Sonnet (and addendum) | Model card | June 2024, Oct 2024 | Document analysis, visual understanding, coding evaluations |
| Anthropic | Claude 3.7 Sonnet | System card | 2025 | First Anthropic system card under updated RSP |
| Anthropic | Claude Opus 4 / Sonnet 4 | System card | May 2025 | Hybrid-reasoning models; CBRN and autonomy evaluations |
| Anthropic | Claude Opus 4.5 | System card | November 2025 | RSP v3-aligned; Frontier Safety Roadmap and Risk Reports |
| Meta | Llama 2 | Model card | July 2023 | Detailed benchmarks and Responsible Use Guide; published on Hugging Face |
| Meta | Llama 3 (8B, 70B) | Model card | April 2024 | 128K-token tokenizer, 8,192-token sequences; CyberSecEval and CBRNE testing |
| Meta | Llama 3.1 / 3.2 / 3.3 | Model card | 2024 | Multilingual coverage, 405B variant, multimodal extensions |
| Mistral AI | Mistral 7B, Mixtral 8x7B | Model card | 2023 | Concise template focused on benchmarks and Apache 2.0 license |
| Stability AI | Stable Diffusion | Model card | 2022 | Documented training data sources, biases in image generation, misuse risks |
| DeepSeek | DeepSeek-V3 | Technical report | December 2024 | 671B MoE with 37B active; 14.8T training tokens; 2.788M H800 GPU hours |
| DeepSeek | DeepSeek-R1 | Technical report | January 2025 | Reinforcement-learning-only reasoning model with cold-start data |
The variation between these documents reflects both organizational priorities and the absence of a binding standard. OpenAI and Anthropic have converged on long, multi-section system cards (often 50 to 200+ pages for frontier models, with the combined Claude Opus 4.6 and GPT-5.3 system cards reportedly totaling 244 pages). Meta and Mistral favor compact model cards. Open-source community releases sit anywhere on the spectrum.
For large language models, the standard model card template has expanded well beyond the original Mitchell et al. nine sections. A typical 2024 to 2026 LLM model card includes:
| Section | Typical content |
|---|---|
| Architecture | Model family, decoder-only or MoE, parameter count, active parameters, attention type (e.g., grouped-query, multi-head latent), context length |
| Training data | Composition (web, code, books, math, multilingual), token count, cutoff date, deduplication and filtering |
| Training compute | FLOPs, hardware (e.g., H100, H800, TPU v5p), GPU-hours, training duration |
| Energy and carbon | kWh consumed, location-based and market-based CO2e, sometimes water usage |
| Capabilities benchmarks | MMLU, GPQA, HumanEval, MATH, BIG-Bench Hard, MMMU, AIME, SWE-bench, GDPval |
| Long-context evaluations | Needle-in-a-haystack, RULER, multi-document QA |
| Safety evaluations | Red-team results, refusal rates, jailbreak resistance, sycophancy and deception checks |
| Bias and fairness | StereoSet, BBQ, CrowS-Pairs, multilingual fairness checks |
| Alignment evaluations | Constitutional AI compliance, instruction-following, RLHF win rates |
| Dangerous-capability evals | CBRN uplift, cybersecurity (CTF, vulnerability discovery), autonomous replication |
| Suggested uses | Recommended applications and audiences |
| Out-of-scope uses | Applications the developer disclaims (e.g., medical diagnosis, legal advice) |
| License and access | Apache 2.0, MIT, Llama Community License, custom RAIL license, API-only access |
Not every model card hits every section. Open-weights releases are more likely to publish carbon and compute numbers; closed-weights frontier releases tend to invest more in safety and dangerous-capability sections.
Despite their wide adoption, model cards have been subject to several substantive critiques. Many of these limitations are acknowledged in the original Mitchell et al. paper, but they have not been fully resolved by subsequent practice.
Voluntary and inconsistent. Outside of regulated environments, model card creation is voluntary. Many models, particularly those released by smaller organizations or individual researchers, ship with incomplete or nonexistent documentation. The Liang et al. (2024) systematic analysis of 32,111 Hub model cards found that even the average card had multiple sections missing or stub-quality, with environmental, limitations, and evaluation sections least likely to be filled in [5].
Static documents. Model cards are typically created at the time of model release and rarely updated thereafter. As models are fine-tuned, deployed in new contexts, or discovered to have previously unknown failure modes, the original model card becomes outdated. Some organizations have adopted versioned model cards (Anthropic publishes addenda for each Claude release; Meta updates Llama cards across 3.1, 3.2, 3.3), but this practice is not widespread. Reward Reports (Gilbert et al., 2023) were proposed specifically to address the dynamic-system blind spot, focusing on reinforcement learning systems whose objectives evolve with deployment [18].
Self-reported. Model cards are written by the organizations that develop and release the models. There is an inherent tension between transparency and self-interest: organizations may understate risks, overstate performance, or omit information about known problems. Independent auditing of model card claims remains rare. Inioluwa Deborah Raji and colleagues have argued that internal algorithmic auditing should be standard practice and that external audit access should be a precondition for high-risk deployments [19].
Documentation theater. Critics have warned that producing a model card can substitute for actually doing the underlying safety, fairness, or accountability work. A polished card with the right section headers can give the impression of due diligence while the model itself was never tested in the relevant ways. This concern parallels longer-standing critiques of "ethics washing" in tech.
Readability and audience. Model cards are often written in technical language that is accessible to ML practitioners but opaque to policymakers, journalists, affected communities, and other stakeholders who may need the information most. The original Mitchell et al. paper acknowledged this tension and called for plain-language summaries; in practice, most cards still read like research-paper appendices.
Scope limitations. A model card documents a model; it does not document the deployment context, the downstream applications, or the lived experiences of people affected by the model's outputs. A toxicity classifier's model card might document that the model performs well on a standard benchmark, without capturing how its deployment in a content moderation system affects free expression for specific communities. System cards partially address this gap, but deployment-level documentation remains underdeveloped.
Benchmark gaming. As model cards have become competitive marketing artifacts, the benchmarks they report have become subject to contamination, cherry-picking, and tuning-for-the-test. The cards themselves are not at fault, but they have helped concentrate attention on a narrow set of metrics that may not generalize.
Model cards exist within a broader ecosystem of documentation and governance tools for machine learning. Hugging Face maintains a Landscape of ML Documentation Tools resource that catalogs these complementary frameworks; each addresses a different layer of the AI lifecycle.
| Tool / Framework | Purpose | Originating work |
|---|---|---|
| Model cards | Document individual ML models | Mitchell et al., FAT* 2019 |
| Datasheets for datasets | Document training and evaluation datasets | Gebru et al., CACM 2021 |
| System cards | Document deployed AI systems including safety measures | OpenAI (GPT-4 System Card, 2023) |
| AI FactSheets | Comprehensive AI governance documentation | Arnold et al., IBM 2019 |
| Data Statements | Document characteristics of NLP datasets | Bender and Friedman, TACL 2018 (v3 2023) |
| Reward Reports | Document objectives and reward functions of reinforcement learning systems | Gilbert et al., AIES 2023 |
| Dataset Nutrition Labels | Standardized nutrition-label-style data summary | Holland et al., 2018 |
| Responsible AI Licenses (RAIL) | License terms with use restrictions for AI models | RAIL Initiative |
| Algorithmic Impact Assessments | Pre-deployment risk assessment of AI systems | Government of Canada, 2020 |
These tools serve overlapping but distinct purposes. No single document can capture all relevant information about an AI system, from raw training data to real-world impact. The trend in AI documentation is toward complementary, layered documentation that covers different aspects of the AI lifecycle, with model cards anchoring the model layer while datasheets cover data, system cards cover deployment, and reward reports cover dynamic behavior.
As of April 2026, model cards are firmly established as a norm in the AI industry, though their adoption and quality remain uneven. The EU AI Act's GPAI obligations have been in force since August 2025, and full enforcement powers begin August 2026, which is already pulling the median quality of frontier-model documentation upward. Providers selling into the European market increasingly publish documentation that satisfies EU technical-documentation expectations regardless of where the company itself sits.
Hugging Face continues to lead in model card tooling and infrastructure. The platform now hosts over two million public models, with automated card generation features that pre-populate certain fields based on metadata, training logs, and benchmark results. Research groups have explored using language models themselves to draft model card content, an idea that raises awkward questions about accuracy and accountability when the documenter is the same kind of system being documented.
Frontier-lab system cards have grown dramatically in scope. Where the GPT-4 System Card in 2023 ran roughly 60 pages, more recent releases such as Claude Opus 4.5 and the GPT-5.3 system cards run well over 100 pages each, with detailed sections on dangerous-capability evaluations (CBRN uplift studies, cyber CTF performance, autonomous replication tests), agentic-task evaluations like SWE-bench and OpenAI's GDPval gold dataset of 220 tasks across 44 occupations, and a growing focus on evaluation reproducibility. Anthropic's Responsible Scaling Policy, OpenAI's Preparedness Framework, and Google DeepMind's Frontier Safety Framework all tie release decisions to specific evaluations that get reported back in the system card.
The broader trajectory points toward AI documentation becoming not just a best practice but a legal and commercial requirement. As AI systems are deployed in healthcare, finance, criminal justice, and defense, the demand for thorough, accurate, and independently verifiable documentation will only grow. Model cards, for all their limitations, established the vocabulary and expectations that more rigorous successors are now being built on.
[1] Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I.D., and Gebru, T. "Model Cards for Model Reporting." Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT* 2019), 220-229. https://arxiv.org/abs/1810.03993 and https://dl.acm.org/doi/10.1145/3287560.3287596
[2] Mitchell, M. Personal site and biography. https://www.m-mitchell.com/ ; "Margaret Mitchell (scientist)." Wikipedia. https://en.wikipedia.org/wiki/Margaret_Mitchell_(scientist)
[3] Hugging Face. "Model Cards." https://huggingface.co/docs/hub/en/model-cards ; Hugging Face. "Model Card Guidebook." https://huggingface.co/docs/hub/en/model-card-guidebook ; Hugging Face. "Annotated Model Card Template." https://huggingface.co/docs/hub/en/model-card-annotated
[4] Hugging Face. "huggingface_hub v1.0: Five Years of Building the Foundation of Open Machine Learning." Blog post, 2025. https://huggingface.co/blog/huggingface-hub-v1
[5] Liang, W. et al. "Systematic analysis of 32,111 AI model cards characterizes documentation practice in AI." Nature Machine Intelligence, 2024. https://www.nature.com/articles/s42256-024-00857-z
[6] Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J.W., Wallach, H., Daume III, H., and Crawford, K. "Datasheets for Datasets." Communications of the ACM, Vol. 64, No. 12, December 2021, 86-92. https://arxiv.org/abs/1803.09010 and https://dl.acm.org/doi/10.1145/3458723
[7] Lunden, I. "Google fires top AI ethics researcher Margaret Mitchell." TechCrunch, February 19, 2021. https://techcrunch.com/2021/02/19/google-fires-top-ai-ethics-researcher-margaret-mitchell/
[8] OpenAI. "GPT-4 System Card." March 23, 2023. https://cdn.openai.com/papers/gpt-4-system-card.pdf ; OpenAI. "GPT-4o System Card." August 2024. https://openai.com/index/gpt-4o-system-card/ ; OpenAI. "Operator System Card." 2025. https://openai.com/index/operator-system-card/
[9] Anthropic. "The Claude 3 Model Family: Opus, Sonnet, Haiku." Model card, March 2024. https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf ; Anthropic. "Model system cards." https://www.anthropic.com/system-cards
[10] Anthropic. "Anthropic's Responsible Scaling Policy." Versions 1.0 (Sept 2023), 2.0 (Oct 2024), 3.0 (2025). https://www.anthropic.com/responsible-scaling-policy
[11] DeepSeek-AI. "DeepSeek-V3 Technical Report." arXiv:2412.19437, December 2024. https://arxiv.org/abs/2412.19437 ; DeepSeek-AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." January 2025. https://huggingface.co/deepseek-ai/DeepSeek-R1
[12] European Commission. "AI Act." https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai ; European Commission. "The General-Purpose AI Code of Practice." https://digital-strategy.ec.europa.eu/en/policies/contents-code-gpai
[13] Future of Life Institute. "High-level summary of the AI Act." https://artificialintelligenceact.eu/high-level-summary/
[14] National Institute of Standards and Technology. "Artificial Intelligence Risk Management Framework (AI RMF 1.0)." NIST AI 100-1, January 2023. https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf ; NIST. "AI Risk Management Framework: Generative AI Profile." NIST AI 600-1, July 2024. https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf
[15] White House. "Executive Order 14110: Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence." October 30, 2023. https://en.wikipedia.org/wiki/Executive_Order_14110
[16] White House. "Executive Order 14148: Initial Rescissions of Harmful Executive Orders and Actions." January 20, 2025; "Removing Barriers to American Leadership in Artificial Intelligence." January 23, 2025. https://www.aila.org/library/executive-order-on-removing-barriers-to-american-leadership-in-artificial-intelligence
[17] Colorado General Assembly. "SB24-205 Consumer Protections for Artificial Intelligence." Signed May 17, 2024, effective June 30, 2026. https://leg.colorado.gov/bills/sb24-205
[18] Gilbert, T.K., Lambert, N., Dean, S., Zick, T., Snoswell, A., and Mehta, S. "Reward Reports for Reinforcement Learning." Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society. https://arxiv.org/abs/2204.10817
[19] Raji, I.D., Smart, A., White, R.N., Mitchell, M., Gebru, T., Hutchinson, B., Smith-Loud, J., Theron, D., and Barnes, P. "Closing the AI Accountability Gap: Defining an End-to-End Framework for Internal Algorithmic Auditing." FAT* 2020. https://dl.acm.org/doi/10.1145/3351095.3372873
[20] Bender, E.M. and Friedman, B. "Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science." Transactions of the Association for Computational Linguistics, 2018. https://aclanthology.org/Q18-1041/
[21] Arnold, M., Bellamy, R.K.E., Hind, M., et al. "FactSheets: Increasing Trust in AI Services through Supplier's Declarations of Conformity." IBM Journal of Research and Development. https://research.ibm.com/blog/aifactsheets
[22] TensorFlow / Google. "Model Card Toolkit." https://github.com/tensorflow/model-card-toolkit ; "Introducing the Model Card Toolkit for Easier Model Transparency Reporting." Google Research blog, July 2020. https://research.google/blog/introducing-the-model-card-toolkit-for-easier-model-transparency-reporting/
[23] Meta. "Llama 3 Model Card." April 2024. https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md ; "Llama 3.1 Model Card." https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md