Kaggle is an online community and platform for data scientists and machine learning practitioners. Founded in April 2010 by Anthony Goldbloom and Ben Hamner, it began as a host for predictive modeling competitions where companies post data and contestants compete to produce the best models. Over fifteen years it grew into an ecosystem that includes public datasets, cloud-based notebooks, discussion forums, free courses, and a model hub. Google acquired Kaggle in March 2017, and it now operates as part of Google Cloud while keeping its open community character. By April 2025 the platform had crossed 23 million registered accounts, making it the largest community of working and aspiring data scientists in the world.
Kaggle has played an outsized role in shaping modern machine learning culture. The site popularized public competition leaderboards, helped launch the careers of thousands of data scientists, and served as the proving ground for influential libraries such as XGBoost, LightGBM, and CatBoost. Its tiered ranking system, with Novice, Contributor, Expert, Master, and Grandmaster levels, became a widely recognized credential in industry hiring. In recent years Kaggle has hosted some of the highest-profile open challenges in AI, including the ARC-AGI Prize, the Vesuvius Challenge for reading carbonized Roman scrolls, and generative AI competitions tied to Google's foundation models.
| Field | Value |
|---|---|
| Type | Subsidiary of Google LLC |
| Industry | Data science, machine learning, education |
| Founded | April 2010 |
| Founders | Anthony Goldbloom, Ben Hamner |
| Headquarters | San Francisco, California, United States |
| Original location | Melbourne, Australia |
| Parent | Google (Google Cloud / AI) |
| Acquired | March 8, 2017 |
| CEO | D. Sculley (since June 2022) |
| Users | 23+ million (April 2025) |
| Website | kaggle.com |
Kaggle was founded in Melbourne, Australia, in April 2010. Anthony Goldbloom, an Australian economist who had worked at the Reserve Bank of Australia and the Treasury, conceived the platform after writing about predictive modeling for The Economist. He noticed that organizations sitting on huge amounts of data rarely had the in-house expertise to extract value from it, while talented analysts often lacked access to interesting problems. The idea was a marketplace where anyone with statistical or machine learning skills could compete on company-supplied datasets, with the best entries rising on a public leaderboard.
Goldbloom was joined within months by Ben Hamner, a Duke University engineer who became co-founder and chief technology officer. Nicholas Gruen served as founding chair, and in November 2010 Jeremy Howard, previously the top-ranked competitor on the platform, joined as President and Chief Scientist. The company moved its headquarters from Melbourne to San Francisco in 2011 to be closer to Silicon Valley investors. PayPal co-founder Max Levchin chaired the board after Kaggle raised a Series A round of about $11 million from Index Ventures and Khosla Ventures, with later participation pushing total venture funding to roughly $12.75 million.
Kaggle's first competitions in 2010 and 2011 were small-scale challenges involving HIV research, chess ratings, and tourism forecasting. The platform got its first major dose of attention in April 2011 with the launch of the Heritage Health Prize, a $3 million purse sponsored by the Heritage Provider Network in California. Contestants were asked to predict, from anonymized claims data, how many days each patient would spend in the hospital over the following year. The competition ran for two years, attracted thousands of teams, and produced a steady cascade of methodological innovations. None of the final entries cleared the demanding accuracy threshold, so the grand prize went unclaimed; the leading team, POWERDOT, took an interim award of $500,000 in June 2013. The contest is still cited as the turning point that gave Kaggle credibility with serious enterprise sponsors. From 2011 onward the platform hosted challenges from Allstate, Merck, Facebook, Microsoft (Kinect gesture recognition), GE, and Manchester City football club, and by 2013 was running several dozen competitions a year while becoming a recognized recruiting funnel for quantitative hedge funds and large tech companies.
In May 2014 the ATLAS collaboration at CERN, in partnership with Paris-Saclay Centre for Data Science and Google, launched the Higgs Boson Machine Learning Challenge on Kaggle. The contest asked participants to separate signal events involving the Higgs boson from background noise in simulated proton-proton collisions. It became one of the largest physics-meets-machine-learning collaborations of the decade, drawing 1,785 teams. Gabor Melis of Hungary won first place with an ensemble of deep learning neural networks trained with minimal feature engineering. Tianqi Chen and Tong He, competing as team Crowwork, took the special High Energy Physics meets Machine Learning Award for the elegance of their solution. Their submission was built on a then-new gradient boosting library called XGBoost, which Chen had developed as a Ph.D. project at the University of Washington.
The Higgs competition is widely credited as the moment XGBoost broke through. In the years that followed it dominated the Kaggle leaderboards on tabular data problems. Surveys of winning solutions in the late 2010s consistently showed XGBoost in roughly half of all Kaggle wins, often ahead of deep learning approaches on structured data. The pattern repeated when Microsoft's LightGBM appeared in 2016 and Yandex's CatBoost in 2017, both of which were tested, hardened, and refined inside Kaggle competitions before becoming standard industry tools.
On March 8, 2017, Fei-Fei Li, then chief scientist of AI and machine learning at Google Cloud, announced from the stage of the Google Cloud Next conference in San Francisco that Google was acquiring Kaggle. The financial terms were never disclosed publicly. TechCrunch and Bloomberg reported that the deal value was modest by Silicon Valley standards but that Google considered the acquisition strategically important: the company gained access to a community of more than 800,000 data scientists at the time, deep visibility into what models and tools they were using, and a recruiting pipeline for its AI teams. Kaggle remained a distinct brand and continued to operate competitions for sponsors that competed with Google, including financial firms and other cloud providers.
In the weeks before the announcement Google and Kaggle had jointly run a $100,000 competition on YouTube-8M video classification, which served as a public preview of what the integration would look like. After the acquisition Kaggle gained free Tensor Processing Unit access for its notebooks, tighter integration with Google Cloud Storage and BigQuery, and engineering support that improved the reliability of its leaderboards and API.
In 2015 Kaggle launched a feature called Scripts, soon renamed Kernels, that let users execute Python or R code against competition datasets directly in the browser. Kernels were among the first widely used cloud-hosted Jupyter-style environments. In June 2018 the feature was rebranded as Kaggle Notebooks, with expanded GPU and TPU support, longer execution times, and integration with the Kaggle Datasets catalog.
In June 2022, after twelve years at the helm, Anthony Goldbloom and Ben Hamner stepped down as CEO and CTO. Both founders left to start a new company in the generative AI space; Goldbloom went on to co-found Sample, a startup using large language models for analytics. D. Sculley, formerly director of engineering at Google Brain and a long-time researcher in machine learning systems, took over as chief executive officer. Sculley is best known in the research community as a co-author of the widely cited paper Hidden Technical Debt in Machine Learning Systems. Under his leadership Kaggle has leaned harder into generative AI tooling, model hosting, and partnerships with academic and scientific organizations.
In February 2023 Kaggle launched Kaggle Models, a hub for pretrained models that mirrors what Hugging Face does for the open-source community but with deeper integration into Google's ecosystem. The catalog includes Google's Gemma family, Meta's Llama models, and many community-contributed checkpoints. Through 2024 and 2025 Kaggle hosted a wave of generative AI competitions, including the multi-edition Google AI Studio competitions and the Google Gemini API competition series. In April 2025 Kaggle and the Wikimedia Foundation announced a partnership to host structured Wikipedia datasets directly on Kaggle. Throughout 2025 and into 2026, the platform remained the venue of choice for open AGI-style benchmarks.
Kaggle is several tightly linked products that together form the data science workflow:
| Product | Launched | Purpose |
|---|---|---|
| Competitions | 2010 | Public and private predictive modeling contests with leaderboards and prizes |
| Datasets | 2016 | Public catalog of community and organization-shared datasets, searchable and versioned |
| Kernels (now Notebooks) | 2015 (renamed 2018) | Free cloud-hosted Jupyter notebooks with CPU, GPU, and TPU support |
| Discussions | 2010 | Forums attached to every competition, dataset, and notebook for collaboration and Q&A |
| Learn | 2018 | Free interactive micro-courses on Python, machine learning, deep learning, and SQL |
| Models | February 2023 | Hub for pretrained model weights including Gemma, Llama, and community uploads |
| Kaggle API | 2017 | Command-line tool for downloading datasets, submitting predictions, and managing notebooks |
| Kaggle Days | 2018 | Global series of in-person events, conferences, and meetups |
Competitions are the foundation of Kaggle. Sponsors provide a dataset, define a target metric, set a timeframe, and post a prize. Contestants submit predictions, which are scored on a hidden test set, and a leaderboard updates in real time. Scores on the public test split are visible during the contest, but final standings are determined on a separate private split that contestants only see after the deadline. This split design has become standard practice in machine learning evaluation pipelines well beyond Kaggle.
Kaggle hosts several flavors of competition: Featured (large sponsor-backed contests with significant prize purses), Research (academic, often non-cash prizes such as paper co-authorship), Getting Started (evergreen tutorials like Titanic and Ames House Prices), Playground (short low-stakes practice contests), and Code competitions (introduced in 2017, requiring contestants to submit running notebooks instead of static prediction files, which caps inference cost).
The Datasets product, launched in 2016, lets anyone upload a dataset and share it with the community. As of the mid-2020s the catalog held hundreds of thousands of datasets, from canonical benchmarks like MNIST and CIFAR to scraped social media corpora, government statistics, and sports data. Each dataset has versioning, a discussion thread, and integrated notebook examples.
Kaggle Notebooks provide a free, browser-based environment with a recent Python and R stack, common scientific libraries preinstalled, and access to CPU, GPU (NVIDIA T4 and P100 class hardware), and TPU resources. Each user gets a weekly quota of accelerator hours. Notebooks can be made public, forked, and voted on, and the most-upvoted ones earn medals. Many of the top-ranked Kaggle Notebooks have become reference implementations within the broader machine learning community.
Every dataset, notebook, and competition has its own discussion forum, and there is a global area that functions like a Stack Overflow for data science. Discussions are how teams form, how solutions are publicly shared after competitions close, and how the community debates leaderboard tactics, ethics issues, and platform policies.
Kaggle Learn, launched in 2018, is a set of short, free, interactive courses pairing brief reading material with notebook-based exercises. Topics include Python basics, machine learning, deep learning, computer vision, natural language processing, time series, feature engineering, data visualization, SQL, geospatial analysis, and game AI. Each course awards a certificate on completion.
Kaggle Models, launched in February 2023, is a hub for pretrained model weights designed to be discoverable and easy to load inside Kaggle Notebooks. The catalog includes Google's Gemma open-weight models, Meta's Llama family, several Stable Diffusion variants, classic computer vision backbones, and a growing roster of community uploads.
The table below lists some of the highest-profile Kaggle competitions across the platform's history.
| Year | Competition | Sponsor | Prize | Winner / Notable result |
|---|---|---|---|---|
| 2011-2013 | Heritage Health Prize | Heritage Provider Network | $3 million (unclaimed) | Team POWERDOT took $500,000 interim prize |
| 2014 | Higgs Boson Machine Learning Challenge | CERN ATLAS, Paris-Saclay, Google | $13,000 | Gabor Melis (1st); Tianqi Chen and Tong He introduced XGBoost |
| 2015 | Otto Group Product Classification | Otto Group | $10,000 | Stacking became standard practice |
| 2015 | Diabetic Retinopathy Detection | California Healthcare Foundation | $100,000 | Deep learning for medical imaging at scale |
| 2016 | Mercedes-Benz Greener Manufacturing | Mercedes-Benz | $25,000 | Popular benchmark for stacking |
| 2016 | Two Sigma Financial Modeling | Two Sigma | $100,000 | First large code competition |
| 2017 | Zillow Prize | Zillow | $1.2 million | Among the largest cash purses in Kaggle history |
| 2018 | Home Credit Default Risk | Home Credit Group | $70,000 | 7,000+ teams, gradient boosting again dominant |
| 2019 | Santander Customer Transaction Prediction | Banco Santander | $65,000 | Feature engineering on anonymized features |
| 2023 | Vesuvius Challenge - Ink Detection | Scroll Prize | $1 million+ across phases | First legible Greek text from Herculaneum scrolls |
| 2024 | ARC Prize 2024 (ARC-AGI) | Mike Knoop, Francois Chollet | $1.1 million pool | The ARChitects (Franzen, Disselhoff) won using Test Time Training |
| 2024 | Vesuvius Challenge - Surface Detection | Scroll Prize | $100,000 | Reignited progress on virtual unwrapping |
| 2025-2026 | Google Gemini API; ARC Prize 2026 | Google; ARC Prize Foundation | varies | Generative AI evaluation; ARC-AGI-3 benchmark |
The Ames House Prices challenge (House Prices: Advanced Regression Techniques) deserves special mention. It is not a prize competition but rather a Getting Started tutorial running continuously since 2016 using a dataset of 2,930 home sales in Ames, Iowa, originally compiled by economist Dean De Cock. The contest has trained a generation of beginners in feature engineering, regression, and gradient boosting, and along with the Titanic competition is the most common entry point for newcomers.
Kaggle launched the year after Netflix awarded the famous Netflix Prize for collaborative filtering. While the Netflix Prize was not itself a Kaggle competition, the model of a long-running open contest with a public leaderboard was inherited directly. Many of the top finishers in the Netflix Prize, including BellKor's Pragmatic Chaos team members, went on to compete on Kaggle, and the platform absorbed both the prize-money culture and the heavy emphasis on ensembling that the Netflix contest had popularized.
The 2014 Higgs Boson Machine Learning Challenge was a turning point for both physics and machine learning. The CERN team integrated several of the techniques developed during the contest into the actual ATLAS analysis pipeline, and it provided what may be the first large public demonstration of XGBoost outperforming bespoke physics features.
The ARC-AGI benchmark was created by Francois Chollet, author of Keras, in 2019 as a test of fluid intelligence in AI systems that resists memorization. In 2024 the ARC Prize Foundation launched a $1.1 million competition pool on Kaggle to encourage open-source progress, with winners required to publish their code. The ARChitects (German researchers Daniel Franzen and Jan Disselhoff) won by combining test time training with a fine-tuned language model, scoring 53.5 percent on the private evaluation. MindsAI scored higher (55.5 percent) but did not open-source their solution and were ineligible for the cash prize. Independently, researcher Ryan Greenblatt used a GPT-4o-driven program search to reach 42 percent on the public ARC-AGI-Pub leaderboard. In late December 2024, OpenAI publicly demonstrated its forthcoming o3 model on ARC-AGI-1 and reported scores as high as 87.5 percent at very high inference cost, sparking discussion about whether the benchmark was approaching saturation. ARC Prize editions continued on Kaggle in 2025 and 2026 with harder versions of the benchmark (ARC-AGI-2 and ARC-AGI-3).
The Vesuvius Challenge, launched in 2023 by tech investors Nat Friedman and Daniel Gross along with computer scientist Brent Seales, uses Kaggle to host its computer vision sub-competitions. Contestants recover legible text from 3D X-ray scans of papyrus scrolls carbonized by the eruption of Mount Vesuvius in 79 CE. In 2024 a small team of student researchers won the grand prize for reading the first continuous Greek passages from one of the scrolls. The Surface Detection sub-competition on Kaggle in 2024 carried a $100,000 purse.
Kaggle uses a five-tier progression system across four categories (Competitions, Datasets, Notebooks, Discussions). Each contribution can earn a Bronze, Silver, or Gold medal, and tier promotions require specific combinations of medals.
| Tier | General description | Approximate criteria (Competitions track) |
|---|---|---|
| Novice | Default tier on registration | None |
| Contributor | First level of engagement | Complete profile, run a notebook, cast a vote, post in discussion, submit to a competition |
| Expert | Demonstrated skill | At least 2 bronze medals (Competitions); category-specific equivalents apply for Notebooks, Datasets, Discussions |
| Master | Strong track record | At least 1 gold and 2 silver medals (Competitions) |
| Grandmaster | Top of the platform | At least 5 gold medals including 1 solo gold (Competitions) |
Medals in competitions are awarded by relative rank (roughly top 10 percent for Bronze, top 5 percent for Silver, plus a fixed cap for Gold). Notebooks, Datasets, and Discussions earn medals based on community upvotes. Each tier has its own track per category, so a person can be a Notebooks Grandmaster while still being a Competitions Expert. As of April 2025 Kaggle reported about 612 Grandmasters and 2,973 Masters across more than 23 million accounts, making the Grandmaster cohort roughly 0.003 percent of the user base. The progression system has become a recognizable hiring signal in industry, with many senior data scientist roles, particularly at quantitative finance firms and large tech companies, listing Kaggle Master or Grandmaster status as a desirable credential.
Kaggle developed a distinctive culture early on. Solutions to public competitions are typically published in detail on the discussion forums after the contest closes, including the architecture, hyperparameters, training data tricks, and ensemble structure that the winners used. This open-publishing norm meant techniques that worked in one competition diffused rapidly into others and into the broader machine learning community. Stacking, blending, target encoding, pseudo-labeling, snapshot ensembles, test-time augmentation, and several variants of cross-validation strategy were either invented on or popularized through Kaggle.
The community is geographically global. The Kaggle Days event series, founded in 2018 in collaboration with LogicAI, has hosted in-person conferences and meetups in cities including Warsaw, Paris, San Francisco, Tokyo, Beijing, Bangalore, Cairo, Dubai, and Brussels. The flagship Kaggle Days World Championship has been held annually since 2018. Since 2017 the platform has also run an annual Machine Learning and Data Science Survey of its users, the results of which are themselves published as a public dataset. The surveys have documented the rise of Python at the expense of R, the steady growth of deep learning frameworks (TensorFlow, then PyTorch), and the rapid adoption of large language model tooling from 2023 onward.
Kaggle's influence on the wider field is hard to overstate. It created a culture in which competing methods are evaluated head-to-head on identical data with held-out test sets, forcing practitioners to be honest about generalization. The public-private leaderboard split is now a basic concept taught in introductory machine learning courses.
The platform served as the practical R&D environment in which several of the most widely used machine learning libraries were tested and refined. XGBoost, LightGBM, and CatBoost all gained traction primarily through Kaggle wins. Many of the standard tricks of modern competitions, including stacking, target encoding, and clever cross-validation strategies, were invented or hardened on the platform.
Kaggle also democratized access to real machine learning problems. Before it existed, a graduate student or hobbyist had little way to see what production-scale tabular or computer vision problems actually looked like. After Kaggle, anyone with a browser could download a corporate dataset, train a model with free cloud GPUs, and see how their solution stacked up against thousands of others. Kaggle Learn courses and the dataset and notebook ecosystem have since been used in countless classroom and self-study programs, and many universities incorporate Kaggle competitions directly into their machine learning syllabi.
In the most recent era, Kaggle has been the staging ground for some of the most ambitious open AI evaluation efforts, including the ARC-AGI Prize. The platform's combination of trustworthy leaderboard infrastructure, large international community, and integration with Google Cloud has made it a default venue when an organization wants to run an open challenge with credibility and reach.
Kaggle has faced several recurring criticisms. The focus on a single optimization metric per competition has been called out for encouraging narrow problem framing that does not reflect real-world deployment. Winning solutions are often very large ensembles that would be impractical to put into production, although the introduction of code competitions in 2017 and inference-time limits in recent contests have partly addressed this. Academic and industry observers have argued that the heavy emphasis on small percentage improvements on benchmark datasets can crowd out more meaningful work on data quality, problem definition, and deployment.
A more concrete controversy emerged in late 2025, when researchers and journalists flagged that several widely-shared face recognition datasets hosted on Kaggle had been collected without informed consent. Kaggle subsequently retracted approximately 40 datasets connected to those concerns. In April 2026 a follow-up review surfaced additional datasets, particularly some used in clinical machine learning research, that lacked clear provenance documentation. The Kaggle and Wikimedia Foundation initiative to host high-quality, well-licensed data, announced in April 2025, is partly a response to these provenance concerns.
Kaggle in 2026 is a recognizably different platform than the small Melbourne startup of 2010, but the basic premise is unchanged: sponsors post a problem and a leaderboard determines the winner. Roughly 23 million accounts now participate, and the platform continues to host the highest-profile open AI challenges in the world.