Netflix Prize
Last reviewed
Apr 30, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,132 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,132 words
Add missing citations, update stale details, or suggest a clearer explanation.
The Netflix Prize was an open machine learning competition run by Netflix between October 2006 and September 2009. The company offered a grand prize of US$1,000,000 to the first team that could improve the root mean squared error (RMSE) of its in-house movie recommender system, Cinematch, by at least 10 percent on a held-out test set of customer ratings. The contest released a public dataset of more than 100 million ratings, attracted tens of thousands of teams from over 180 countries, and produced a body of collaborative filtering research that shaped industrial recommender systems for years afterward. The grand prize was awarded on September 21, 2009 to the team BellKor's Pragmatic Chaos for an RMSE of 0.8567 on the qualifying test set, a 10.06 percent improvement over Cinematch.
The Netflix Prize also became a defining cautionary tale about data privacy. In 2008 Arvind Narayanan and Vitaly Shmatikov of the University of Texas at Austin showed that the supposedly anonymous ratings could be re-identified using only a small amount of public auxiliary information from the Internet Movie Database (IMDb). A class-action privacy lawsuit (Doe v. Netflix) and a Federal Trade Commission inquiry followed in 2009. In March 2010 Netflix cancelled a planned sequel competition, the "Netflix Prize 2", in response.
The Netflix Prize was announced on October 2, 2006 at a launch event in New York. The rules, posted at netflixprize.com, were simple. Netflix released a training dataset of customer ratings and asked competitors to predict ratings on a withheld qualifying set. Predictions were scored by RMSE, and any team that beat Cinematch by 10 percent on the private test portion of the qualifying set would win one million dollars. Two interim Progress Prizes of US$50,000 each were available at the end of 2007 and 2008 for the team in the lead, provided they had improved by at least 1 percent over the previous year's best.
A crucial design choice made the contest both fair and game-theoretic: Netflix never told entrants which of their submitted predictions were used for the public quiz score and which were used for the private test score, which made it difficult to overfit by repeated probing of the leaderboard. Submissions were unlimited, but each team could only submit once per day.
Netflix had several motivations. Cinematch, deployed since 2000, was a competent neighborhood-based collaborative filter that powered the company's DVD-by-mail recommendations. Internal improvements had stalled at small RMSE gains. James Bennett and Stan Lanning, who designed the contest, hoped that exposing the problem to outside researchers and offering a large enough prize would draw in the academic and hobbyist communities that Netflix could not hire all of. The format borrowed from earlier challenges such as the KDD Cup, but the size of the dataset and the size of the prize were both unprecedented.
The published training set contained 100,480,507 ratings of 17,770 movies given by 480,189 anonymous customers between October 1998 and December 2005. Ratings were integers from 1 to 5 stars, and each record included a date. To enable in-sample tuning, Netflix released a probe set of 1,408,395 ratings drawn from the training data with statistical properties matching the held-out evaluation set, so teams could estimate their leaderboard score locally.
The qualifying set, used for scoring, contained 2,817,131 user-movie pairs without ratings. Netflix split this internally into a quiz set of 1,408,342 ratings (used for the public leaderboard) and a test set of 1,408,789 ratings (used for the private score that decided the prize). Only Netflix's judges knew which prediction belonged to which subset, and quiz scores were rounded to four decimal places to limit information leakage.
| Subset | Ratings | Purpose |
|---|---|---|
| Training set | 100,480,507 | Model fitting |
| Probe set | 1,408,395 | In-sample validation, drawn from training data |
| Quiz set | 1,408,342 | Public leaderboard score |
| Test set | 1,408,789 | Private score that decided the grand prize |
The data were heavily imbalanced. A handful of users had rated tens of thousands of movies, while most users had rated only a few; popular films had hundreds of thousands of ratings, while obscure films had a handful. The most-rated film in the dataset was Miss Congeniality (2000).
The metric was root mean squared error between predicted ratings and actual ratings on the test set. Lower is better. Cinematch's RMSE on the quiz set at the start of the contest was 0.9514 (Netflix later reported the test-set baseline at 0.9525, which is the figure used to compute the 10 percent target). A trivial predictor that always returned the global mean rating scored about 1.0540.
The grand prize threshold was a 10 percent improvement on the test set, which works out to RMSE 0.8572. The winning submission scored 0.8567, an improvement of 10.06 percent. To put that gap in context, the difference between Cinematch and the trivial mean predictor (0.9525 vs 1.0540) was about 10 percent, and three years of intense competition were required to close roughly the same distance again.
| Date | Event |
|---|---|
| October 2, 2006 | Netflix Prize launched in New York |
| October 8, 2006 | First team beats Cinematch on the quiz leaderboard |
| November 13, 2007 | First Progress Prize ($50,000) awarded to KorBell (later renamed BellKor) at RMSE 0.8712, an 8.43 percent improvement |
| 2008 | Second Progress Prize awarded to BellKor in BigChaos at RMSE 0.8616 |
| June 26, 2009 | BellKor's Pragmatic Chaos crosses the 10 percent threshold (Quiz RMSE 0.8558) and triggers the 30-day "last call" period |
| July 25-26, 2009 | The Ensemble, a 30-plus-team coalition, ties BellKor's Pragmatic Chaos at the same Quiz RMSE; tie broken by submission timestamp (BellKor's Pragmatic Chaos submitted about 20 minutes earlier) |
| September 18, 2009 | Netflix announces BellKor's Pragmatic Chaos as the winner with Test RMSE 0.8567 |
| September 21, 2009 | Grand Prize awarded at a ceremony in New York |
| August 2009 | Netflix announces a sequel competition, "Netflix Prize 2" |
| December 17, 2009 | Doe v. Netflix class-action lawsuit filed |
| March 12, 2010 | Netflix cancels Netflix Prize 2, citing the FTC inquiry and the Doe lawsuit; settles with the FTC and plaintiffs shortly afterward |
The Netflix Prize is widely credited with popularising several recommender-system techniques that had existed in research but were not yet standard industrial practice. The most important was the latent-factor approach based on matrix factorization.
The core idea, developed in the contest by Simon Funk (Brandyn Webb) in late 2006 and refined into industrial form by Yehuda Koren, Robert Bell, and Chris Volinsky, was to model the user-item rating matrix R as the product of two low-rank matrices: a user-factor matrix P and an item-factor matrix Q. Each user and each movie is represented by a vector of, say, 50 to 200 latent dimensions, and the predicted rating is the dot product of the relevant user and item vectors plus per-user and per-item biases and a global mean. The factors are learned by stochastic gradient descent on the observed ratings with L2 regularisation.
This is closely related to a truncated singular value decomposition, but classical SVD requires a fully populated matrix and the Netflix matrix was about 99 percent missing. Funk's contribution was to fit the factorization only on the observed entries, sidestepping the missing-data problem. Koren, Bell, and Volinsky's 2009 IEEE Computer article "Matrix Factorization Techniques for Recommender Systems" became the canonical reference for the approach and is one of the most cited papers in the recommender-system literature.
At KDD 2008, Yehuda Koren introduced two extensions in the paper "Factorization Meets the Neighborhood". Asymmetric SVD replaces the per-user factor vector with a sum over the items the user has rated, which means new users do not require model retraining. SVD++ goes further by adding a second item-factor vector that captures implicit feedback, the bare fact that a user chose to rate an item, regardless of the score. Both ideas exploit information that pure rating prediction ignores, and both contributed materially to the final solution.
Classical collaborative filtering computed similarity between users or items using fixed formulas (Pearson correlation, cosine similarity) and predicted ratings as similarity-weighted averages of neighbors' ratings. Bell and Koren's 2007 paper "Improved Neighborhood-based Collaborative Filtering" replaced the heuristic similarities with interpolation weights learned by least squares, producing a substantially better neighborhood model. The neighborhood family was important because it captured local effects (a user who likes one obscure film tends to like a closely related one) that latent-factor models smooth out.
In 2007 Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey Hinton showed at ICML that Restricted Boltzmann Machines, a class of two-layer stochastic neural networks, could be applied to the rating-prediction problem and slightly outperformed carefully tuned SVD models. Their RBM-based predictors were combined with SVD models in many of the top entries and ended up as a key ingredient of the BellKor's Pragmatic Chaos blend.
Koren's 2009 KDD paper "Collaborative Filtering with Temporal Dynamics" introduced time-aware variants of the SVD++ and neighborhood models. The data spanned more than seven years, and during that period the rating scale itself drifted: average ratings crept upward in early 2004 (probably because Netflix changed the wording on its rating widget) and individual users' baselines drifted with mood, age, or context. The temporal models captured these effects with time-dependent biases and time-dependent factor vectors and were responsible for some of the largest single-model gains during the final year.
No single model ever crossed the 10 percent line. The winning submission was a blend of 107 different predictors combined first by linear regression and later by gradient-boosted decision trees. The lesson, repeated by every top team, was that ensembling diverse, individually mediocre models almost always beat any single careful model. This finding shaped the conventional wisdom for ensemble methods in machine-learning competitions throughout the next decade.
| Technique | Originator | Year | Role in the winning solution |
|---|---|---|---|
| Funk SVD (matrix factorization on observed entries) | Simon Funk | 2006 | Core latent-factor predictor |
| Improved neighborhood model | Bell, Koren | 2007 | Local effects, complement to factor models |
| RBM for collaborative filtering | Salakhutdinov, Mnih, Hinton | 2007 | Diverse predictor for the blend |
| Asymmetric SVD, SVD++ | Koren | 2008 | Implicit feedback, no retraining for new users |
| Temporal dynamics (timeSVD++, time-aware neighborhood) | Koren | 2009 | Modeled rating-scale drift over the seven-year span |
| Gradient-boosted decision tree blending | BellKor's Pragmatic Chaos | 2009 | Combined 107 predictors into the final submission |
Three research groups dominated the final years and eventually merged into the winning entry.
| Team | Members | Affiliation |
|---|---|---|
| BellKor (originally KorBell) | Yehuda Koren, Robert Bell, Chris Volinsky | AT&T Labs (later Yahoo Research for Koren) |
| BigChaos | Andreas Toscher, Michael Jahrer | Commendo Research and Consulting, Austria |
| Pragmatic Theory | Martin Piotte, Martin Chabbert | Independent engineers, Quebec |
| BellKor's Pragmatic Chaos | The seven names above | Joint team that won the Grand Prize |
| The Ensemble | A coalition of more than 30 individual teams | Open consortium that tied at the deadline |
BellKor won the 2007 Progress Prize as a three-person AT&T team. For the 2008 Progress Prize they joined forces with BigChaos as "BellKor in BigChaos" and reached RMSE 0.8616. By spring 2009 the gains from individual models had largely been exhausted, so in the final months several leading teams combined. BellKor in BigChaos absorbed Pragmatic Theory to form BellKor's Pragmatic Chaos. A separate consortium, The Ensemble, formed by aggregating dozens of mid-leaderboard teams whose predictions were diverse enough to blend usefully.
The finish was extraordinarily close. On June 26, 2009 BellKor's Pragmatic Chaos posted a quiz RMSE of 0.8558 (about 10.05 percent improvement), which under the rules opened a 30-day "last call" window. On July 26, the last possible day, The Ensemble matched the same quiz RMSE down to four decimal places, but their submission arrived about 20 minutes after BellKor's Pragmatic Chaos's final entry. The tie-breaking rule was the submission timestamp. Netflix scored both teams on the private test set in the weeks that followed. The two teams ended up effectively tied on the test set as well, with BellKor's Pragmatic Chaos at 0.8567 narrowly ahead of The Ensemble. BellKor's Pragmatic Chaos was declared the winner.
Netflix released the training data with user IDs replaced by random integers and with no demographic information attached, and described the data as anonymous. In 2008 Arvind Narayanan and Vitaly Shmatikov of the University of Texas at Austin published "Robust De-anonymization of Large Sparse Datasets" at the IEEE Symposium on Security and Privacy. They showed that the high dimensionality and sparsity of the rating vectors made each user's rating history nearly unique, so a small amount of side information was enough to identify them. Using public IMDb ratings as auxiliary data, they were able to re-identify Netflix users and, in doing so, to reveal information that the users had not chosen to make public.
The paper became a landmark in privacy research because it generalised: the same attack could in principle work on any high-dimensional sparse behavioural dataset, not just movie ratings. It is frequently cited as evidence that simple removal of names and addresses is not sufficient anonymisation for rich behavioural data, and influenced subsequent work on differential privacy.
In August 2009 Netflix announced a sequel competition that would have released additional, richer data. Four anonymous Netflix subscribers filed Doe v. Netflix in December 2009 alleging violations of the Video Privacy Protection Act, and the Federal Trade Commission opened an inquiry into how the new release would affect customer privacy. On March 12, 2010 Netflix announced that it was cancelling Netflix Prize 2 and reached a settlement with the plaintiffs and the FTC shortly afterward. The original Netflix Prize dataset remained available for several years afterward and was widely used in academic research, but Netflix never released a successor.
A recurring observation from Netflix engineers, most clearly stated in a 2012 blog post by Xavier Amatriain and Justin Basilico on the Netflix Tech Blog, is that the full BellKor's Pragmatic Chaos solution was never put into production. Two algorithms developed during the contest, both from the 2007 Progress Prize era, were folded into Netflix's recommendation pipeline and stayed there for years, but the final 107-predictor blend was judged too engineering-heavy for the marginal accuracy gain it offered.
More importantly, Netflix's product changed underneath the prize. By 2009 streaming had begun to overtake DVD-by-mail, and the recommendation problem moved from "predict what star rating this user would give this movie if they watched it" to "choose what to put on the home page so the user starts watching now". The latter is a top-N ranking problem, often without any explicit ratings at all, and has different evaluation metrics, different inputs (clicks, dwell time, completion rates), and different optimisation targets than RMSE on five-star ratings. The Netflix Prize had pushed the rating-prediction problem about as far as it could be pushed at the time, which is partly why Netflix moved on.
For recommender-system research, the Netflix Prize moved matrix factorization from a niche technique to the default starting point. The 2009 IEEE Computer paper by Koren, Bell, and Volinsky has been cited tens of thousands of times and is taught in nearly every graduate-level course on recommender systems. SVD++, time-aware models, and the recipe of "learn many models, then blend" all became standard practice in industry.
For competitive data science, the prize set the template that Kaggle and similar platforms later inherited: a public dataset, a held-out leaderboard, a private test set to discourage overfitting, and a clear single-number evaluation metric. The dual-leaderboard idea, with quiz scores public and test scores hidden, became a staple of subsequent competitions. The contest also demonstrated that small, distributed teams of strangers could outperform large internal research groups on focused problems, an observation that informed the rise of crowdsourced ML throughout the 2010s.
For data privacy, the Narayanan and Shmatikov result and the subsequent legal action helped move the conversation away from informal anonymisation and toward formal privacy guarantees, particularly differential privacy. It also made companies considerably more cautious about releasing real user data for academic challenges. The contrast is visible in later contests: most large-scale industry-sponsored competitions since 2010 either use synthetic data, heavily aggregated data, or data that is pre-screened by privacy reviewers.
For Netflix itself, the prize is remembered as a marketing and research success more than a product success. The company spent about a million dollars in cash and a great deal of engineering time and got hundreds of papers, dozens of new techniques, and an enduring association with cutting-edge machine learning. The fact that the winning algorithm was never fully deployed has become a standard teaching example of the gap between offline benchmark improvements and real product impact.