COMPAS (recidivism risk assessment)
Last reviewed
May 2, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,975 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 2, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,975 words
Add missing citations, update stale details, or suggest a clearer explanation.
COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) is a proprietary actuarial risk-assessment instrument used by United States courts and corrections agencies to estimate the likelihood that a criminal defendant will reoffend, fail to appear at a court date, or commit a violent crime. The tool is sold by Equivant, the present-day brand of the company originally known as Northpointe, Inc., which built the first version in the late 1990s. COMPAS scores defendants on a one-to-ten scale across multiple risk dimensions, and these scores are used in pretrial release decisions, probation supervision, parole eligibility, and, in some jurisdictions, in sentencing.
What sets COMPAS apart from the dozens of other risk instruments in use across the country is what happened to it in 2016. On May 23 of that year, ProPublica published Machine Bias, an investigation by Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner that analyzed more than 7,000 COMPAS scores from Broward County, Florida, and concluded that the algorithm made systematic mistakes by race. Black defendants who did not go on to reoffend were nearly twice as likely to be tagged high-risk as comparable white defendants. White defendants who did go on to reoffend were nearly twice as likely to be tagged low-risk. Northpointe pushed back hard, arguing that the system was calibrated correctly within each racial group. Both sides turned out to be telling the truth about different things, which is the heart of why COMPAS became the founding case study of an entire research field.
In the years that followed, the COMPAS story drove the emergence of algorithmic fairness as a recognizable academic discipline, prompted formal impossibility theorems showing why the ProPublica and Northpointe positions could not be reconciled, anchored a wave of popular books on algorithmic bias, survived a Wisconsin Supreme Court challenge in Loomis v. Wisconsin, and helped seed the founding of the ACM Conference on Fairness, Accountability, and Transparency (FAccT) in 2018. Almost every introduction to fairness in machine learning mentions COMPAS within the first few pages.
Northpointe, Inc. was founded in 1989 by criminologists Tim Brennan and Dave Wells, with William Dieterich joining as a senior researcher. The company built decision-support software for correctional agencies, drawing on Brennan's academic work on offender classification at the University of Colorado. The first version of the COMPAS instrument appeared in 1998, and the suite expanded over the next decade to include separate scales for community supervision, reentry, women, and youth.
Northpointe was acquired by Constellation Software, a Toronto-based holding company that owns dozens of vertical-market software firms, in 2011. The acquisition placed Northpointe inside a portfolio that already included CourtView Justice Solutions and Constellation Justice Systems. On January 9, 2017, the three sister companies were merged under a single brand called Equivant. The rebrand was done partly for marketing reasons (so that the same sales team did not visit a county under three different names) and partly to put some distance between the corporate identity and the Northpointe name, which by that point had become inseparable from the ProPublica controversy.
Equivant continues to license the COMPAS suite to state corrections departments, county jails, probation offices, and pretrial services agencies. The company does not publish its customer list, and academic researchers who want to validate the instrument typically have to obtain scores through public-records requests directed at the agencies that use it, which is the route ProPublica took.
The core COMPAS questionnaire used in most deployments contains roughly 137 items. The items combine information drawn from the criminal record (prior arrests, prior convictions, prior incarceration, age at first offense) with answers given by the defendant during an intake interview (employment history, family background, neighborhood characteristics, attitudes toward authority, peer associations, perceived opportunities for crime).
Race is not directly asked. This is the point Equivant returns to whenever the instrument is described as racially biased. Critics counter that several items act as proxies for race in the United States given how segregated neighborhoods, schooling, and policing patterns are, so omitting the variable does not insulate the score.
The questionnaire feeds into a set of separate scales. The three most cited in the academic literature are:
| Scale | What it predicts | Time horizon |
|---|---|---|
| General Recidivism Risk | Any new arrest for a misdemeanor or felony | Two years from release or assessment |
| Violent Recidivism Risk | New arrest for a violent offense | Two years from release or assessment |
| Pretrial Release Risk / Failure to Appear | Skipping court or new arrest while on pretrial release | Pretrial period |
Each scale produces a decile score from 1 to 10, with 1 to 4 conventionally treated as low risk, 5 to 7 as medium risk, and 8 to 10 as high risk. The thresholds are not strict cutoffs in the underlying mathematics; they are presentation conventions used in the printed reports that judges and probation officers see. The same questionnaire also produces a longer set of "criminogenic needs" scales (substance abuse, family criminality, antisocial peers, and others) that are intended for case planning rather than for risk gating.
COMPAS is one of several major risk-assessment products used in American criminal justice. Others include the Public Safety Assessment (PSA) developed by the Laura and John Arnold Foundation, the Level of Service / Case Management Inventory (LS/CMI) made by Multi-Health Systems, the Static-99 used for sex-offender risk, and an assortment of state-built tools. Adoption of COMPAS specifically is uneven and changes year to year, but in the early 2010s it was used at some level in Wisconsin, New York, California, Florida, Michigan, and several other states. After 2016 some jurisdictions replaced it with the PSA or with home-grown alternatives, and a few moved away from algorithmic risk tools altogether at the pretrial stage.
The most consequential use is at sentencing, where the score appears in the presentence investigation report (PSI) that the judge reads before imposing a term. This is also where the legal challenges have concentrated, because in pretrial decisions the judge is making a fast bail call and in probation the score affects supervision intensity rather than incarceration directly, but at sentencing a high score can plausibly contribute to a longer prison term. Wisconsin, where the Loomis case originated, used the instrument in this way.
The Broward County data was the right test bed for several reasons. Florida has a robust public-records statute, the county sheriff's office had been using COMPAS for several years and was willing to release scores under the statute, and the county is large and demographically mixed, so there was enough data on both Black and white defendants to do a meaningful subgroup analysis. ProPublica obtained 18,610 scores covering people booked into the Broward County jail between 2013 and 2014, and after merging with state and federal criminal-history records and excluding cases that lacked the required follow-up, the analysis settled on a usable cohort of 7,214 defendants.
Recidivism was defined operationally as a new arrest for a misdemeanor or felony within two years of the date of the COMPAS assessment. The reporters then compared the predicted risk score to whether the person was actually re-arrested in that window.
The headline numbers from the ProPublica analysis are summarized in the table below. The two columns that drove the controversy are the rows for false positives (high-risk score but no re-arrest) and false negatives (low-risk score but a subsequent re-arrest).
| Metric | Black defendants | White defendants |
|---|---|---|
| Labeled high-risk but did not reoffend (false positive rate) | 45% | 23% |
| Labeled low-risk but did reoffend (false negative rate) | 28% | 48% |
| Overall predictive accuracy (correct re-arrest classification) | ~61-65% | ~61-65% |
| AUC (area under ROC curve) | ~0.68-0.70 | ~0.68-0.70 |
The pattern is striking when you set it next to the overall accuracy numbers. The model was about equally accurate at distinguishing future re-arrests from non-re-arrests in both groups, which is what the AUC numbers say. The errors it made, though, were sharply different. When the model was wrong about a Black defendant, it was usually wrong in the direction of overestimating risk. When it was wrong about a white defendant, it was usually wrong in the direction of underestimating risk. This is what ProPublica meant by machine bias.
ProPublica also published a methodology piece, How We Analyzed the COMPAS Recidivism Algorithm, which documented every step of the data cleaning, the regression specifications, and the cohort definitions. This was unusual at the time for an investigative news story, and it made the analysis reproducible. Researchers have since rerun versions of it on the same dataset, often with small variations, and the qualitative findings hold up.
Within weeks of the ProPublica story, Northpointe released a 37-page rebuttal authored by William Dieterich, Christina Mendoza, and Tim Brennan titled COMPAS Risk Scales: Demonstrating Accuracy Equity and Predictive Parity. The argument was technical, and a lot of the public confusion that followed comes from the fact that the technical argument was correct on its own terms.
Northpointe's claim was that COMPAS satisfied a fairness criterion called predictive parity, sometimes also called calibration. Predictive parity means that within any given risk score, the actual rate of re-arrest is approximately the same across racial groups. A defendant who scores a 7 has roughly the same chance of reoffending whether they are Black or white. From the perspective of someone using the score to make a decision, the score means the same thing in both groups. By this measure, the company argued, COMPAS was not biased.
ProPublica was measuring something else: balance for the false positive and false negative rates across groups. The company and the reporters were both right. They were just measuring different things. The deeper question, then, was whether the two definitions could be reconciled. The answer, it turned out, was no.
In the months after the ProPublica piece, two papers appeared that formalized the trade-off and proved it was unavoidable.
Alexandra Chouldechova, then at Carnegie Mellon, published Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments in the journal Big Data in 2017 (a preprint had circulated in late 2016). The paper showed that whenever the base rate of recidivism differs between two groups (which it does in the COMPAS data, where the observed re-arrest rate among Black defendants in Broward was higher than among white defendants), an instrument that satisfies predictive parity cannot also have equal false positive and false negative rates across the groups. The math is fairly straightforward once you write down the confusion matrix and the constraints, but the implication was sweeping: the ProPublica criterion and the Northpointe criterion are mutually incompatible, not because anyone designed the system poorly but because of an arithmetic identity.
Almost simultaneously, Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan posted Inherent Trade-Offs in the Fair Determination of Risk Scores (arXiv:1609.05807, later in ITCS 2017). Their result was more general. They identified three desirable fairness properties (calibration within groups, balance for the positive class, and balance for the negative class) and proved that no risk scoring system can satisfy all three simultaneously, except in the trivial cases where the base rates are identical or where the predictor is perfect.
Taken together, the Chouldechova and Kleinberg results reframed the COMPAS debate. The question stopped being "is the algorithm biased" and started being "which kind of bias do we tolerate, given that we cannot avoid all of them at once." That is now the standard frame in algorithmic fairness, and many subsequent papers extend, sharpen, or relax the impossibility result under different assumptions.
| Criterion | Plain language statement | Who emphasizes it |
|---|---|---|
| Predictive parity / calibration | Within a given score, recidivism rates are equal across groups. | Northpointe / Equivant |
| Balance for the positive class | Among people who actually reoffend, average score is equal across groups. | Kleinberg, Mullainathan, Raghavan (one of three) |
| Balance for the negative class | Among people who do not reoffend, average score is equal across groups. | Kleinberg, Mullainathan, Raghavan (one of three) |
| Equal false positive rate | Among non-recidivists, rate of being flagged high-risk is equal across groups. | ProPublica |
| Equal false negative rate | Among recidivists, rate of being missed (low-risk) is equal across groups. | ProPublica |
| Equalized odds | Equal true positive rate and equal false positive rate across groups. | Hardt, Price, Srebro (2016) |
| Demographic parity | Same proportion flagged high-risk in each group, regardless of base rate. | Some civil-rights frameworks |
When base rates differ across groups, satisfying any one of these often forces violations of the others. This is the core lesson the field absorbed from the COMPAS dispute.
A separate strand of criticism asked a different question: forget bias for a moment, is COMPAS even good at predicting recidivism in the first place? In January 2018 Julia Dressel, then an undergraduate at Dartmouth, and Hany Farid, her advisor, published The Accuracy, Fairness, and Limits of Predicting Recidivism in Science Advances. They ran two experiments on the ProPublica Broward data.
In the first, they showed short descriptions of defendants (sex, age, prior criminal history, current charge) to 400 workers on Amazon Mechanical Turk and asked them to predict whether each person would be re-arrested within two years. The Turk workers, who had no criminal-justice expertise and were paid a few dollars, achieved a pooled accuracy of about 67 percent. COMPAS, drawing on its 137 features, achieved about 65 percent on the same defendants. The difference was not statistically significant.
In the second, the authors fit a simple logistic regression using only two features (the defendant's age and total number of prior convictions). It matched COMPAS performance.
The paper did not claim that human judgment is good at recidivism prediction. It claimed that COMPAS was not adding much over either humans or trivial models. That undercut a central justification for using a proprietary 137-item questionnaire at all. Equivant disputed the methodology, and a group of correctional researchers led by Anthony Flores published a rejoinder. The Dressel and Farid result, however, has been replicated in spirit by several follow-up studies, and the broader conclusion (that the predictive ceiling on individual-level recidivism is fairly low and is reached by quite simple models) is now widely accepted.
Eric Loomis was charged in 2013 with five offenses related to a drive-by shooting in La Crosse, Wisconsin. He pleaded guilty to two of the lesser counts: attempting to flee a traffic officer and operating a motor vehicle without the owner's consent. The presentence investigation report prepared by the Department of Corrections included a COMPAS assessment that scored him high on risk of pretrial recidivism, general recidivism, and violent recidivism. The judge, Scott Horne, cited the COMPAS report at sentencing and imposed six years in prison.
Loomis appealed, arguing that his due-process rights were violated because (1) the proprietary nature of COMPAS prevented him from examining how the score was produced, (2) the score was based in part on group statistics rather than individualized inquiry, and (3) the instrument used gender as an input.
The Wisconsin Supreme Court ruled in State v. Loomis, 881 N.W.2d 749 (Wis. 2016) on July 13, 2016, that the use of COMPAS in sentencing did not violate due process, but only under specific conditions. The court held that:
The United States Supreme Court denied certiorari in June 2017, leaving the Wisconsin ruling in place. Loomis is now the canonical American legal authority on the use of proprietary risk instruments in sentencing. It permits the practice with caveats, but the caveats are unusually elaborate for a sentencing-procedure case, and the dissenting opinions have been heavily cited in subsequent academic commentary.
The COMPAS controversy did not invent the academic study of fairness in machine learning. There was earlier work by Cynthia Dwork on differential privacy and fairness, by Toon Calders and Sicco Verwer on classification under fairness constraints, and a small FAT/ML workshop series running since 2014. What COMPAS did was give the field a single, vivid, real-world case that everyone could refer to. Almost every fairness paper from 2017 onward includes a benchmark experiment on the ProPublica Broward dataset, and it became one of the standard testbeds in libraries like AIF360.
Four effects on the field stand out:
First, the impossibility theorems by Chouldechova and Kleinberg et al. became foundational. They are now taught in the first lecture of most algorithmic fairness courses. The framing they introduced (multiple incompatible fairness criteria, choose your trade-off explicitly) has become the default frame.
Second, Moritz Hardt, Eric Price, and Nati Srebro's 2016 paper Equality of Opportunity in Supervised Learning introduced the equalized-odds criterion, which is intended to capture the kind of disparity ProPublica documented while remaining mathematically tractable. It is now one of the three or four most-cited fairness criteria.
Third, the conversation went mainstream. Cathy O'Neil's Weapons of Math Destruction came out in September 2016, four months after the ProPublica piece, with COMPAS as one of its central case studies. Virginia Eubanks's Automating Inequality (2018) and Safiya Noble's Algorithms of Oppression (2018) extended the broader argument across welfare administration, child-protective services, and search engines. Julia Angwin went on to co-found The Markup, an investigative outlet focused on algorithmic accountability.
Fourth, the inaugural ACM Conference on Fairness, Accountability, and Transparency in Machine Learning (then FAT*, now FAccT) was held in February 2018 in New York City. The conference grew out of the FAT/ML workshop series and was explicitly framed around the kind of problem that COMPAS exemplified. It now draws thousands of attendees and has become the central venue for fairness research.
The years after 2018 produced more academic work, more legal and policy responses, and slow shifts in adoption.
Replication studies have largely confirmed the core ProPublica findings on the Broward data, while complicating some of the secondary claims. Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq published Algorithmic Decision Making and the Cost of Fairness in 2017, formally analyzing the trade-off and arguing that some of ProPublica's framing conflated separate concepts. Studies on data from other jurisdictions (Kentucky, New York City, the federal pretrial system) found similar patterns of differential error rates by race, though the magnitudes vary.
On the policy side, several jurisdictions stepped back from algorithmic risk tools at the pretrial stage. New Jersey shifted to the Public Safety Assessment in 2017 as part of a broader bail-reform package. Some California counties have used the PSA instead of COMPAS. New York State's 2019 bail reform reduced the use of pretrial detention generally, which made the choice of instrument less consequential at the front end. Other jurisdictions, including parts of Wisconsin, continue to use COMPAS.
Equivant has continued to refine and market the COMPAS suite, and the company commissioned validation studies on its updated scales. Critics, including the Partnership on AI in its 2019 Report on Algorithmic Risk Assessment Tools in the U.S. Criminal Justice System, have argued that the validation studies are not enough on their own and that the deeper questions about base rates, ground-truth labels (re-arrest is not the same as re-offense), and the social meaning of risk scores remain unresolved.
The Partnership on AI report also pointed out something that gets lost in the technical debates. The labeled outcome in every COMPAS validation, ProPublica analysis, and academic replication is re-arrest, not actual recidivism. Whether someone gets arrested again depends on policing patterns, which are themselves not race-neutral in many jurisdictions. So even a perfectly calibrated model trained on re-arrest data inherits the disparities in policing. This problem ("label bias") has become one of the active research directions in the field that COMPAS helped create.
It would be a mistake to read the COMPAS story as a story about one bad algorithm. By most technical measures, COMPAS is unremarkable. It is a logistic-regression-style scoring tool with a moderately rich feature set and predictive performance comparable to other instruments and to untrained humans. The reason it became the case study it did is that ProPublica chose well, the data was available, the company defended the product publicly, and the resulting argument exposed a structural feature of risk prediction that nobody had fully articulated before: when you predict an outcome that has different base rates across groups, you are forced to choose which kind of fairness you want, because you cannot have all of them.
That lesson is now embedded in the way machine-learning practitioners think about deploying models in any high-stakes setting, including hiring, lending, healthcare, and child-welfare screening. Most of those deployments do not get a ProPublica-style audit. COMPAS got one, and the field is still working through what the audit revealed.