R (programming language)
Last reviewed
May 2, 2026
Sources
32 citations
Review status
Source-backed
Revision
v1 · 4,693 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 2, 2026
Sources
32 citations
Review status
Source-backed
Revision
v1 · 4,693 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Python, Data science, Machine learning, Statistics
R is a free, open-source programming language and software environment for statistical computing and graphics. It originated as an open-source dialect of the S language developed at Bell Laboratories in the 1970s by John Chambers and colleagues. R was created by Ross Ihaka and Robert Gentleman at the University of Auckland in New Zealand starting in 1993, with the first public binaries posted to the StatLib server in August of that year. The language is distributed under the GNU General Public License and is maintained by the R Core Team and the R Foundation for Statistical Computing, formed in 2003.
The defining feature of R is not the language itself, which is small and a bit eccentric, but the package ecosystem around it. The Comprehensive R Archive Network (CRAN), founded in 1997, hosts more than 23,000 contributed packages covering most of applied statistics, plus a separate ecosystem called Bioconductor for genomics and bioinformatics. The tidyverse, a set of packages built around tidy-data principles by Hadley Wickham and collaborators at Posit (formerly RStudio), has become the de-facto modern dialect for data manipulation and visualization. R sits next to Python as one of the two dominant languages in data science. Python has clearly won the deep-learning and production-ML race, but R remains the first choice in academic statistics, biostatistics, clinical trials, official statistics, and much social-science work.
For machine learning work, R has bindings to most of the same numerical engines Python users rely on. The xgboost, lightgbm, and glmnet packages share C++ cores with their Python siblings; caret and tidymodels give a unified interface to dozens of model families; keras, tensorflow, and torch provide deep learning bindings against TensorFlow and PyTorch; and reticulate lets R and Python share data and call each other in the same session.
Imagine a programmable calculator. You can ask it things like "draw me a graph of how ice-cream sales change with temperature" or "is the difference between these two groups bigger than I would expect by chance?" and it gives you back a chart or a number. R is that calculator, with tens of thousands of free add-ons for almost any kind of math, especially the kind statisticians use.
R is a sibling of Python. Python is a general-purpose language that is also good at math; R is a stats language that is also good at general purpose stuff if you push it. Biologists measuring gene expression, economists forecasting inflation, and grad students fitting mixed-effects models on Tuesday usually reach for R. People training huge neural networks or shipping web services usually reach for Python. The two worlds talk to each other, and most working analysts use both.
R's lineage starts with S language, developed at Bell Labs by John Chambers, Rick Becker, Allan Wilks, and others starting around 1976. S was an interactive layer over existing Fortran statistical libraries, a way to do exploratory data analysis without recompiling a Fortran program every time you wanted a histogram. Chambers's stated aim was "to turn ideas into software, quickly and faithfully," and that line still describes the ambition of R better than any tagline R itself ever had.
S evolved through the 1980s; the third version ("new S," 1988) introduced the formula notation (y ~ x1 + x2) that survives in R today, and the fourth version (1998) brought the S4 object system. S was sold commercially as S-PLUS by StatSci (later Insightful, then TIBCO). Chambers received the 1998 ACM Software System Award and later joined the R Core Team, which is one of those rare cases where the creator of a language follows it into the open-source clone.
Ross Ihaka and Robert Gentleman were junior staff in the Statistics Department at the University of Auckland in the early 1990s, both teaching introductory statistics, both unhappy with the available tools. They wanted something free and small enough to run on the Macintosh computers in the teaching lab. They began writing an interpreter inspired by S but with a Scheme-like internal model (lexical scoping, first-class functions). The first binary was posted to the StatLib server in August 1993 with an announcement on the s-news mailing list. Ihaka and Gentleman published the design rationale in 1996 in the Journal of Computational and Graphical Statistics under the title "R: A Language for Data Analysis and Graphics."
R became a GNU project in December 1997, the same year CRAN was set up by Kurt Hornik and Friedrich Leisch at TU Wien. Version 1.0 was released on February 29, 2000, a leap day chosen so the developers would not have to commit to a yearly anniversary. Notable later versions include R 2.0.0 (October 2004), R 3.0.0 (April 2013, which broke binary backward compatibility for compiled packages), and R 4.0.0 (April 2020). Minor versions ship roughly once a year. The current stable release is R 4.6.0 "Because it was There," released April 24, 2026; R release names are running jokes from Peanuts strips.
The R Core Team, formed informally in 1997, holds commit rights to the language. As of the mid-2020s the team has roughly twenty members, including Ihaka, Gentleman, John Chambers, Brian Ripley, Kurt Hornik, Peter Dalgaard, Martin Maechler, Luke Tierney, Duncan Murdoch, and Tomas Kalibera. The R Foundation for Statistical Computing, registered as a non-profit in Vienna in April 2003, holds the copyrights and runs finances. The R Consortium, a Linux Foundation project launched in 2015 with corporate members including Microsoft, Google, RStudio/Posit, and IBM, funds infrastructure and outreach but does not control the language.
R is a multi-paradigm language: procedural, functional, object-oriented, with strong vector semantics and a heavy emphasis on interactive use. Most of the surface idioms come from S, the internals borrow from Scheme, and a handful of design choices come from APL.
The atomic unit in R is the vector. There are no scalars in the strict sense; a single number is a length-1 vector, and arithmetic is vectorized by default:
x <- c(1, 2, 3, 4)
y <- c(10, 20, 30, 40)
x + y # returns 11 22 33 44
When vectors of different lengths interact, R recycles the shorter one to match. Most numerical work is done with vectors and matrices, and tight for loops are generally slower than vectorized expressions.
The core abstraction for tabular data is the data frame, a named list of equal-length vectors introduced in S in the late 1980s. It is what makes R feel like a statistics environment rather than a numeric one. The tidyverse tibble and Matt Dowle's data.table are alternative implementations with different performance and syntax trade-offs. Modern R also supports Arrow-backed data frames through the arrow package, which gives zero-copy interop with Python, pandas, and Apache Spark.
Functions in R are first-class objects. Closures capture their enclosing environment (lexical scoping, inherited from Scheme), ... collects extra arguments to forward, and lazy evaluation defers argument evaluation until use. Lazy evaluation enables some of R's most distinctive idioms, including the formula notation (lm(y ~ x, data = df) works because the formula object captures unevaluated expressions).
R has not one but several object systems, layered on top of each other for historical reasons:
| System | Year | Style | Used in |
|---|---|---|---|
| S3 | 1988 (S), inherited by R | Generic-function dispatch on a class attribute, very lightweight | Most base R, lm, glm, print, summary |
| S4 | 1998 (S), 2001 (R) | Formal multiple-dispatch generics, slot-based | Bioconductor, lme4, sp |
| Reference classes (R5) | 2010 | Mutable Java-style classes, methods bound to objects | Some packages, mostly superseded |
| R6 | 2014 | Lightweight reference classes by Winston Chang | Shiny, plumber, modern object-oriented packages |
| S7 | 2023 | Joint successor to S3 and S4, designed by R Core and Posit | Experimental, slowly being adopted |
Most everyday R code uses S3, which is so minimal it almost feels like a convention rather than a system. S4 shows up in Bioconductor, where multiple dispatch is useful for biological data structures. R6 is the typical choice for stateful objects in modern packages.
The magrittr package by Stefan Milton Bache (2014) introduced the %>% pipe, modelled on F#'s forward pipe. It quickly became the standard in tidyverse code: data %>% filter(x > 0) %>% mutate(y = x^2) %>% summarize(mean(y)) reads left to right rather than inside out. R Core added a native pipe |> in version 4.1.0 (May 2021), with slightly different semantics. Most new code uses |>, but %>% is still everywhere because old code still works.
CRAN is R's distinguishing institution, mirrored on around 90 sites worldwide. Every package goes through automated checks (the R CMD check machinery) on multiple operating systems before being accepted, and packages that break their dependencies are nagged or pulled. As of early 2026, CRAN hosts more than 23,000 contributed packages.
The quality bar is uneven by design. CRAN does not vet science or statistical correctness; it vets that the package builds, passes its own tests, and does not break other packages. The long tail includes both careful, peer-reviewed work and weekend hobby projects. The tidyverse, Bioconductor, and ROpenSci sit on top as quality filters.
| Family | Maintained by | Focus |
|---|---|---|
| base R | R Core | Language plus core stats: lm, glm, aov, t.test, etc. |
| Recommended packages | R Core | Ships with R: MASS, Matrix, survival, nlme, lattice |
| tidyverse | Posit (Wickham et al.) | Data wrangling: dplyr, tidyr, ggplot2, purrr, readr, tibble, stringr, forcats |
| data.table | Matt Dowle, Arun Srinivasan | Fast in-memory data tables with their own DSL |
| Bioconductor | Bioconductor Core | Genomics, microarrays, sequencing |
| tidymodels | Posit (Max Kuhn et al.) | Modern ML wrappers (parsnip, recipes, rsample, yardstick) |
| mlr3 | Bernd Bischl's group | Object-oriented ML framework |
| ROpenSci | ROpenSci collective | Peer-reviewed scientific packages |
| spatial | r-spatial team | sf, terra, stars for geospatial |
Bioconductor is a parallel ecosystem to CRAN focused on bioinformatics, started in 2001 by Robert Gentleman and colleagues at the Fred Hutchinson Cancer Center. It uses S4 heavily, has its own twice-yearly release cycle aligned with R minor versions, and hosts more than 2,000 packages for genomics, proteomics, flow cytometry, and single-cell analysis. Papers in Nature Biotechnology that say "analysis was performed in R using DESeq2 / limma / edgeR / Seurat" are usually citing Bioconductor (Seurat is on CRAN, but the methodological neighborhood is the same).
The tidyverse is the most influential thing to happen to R in the last fifteen years. The name was coined by Hadley Wickham around 2016, and the meta-package tidyverse was published to CRAN on September 15, 2016. The idea is that data should be in tidy form (one observation per row, one variable per column, one value per cell, per Wickham's 2014 paper) and that a small grammar of verbs should be enough to manipulate it.
The core packages are:
| Package | Purpose | First release |
|---|---|---|
ggplot2 | Layered grammar of graphics | 2007 |
dplyr | Data manipulation verbs (filter, mutate, summarize, group_by, arrange, select) | 2014 |
tidyr | Reshaping (wide-to-long, long-to-wide, pivoting, unnesting) | 2014 |
readr | Fast text-file reading with type guessing | 2015 |
purrr | Functional iteration (map, walk, pmap) | 2015 |
tibble | Modern data-frame with cleaner printing and stricter rules | 2014 |
stringr | String manipulation wrappers around stringi | 2009 |
forcats | Factor (categorical) manipulation | 2016 |
Around the core sit broom (turn model objects into tidy tibbles), lubridate (dates and times), rvest (web scraping), httr2 (HTTP), dbplyr (translate dplyr to SQL), and many others. The book that taught a generation of working data scientists tidyverse-style R is Wickham and Grolemund's R for Data Science, free online; the second edition (2023) is co-authored with Mine Çetinkaya-Rundel.
Wickham wrote ggplot2 as part of his 2008 PhD dissertation at Iowa State, building on Leland Wilkinson's 1999 book The Grammar of Graphics. A plot is built up from layers (data, geometric mark, statistical transform, scale, coordinate system, facet) rather than chosen from a menu of plot types. A faceted scatter plot with a per-group smoother is a few lines:
library(ggplot2)
ggplot(mpg, aes(displ, hwy, color = class)) +
geom_point() +
geom_smooth(method = "loess") +
facet_wrap(~ year)
The aesthetic is so distinctive that ggplot's gray-grid panels became almost a marker of "a chart from a data scientist" in 2010s journalism and academic papers. Extension packages include patchwork for composition, gganimate for animation, ggrepel for label placement, and ggdist for uncertainty viz.
RStudio the integrated development environment was first released in February 2011 by RStudio Inc., a company founded by JJ Allaire (creator of ColdFusion and Open Live Writer). The IDE was an immediate hit. R had previously been used through plain editors plus a console, and RStudio bundled a code editor, console, plot pane, environment browser, package manager, and Sweave/knitr integration into one window. The free edition is open-source under the AGPL; commercial editions add team features.
In July 2022 RStudio Inc. renamed itself to Posit PBC (a registered Public Benefit Corporation), to signal that the company was no longer R-only and was investing seriously in Python. Posit now ships Posit Workbench (multi-language IDE), Posit Connect (publishing platform), and Posit Cloud (hosted environments), and it maintains the tidyverse, the keras and tensorflow R bindings, the torch R port, reticulate, Quarto, Shiny, and many other open-source projects.
Reproducible reporting in R has gone through several generations. Friedrich Leisch's Sweave (2002) wove R code into LaTeX. Yihui Xie's knitr (2011) generalized it to Markdown, HTML, and other formats. R Markdown, built on knitr and Pandoc by JJ Allaire, Yihui Xie, and the RStudio team starting in 2014, made literate programming the default for analyses, papers, books, and slides. R for Data Science and Forecasting: Principles and Practice are both written in R Markdown; the bookdown extension by Xie made book-length projects practical.
Quarto, released by Posit in 2022, is the next-generation rewrite. It is language-agnostic by design (R, Python, Julia, Observable JavaScript) and runs as a separate command-line tool rather than an R package. The same .qmd file can render to HTML, PDF, Word, websites, books, slides, or dashboards. Quarto is one of the clearest artifacts of Posit's broader-than-R turn; scientific Python users have adopted it alongside Jupyter.
Shiny, released in 2012 by Joe Cheng at RStudio, is R's web application framework. A Shiny app is an R script that defines a UI (HTML widgets) and a server function (reactive R code), with the framework keeping them in sync over a websocket. An analyst can build an interactive dashboard or modeling tool in R alone, no JavaScript required. Shiny is used heavily in pharma, finance, government, and academia. A companion Shiny for Python launched in 2022.
R's classical statistics coverage is essentially exhaustive. Linear and generalized linear models (lm, glm in base), mixed-effects models (lme4 by Doug Bates, Martin Maechler, Ben Bolker), survival analysis (survival by Terry Therneau), generalized additive models (mgcv by Simon Wood), Bayesian inference (rstan, brms, cmdstanr, rjags), and thousands more in econometrics, psychometrics, and ecology. If a paper proposes a statistical method, there is a good chance one of the authors put a working R package on CRAN.
For machine learning specifically, R has long had wrappers around the standard algorithm families:
| Package | Author / year | Method |
|---|---|---|
randomForest | Andy Liaw & Matthew Wiener (2002), wrapping Breiman & Cutler's Fortran code | Random forest |
ranger | Marvin N. Wright (2015) | Faster random forests, especially for high dimensions |
xgboost | Tianqi Chen et al. (2014) | XGBoost gradient boosting, R binding |
lightgbm | Microsoft (2017) | LightGBM, R binding |
glmnet | Friedman, Hastie, Tibshirani (2010) | Lasso and elastic-net regularized regression |
e1071 | David Meyer et al. | LIBSVM bindings, naive Bayes, k-means |
kernlab | Alexandros Karatzoglou et al. | Kernel methods, SVM, kernel PCA |
nnet | Brian Ripley | Single-hidden-layer neural networks (in base for decades) |
mboost | Hothorn et al. | Model-based boosting |
MASS | Venables & Ripley | LDA, QDA, polynomial regression, classic methods |
The caret package (Classification And REgression Training) was built by Max Kuhn starting in 2007 to wrap the dozens of inconsistent ML packages behind a single API: a unified train() function that handles preprocessing, resampling, hyperparameter tuning, and prediction across hundreds of models. For about a decade, caret was how most R users did ML.
Kuhn rewrote the framework with the same philosophy but tidyverse-native idioms; the result is tidymodels, a meta-package and collection of components (parsnip for models, recipes for preprocessing, rsample for resampling, yardstick for metrics, tune for hyperparameter search, workflowsets for combining them) released to CRAN in 2020 and now the default for new code. mlr3, by Bernd Bischl's group at LMU Munich, is the third generation of the mlr framework and a more object-oriented alternative to tidymodels. It uses R6 classes throughout, has good support for benchmarking and pipelines, and is favored in some research labs for experiment management.
R is not where most deep learning research happens. The community of paper authors writing in R is small relative to the Python/PyTorch crowd, and the papers-with-code culture grew up around Python notebooks. But R has good bindings for the major frameworks.
The tensorflow R package binds the TensorFlow C++ library through reticulate. The companion keras package wraps the Keras high-level API. Both were released by RStudio starting in 2017, with JJ Allaire as lead author. François Chollet co-wrote Deep Learning with R (2018, second edition 2022) with Allaire and Tomasz Kalinowski, an R port of his Deep Learning with Python.
The torch R package, written by Daniel Falbel and the mlverse team and first released to CRAN in October 2020, is a different beast. It does not bind PyTorch through Python; it links directly against libtorch, the C++ library underlying PyTorch. That means torch for R has no Python dependency, runs in a single process, and gets near-native speed. The API mirrors PyTorch closely (autograd, nn.Module-equivalent, datasets and dataloaders), and the mlverse organization on GitHub maintains extensions including torchvision, torchaudio, tabnet, and luz (a high-level training loop akin to PyTorch Lightning).
R's deep-learning footprint is biggest in research areas where R was already entrenched: epidemiological modeling, ecology, psychometrics, single-cell genomics. Most production deep-learning systems still run in Python, but a researcher who needs to fit a CNN as part of a larger statistical analysis can stay inside R if they want to.
The reticulate package, released by RStudio in 2018, embeds a Python interpreter inside an R session and translates objects between the two. You can call Python functions from R, source .py files, use NumPy arrays and pandas DataFrames as if they were R vectors and data frames, and run a Python REPL inside the R console. Reticulate is what makes the R-side Keras and TensorFlow bindings work, and it is what makes mixed-language Quarto documents practical. The other direction, calling R from Python, is handled by rpy2 (Laurent Gautier, 2008). Apache Arrow's R and Python bindings share an in-memory format, so large data frames move between languages with no serialization cost. In practice many data teams write data pipelines and ML training in Python and reach for R for the final modeling and reporting steps.
The rivalry between R and Python has cooled. Both communities mostly accept that they overlap heavily and complement each other for the rest. A rough division of labor:
| Task | R is typically stronger | Python is typically stronger |
|---|---|---|
| Classical statistics (mixed models, survival, GAMs, Bayesian inference) | Yes, by a wide margin | Catching up via PyMC and statsmodels but still behind |
| Data wrangling | Tidyverse, data.table, dplyr are excellent | pandas is excellent; Polars is excellent |
| Static plotting | ggplot2 is widely considered best in class | matplotlib, plotly, seaborn are all good |
| Interactive dashboards | Shiny | Streamlit, Dash, Gradio, Shiny for Python |
| Reproducible reports | R Markdown, Quarto | Jupyter, Quarto |
| ML pipelines | tidymodels, mlr3 | scikit-learn |
| Deep learning | torch, keras (R bindings to the same engines) | PyTorch, TensorFlow, JAX (the actual research happens here) |
| LLM tooling | Some, mainly through ellmer and Ollama bindings | Vast: LangChain, LlamaIndex, transformers, vLLM, etc. |
| Bioinformatics / genomics | Bioconductor is the standard | Biopython exists but Bioconductor is bigger |
| Production web services | Plumber, Shiny | FastAPI, Django, Flask |
| General-purpose programming | Possible but awkward | Designed for it |
Most serious data teams now use both. There is no Python equivalent of Bioconductor for genomics, and no R equivalent of the PyTorch / Hugging Face ecosystem for deep learning. Pick the language that matches where the rest of the work in your subfield already lives.
R is the working language of a sizable chunk of academic and applied statistics. Specific strongholds: pharma and clinical trials (the FDA accepts R for regulatory submissions, and frameworks like the R Validation Hub are part of the toolchain); official statistics (Eurostat, the U.S. Bureau of Labor Statistics, Statistics Canada, Statistics NZ); genomics and biostatistics (Bioconductor and single-cell packages like Seurat are the default); econometrics and quantitative finance; ecology and geospatial analysis through vegan, sf, and terra; data journalism at outlets like the BBC, FiveThirtyEight, and the Financial Times; and most public NBA/NFL analytics work.
The textbook An Introduction to Statistical Learning by James, Witten, Hastie, and Tibshirani (2013, second edition 2021) was originally written with R examples and has become one of the most widely used ML textbooks in undergraduate statistics. A companion Python edition appeared in 2023, which is itself a small data point about the shifting balance.
R is released under the GNU General Public License version 2 or 3, at the user's option. CRAN does not require GPL for contributed packages; common alternatives include MIT, Apache 2.0, and BSD.
Major version timeline:
| Version | Release | Notes |
|---|---|---|
| R 0.16 | 1995 | First public version |
| R 0.49 | April 1997 | First on CRAN |
| R 0.60 | December 1997 | Became a GNU project |
| R 1.0.0 | February 29, 2000 | First stable release |
| R 2.0.0 | October 4, 2004 | Lazy-loading package data |
| R 3.0.0 | April 3, 2013 | Long vectors; broke binary compatibility |
| R 3.4.0 | April 2017 | JIT compilation by default |
| R 3.5.0 | April 2018 | ALTREP framework |
| R 4.0.0 | April 24, 2020 | stringsAsFactors = FALSE default |
| R 4.1.0 | May 2021 | Native pipe ` |
| R 4.2.0 | April 2022 | Native pipe placeholder; UTF-8 on Windows |
| R 4.3.0 | April 2023 | _ placeholder in pipes |
| R 4.4.0 | April 2024 | Language-level enhancements |
| R 4.5.0 | April 2025 | Routine release |
| R 4.6.0 | April 24, 2026 | "Because it was There"; current stable |
R is productive for what it was designed to do, but it has rough edges that practitioners have lived with for decades.
data.table or Arrow than in base R.Rcpp package by Dirk Eddelbuettel and Romain François (2008), which has its own thriving ecosystem.NA handling, factor defaults, base graphics versus ggplot2, and a dozen other small things. The tidyverse cleaned up a lot of this, but old code still mixes idioms.parallel, future, and mirai packages provide multi-process backends but there is nothing like Python's asyncio built into the language.Most R users would say none of these are dealbreakers for the work R is actually used for, and the ones that matter (Rcpp for speed, Arrow for memory, Plumber for serving) have working answers.