Data Science
Last reviewed
May 4, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 ยท 4,306 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 4, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 ยท 4,306 words
Add missing citations, update stale details, or suggest a clearer explanation.
Data science is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processes, algorithms, and systems to extract knowledge and insights from noisy, structured, and unstructured data. It combines elements of mathematics, computer science, domain expertise, and information theory to study, analyze, and act upon data of many different kinds. The field exists at the intersection of statistical reasoning and software engineering, with the goal of producing data-driven understanding that is reproducible, scalable, and useful to decision makers.
The modern practice of data science is deeply intertwined with machine learning and artificial intelligence. Most production data science work involves training predictive models, evaluating them against held-out data, and deploying them as services that score new observations. Common techniques range from classical regression to gradient-boosted trees to deep learning architectures used in computer vision and natural language processing. The boundary between data science and applied machine learning has steadily eroded since the late 2010s, and most large technology companies now treat the two as overlapping specializations within a broader analytics organization.
While the term "data science" first appeared in the 1960s and 1970s in statistical and computer science writing, the modern profession was named in 2008 by DJ Patil at LinkedIn and Jeff Hammerbacher at Facebook. A 2012 Harvard Business Review article describing the role as "the sexiest job of the 21st century" cemented the title in mainstream business vocabulary. Today the practice spans government, healthcare, finance, retail, telecommunications, and academic research, and it is taught as a major or minor at hundreds of universities. The arrival of foundation models and large language models after 2022 has begun a fresh wave of change in how data scientists work, shifting effort away from training models from scratch and toward prompting, fine-tuning, and evaluating pre-trained systems.
There is no single agreed definition of data science. The term has been used since at least 1974 to describe a wide range of activities concerning data: collection, storage, transformation, analysis, modeling, visualization, and communication. Most contemporary descriptions converge on a few common ideas. Data science is interdisciplinary, drawing on probability and statistics, computer science, optimization, and a substantive application domain. It is empirical, in that it begins with measurements rather than theory. It is computational, in that the volume and variety of modern data force practitioners to rely on programming and distributed systems. And it is goal-oriented, aiming at decisions, predictions, or new knowledge rather than at proofs.
It is useful to compare data science with several adjacent fields. Statistics provides much of the inferential machinery used by data scientists, but classical statistics has traditionally focused on small samples, model assumptions, and rigorous uncertainty quantification, while data science often works with large, observational data sets and emphasizes predictive accuracy. Machine learning supplies the algorithmic toolkit used to fit complex predictive models, and many data science teams treat machine learning as a subset of their work. Artificial intelligence is broader than data science and includes symbolic reasoning, planning, and robotics, but in industry the two terms are often used interchangeably when the underlying systems are statistical. Business intelligence overlaps with descriptive analytics within data science but typically stops at dashboards and reports rather than predictive modeling. Data engineering builds and operates the pipelines that data scientists rely on, and the boundary between data science and data engineering is one of the most contested in industry hiring.
In 2001, William S. Cleveland of Bell Labs proposed treating data science as an enlarged version of statistics that explicitly incorporates computing, multidisciplinary investigations, theory, models and methods for data, pedagogy, tool evaluation, and theory. His paper argued for a six-part allocation of effort across these areas and recommended that university statistics departments adopt the new name. The proposal was influential in academia and helped shape the curricula that began appearing in the 2010s.
The history of data science is the history of repeated attempts to expand statistics outward, of computer science reaching toward applications in measurement, and of industry recognizing that data work needed a name. The following timeline highlights the most frequently cited milestones.
In 1962, the American statistician John W. Tukey published "The Future of Data Analysis" in the Annals of Mathematical Statistics. The paper was 67 pages long and called for a new field, distinct from mathematical statistics, that would emphasize learning from data through exploration, graphics, and computation. Tukey is often called the first data scientist for this reason, although he did not use the term. He later coined "exploratory data analysis" and "bit" and developed the box plot and the Fast Fourier Transform.
In 1974, the Danish computer scientist Peter Naur, who would go on to win the Turing Award in 2005, published "Concise Survey of Computer Methods." The book is widely cited as the first printed use of the phrase "data science." Naur described the field as the science of dealing with data once they have been established, distinct from the question of what those data represent.
In 1996, the International Federation of Classification Societies held its biennial conference in Kobe, Japan. The proceedings were titled "Data Science, Classification, and Related Methods," the first time the term appeared in the title of a major international conference. The volume covered classification, clustering, exploratory and multivariate data analysis, and knowledge discovery.
In November 1997, C.F. Jeff Wu delivered his inaugural lecture as the H.C. Carver Professor at the University of Michigan with the title "Statistics = Data Science?" Wu argued that statistics should be renamed data science and statisticians should be renamed data scientists, reflecting a broader engagement with computing and applications.
In 2001, William S. Cleveland published "Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics" in the International Statistical Review. The paper laid out a curriculum and research agenda for a new field defined by the integration of statistics with computing and substantive applications.
In 2008, DJ Patil at LinkedIn and Jeff Hammerbacher at Facebook compared notes on the difficulty of recruiting analysts whose work spanned engineering, statistics, and product. The pair settled on "data scientist" as a job title, in part because internal experiments at LinkedIn showed that the title attracted more qualified applicants than alternatives such as "research scientist," "statistician," or "analyst." Both companies adopted the title, and other Silicon Valley firms followed.
In October 2012, Thomas H. Davenport and DJ Patil published "Data Scientist: The Sexiest Job of the 21st Century" in the Harvard Business Review. The article argued that data scientists combined the skills of a hacker, an analyst, a communicator, and a trusted adviser, and that demand for the role had outstripped supply. The phrase entered popular culture and accelerated the founding of degree programs and bootcamps.
From 2013 onward, universities began launching dedicated programs. New York University founded its Center for Data Science in 2013. Berkeley, the Massachusetts Institute of Technology, and Carnegie Mellon followed with majors, minors, masters degrees, and dedicated schools. Online education made the field broadly accessible: the Coursera Machine Learning course taught by Andrew Ng became one of the most enrolled courses in the world after its 2011 launch, and platforms such as edX and DataCamp added specialized data science tracks. Kaggle, a competition platform founded in 2010 and acquired by Google in 2017, served as a global training ground.
In February 2015, President Barack Obama appointed DJ Patil as the first U.S. Chief Data Scientist within the Office of Science and Technology Policy. Patil's tenure helped launch the Police Data Initiative, the Data-Driven Justice Initiative, and the Precision Medicine Initiative, and it signaled to other governments that the role was now a fixture of public policy.
The second half of the decade saw the deep learning boom transform data science practice. Frameworks such as TensorFlow (released by Google in 2015) and PyTorch (released by Facebook in 2016) made it possible for individual practitioners to train large neural networks. Cloud platforms commoditized storage and compute. Tools such as Apache Spark, Snowflake, and Databricks absorbed much of the heavy lifting of distributed computing.
The public release of ChatGPT in November 2022 began a new chapter. Foundation models trained on internet-scale corpora can now perform many of the tasks that data scientists previously built bespoke models for, including text classification, summarization, and basic code generation. ChatGPT's Code Interpreter feature (later renamed Advanced Data Analysis) executes Python in a sandbox, allowing non-technical users to upload a spreadsheet and ask for an analysis in natural language. GitHub Copilot, announced in 2021 and steadily expanded, helps data scientists write code in Jupyter notebooks and other environments. Specialized analytics tools such as Hex, Mode, and Deepnote have integrated AI assistants directly into the notebook interface.
The role itself is shifting. Data scientists in 2026 spend less time training small custom models from scratch and more time evaluating, fine-tuning, and orchestrating foundation models. Skills in prompt design, retrieval-augmented generation, and model evaluation have grown in importance. The traditional core of the field, statistical inference applied to messy real-world data, remains essential.
Data science projects typically follow a recognizable sequence of phases. Two reference frameworks are widely cited. CRISP-DM, the Cross-Industry Standard Process for Data Mining, was published in 1999 by a consortium led by SPSS, Daimler-Chrysler, and NCR. It describes six iterative phases. OSEMN, pronounced "awesome" and proposed by Hilary Mason and Chris Wiggins in 2010, summarizes the workflow with five verbs: Obtain, Scrub, Explore, Model, Interpret. Most contemporary teams use a hybrid that resembles the table below.
| Phase | Typical activities | Common artifacts |
|---|---|---|
| Problem framing | Translate business or research question into measurable objective; agree on success metric and constraints | Project brief, success criteria |
| Data collection | Identify sources, request access, write extraction queries, gather logs, run surveys | Raw data dump, data dictionary |
| Data wrangling | Parse, normalize, deduplicate, join, handle missing values, fix encoding errors | Cleaned table, transformation script |
| Exploratory data analysis | Compute summary statistics, plot distributions, identify outliers, test hypotheses informally | Notebook, chart gallery |
| Feature engineering | Construct predictors from raw signals; encode categorical variables; build embeddings | Feature store entry, transformation pipeline |
| Modeling | Choose algorithm family, train candidates, tune hyperparameters, run cross-validation | Trained model artifact, training log |
| Evaluation | Score on held-out data, compare against baseline, examine errors, check fairness and calibration | Evaluation report, slice analysis |
| Deployment | Package model, expose as service or batch job, integrate with downstream system | API endpoint, scheduled pipeline |
| Monitoring | Track input drift, output distribution, performance against ground truth, retrain when needed | Dashboard, alerting rules |
In practice the phases are not strictly sequential. Findings during exploration often send a project back to data collection. A model evaluation that exposes a fairness problem can require redesigning features. Time spent on cleaning and wrangling commonly dominates the calendar. A widely cited industry survey from 2017 reported that data scientists spend roughly 60 percent of their time preparing data and only about 20 percent building models, and the ratio has shifted only modestly since.
Data science draws on a wide methodological catalog. The choice of method depends on the goal (description, prediction, causal inference, generation), the type of data, and the resources available.
Most projects begin with descriptive statistics: counts, means, medians, standard deviations, quantiles, correlation, and visualization. Inferential techniques such as confidence intervals, hypothesis tests, bootstrap resampling, and Bayesian estimation provide ways to reason about uncertainty when the data are a sample from a larger population.
Supervised learning fits a function from features to a labeled target. Linear regression, logistic regression, decision trees, random forests, gradient boosting machines such as XGBoost, support vector machines, and neural networks are the workhorses. Regression problems predict continuous values such as price or temperature; classification problems predict discrete labels such as spam or fraud.
Unsupervised methods find structure in data without labels. Clustering algorithms (k-means, DBSCAN, hierarchical clustering) group similar observations. Dimensionality reduction techniques compress high-dimensional data into a smaller representation, with principal component analysis the most common linear approach and t-SNE and UMAP the most common nonlinear ones. Association rule mining and topic models are also part of this family.
Time series data, including financial prices, sensor readings, and web traffic, require methods that respect temporal ordering. Classical tools include ARIMA, exponential smoothing, and state-space models. Modern alternatives include Prophet, gradient-boosted trees on lagged features, and recurrent or transformer-based neural networks.
Natural language processing covers tasks such as classification, named entity recognition, machine translation, and question answering. Computer vision covers image classification, object detection, segmentation, and generation. Both subfields have been transformed by deep learning since the early 2010s, and most production systems in 2026 build on pre-trained transformer models rather than training from scratch.
Deep learning uses multilayer neural networks to learn representations directly from raw inputs. It has become the dominant approach when data are abundant and when the input is unstructured, such as images, audio, or text. Researchers including Geoffrey Hinton, Yann LeCun, and Yoshua Bengio shared the 2018 Turing Award for their foundational work.
Many business questions are causal: would changing a button color increase clicks; does a training program improve outcomes. Randomized experiments, often called A/B tests, are the gold standard. When randomization is not possible, observational causal inference techniques such as instrumental variables, regression discontinuity, propensity score matching, and difference-in-differences provide alternatives, each with their own assumptions and risks.
The practical work of data science is mediated by a large and rapidly evolving toolchain. The table below lists representative tools across the major categories.
| Category | Examples |
|---|---|
| Languages | Python, R, SQL, Julia, Scala |
| Notebooks and IDEs | Jupyter, Google Colab, Databricks notebooks, Kaggle Kernels, VS Code, RStudio |
| Numerical and ML libraries (Python) | NumPy, pandas, scikit-learn, statsmodels, XGBoost, LightGBM, PyTorch, TensorFlow, JAX |
| R packages | tidyverse, data.table, caret, tidymodels, ggplot2 |
| Visualization | matplotlib, seaborn, plotly, Altair, Tableau, Power BI, Looker |
| Big data and distributed compute | Apache Spark, Hadoop, Dask, Polars, Ray |
| Workflow and orchestration | Airflow, Prefect, Dagster, dbt |
| MLOps and experiment tracking | MLflow, Weights & Biases, Kubeflow, Neptune, Comet |
| Cloud platforms | AWS SageMaker, Google Vertex AI, Azure ML, Databricks, Snowflake |
| Foundation model APIs | Anthropic Claude, OpenAI GPT, Google Gemini, Cohere |
Python has become the dominant language for data science, with pandas for tabular manipulation, NumPy for numerical arrays, and scikit-learn for classical machine learning. R remains popular in statistics and biostatistics, particularly within academia. SQL is essentially universal: nearly every data scientist queries a relational warehouse on a daily basis. Tableau and Power BI dominate enterprise dashboarding. TensorFlow and PyTorch are the leading deep learning frameworks, with PyTorch ahead in research and roughly even with TensorFlow in industrial deployment.
Formal education in data science has expanded rapidly. The first standalone undergraduate major in the United States was launched at the College of Charleston in 2007. UC Berkeley assembled a campus-wide task force in 2014 and graduated its first dedicated data science majors in 2018. Stanford University, MIT, Carnegie Mellon, and New York University followed with majors, minors, masters degrees, and entire schools or colleges of computing and data. The MIT Stephen A. Schwarzman College of Computing was announced in 2018 with a one billion dollar commitment, and the Berkeley Division of Computing, Data Science, and Society was established the same year.
Massive open online courses (MOOCs) opened the field to learners worldwide. Andrew Ng's Machine Learning course on Coursera, launched in 2011, became the most popular MOOC in history, with millions of enrollments. The Coursera Data Science Specialization from Johns Hopkins, the IBM Data Science Professional Certificate, the Microsoft Professional Program, the fast.ai Practical Deep Learning courses, and the DeepLearning.AI specializations have together trained millions of additional learners. Kaggle competitions, with cash prizes and a public leaderboard, became a parallel proving ground.
Bootcamps including General Assembly, Metis, Springboard, and Insight Data Science offered intensive immersion programs targeted at career changers. Many have closed or pivoted as the labor market has saturated and as universities have caught up, but the format remains influential.
Academic certification has been complemented by community-driven credentials such as the Hugging Face certifications for natural language processing and the Cloud Provider associate-level data and machine learning certifications offered by AWS, Google Cloud, and Microsoft Azure.
The "data scientist" title now describes a wide range of jobs. Most organizations distinguish among several closely related roles, although boundaries blur and titles vary. The table below summarizes the most common distinctions in industry as of 2026.
| Role | Primary focus | Typical tools |
|---|---|---|
| Data analyst | Descriptive analytics, reporting, dashboards, ad-hoc analysis | SQL, Excel, Tableau, Power BI |
| Data scientist | Statistical analysis, predictive modeling, experimentation | Python, R, SQL, Jupyter, scikit-learn |
| Machine learning engineer | Building, deploying, and scaling production ML systems | Python, PyTorch, TensorFlow, Kubernetes, MLflow |
| Data engineer | Designing and operating data pipelines, warehouses, and lakes | SQL, Spark, Airflow, dbt, cloud data platforms |
| Statistician | Study design, inference, causal analysis, especially in clinical or government settings | R, SAS, Stata |
| Research scientist | Original ML or AI research, often with publications | PyTorch, JAX, large compute clusters |
| Analytics engineer | Bridging data engineering and analysis through modeled tables | dbt, SQL, BI tools |
Industries hiring data scientists include technology, finance, healthcare and life sciences, retail and e-commerce, consumer packaged goods, telecommunications, energy, transportation, government, and consulting. Tech and finance pay the highest base salaries, while healthcare and government often offer mission-driven work and stronger job security.
The U.S. Bureau of Labor Statistics reported a median annual wage of 112,590 dollars for data scientists in May 2024, the most recent year available. The lowest-earning ten percent made less than 63,650 dollars, while the top ten percent made more than 194,410 dollars. The median wage was higher in scientific research and development services at 120,090 dollars. The Bureau projects 34 percent employment growth between 2024 and 2034, much faster than the average for all occupations, with about 23,400 openings projected each year.
Compensation at major technology firms can substantially exceed the BLS median once equity is included. Salary aggregator Levels.fyi consistently shows total compensation for senior data scientists and machine learning engineers at large tech companies in the 250,000 to 500,000 dollar range, with staff and principal levels reaching higher.
The deployment of capable foundation models has changed daily practice. Several patterns are now widespread.
First, code generation has become a default. Tools such as GitHub Copilot, Anthropic Claude, and ChatGPT can produce functional pandas code, scikit-learn pipelines, SQL queries, and visualizations from natural language descriptions. Most data scientists use these tools to scaffold analyses and to recall syntax for less familiar libraries, then edit and verify the results.
Second, exploratory analysis can be conversational. ChatGPT's Advanced Data Analysis (originally launched in 2023 as Code Interpreter) accepts uploaded files, executes Python in a sandbox, and reports results with charts. Anthropic's Claude offers analogous tools and an Analysis tool that runs JavaScript or Python on the user's data. Specialized notebooks such as Hex, Mode, and Deepnote have integrated similar assistants, and many established platforms have added their own.
Third, the modeling step has been partly displaced. For text classification, named entity recognition, summarization, and question answering, calling a hosted large language model is often faster and more accurate than training a bespoke model. The data scientist's job becomes one of evaluation, prompt design, and orchestration: building retrieval-augmented generation pipelines, choosing the right model, deciding when fine-tuning is justified, and monitoring outputs for hallucination, bias, and drift.
Fourth, the boundary with software engineering has tightened. Production AI systems involve agents, tools, retrieval indices, vector databases, observability, and complex prompt assembly. Many data science teams now require engineering practices that were once optional.
The traditional core of data science remains intact. Cleaning data, framing problems, designing experiments, building reliable predictive models, and communicating results to non-technical audiences are still central. Foundation models do not replace causal inference, time series forecasting, or rigorous A/B testing.
A growing number of universities have organized substantial portions of their faculty around data science.
| Institution | Unit | Notes |
|---|---|---|
| UC Berkeley | Division of Computing, Data Science, and Society | Established 2018; offers undergraduate major and graduate programs |
| MIT | Stephen A. Schwarzman College of Computing | Announced 2018; one billion dollar commitment |
| Stanford University | Stanford Data Science | Cross-school initiative; AI Index report |
| New York University | Center for Data Science | Founded 2013; offers Ph.D. and masters programs |
| Carnegie Mellon University | Machine Learning Department, Statistics & Data Science | First Ph.D. in machine learning, founded 2006 |
| Columbia University | Data Science Institute | Founded 2012 |
| University of Michigan | Michigan Institute for Data Science | Founded 2015 |
| University of Washington | eScience Institute | Founded 2008 |
| Harvard University | Harvard Data Science Initiative | Founded 2017 |
Leading journals and venues include the Journal of Data Science (founded 2003), the Harvard Data Science Review (founded 2019), the International Statistical Review, the Journal of Machine Learning Research, and the proceedings of conferences such as NeurIPS, ICML, KDD, and ICLR. Industry-facing venues such as the O'Reilly Strata Data Conference and Spark + AI Summit have also shaped the practical literature.
Data science attracts steady criticism, much of which is well grounded.
The "sexiest job" framing oversold the day-to-day reality. Surveys have repeatedly found that data scientists spend most of their time on data cleaning, plumbing, and stakeholder management rather than on modeling. Many practitioners have written about the gap between the rich machine learning curricula taught in universities and the unglamorous warehouse work that dominates their first years on the job.
Reproducibility has been a persistent concern. Predictive models published in academic papers often fail to generalize. A widely cited 2023 review identified data leakage as a pervasive problem in applied machine learning research, with a substantial share of published claims overstated because evaluation data leaked into training. Industry teams have built MLOps tooling in part to address these issues.
Algorithmic bias and ethics have moved from a fringe topic to a central one. Models trained on historical data can perpetuate or amplify discrimination in lending, hiring, criminal justice, and healthcare. The 2016 ProPublica investigation of the COMPAS recidivism algorithm and Joy Buolamwini's research on facial recognition accuracy by skin tone are commonly cited examples. Regulators in the European Union, the United States, and elsewhere have introduced rules requiring documentation, explanation, and impact assessment for high-risk automated decisions.
Title inflation and skill overlap have created confusion. A survey of job postings shows that the "data scientist" label can refer to a SQL analyst, a Python modeler, a research scientist with a Ph.D., or a machine learning engineer. Some companies have stopped using the title in favor of more specific roles such as analytics engineer or applied scientist.
Finally, the rapid arrival of foundation models has produced anxiety about the longevity of the role. Tasks that once filled a junior data scientist's quarter can now be done in minutes by a chatbot. Most observers expect the role to continue evolving, with humans focusing on framing, validation, and stakeholder communication, but the transition is uncomfortable for many practitioners.