# Data Science

> Source: https://aiwiki.ai/wiki/data_science
> Updated: 2026-06-25
> Categories: Computer Science, Education AI, Machine Learning, Statistics
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Data science** is an interdisciplinary field that uses [statistics](/wiki/statistics), programming, and domain expertise to extract knowledge and insights from structured and unstructured data. It combines mathematics, computer science, and a substantive application area to study, analyze, and act on data, with the goal of producing data-driven understanding that is reproducible, scalable, and useful to decision makers. The practitioners who do this work are called data scientists, a title that Thomas Davenport and DJ Patil famously described in 2012 as "the sexiest job of the 21st century." [6]

The field exists at the intersection of statistical reasoning and software engineering. A widely cited 2012 Harvard Business Review article defined a data scientist as "a high-ranking professional with the training and curiosity to make discoveries in the world of [big data](/wiki/big_data)," adding that "more than anything, what data scientists do is make discoveries while swimming in data." [6] Demand for the role has grown accordingly: the U.S. Bureau of Labor Statistics reported a median annual wage of 112,590 dollars for data scientists in May 2024 and projects 34 percent employment growth from 2024 to 2034, much faster than the average for all occupations. [8]

The modern practice of data science is deeply intertwined with [machine learning](/wiki/machine_learning) and artificial intelligence. Most production data science work involves training predictive models, evaluating them against held-out data, and deploying them as services that score new observations. Common techniques range from classical regression to gradient-boosted trees to [deep learning](/wiki/deep_learning) architectures used in computer vision and natural language processing. The boundary between data science and applied machine learning has steadily eroded since the late 2010s, and most large technology companies now treat the two as overlapping specializations within a broader analytics organization.

While the term "data science" first appeared in the 1960s and 1970s in statistical and computer science writing, the modern profession was named in 2008 by [DJ Patil](/wiki/dj_patil) at LinkedIn and Jeff Hammerbacher at Facebook. [12] The 2012 Harvard Business Review article cemented the title in mainstream business vocabulary. [6] Today the practice spans government, healthcare, finance, retail, telecommunications, and academic research, and it is taught as a major or minor at hundreds of universities. The arrival of foundation models and [large language models](/wiki/large_language_model) after 2022 has begun a fresh wave of change in how data scientists work, shifting effort away from training models from scratch and toward prompting, fine-tuning, and evaluating pre-trained systems.

## What is data science?

There is no single agreed definition of data science. The term has been used since at least 1974 to describe a wide range of activities concerning data: collection, storage, transformation, analysis, modeling, visualization, and communication. [2] Most contemporary descriptions converge on a few common ideas. Data science is interdisciplinary, drawing on probability and statistics, computer science, optimization, and a substantive application domain. It is empirical, in that it begins with measurements rather than theory. It is computational, in that the volume and variety of modern data force practitioners to rely on programming and distributed systems. And it is goal-oriented, aiming at decisions, predictions, or new knowledge rather than at proofs.

It is useful to compare data science with several adjacent fields. Statistics provides much of the inferential machinery used by data scientists, but classical statistics has traditionally focused on small samples, model assumptions, and rigorous uncertainty quantification, while data science often works with large, observational data sets and emphasizes predictive accuracy. Machine learning supplies the algorithmic toolkit used to fit complex predictive models, and many data science teams treat machine learning as a subset of their work. Artificial intelligence is broader than data science and includes symbolic reasoning, planning, and robotics, but in industry the two terms are often used interchangeably when the underlying systems are statistical. Business intelligence overlaps with descriptive analytics within data science but typically stops at dashboards and reports rather than predictive modeling. [Data engineering](/wiki/data_engineering) builds and operates the pipelines that data scientists rely on, and the boundary between data science and data engineering is one of the most contested in industry hiring.

In 2001, [William S. Cleveland](/wiki/william_s_cleveland) of Bell Labs proposed treating data science as an enlarged version of statistics that explicitly incorporates computing, multidisciplinary investigations, theory, models and methods for data, pedagogy, tool evaluation, and theory. [5] His paper argued for a six-part allocation of effort across these areas and recommended that university statistics departments adopt the new name. The proposal was influential in academia and helped shape the curricula that began appearing in the 2010s.

## How is data science different from machine learning, AI, and statistics?

Data science, machine learning, artificial intelligence, and statistics overlap heavily and are routinely confused, but they are not synonyms. Statistics is the mathematical foundation: it is the discipline of collecting, analyzing, and drawing inferences from data, with deep theory around sampling, estimation, and uncertainty. Machine learning is a set of algorithms that learn patterns from data to make predictions, and it can be viewed as one toolkit inside the larger data science workflow. Artificial intelligence is the broadest umbrella, covering any system that performs tasks associated with human intelligence, including symbolic reasoning, planning, robotics, and perception, not only statistical learning. Data science is the applied, end-to-end practice that ties these together: it frames a real-world problem, acquires and cleans the data, applies statistical and machine learning methods, and communicates the result to decision makers.

| Field | Core question | Typical output |
| --- | --- | --- |
| Statistics | What can we infer about a population from a sample, and how certain are we? | Estimates, confidence intervals, hypothesis tests |
| Machine learning | Which algorithm best predicts the target from the features? | Trained predictive model |
| Artificial intelligence | Can a system perform a task that normally requires human intelligence? | Autonomous or assistive system |
| Data science | What decision or insight can we extract from this data, end to end? | Analysis, prediction, recommendation, product feature |

In practice the labels blur. A single data scientist may use statistical inference one day, train a machine learning model the next, and call a pre-trained AI system the day after. The distinctions matter most for organizing teams, curricula, and job titles rather than for drawing hard methodological lines.

## What is the history of data science?

The history of data science is the history of repeated attempts to expand statistics outward, of computer science reaching toward applications in measurement, and of industry recognizing that data work needed a name. The following timeline highlights the most frequently cited milestones.

### Origins in statistics, 1962 to 1997

In 1962, the American statistician John W. Tukey published "The Future of Data Analysis" in the Annals of Mathematical Statistics. [1] The paper was 67 pages long and called for a new field, distinct from mathematical statistics, that would emphasize learning from data through exploration, graphics, and computation. Tukey is often called the first data scientist for this reason, although he did not use the term. He later coined "exploratory data analysis" and "bit" and developed the box plot and the Fast Fourier Transform.

In 1974, the Danish computer scientist Peter Naur, who would go on to win the Turing Award in 2005, published "Concise Survey of Computer Methods." [2] The book is widely cited as the first printed use of the phrase "data science." Naur described the field as the science of dealing with data once they have been established, distinct from the question of what those data represent.

In 1996, the International Federation of Classification Societies held its biennial conference in Kobe, Japan. The proceedings were titled "Data Science, Classification, and Related Methods," the first time the term appeared in the title of a major international conference. [3] The volume covered classification, clustering, exploratory and multivariate data analysis, and knowledge discovery.

In November 1997, C.F. Jeff Wu delivered his inaugural lecture as the H.C. Carver Professor at the University of Michigan with the title "Statistics = Data Science?" [4] Wu argued that statistics should be renamed data science and statisticians should be renamed data scientists, reflecting a broader engagement with computing and applications.

### From action plan to job title, 2001 to 2012

In 2001, William S. Cleveland published "Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics" in the International Statistical Review. [5] The paper laid out a curriculum and research agenda for a new field defined by the integration of statistics with computing and substantive applications.

In 2008, DJ Patil at LinkedIn and Jeff Hammerbacher at Facebook compared notes on the difficulty of recruiting analysts whose work spanned engineering, statistics, and product. [12] The pair settled on "data scientist" as a job title, in part because internal experiments at LinkedIn showed that the title attracted more qualified applicants than alternatives such as "research scientist," "statistician," or "analyst." Both companies adopted the title, and other Silicon Valley firms followed.

In October 2012, Thomas H. Davenport and DJ Patil published "Data Scientist: The Sexiest Job of the 21st Century" in the Harvard Business Review. [6] The article argued that data scientists combined the skills of a hacker, an analyst, a communicator, and a trusted adviser, and that demand for the role had outstripped supply. It defined the data scientist as "a high-ranking professional with the training and curiosity to make discoveries in the world of big data." [6] The phrase entered popular culture and accelerated the founding of degree programs and bootcamps.

### Mainstream adoption, 2013 to 2021

From 2013 onward, universities began launching dedicated programs. New York University founded its Center for Data Science in 2013. Berkeley, the Massachusetts Institute of Technology, and Carnegie Mellon followed with majors, minors, masters degrees, and dedicated schools. Online education made the field broadly accessible: the Coursera Machine Learning course taught by [Andrew Ng](/wiki/andrew_ng) became one of the most enrolled courses in the world after its 2011 launch, and platforms such as edX and DataCamp added specialized data science tracks. [Kaggle](/wiki/kaggle), a competition platform founded in 2010 and acquired by Google in 2017, served as a global training ground.

In February 2015, President Barack Obama appointed DJ Patil as the first U.S. Chief Data Scientist within the Office of Science and Technology Policy. [7] Patil's tenure helped launch the Police Data Initiative, the Data-Driven Justice Initiative, and the Precision Medicine Initiative, and it signaled to other governments that the role was now a fixture of public policy.

The second half of the decade saw the deep learning boom transform data science practice. Frameworks such as TensorFlow (released by Google in 2015) and [PyTorch](/wiki/pytorch) (released by Facebook in 2016) made it possible for individual practitioners to train large neural networks. Cloud platforms commoditized storage and compute. Tools such as Apache Spark, [Snowflake](/wiki/snowflake), and [Databricks](/wiki/databricks) absorbed much of the heavy lifting of distributed computing.

### The foundation model era, 2022 to present

The public release of ChatGPT in November 2022 began a new chapter. Foundation models trained on internet-scale corpora can now perform many of the tasks that data scientists previously built bespoke models for, including text classification, summarization, and basic code generation. ChatGPT's Code Interpreter feature (later renamed Advanced Data Analysis) executes Python in a sandbox, allowing non-technical users to upload a spreadsheet and ask for an analysis in natural language. GitHub Copilot, announced in 2021 and steadily expanded, helps data scientists write code in [Jupyter notebooks](/wiki/jupyter_notebook) and other environments. Specialized analytics tools such as Hex, Mode, and Deepnote have integrated AI assistants directly into the notebook interface.

The role itself is shifting. Data scientists in 2026 spend less time training small custom models from scratch and more time evaluating, fine-tuning, and orchestrating foundation models. Skills in prompt design, retrieval-augmented generation, and model evaluation have grown in importance. The traditional core of the field, statistical inference applied to messy real-world data, remains essential.

## What does a data scientist do? The data science workflow

Data science projects typically follow a recognizable sequence of phases. Two reference frameworks are widely cited. CRISP-DM, the Cross-Industry Standard Process for Data Mining, was published in 1999 by a consortium led by SPSS, Daimler-Chrysler, and NCR; it describes six iterative phases. [10] OSEMN, pronounced "awesome" and proposed by Hilary Mason and Chris Wiggins in 2010, summarizes the workflow with five verbs: Obtain, Scrub, Explore, Model, Interpret. Most contemporary teams use a hybrid that resembles the table below.

| Phase | Typical activities | Common artifacts |
| --- | --- | --- |
| Problem framing | Translate business or research question into measurable objective; agree on success metric and constraints | Project brief, success criteria |
| Data collection | Identify sources, request access, write extraction queries, gather logs, run surveys | Raw data dump, data dictionary |
| Data wrangling | Parse, normalize, deduplicate, join, handle missing values, fix encoding errors | Cleaned table, transformation script |
| Exploratory data analysis | Compute summary statistics, plot distributions, identify outliers, test hypotheses informally | Notebook, chart gallery |
| Feature engineering | Construct predictors from raw signals; encode categorical variables; build embeddings | Feature store entry, transformation pipeline |
| Modeling | Choose algorithm family, train candidates, tune hyperparameters, run cross-validation | Trained model artifact, training log |
| Evaluation | Score on held-out data, compare against baseline, examine errors, check fairness and calibration | Evaluation report, slice analysis |
| Deployment | Package model, expose as service or batch job, integrate with downstream system | API endpoint, scheduled pipeline |
| Monitoring | Track input drift, output distribution, performance against ground truth, retrain when needed | Dashboard, alerting rules |

In practice the phases are not strictly sequential. Findings during exploration often send a project back to data collection. A model evaluation that exposes a fairness problem can require redesigning features. Time spent on cleaning and wrangling commonly dominates the calendar. A widely cited industry survey from 2017 reported that data scientists spend roughly 60 percent of their time preparing data and only about 20 percent building models, and the ratio has shifted only modestly since.

## What methods and techniques does data science use?

Data science draws on a wide methodological catalog. The choice of method depends on the goal (description, prediction, causal inference, generation), the type of data, and the resources available.

### Descriptive and inferential statistics

Most projects begin with descriptive statistics: counts, means, medians, standard deviations, quantiles, correlation, and visualization. Inferential techniques such as confidence intervals, hypothesis tests, bootstrap resampling, and Bayesian estimation provide ways to reason about uncertainty when the data are a sample from a larger population.

### Supervised learning

Supervised learning fits a function from features to a labeled target. [Linear regression](/wiki/linear_regression), logistic regression, decision trees, [random forests](/wiki/random_forest), gradient boosting machines such as [XGBoost](/wiki/xgboost), support vector machines, and neural networks are the workhorses. Regression problems predict continuous values such as price or temperature; classification problems predict discrete labels such as spam or fraud.

### Unsupervised learning

Unsupervised methods find structure in data without labels. Clustering algorithms (k-means, DBSCAN, hierarchical clustering) group similar observations. Dimensionality reduction techniques compress high-dimensional data into a smaller representation, with [principal component analysis](/wiki/principal_component_analysis_pca) the most common linear approach and t-SNE and UMAP the most common nonlinear ones. Association rule mining and topic models are also part of this family. Much of this work overlaps with [data mining](/wiki/data_mining), the discovery of patterns in large data sets.

### Time series analysis

Time series data, including financial prices, sensor readings, and web traffic, require methods that respect temporal ordering. Classical tools include ARIMA, exponential smoothing, and state-space models. Modern alternatives include Prophet, gradient-boosted trees on lagged features, and recurrent or transformer-based neural networks.

### Natural language processing and computer vision

Natural language processing covers tasks such as classification, named entity recognition, machine translation, and question answering. Computer vision covers image classification, object detection, segmentation, and generation. Both subfields have been transformed by deep learning since the early 2010s, and most production systems in 2026 build on pre-trained transformer models rather than training from scratch.

### Deep learning

Deep learning uses multilayer neural networks to learn representations directly from raw inputs. It has become the dominant approach when data are abundant and when the input is unstructured, such as images, audio, or text. Researchers including [Geoffrey Hinton](/wiki/geoffrey_hinton), Yann LeCun, and Yoshua Bengio shared the 2018 Turing Award for their foundational work.

### A/B testing and causal inference

Many business questions are causal: would changing a button color increase clicks; does a training program improve outcomes. Randomized experiments, often called A/B tests, are the gold standard. When randomization is not possible, observational causal inference techniques such as instrumental variables, regression discontinuity, propensity score matching, and difference-in-differences provide alternatives, each with their own assumptions and risks.

## What tools do data scientists use?

The practical work of data science is mediated by a large and rapidly evolving toolchain. The table below lists representative tools across the major categories.

| Category | Examples |
| --- | --- |
| Languages | Python, R, SQL, Julia, Scala |
| Notebooks and IDEs | Jupyter, Google Colab, Databricks notebooks, Kaggle Kernels, VS Code, RStudio |
| Numerical and ML libraries (Python) | NumPy, pandas, scikit-learn, statsmodels, XGBoost, LightGBM, PyTorch, TensorFlow, JAX |
| R packages | tidyverse, data.table, caret, tidymodels, ggplot2 |
| Visualization | matplotlib, seaborn, plotly, Altair, Tableau, Power BI, Looker |
| Big data and distributed compute | Apache Spark, Hadoop, Dask, Polars, Ray |
| Workflow and orchestration | Airflow, Prefect, Dagster, dbt |
| MLOps and experiment tracking | MLflow, Weights & Biases, Kubeflow, Neptune, Comet |
| Cloud platforms | AWS SageMaker, Google Vertex AI, Azure ML, Databricks, Snowflake |
| Foundation model APIs | Anthropic Claude, OpenAI GPT, Google Gemini, Cohere |

[Python](/wiki/python) has become the dominant language for data science, with [pandas](/wiki/pandas) for tabular manipulation, [NumPy](/wiki/numpy) for numerical arrays, and [scikit-learn](/wiki/scikit_learn) for classical machine learning. [R](/wiki/r_programming_language) remains popular in statistics and biostatistics, particularly within academia. [SQL](/wiki/sql) is essentially universal: nearly every data scientist queries a relational warehouse on a daily basis. [Tableau](/wiki/tableau) and [Power BI](/wiki/power_bi) dominate enterprise dashboarding. [TensorFlow](/wiki/tensorflow) and PyTorch are the leading deep learning frameworks, with PyTorch ahead in research and roughly even with TensorFlow in industrial deployment.

## How big is the data science market?

Data science is one of the fastest-growing segments of the technology industry, although market sizing estimates vary widely depending on how analysts draw the boundary around platforms, services, and adjacent analytics tools. IMARC Group valued the global data science platform market at 19.3 billion dollars in 2025 and projects it to reach 163.4 billion dollars by 2034, a compound annual growth rate of 26.00 percent over 2026 to 2034. [13] Other research firms publish figures spanning from roughly 15 billion to more than 200 billion dollars for the same period, reflecting differences in scope and methodology rather than a settled consensus.

The labor market has expanded in parallel. The U.S. Bureau of Labor Statistics classified data scientist as one of the fastest-growing occupations in the United States, projecting about 23,400 openings each year on average over the 2024 to 2034 decade, driven by rising demand for data-driven decisions across industries. [8]

## How do you become a data scientist? Education and training

Formal education in data science has expanded rapidly. The first standalone undergraduate major in the United States was launched at the College of Charleston in 2007. UC Berkeley assembled a campus-wide task force in 2014 and graduated its first dedicated data science majors in 2018. [Stanford University](/wiki/stanford_university), [MIT](/wiki/mit), Carnegie Mellon, and New York University followed with majors, minors, masters degrees, and entire schools or colleges of computing and data. The MIT Stephen A. Schwarzman College of Computing was announced in 2018 with a one billion dollar commitment, and the Berkeley Division of Computing, Data Science, and Society was established the same year.

Massive open online courses (MOOCs) opened the field to learners worldwide. Andrew Ng's Machine Learning course on Coursera, launched in 2011, became the most popular MOOC in history, with millions of enrollments. The Coursera Data Science Specialization from Johns Hopkins, the IBM Data Science Professional Certificate, the Microsoft Professional Program, the fast.ai Practical Deep Learning courses, and the DeepLearning.AI specializations have together trained millions of additional learners. Kaggle competitions, with cash prizes and a public leaderboard, became a parallel proving ground.

Bootcamps including General Assembly, Metis, Springboard, and Insight Data Science offered intensive immersion programs targeted at career changers. Many have closed or pivoted as the labor market has saturated and as universities have caught up, but the format remains influential.

Academic certification has been complemented by community-driven credentials such as the Hugging Face certifications for natural language processing and the Cloud Provider associate-level data and machine learning certifications offered by AWS, Google Cloud, and Microsoft Azure.

## What jobs and salaries exist in data science?

The "data scientist" title now describes a wide range of jobs. Most organizations distinguish among several closely related roles, although boundaries blur and titles vary. The table below summarizes the most common distinctions in industry as of 2026.

| Role | Primary focus | Typical tools |
| --- | --- | --- |
| Data analyst | Descriptive analytics, reporting, dashboards, ad-hoc analysis | SQL, Excel, Tableau, Power BI |
| Data scientist | Statistical analysis, predictive modeling, experimentation | Python, R, SQL, Jupyter, scikit-learn |
| Machine learning engineer | Building, deploying, and scaling production ML systems | Python, PyTorch, TensorFlow, Kubernetes, MLflow |
| Data engineer | Designing and operating data pipelines, warehouses, and lakes | SQL, Spark, Airflow, dbt, cloud data platforms |
| Statistician | Study design, inference, causal analysis, especially in clinical or government settings | R, SAS, Stata |
| Research scientist | Original ML or AI research, often with publications | PyTorch, JAX, large compute clusters |
| Analytics engineer | Bridging data engineering and analysis through modeled tables | dbt, SQL, BI tools |

Industries hiring data scientists include technology, finance, healthcare and life sciences, retail and e-commerce, consumer packaged goods, telecommunications, energy, transportation, government, and consulting. Tech and finance pay the highest base salaries, while healthcare and government often offer mission-driven work and stronger job security.

### How much do data scientists earn?

The U.S. Bureau of Labor Statistics reported a median annual wage of 112,590 dollars for data scientists in May 2024, the most recent year available. [8] The lowest-earning ten percent made less than 63,650 dollars, while the top ten percent made more than 194,410 dollars. [8] The median wage was higher in scientific research and development services at 120,090 dollars. The Bureau projects 34 percent employment growth between 2024 and 2034, much faster than the average for all occupations, with about 23,400 openings projected each year. [8]

Compensation at major technology firms can substantially exceed the BLS median once equity is included. Salary aggregator Levels.fyi consistently shows total compensation for senior data scientists and machine learning engineers at large tech companies in the 250,000 to 500,000 dollar range, with staff and principal levels reaching higher.

## How are large language models changing data science?

The deployment of capable foundation models has changed daily practice. Several patterns are now widespread.

First, code generation has become a default. Tools such as GitHub Copilot, Anthropic Claude, and ChatGPT can produce functional pandas code, scikit-learn pipelines, SQL queries, and visualizations from natural language descriptions. Most data scientists use these tools to scaffold analyses and to recall syntax for less familiar libraries, then edit and verify the results.

Second, exploratory analysis can be conversational. ChatGPT's Advanced Data Analysis (originally launched in 2023 as Code Interpreter) accepts uploaded files, executes Python in a sandbox, and reports results with charts. Anthropic's Claude offers analogous tools and an Analysis tool that runs JavaScript or Python on the user's data. Specialized notebooks such as Hex, Mode, and Deepnote have integrated similar assistants, and many established platforms have added their own.

Third, the modeling step has been partly displaced. For text classification, named entity recognition, summarization, and question answering, calling a hosted large language model is often faster and more accurate than training a bespoke model. The data scientist's job becomes one of evaluation, prompt design, and orchestration: building retrieval-augmented generation pipelines, choosing the right model, deciding when fine-tuning is justified, and monitoring outputs for hallucination, bias, and drift.

Fourth, the boundary with software engineering has tightened. Production AI systems involve agents, tools, retrieval indices, vector databases, observability, and complex prompt assembly. Many data science teams now require engineering practices that were once optional.

The traditional core of data science remains intact. Cleaning data, framing problems, designing experiments, building reliable predictive models, and communicating results to non-technical audiences are still central. Foundation models do not replace causal inference, time series forecasting, or rigorous A/B testing.

## What institutions and journals shape data science?

A growing number of universities have organized substantial portions of their faculty around data science.

| Institution | Unit | Notes |
| --- | --- | --- |
| UC Berkeley | Division of Computing, Data Science, and Society | Established 2018; offers undergraduate major and graduate programs |
| MIT | Stephen A. Schwarzman College of Computing | Announced 2018; one billion dollar commitment |
| Stanford University | Stanford Data Science | Cross-school initiative; AI Index report |
| New York University | Center for Data Science | Founded 2013; offers Ph.D. and masters programs |
| Carnegie Mellon University | Machine Learning Department, Statistics & Data Science | First Ph.D. in machine learning, founded 2006 |
| Columbia University | Data Science Institute | Founded 2012 |
| University of Michigan | Michigan Institute for Data Science | Founded 2015 |
| University of Washington | eScience Institute | Founded 2008 |
| Harvard University | Harvard Data Science Initiative | Founded 2017 |

Leading journals and venues include the Journal of Data Science (founded 2003), the Harvard Data Science Review (founded 2019), the International Statistical Review, the Journal of Machine Learning Research, and the proceedings of conferences such as NeurIPS, ICML, KDD, and ICLR. Industry-facing venues such as the O'Reilly Strata Data Conference and Spark + AI Summit have also shaped the practical literature.

## What are the main criticisms of data science?

Data science attracts steady criticism, much of which is well grounded.

The "sexiest job" framing oversold the day-to-day reality. Surveys have repeatedly found that data scientists spend most of their time on data cleaning, plumbing, and stakeholder management rather than on modeling. Many practitioners have written about the gap between the rich machine learning curricula taught in universities and the unglamorous warehouse work that dominates their first years on the job.

Reproducibility has been a persistent concern. Predictive models published in academic papers often fail to generalize. A widely cited 2023 review identified data leakage as a pervasive problem in applied machine learning research, with a substantial share of published claims overstated because evaluation data leaked into training. Industry teams have built MLOps tooling in part to address these issues.

Algorithmic bias and ethics have moved from a fringe topic to a central one. Models trained on historical data can perpetuate or amplify discrimination in lending, hiring, criminal justice, and healthcare. The 2016 ProPublica investigation of the COMPAS recidivism algorithm and Joy Buolamwini's research on facial recognition accuracy by skin tone are commonly cited examples. Regulators in the European Union, the United States, and elsewhere have introduced rules requiring documentation, explanation, and impact assessment for high-risk automated decisions.

Title inflation and skill overlap have created confusion. A survey of job postings shows that the "data scientist" label can refer to a SQL analyst, a Python modeler, a research scientist with a Ph.D., or a machine learning engineer. Some companies have stopped using the title in favor of more specific roles such as analytics engineer or applied scientist.

Finally, the rapid arrival of foundation models has produced anxiety about the longevity of the role. Tasks that once filled a junior data scientist's quarter can now be done in minutes by a chatbot. Most observers expect the role to continue evolving, with humans focusing on framing, validation, and stakeholder communication, but the transition is uncomfortable for many practitioners.

## ELI5: What is data science?

Imagine you have a giant pile of LEGO bricks all mixed together, plus a notebook where you wrote down every time it rained for a year. Data science is the job of sorting through big messy piles of information like that, finding the hidden patterns (maybe red bricks show up more on rainy days), and then using those patterns to make a smart guess about what happens next. A data scientist is part detective, part math whiz, and part storyteller: they clean up the mess, do the math, and then explain what they found in a way that helps people make better decisions.

## See also

- [Machine learning](/wiki/machine_learning)
- [Deep learning](/wiki/deep_learning)
- [Statistics](/wiki/statistics)
- [Big data](/wiki/big_data)
- [Data mining](/wiki/data_mining)
- [Python](/wiki/python)
- [Large language model](/wiki/large_language_model)

## References

1. Tukey, John W. "The Future of Data Analysis." Annals of Mathematical Statistics, vol. 33, no. 1, 1962, pp. 1-67. https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-33/issue-1/The-Future-of-Data-Analysis/10.1214/aoms/1177704711.full
2. Naur, Peter. Concise Survey of Computer Methods. Studentlitteratur, 1974. http://www.naur.com/Conc.Surv.html
3. Hayashi, Chikio, et al., editors. Data Science, Classification, and Related Methods: Proceedings of the Fifth Conference of the International Federation of Classification Societies (IFCS-96), Kobe, Japan. Springer, 1998. https://link.springer.com/book/10.1007/978-4-431-65950-1
4. Wu, C. F. Jeff. "Statistics = Data Science?" Inaugural lecture, H.C. Carver Professorship, University of Michigan, 1997. https://www2.isye.gatech.edu/~jeffwu/presentations/datascience.pdf
5. Cleveland, William S. "Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics." International Statistical Review, vol. 69, no. 1, 2001, pp. 21-26. https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1751-5823.2001.tb00477.x
6. Davenport, Thomas H., and DJ Patil. "Data Scientist: The Sexiest Job of the 21st Century." Harvard Business Review, October 2012. https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century
7. The White House. "The White House Names Dr. DJ Patil as the First U.S. Chief Data Scientist." Obama White House Archives, 18 February 2015. https://obamawhitehouse.archives.gov/blog/2015/02/18/white-house-names-dr-dj-patil-first-us-chief-data-scientist
8. U.S. Bureau of Labor Statistics. "Data Scientists." Occupational Outlook Handbook, 2025 edition. https://www.bls.gov/ooh/math/data-scientists.htm
9. Donoho, David. "50 Years of Data Science." Journal of Computational and Graphical Statistics, vol. 26, no. 4, 2017, pp. 745-766. https://www.tandfonline.com/doi/full/10.1080/10618600.2017.1384734
10. Cross-Industry Standard Process for Data Mining (CRISP-DM). https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining
11. Davenport, Thomas H., and DJ Patil. "Is Data Scientist Still the Sexiest Job of the 21st Century?" Harvard Business Review, July-August 2022. https://hbr.org/2022/07/is-data-scientist-still-the-sexiest-job-of-the-21st-century
12. Press, Gil. "A Very Short History of Data Science." Forbes, 28 May 2013. https://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/
13. IMARC Group. "Data Science Platform Market Size, Share, Trends and Forecast 2026-2034." IMARC Group, 2025. https://www.imarcgroup.com/data-science-platform-market