Data Analysis
Last reviewed
Sources
15 citations
Review status
Source-backed
Revision
v4 ยท 3,352 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
15 citations
Review status
Source-backed
Revision
v4 ยท 3,352 words
Add missing citations, update stale details, or suggest a clearer explanation.
Data analysis is the process of inspecting, cleaning, transforming, and modeling data to discover useful information, draw conclusions, and support decision-making.[1] It combines techniques from statistics, mathematics, and computer science, and since 2023 it has been reshaped by large language models such as OpenAI's Code Interpreter (Advanced Data Analysis), which let users analyze uploaded data sets in natural language by generating and running Python code in a sandbox. It plays a central role in fields ranging from scientific research and business intelligence to machine learning and artificial intelligence.
Data analysis is not a single step but an iterative workflow. Analysts cycle through phases of collection, cleaning, exploration, modeling, and communication, refining their understanding of a data set at each pass. The discipline has deep roots in classical statistics, but its modern form has been shaped by the explosive growth of digital data, the open-source Python and R ecosystems, and, most recently, AI assistants that automate large portions of the workflow.
The foundations of data analysis trace back to early statistical methods developed in the 17th and 18th centuries, when scholars such as John Graunt and Carl Friedrich Gauss introduced techniques for summarizing and interpreting numerical observations. However, the modern concept of data analysis as a distinct discipline gained prominence in the mid-20th century.
In 1962, the mathematician and statistician John W. Tukey published "The Future of Data Analysis," arguing that statistics should broaden its focus beyond formal inference to include the practical art of examining data.[2] Tukey went on to develop Exploratory Data Analysis (EDA), publishing his landmark book of the same name in 1977.[3] In this work, Tukey contended that too much emphasis had been placed on confirmatory hypothesis testing, and that analysts needed robust visual and numerical tools to let the data itself suggest hypotheses worth testing.
Tukey's advocacy spurred the development of statistical computing environments, most notably the S programming language at Bell Labs, which later inspired S-PLUS and, ultimately, the R programming language. A second major shift arrived in 2008, when Wes McKinney began building pandas while working at the quantitative investment firm AQR Capital Management; he made the project public in 2009.[5] pandas brought R-style data frames to Python and turned Python into the dominant language for tabular data analysis. These tools made it practical for analysts to apply EDA techniques at scale, laying the groundwork for the data-driven workflows that dominate data science and machine learning today.
Data analysis can be categorized into several distinct types, each addressing different questions and objectives.
| Type | Core Question | Description |
|---|---|---|
| Descriptive | What happened? | Summarizes historical data using aggregates such as means, medians, counts, and percentages. Dashboards and standard reports are common outputs. |
| Diagnostic | Why did it happen? | Drills into data to identify causes and correlations behind observed patterns. Techniques include root-cause analysis and drill-down analysis. |
| Predictive | What is likely to happen? | Uses statistical models and machine learning algorithms to forecast future outcomes based on historical data. |
| Prescriptive | What should we do? | Recommends optimal actions by combining predictive models with optimization and simulation techniques. |
| Exploratory (EDA) | What patterns exist? | Uses visualization and summary statistics to discover previously unknown patterns, trends, and anomalies in data. |
| Confirmatory (CDA) | Is my hypothesis supported? | Tests pre-specified hypotheses using formal statistical methods, controlling for error rates. |
Although specific workflows vary by domain, most data analysis projects follow a common sequence of phases. These phases are iterative: findings in later stages often require revisiting earlier steps.
Before any data is collected, analysts must clarify the questions they want to answer and the decisions the analysis will support. This stage involves identifying relevant variables, defining success metrics, and setting the scope of the investigation.
Data is gathered from sources such as databases, APIs, sensors, surveys, web scraping, or public data set repositories. The choice of sources depends on availability, reliability, and relevance to the analysis goals.
Raw data almost always contains errors, inconsistencies, and gaps. Data cleaning (also called data cleansing or data scrubbing) addresses these issues through a range of techniques:
Data cleaning is often the most time-consuming phase of analysis. Surveys of data professionals consistently find that 60 to 80 percent of project time is spent on data preparation.[4]
Once data is clean, it must be restructured into a format suitable for analysis. Common transformations include:
These steps overlap significantly with feature engineering, especially in machine learning workflows.
EDA is the phase where analysts develop an intuitive understanding of the data through visualization and summary statistics. Tukey described EDA as "detective work" that precedes formal testing.[3] A typical EDA workflow includes:
Modern tools such as pandas profiling (now YData Profiling) can automate much of this workflow, generating comprehensive reports that include distribution plots, missing-value summaries, correlation heatmaps, and interaction visualizations from a single line of Python code.[6]
With a solid understanding of the data in hand, analysts apply formal methods to answer their research questions. These methods range from simple descriptive statistics to complex predictive models.
Findings must be communicated clearly to stakeholders through reports, dashboards, or presentations. Effective data communication matches the level of detail to the audience and uses appropriate visualizations to highlight key insights.
Descriptive statistics condense a data set into a handful of numbers that capture its essential characteristics. They form the backbone of almost every data analysis project.
| Measure | Definition | Best Used When |
|---|---|---|
| Mean | The arithmetic average of all values. | Data is roughly symmetric with few outliers. |
| Median | The middle value when data is sorted. | Data is skewed or contains extreme values. |
| Mode | The most frequently occurring value. | Data is categorical or you need the most common outcome. |
| Measure | Definition | Interpretation |
|---|---|---|
| Range | Difference between the maximum and minimum values. | Gives the total spread but is sensitive to outliers. |
| Variance | The average of squared deviations from the mean. | Quantifies overall dispersion; units are squared. |
| Standard deviation | The square root of variance. | Same units as the original data; the most commonly reported spread measure. |
| Interquartile range (IQR) | Difference between the 75th and 25th percentiles. | Robust to outliers; used in box plots. |
Visualization transforms numbers into pictures, making patterns and anomalies visible at a glance. Common chart types used in data analysis include:
| Visualization | Purpose | Example Use Case |
|---|---|---|
| Histogram | Shows the frequency distribution of a single numerical variable. | Checking whether exam scores follow a bell curve. |
| Box plot | Displays the five-number summary (min, Q1, median, Q3, max) and highlights outliers. | Comparing salary distributions across departments. |
| Scatter plot | Reveals the relationship between two numerical variables. | Exploring the correlation between advertising spend and revenue. |
| Heatmap | Uses color intensity to represent values in a matrix. | Visualizing a correlation matrix of data set features. |
| Bar chart | Compares quantities across categories. | Showing product sales by region. |
| Line chart | Tracks changes over time. | Displaying stock price trends over a year. |
Popular visualization libraries include matplotlib, Seaborn, Plotly, and ggplot2 (in R). Business intelligence platforms such as Tableau and Power BI provide drag-and-drop interfaces for creating interactive dashboards without writing code.
Correlation analysis measures the strength and direction of the relationship between two variables. The correlation coefficient ranges from -1 (perfect negative relationship) to +1 (perfect positive relationship), with 0 indicating no linear association.
| Method | What It Measures | When to Use |
|---|---|---|
| Pearson correlation | Linear relationship between two continuous variables. | Both variables are approximately normally distributed with a linear trend. |
| Spearman correlation | Monotonic (but not necessarily linear) relationship based on ranks. | Data contains outliers, is ordinal, or the relationship is non-linear but monotonic. |
| Kendall's tau | Concordance between paired observations. | Small sample sizes or when a robust, distribution-free measure is needed. |
A critical principle to keep in mind is that correlation does not imply causation. Two variables may move together because they share a common underlying cause (a confounding variable) rather than because one directly affects the other.[7]
Hypothesis testing is a framework for making statistical decisions using data. The general procedure involves:
Common statistical tests include the t-test (comparing means of two groups), chi-square test (testing independence of categorical variables), ANOVA (comparing means of three or more groups), and the Mann-Whitney U test (a non-parametric alternative to the t-test).
It is important to distinguish statistical significance from practical significance. A result can be statistically significant (p < 0.05) while being too small in magnitude to matter in practice.
Since 2023, large language models have moved from generating code snippets to running entire analysis workflows end to end. The most influential example is OpenAI's Code Interpreter, which OpenAI began rolling out in beta to ChatGPT Plus users on July 6, 2023, and later renamed Advanced Data Analysis.[9][10] The feature lets a user upload a file (CSV, Excel, JSON, or even a SQLite database), describe a goal in plain English, and have ChatGPT write Python, execute it in a sandboxed environment, and return both the result and the code that produced it.
The sandbox runs server-side with no internet access from user code, and it comes pre-loaded with the standard Python data stack, including pandas, NumPy, matplotlib, Seaborn, scikit-learn, and statsmodels.[10] Because the model can read errors and revise its own code, it iterates toward a working analysis without the user writing any Python. A 2023 Datanami report described the tool as "your personal data analyst."[11]
OpenAI expanded these capabilities on May 16, 2024, alongside the launch of GPT-4o. The update added interactive tables and charts and the ability to upload files directly from Google Drive and Microsoft OneDrive, and it was rolled out to ChatGPT Plus, Team, and Enterprise users.[12]
| Capability | What the LLM does | Practical effect |
|---|---|---|
| Natural-language querying | Translates a plain-English question into pandas or SQL code. | Non-programmers can run analyses without learning syntax. |
| Automated cleaning | Detects missing values, inconsistent types, and duplicates, then proposes fixes. | Speeds up the most time-consuming phase of analysis. |
| EDA and charting | Generates summary statistics and renders matplotlib or Seaborn plots from uploaded data. | Produces a first pass of exploratory analysis in seconds. |
| Code generation and self-correction | Writes Python, runs it, reads the error trace, and revises. | Reduces the manual debugging loop. |
| Explanation | Describes results and methods in prose. | Makes findings accessible to non-technical stakeholders. |
Beyond single-turn assistants, a newer class of AI data agents chains many tool calls together to complete a multi-step analysis with minimal supervision. An agent can plan a workflow, query a database, clean and join tables, fit a model, and assemble a report, deciding for itself which step comes next. This builds on the same execute-observe-revise loop that Code Interpreter pioneered, extended across an entire pipeline. Such systems still require human review: LLMs can misread schema, choose an inappropriate statistical test, or hallucinate a column that does not exist, so analysts must verify generated code and results rather than trusting them blindly.
Exploratory data analysis is not just a standalone activity; it is a critical step in any machine learning pipeline. EDA informs decisions at nearly every subsequent stage:
Automated EDA tools such as YData Profiling, Sweetviz, and D-Tale accelerate this phase by generating comprehensive reports with minimal code.[6]
The ecosystem of data analysis tools has grown substantially, spanning programming languages, libraries, AI assistants, and commercial platforms.
| Tool | Type | Key Strengths |
|---|---|---|
| pandas | Python library | De facto standard for tabular data manipulation; rich API for filtering, grouping, merging, and reshaping data. |
| NumPy | Python library | Provides efficient array operations and mathematical functions that underpin most scientific Python libraries. |
| matplotlib | Python library | Foundational plotting library offering fine-grained control over chart customization. |
| Seaborn | Python library | Built on matplotlib; provides high-level statistical visualization functions with attractive defaults. |
| Code Interpreter / Advanced Data Analysis | LLM tool | Runs Python in a ChatGPT sandbox from natural-language prompts; cleans, analyzes, and charts uploaded files without manual coding. |
| R | Programming language | Designed for statistical computing; strong in academic and research settings; ggplot2 offers a powerful grammar of graphics. |
| SQL | Query language | Essential for extracting and aggregating data stored in relational databases; remains one of the most-used languages among professional developers in the 2025 Stack Overflow Developer Survey.[13] |
| Tableau | BI platform | Drag-and-drop visualization tool; strong with large data sets and real-time analytics dashboards. |
| Power BI | BI platform | Integrates with the Microsoft 365 ecosystem; cost-effective for organizations already using Microsoft products. |
| Polars | Python / Rust library | A newer alternative to pandas that offers significant speed improvements for memory-intensive operations through lazy evaluation and multi-threaded execution. |
| Excel | Spreadsheet | Widely accessible; suitable for small-scale analysis with pivot tables, charts, and built-in statistical functions. |
Effective data analysis requires more than technical skill. Analysts should keep the following principles in mind:
Imagine you have a big box of crayons all jumbled up. Data analysis is like sorting those crayons so you can understand what you have. First, you take out any broken ones and throw away duplicates (that is data cleaning). Then you sort them by color (that is organizing and transforming). Next, you count how many of each color you have and maybe line them up from lightest to darkest to see if you have more blues or more reds (that is exploratory data analysis). Finally, you tell your friend, "We have mostly blue and green crayons, and hardly any orange ones" (that is communicating your findings). Today you can even ask a smart computer helper to do a lot of the sorting and counting for you, but you still have to double-check that it sorted the crayons the way you meant. Data analysis helps people look at big piles of information and turn them into simple answers.