# Data Analysis

> Source: https://aiwiki.ai/wiki/data_analysis
> Updated: 2026-06-22
> Categories: Data Science, Machine Learning, Statistics
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Data analysis** is the process of inspecting, cleaning, transforming, and modeling data to discover useful information, draw conclusions, and support decision-making.[1] It combines techniques from statistics, mathematics, and computer science, and since 2023 it has been reshaped by [large language models](/wiki/large_language_model) such as OpenAI's [Code Interpreter (Advanced Data Analysis)](/wiki/code_interpreter), which let users analyze uploaded data sets in natural language by generating and running Python code in a sandbox. It plays a central role in fields ranging from scientific research and business intelligence to [machine learning](/wiki/machine_learning) and artificial intelligence.

Data analysis is not a single step but an iterative workflow. Analysts cycle through phases of collection, cleaning, exploration, modeling, and communication, refining their understanding of a [data set](/wiki/data_set_or_dataset) at each pass. The discipline has deep roots in classical statistics, but its modern form has been shaped by the explosive growth of digital data, the open-source Python and R ecosystems, and, most recently, AI assistants that automate large portions of the workflow.

## History

The foundations of data analysis trace back to early statistical methods developed in the 17th and 18th centuries, when scholars such as John Graunt and Carl Friedrich Gauss introduced techniques for summarizing and interpreting numerical observations. However, the modern concept of data analysis as a distinct discipline gained prominence in the mid-20th century.

In 1962, the mathematician and statistician John W. Tukey published "The Future of Data Analysis," arguing that statistics should broaden its focus beyond formal inference to include the practical art of examining data.[2] Tukey went on to develop **Exploratory Data Analysis (EDA)**, publishing his landmark book of the same name in 1977.[3] In this work, Tukey contended that too much emphasis had been placed on confirmatory hypothesis testing, and that analysts needed robust visual and numerical tools to let the data itself suggest hypotheses worth testing.

Tukey's advocacy spurred the development of statistical computing environments, most notably the S programming language at Bell Labs, which later inspired S-PLUS and, ultimately, the R programming language. A second major shift arrived in 2008, when Wes McKinney began building [pandas](/wiki/pandas) while working at the quantitative investment firm AQR Capital Management; he made the project public in 2009.[5] pandas brought R-style data frames to Python and turned Python into the dominant language for tabular data analysis. These tools made it practical for analysts to apply EDA techniques at scale, laying the groundwork for the data-driven workflows that dominate data science and [machine learning](/wiki/machine_learning) today.

## Types of Data Analysis

Data analysis can be categorized into several distinct types, each addressing different questions and objectives.

| Type | Core Question | Description |
|------|--------------|-------------|
| **Descriptive** | What happened? | Summarizes historical data using aggregates such as means, medians, counts, and percentages. Dashboards and standard reports are common outputs. |
| **Diagnostic** | Why did it happen? | Drills into data to identify causes and correlations behind observed patterns. Techniques include root-cause analysis and drill-down analysis. |
| **Predictive** | What is likely to happen? | Uses statistical models and [machine learning](/wiki/machine_learning) algorithms to forecast future outcomes based on historical data. |
| **Prescriptive** | What should we do? | Recommends optimal actions by combining predictive models with optimization and simulation techniques. |
| **Exploratory (EDA)** | What patterns exist? | Uses visualization and summary statistics to discover previously unknown patterns, trends, and anomalies in data. |
| **Confirmatory (CDA)** | Is my hypothesis supported? | Tests pre-specified hypotheses using formal statistical methods, controlling for error rates. |

## The Data Analysis Process

Although specific workflows vary by domain, most data analysis projects follow a common sequence of phases. These phases are iterative: findings in later stages often require revisiting earlier steps.

### 1. Defining Requirements

Before any data is collected, analysts must clarify the questions they want to answer and the decisions the analysis will support. This stage involves identifying relevant variables, defining success metrics, and setting the scope of the investigation.

### 2. Data Collection

Data is gathered from sources such as databases, APIs, sensors, surveys, web scraping, or public [data set](/wiki/data_set_or_dataset) repositories. The choice of sources depends on availability, reliability, and relevance to the analysis goals.

### 3. Data Cleaning

Raw data almost always contains errors, inconsistencies, and gaps. Data cleaning (also called data cleansing or data scrubbing) addresses these issues through a range of techniques:

- **Handling missing values:** Imputing with mean, median, or mode values; using model-based imputation; or removing incomplete records when appropriate.
- **Removing duplicates:** Identifying and eliminating repeated records that could distort analysis results.
- **Correcting errors:** Fixing typos, standardizing formats (dates, units, naming conventions), and resolving conflicting entries.
- **Outlier detection:** Flagging data points that fall far outside expected ranges for further investigation. Outliers may indicate measurement errors or genuinely unusual observations.

Data cleaning is often the most time-consuming phase of analysis. Surveys of data professionals consistently find that 60 to 80 percent of project time is spent on data preparation.[4]

### 4. Data Transformation

Once data is clean, it must be restructured into a format suitable for analysis. Common transformations include:

- **[Normalization](/wiki/normalization) and scaling:** Adjusting numerical features to a common range (for example, 0 to 1) so that no single variable dominates distance-based algorithms.
- **Encoding categorical variables:** Converting text categories into numerical representations using one-hot encoding, label encoding, or target encoding.
- **Aggregation:** Summarizing granular data at higher levels (daily sales rolled up to monthly totals, for instance).
- **Log and power transformations:** Reducing skewness in distributions to better meet the assumptions of parametric statistical tests.

These steps overlap significantly with [feature engineering](/wiki/feature_engineering), especially in [machine learning](/wiki/machine_learning) workflows.

### 5. Exploratory Data Analysis (EDA)

EDA is the phase where analysts develop an intuitive understanding of the data through visualization and summary statistics. Tukey described EDA as "detective work" that precedes formal testing.[3] A typical EDA workflow includes:

1. Computing descriptive statistics (mean, median, standard deviation, quantiles).
2. Plotting distributions with histograms and box plots.
3. Examining relationships between variables using scatter plots and correlation matrices.
4. Identifying clusters, trends, and anomalies.
5. Formulating hypotheses for later confirmatory testing.

Modern tools such as [pandas](/wiki/pandas) profiling (now YData Profiling) can automate much of this workflow, generating comprehensive reports that include distribution plots, missing-value summaries, correlation heatmaps, and interaction visualizations from a single line of Python code.[6]

### 6. Modeling and Statistical Analysis

With a solid understanding of the data in hand, analysts apply formal methods to answer their research questions. These methods range from simple descriptive statistics to complex predictive models.

### 7. Communication

Findings must be communicated clearly to stakeholders through reports, dashboards, or presentations. Effective data communication matches the level of detail to the audience and uses appropriate visualizations to highlight key insights.

## Descriptive Statistics

Descriptive statistics condense a [data set](/wiki/data_set_or_dataset) into a handful of numbers that capture its essential characteristics. They form the backbone of almost every data analysis project.

### Measures of Central Tendency

| Measure | Definition | Best Used When |
|---------|-----------|----------------|
| **Mean** | The arithmetic average of all values. | Data is roughly symmetric with few outliers. |
| **Median** | The middle value when data is sorted. | Data is skewed or contains extreme values. |
| **Mode** | The most frequently occurring value. | Data is categorical or you need the most common outcome. |

### Measures of Spread

| Measure | Definition | Interpretation |
|---------|-----------|----------------|
| **Range** | Difference between the maximum and minimum values. | Gives the total spread but is sensitive to outliers. |
| **Variance** | The average of squared deviations from the mean. | Quantifies overall dispersion; units are squared. |
| **Standard deviation** | The square root of variance. | Same units as the original data; the most commonly reported spread measure. |
| **Interquartile range (IQR)** | Difference between the 75th and 25th percentiles. | Robust to outliers; used in box plots. |

## Data Visualization

Visualization transforms numbers into pictures, making patterns and anomalies visible at a glance. Common chart types used in data analysis include:

| Visualization | Purpose | Example Use Case |
|--------------|---------|------------------|
| **Histogram** | Shows the frequency distribution of a single numerical variable. | Checking whether exam scores follow a bell curve. |
| **Box plot** | Displays the five-number summary (min, Q1, median, Q3, max) and highlights outliers. | Comparing salary distributions across departments. |
| **Scatter plot** | Reveals the relationship between two numerical variables. | Exploring the correlation between advertising spend and revenue. |
| **Heatmap** | Uses color intensity to represent values in a matrix. | Visualizing a correlation matrix of [data set](/wiki/data_set_or_dataset) features. |
| **Bar chart** | Compares quantities across categories. | Showing product sales by region. |
| **Line chart** | Tracks changes over time. | Displaying stock price trends over a year. |

Popular visualization libraries include [matplotlib](/wiki/matplotlib), Seaborn, Plotly, and ggplot2 (in R). Business intelligence platforms such as Tableau and Power BI provide drag-and-drop interfaces for creating interactive dashboards without writing code.

## Correlation Analysis

Correlation analysis measures the strength and direction of the relationship between two variables. The correlation coefficient ranges from -1 (perfect negative relationship) to +1 (perfect positive relationship), with 0 indicating no linear association.

| Method | What It Measures | When to Use |
|--------|-----------------|-------------|
| **Pearson correlation** | Linear relationship between two continuous variables. | Both variables are approximately normally distributed with a linear trend. |
| **Spearman correlation** | Monotonic (but not necessarily linear) relationship based on ranks. | Data contains outliers, is ordinal, or the relationship is non-linear but monotonic. |
| **Kendall's tau** | Concordance between paired observations. | Small sample sizes or when a robust, distribution-free measure is needed. |

A critical principle to keep in mind is that **correlation does not imply causation**. Two variables may move together because they share a common underlying cause (a confounding variable) rather than because one directly affects the other.[7]

## Hypothesis Testing

Hypothesis testing is a framework for making statistical decisions using data. The general procedure involves:

1. **Stating hypotheses.** The null hypothesis (H0) represents the default assumption (for example, "there is no difference between groups"). The alternative hypothesis (H1) represents the claim being tested.
2. **Choosing a significance level (alpha).** Conventionally set at 0.05, this is the threshold for deciding when results are unlikely enough under H0 to warrant rejection.
3. **Computing a test statistic and p-value.** The p-value is the probability of obtaining results at least as extreme as those observed, assuming H0 is true.[8]
4. **Making a decision.** If the p-value is less than alpha, the null hypothesis is rejected in favor of the alternative.

Common statistical tests include the t-test (comparing means of two groups), chi-square test (testing independence of categorical variables), ANOVA (comparing means of three or more groups), and the Mann-Whitney U test (a non-parametric alternative to the t-test).

It is important to distinguish statistical significance from practical significance. A result can be statistically significant (p < 0.05) while being too small in magnitude to matter in practice.

## How is AI used in data analysis?

Since 2023, [large language models](/wiki/large_language_model) have moved from generating code snippets to running entire analysis workflows end to end. The most influential example is OpenAI's [Code Interpreter](/wiki/code_interpreter), which OpenAI began rolling out in beta to [ChatGPT](/wiki/chatgpt) Plus users on July 6, 2023, and later renamed **Advanced Data Analysis**.[9][10] The feature lets a user upload a file (CSV, Excel, JSON, or even a SQLite database), describe a goal in plain English, and have ChatGPT write Python, execute it in a sandboxed environment, and return both the result and the code that produced it.

The sandbox runs server-side with no internet access from user code, and it comes pre-loaded with the standard Python data stack, including [pandas](/wiki/pandas), NumPy, [matplotlib](/wiki/matplotlib), Seaborn, scikit-learn, and statsmodels.[10] Because the model can read errors and revise its own code, it iterates toward a working analysis without the user writing any Python. A 2023 Datanami report described the tool as "your personal data analyst."[11]

OpenAI expanded these capabilities on May 16, 2024, alongside the launch of [GPT-4o](/wiki/gpt_4o). The update added interactive tables and charts and the ability to upload files directly from Google Drive and Microsoft OneDrive, and it was rolled out to ChatGPT Plus, Team, and Enterprise users.[12]

### What can AI data analysis tools do?

| Capability | What the LLM does | Practical effect |
|-----------|-------------------|------------------|
| **Natural-language querying** | Translates a plain-English question into pandas or SQL code. | Non-programmers can run analyses without learning syntax. |
| **Automated cleaning** | Detects missing values, inconsistent types, and duplicates, then proposes fixes. | Speeds up the most time-consuming phase of analysis. |
| **EDA and charting** | Generates summary statistics and renders [matplotlib](/wiki/matplotlib) or Seaborn plots from uploaded data. | Produces a first pass of exploratory analysis in seconds. |
| **Code generation and self-correction** | Writes Python, runs it, reads the error trace, and revises. | Reduces the manual debugging loop. |
| **Explanation** | Describes results and methods in prose. | Makes findings accessible to non-technical stakeholders. |

### How do AI data agents differ from a chatbot?

Beyond single-turn assistants, a newer class of **AI data agents** chains many tool calls together to complete a multi-step analysis with minimal supervision. An agent can plan a workflow, query a database, clean and join tables, fit a model, and assemble a report, deciding for itself which step comes next. This builds on the same execute-observe-revise loop that Code Interpreter pioneered, extended across an entire pipeline. Such systems still require human review: LLMs can misread schema, choose an inappropriate statistical test, or hallucinate a column that does not exist, so analysts must verify generated code and results rather than trusting them blindly.

## EDA in the Machine Learning Pipeline

Exploratory data analysis is not just a standalone activity; it is a critical step in any [machine learning](/wiki/machine_learning) pipeline. EDA informs decisions at nearly every subsequent stage:

- **[Feature engineering](/wiki/feature_engineering):** EDA reveals which variables carry predictive signal, which need transformation, and which can be dropped. Correlation heatmaps, for instance, help identify redundant features that could cause multicollinearity.
- **Model selection:** Understanding data distributions and relationships helps narrow the set of candidate algorithms. For example, if the data shows clear non-linear patterns, tree-based models or neural networks may be preferred over linear regression.
- **Handling class imbalance:** EDA exposes imbalanced target distributions that require techniques such as oversampling, undersampling, or cost-sensitive learning.
- **Detecting data leakage:** Inspecting feature distributions relative to the target variable can reveal features that inadvertently encode information about the outcome, which would inflate model performance on training data but fail in production.

Automated EDA tools such as YData Profiling, Sweetviz, and D-Tale accelerate this phase by generating comprehensive reports with minimal code.[6]

## Tools for Data Analysis

The ecosystem of data analysis tools has grown substantially, spanning programming languages, libraries, AI assistants, and commercial platforms.

| Tool | Type | Key Strengths |
|------|------|---------------|
| **[pandas](/wiki/pandas)** | Python library | De facto standard for tabular data manipulation; rich API for filtering, grouping, merging, and reshaping data. |
| **NumPy** | Python library | Provides efficient array operations and mathematical functions that underpin most scientific Python libraries. |
| **[matplotlib](/wiki/matplotlib)** | Python library | Foundational plotting library offering fine-grained control over chart customization. |
| **Seaborn** | Python library | Built on [matplotlib](/wiki/matplotlib); provides high-level statistical visualization functions with attractive defaults. |
| **[Code Interpreter / Advanced Data Analysis](/wiki/code_interpreter)** | LLM tool | Runs Python in a [ChatGPT](/wiki/chatgpt) sandbox from natural-language prompts; cleans, analyzes, and charts uploaded files without manual coding. |
| **R** | Programming language | Designed for statistical computing; strong in academic and research settings; ggplot2 offers a powerful grammar of graphics. |
| **SQL** | Query language | Essential for extracting and aggregating data stored in relational databases; remains one of the most-used languages among professional developers in the 2025 Stack Overflow Developer Survey.[13] |
| **Tableau** | BI platform | Drag-and-drop visualization tool; strong with large data sets and real-time analytics dashboards. |
| **Power BI** | BI platform | Integrates with the Microsoft 365 ecosystem; cost-effective for organizations already using Microsoft products. |
| **Polars** | Python / Rust library | A newer alternative to [pandas](/wiki/pandas) that offers significant speed improvements for memory-intensive operations through lazy evaluation and multi-threaded execution. |
| **Excel** | Spreadsheet | Widely accessible; suitable for small-scale analysis with pivot tables, charts, and built-in statistical functions. |

## Best Practices

Effective data analysis requires more than technical skill. Analysts should keep the following principles in mind:

- **Document every step.** Reproducibility is a cornerstone of credible analysis. Use version-controlled scripts or notebooks rather than manual, point-and-click workflows.
- **Check assumptions.** Statistical tests carry assumptions about distributions, independence, and sample size. Violating these assumptions can produce misleading results.
- **Guard against cognitive bias.** Confirmation bias leads analysts to seek patterns that support preconceptions while ignoring contradictory evidence. Pre-registering hypotheses and performing blind analyses can mitigate this risk.[14]
- **Distinguish correlation from causation.** Observational data can reveal associations but rarely proves that one variable causes changes in another without a controlled experimental design.
- **Normalize for comparison.** When comparing quantities across groups of different sizes, use per-capita rates, percentages, or other normalized measures to avoid misleading conclusions.
- **Communicate uncertainty.** Always report confidence intervals or margins of error alongside point estimates so that decision-makers understand the reliability of findings.
- **Verify AI-generated analysis.** Treat code and conclusions produced by LLM tools as a draft. Check that the right columns, statistical tests, and assumptions were used before trusting the output.

## Explain Like I'm 5 (ELI5)

Imagine you have a big box of crayons all jumbled up. Data analysis is like sorting those crayons so you can understand what you have. First, you take out any broken ones and throw away duplicates (that is data cleaning). Then you sort them by color (that is organizing and transforming). Next, you count how many of each color you have and maybe line them up from lightest to darkest to see if you have more blues or more reds (that is exploratory data analysis). Finally, you tell your friend, "We have mostly blue and green crayons, and hardly any orange ones" (that is communicating your findings). Today you can even ask a smart computer helper to do a lot of the sorting and counting for you, but you still have to double-check that it sorted the crayons the way you meant. Data analysis helps people look at big piles of information and turn them into simple answers.

## See Also

- [Machine Learning](/wiki/machine_learning)
- [Feature Engineering](/wiki/feature_engineering)
- [Data Set](/wiki/data_set_or_dataset)
- [Normalization](/wiki/normalization)
- [Pandas](/wiki/pandas)
- [Matplotlib](/wiki/matplotlib)
- [Code Interpreter (Advanced Data Analysis)](/wiki/code_interpreter)
- [ChatGPT](/wiki/chatgpt)
- [Large Language Model](/wiki/large_language_model)

## References

1. Judd, C. M.; McClelland, G. H.; Ryan, C. S. (2017). *Data Analysis: A Model Comparison Approach to Regression, ANOVA, and Beyond* (3rd ed.). Routledge.
2. Tukey, J. W. (1962). "The Future of Data Analysis." *The Annals of Mathematical Statistics*, 33(1), 1-67.
3. Tukey, J. W. (1977). *Exploratory Data Analysis*. Addison-Wesley.
4. CrowdFlower (2016). "Data Science Report." Survey of data scientists on time spent in data preparation.
5. McKinney, Wes. "Wes McKinney." *Wikipedia, The Free Encyclopedia*. https://en.wikipedia.org/wiki/Wes_McKinney
6. YData (2024). "YData Profiling: Data Quality Profiling and Exploratory Data Analysis." https://github.com/ydataai/ydata-profiling
7. Aldrich, J. (1995). "Correlations Genuine and Spurious in Pearson and Yule." *Statistical Science*, 10(4), 364-376.
8. Wasserstein, R. L.; Lazar, N. A. (2016). "The ASA's Statement on p-Values: Context, Process, and Purpose." *The American Statistician*, 70(2), 129-133.
9. Pluralsight (2023). "ChatGPT's Code Interpreter is now Advanced Data Analysis." https://www.pluralsight.com/resources/blog/ai-and-data/ChatGPT-Advanced-Data-Analytics
10. OpenAI Help Center (2024). "Data analysis with ChatGPT." https://help.openai.com/en/articles/8437071-data-analysis-with-chatgpt
11. Datanami (2023). "OpenAI Releases ChatGPT Code Interpreter, 'Your Personal Data Analyst'." https://www.datanami.com/2023/07/11/openai-releases-chatgpt-code-interpreter-your-personal-data-analyst/
12. OpenAI (2024). "Improvements to data analysis in ChatGPT." https://openai.com/index/improvements-to-data-analysis-in-chatgpt/
13. Stack Overflow (2025). "2025 Developer Survey: Technology." https://survey.stackoverflow.co/2025/technology
14. Nosek, B. A.; Ebersole, C. R.; DeHaven, A. C.; Mellor, D. T. (2018). "The preregistration revolution." *Proceedings of the National Academy of Sciences*, 115(11), 2600-2606.
15. Wikipedia contributors (2025). "Data analysis." *Wikipedia, The Free Encyclopedia*. https://en.wikipedia.org/wiki/Data_analysis

