Databricks is an American enterprise software company that provides a unified data analytics and artificial intelligence platform built around the data lakehouse architecture. Founded in 2013 by the creators of Apache Spark, including Ali Ghodsi and Matei Zaharia, Databricks has grown from an open-source data processing company into one of the most valuable private technology companies in the world. The company's platform combines data engineering, data warehousing, and machine learning into a single environment, and its aggressive expansion into AI through the acquisition of MosaicML and the release of open-source models has positioned it as a major player in the enterprise AI market. As of early 2026, Databricks is valued at $134 billion and is preparing for a potential IPO.
Databricks was founded in 2013 by seven co-founders, all of whom were connected through the AMPLab (Algorithms, Machines, and People Lab) at the University of California, Berkeley:
| Co-founder | Role/Background |
|---|---|
| Ali Ghodsi | CEO; PhD from KTH Royal Institute of Technology; UC Berkeley researcher |
| Matei Zaharia | Chief Technologist; created Apache Spark during his PhD at UC Berkeley |
| Andy Konwinski | Co-creator of Apache Mesos |
| Arsalan Tavakoli-Shiraji | Former VP of Engineering |
| Ion Stoica | UC Berkeley professor; co-founder of Conviva |
| Patrick Wendell | Apache Spark release manager |
| Reynold Xin | Apache Spark contributor |
The company grew directly out of the Apache Spark project, which Matei Zaharia created during his doctoral research at Berkeley. Spark was designed as a fast, general-purpose cluster computing system that could handle both batch and streaming data processing. It quickly became one of the most popular open-source data processing frameworks in the world, and Databricks was founded to build a commercial platform and managed service around it.
From the beginning, Databricks embraced an open-source-first strategy, contributing heavily to Apache Spark and later creating additional open-source projects including Delta Lake (a storage layer for data lakes), MLflow (a platform for the machine learning lifecycle), and Apache Spark Structured Streaming.
Databricks is most closely associated with the data lakehouse concept, which it helped popularize. The data lakehouse combines the flexibility and low cost of a data lake with the data management and performance features of a traditional data warehouse.
Historically, organizations maintained separate systems for different data workloads:
| System | Strengths | Weaknesses |
|---|---|---|
| Data warehouse | Structured queries, ACID transactions, governance | Expensive, limited to structured data |
| Data lake | Cheap storage, supports all data types | Poor performance, no transactions, "data swamp" risk |
The lakehouse architecture merges these approaches by adding warehouse-like features (ACID transactions, schema enforcement, indexing) directly on top of data lake storage (typically cloud object storage like Amazon S3 or Azure Blob Storage). Databricks implemented this through Delta Lake, an open-source storage layer that brings reliability and performance to data lakes.
Delta Lake is the foundation of Databricks' lakehouse architecture. It provides:
Delta Lake uses the open Parquet file format underneath, which means data stored in Delta Lake can be read by any tool that supports Parquet, avoiding vendor lock-in.
Delta Lake extends the Parquet format with a file-based transaction log that records every change to the data. This log enables several critical capabilities [11]:
| Feature | Description | Benefit |
|---|---|---|
| ACID transactions | Serializable isolation level via optimistic concurrency control | Multiple concurrent writers and readers without corruption |
| Schema enforcement | Validates data against the table schema on write | Prevents silent data quality degradation |
| Schema evolution | Supports adding, renaming, and dropping columns | Adapts to changing data requirements without downtime |
| Time travel | Query data at any point in its history using version numbers or timestamps | Audit trails, reproducibility, rollback capability |
| Data skipping | Maintains statistics (min, max, count) for each file | Queries skip irrelevant files, reducing I/O |
| Z-ordering | Co-locates related data within files based on specified columns | Dramatically improves query performance for filtered reads |
| Change data feed | Tracks row-level changes (inserts, updates, deletes) between versions | Efficient incremental processing |
Delta Lake is fully compatible with Apache Spark APIs and was developed for tight integration with Structured Streaming, allowing a single copy of data to serve both batch and streaming use cases. As of 2025, Delta Lake also supports interoperability with Apache Iceberg, allowing data stored in Delta format to be read by Iceberg-compatible tools and vice versa [11].
In 2025, Databricks unveiled Lakebase, a Postgres-compatible transactional database engine built for the lakehouse, allowing teams to run OLTP-style applications directly on the same data infrastructure used for analytics and AI workloads.
Lakebase entered public preview at the 2025 Data + AI Summit and reached general availability on February 3, 2026. It represents Databricks' entry into the online transaction processing (OLTP) market, traditionally the domain of dedicated database systems like PostgreSQL, MySQL, and cloud-native databases [12].
Key Lakebase features include:
| Feature | Description |
|---|---|
| Postgres compatibility | Standard PostgreSQL wire protocol and SQL dialect |
| Serverless compute | Auto-scaling with scale-to-zero capability |
| Instant branching | Create database branches for development, testing, or experimentation |
| Point-in-time restore | Recover data to any previous point in time |
| Delta table sync | Managed synchronization between OLTP tables and Delta Lake analytics tables |
| Unity Catalog integration | Governance and access control through the same catalog as all other lakehouse assets |
| Postgres extension support | Compatible with PostgreSQL extensions for specialized functionality |
Lakebase bridges the gap between operational applications and analytical workloads. Instead of maintaining separate OLTP and analytics databases with complex ETL pipelines between them, organizations can use Lakebase for transactional workloads while the built-in sync keeps Delta Lake tables updated for analytics and AI [12].
In June 2023, Databricks acquired MosaicML for $1.3 billion, marking its most significant move into the generative AI space. MosaicML had built tools and infrastructure that simplified and reduced the cost of training large language models, making it possible for enterprises to train custom models without the massive engineering teams that organizations like OpenAI or Google maintained.
Following the acquisition, MosaicML's technology was integrated into Databricks as Mosaic AI, which covers the full machine learning lifecycle from feature engineering and model training to deployment and monitoring. The acquisition brought key talent, including MosaicML's expertise in efficient training techniques, and gave Databricks the capability to offer foundation model training as a service to its enterprise customers.
The Mosaic AI Agent Framework is Databricks' solution for building production-quality AI agent systems, including retrieval-augmented generation (RAG) applications. The framework provides a suite of tooling for developing, evaluating, and deploying compound AI systems that leverage multiple components such as tuned models, retrieval, tool use, and reasoning agents [13].
Key capabilities include:
| Component | Function |
|---|---|
| Agent Bricks | Auto-optimized agent templates for common industry use cases (information extraction, knowledge assistance, text transformation) |
| Agent evaluation | Built-in tools for measuring agent accuracy, safety, and performance |
| Vector Search | Storage-optimized vector search supporting billions of vectors at 7x lower cost |
| Mosaic AI Gateway | Unified entry point for all AI services with centralized governance, usage logging, and control |
| Multi-agent systems | Support for building systems where multiple specialized agents collaborate |
Agent Bricks, announced at the 2025 Data + AI Summit, simplifies agent development by allowing users to provide a high-level description of the agent's task and connect enterprise data, with the system handling optimization automatically [13]. Databricks' "2026 State of AI Agents" report highlighted a 327% increase in multi-agent workflow adoption over the latter half of 2025.
In March 2024, Databricks released DBRX, its first foundation model, under the Databricks Open Model License. DBRX uses a mixture-of-experts (MoE) architecture built on the MegaBlocks open-source project. Key details of DBRX include:
| Specification | Details |
|---|---|
| Architecture | Mixture-of-experts (MoE) |
| Training cost | ~$10 million |
| License | Databricks Open Model License |
| Foundation | MegaBlocks open-source project |
| Training infrastructure | Databricks Mosaic AI |
| Serving | Available via pay-per-token and provisioned throughput endpoints |
DBRX was designed to demonstrate that enterprises could build competitive foundation models at a fraction of the cost of frontier models from major AI labs, reinforcing Databricks' pitch that companies should own and customize their AI models rather than relying entirely on third-party APIs.
In March 2023, shortly before the MosaicML acquisition, Databricks released Dolly, an open-source language model named after Dolly the sheep (the first cloned mammal). Dolly was a 6 billion parameter model based on EleutherAI's GPT-J, fine-tuned on a dataset of instruction-following examples generated by Databricks employees. Dolly was notable as one of the first demonstrations that a relatively small model could exhibit instruction-following capabilities similar to much larger models when fine-tuned on high-quality data.
Databricks later released Dolly 2.0, which used a commercially permissive training dataset created by Databricks employees, making it one of the first instruction-following LLMs that could be used for commercial purposes without licensing restrictions.
MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. Originally created by Databricks and later donated to the Linux Foundation, MLflow provides:
| Component | Function |
|---|---|
| MLflow Tracking | Logging experiments, parameters, metrics, and artifacts |
| MLflow Projects | Packaging ML code for reproducible runs |
| MLflow Models | Deploying models in diverse serving environments |
| MLflow Model Registry | Centralized model store with versioning and staging |
MLflow has become one of the most widely adopted ML lifecycle management tools in the industry, with integration support across major cloud platforms and ML frameworks. Its open-source nature and broad compatibility have helped Databricks build mindshare in the data science community.
MLflow 3, released in 2025, introduces significant enhancements to experiment tracking, observability, and performance evaluation for both traditional ML models and generative AI applications [14].
Key new concepts in MLflow 3 include:
| Feature | Description |
|---|---|
| Logged Models | Persistent model objects that track a model's progress throughout its lifecycle, across environments and runs |
| Deployment Jobs | First-class tracking of model deployment status and configuration |
| Enhanced Model Registry | Direct capture of parameters, metrics, and metadata available across all workspaces |
| GenAI observability | Tracing and evaluation capabilities designed for LLM-powered applications |
| Agent evaluation | Purpose-built metrics and evaluation frameworks for AI agents |
MLflow 3's integration with Unity Catalog means that models tracked in MLflow automatically benefit from centralized governance, lineage tracking, and access control across the Databricks platform [14].
Databricks Model Serving provides real-time and batch inference capabilities integrated with the lakehouse platform. It supports serving custom models trained on Databricks, foundation models accessed via APIs, and external models from providers like OpenAI and Anthropic. Model Serving integrates with MLflow for model versioning and lifecycle management, and with Unity Catalog for governance and access control.
The platform also includes Mosaic AI Model Serving for batch inferencing, which simplifies the infrastructure needed to process unstructured data at scale using large language models.
Mosaic AI Model Serving deploys models to REST API endpoints with automatic monitoring of requests and responses. The serving infrastructure supports several deployment patterns [14]:
| Deployment Type | Description | Billing |
|---|---|---|
| Pay-per-token endpoints | Serverless endpoints billed by tokens processed | Per-token pricing |
| Provisioned throughput | Dedicated compute with guaranteed capacity | Per-compute-hour |
| Custom model endpoints | Serve models trained on Databricks | Per-compute-hour |
| External model endpoints | Proxy to external providers (OpenAI, Anthropic) with governance | Per-token (pass-through + gateway fee) |
All served models are automatically registered in Unity Catalog, ensuring consistent governance and access control regardless of deployment type.
Genie (officially AI/BI Genie) is Databricks' natural language interface for data analysis. Generally available as of 2025, Genie allows business users to query data, build visualizations, and receive AI-generated insights using conversational language, without writing SQL or code. Genie represents Databricks' push to make its platform accessible to non-technical users and to demonstrate the practical value of AI integration in everyday business analytics.
An API in public preview also enables developers to integrate Genie into custom-built applications and productivity tools.
Unity Catalog is Databricks' unified governance solution for data and AI assets. It provides a single place to manage access controls, auditing, lineage, and discovery across all data, ML models, notebooks, and dashboards within a Databricks workspace. Key features include:
Unity Catalog addresses a critical enterprise need: as organizations deploy more AI models and manage more data, they require robust governance to ensure compliance with regulations, protect sensitive data, and maintain data quality.
Unity Catalog organizes assets in a three-level namespace: catalog, schema, and object. This hierarchy maps naturally to organizational structures and allows fine-grained access control [15].
| Level | Description | Example |
|---|---|---|
| Catalog | Top-level container, typically representing a business unit or environment | production, development, marketing |
| Schema | Groups related objects within a catalog | production.sales, production.finance |
| Object | Individual data or AI asset | Tables, views, volumes, models, functions |
Unity Catalog governs the following asset types:
| Asset Type | Description |
|---|---|
| Managed tables | Delta Lake tables with storage managed by Unity Catalog |
| External tables | Tables pointing to data in customer-managed storage |
| Views | Virtual tables defined by SQL queries |
| Volumes | Managed and external file storage (images, documents, raw data) |
| ML models | Models registered through MLflow Model Registry |
| Functions | User-defined functions (UDFs) and AI functions |
| Connections | Metadata for external database connections (federation) |
Introduced in 2025, Unity Catalog Metrics extends governance to business metrics definitions, ensuring that key performance indicators (KPIs) are defined once and used consistently across dashboards, reports, and AI applications. This prevents the common problem of different teams calculating the same metric in different ways [15].
Databricks open-sourced Unity Catalog in 2024, allowing organizations to use its governance capabilities outside the Databricks platform. The open-source version supports Apache Iceberg, Delta Lake, and other table formats, reinforcing Databricks' strategy of building ecosystem adoption through open-source contributions.
Databricks has raised substantial funding across numerous rounds, reflecting its rapid growth:
| Round | Date | Amount | Valuation | Key Investors |
|---|---|---|---|---|
| Series A | 2013 | $14M | - | Andreessen Horowitz |
| Series B | 2014 | $33M | - | Andreessen Horowitz, New Enterprise Associates |
| Series C | 2016 | $60M | - | Various |
| Series D | 2017 | $140M | - | Andreessen Horowitz |
| Series E | 2019 | $250M | $2.75B | Andreessen Horowitz, Microsoft |
| Series F | 2020 | $400M | $6.2B | Various |
| Series G | 2021 | $1.6B | $38B | Franklin Templeton, Amazon |
| Series H | 2023 | $500M | $43B | Various |
| Series I | December 2024 | $10B | $62B | Thrive Capital, a16z, various |
| Series K | September 2025 | $1B | ~$100B | Various |
| Series L | December 2025 / February 2026 | $5B ($3B equity + $2B debt) | $134B | Insight Partners, Fidelity, JP Morgan |
The jump from $62 billion in December 2024 to $134 billion by early 2026 reflects the accelerating demand for unified data and AI platforms and Databricks' strong revenue growth.
Databricks has demonstrated strong financial metrics:
| Metric | Value |
|---|---|
| Revenue run-rate (Q3 2025) | $4.8 billion |
| Year-over-year growth | >55% |
| Data warehousing revenue run-rate | >$1 billion |
| AI products revenue run-rate | >$1 billion |
| Free cash flow | Positive (trailing 12 months) |
The company's transition to positive free cash flow is notable for a company of its size and growth rate, and it has been cited as a key factor in the company's readiness for a potential public listing.
The most frequently discussed competitive rivalry in the data platform market is between Databricks and Snowflake. The two companies approach the market from different directions:
| Dimension | Databricks | Snowflake |
|---|---|---|
| Origin | Open-source data processing (Spark) | Cloud data warehousing |
| Architecture | Data lakehouse | Shared-data cloud warehouse |
| AI/ML capabilities | Deep (Mosaic AI, MLflow, model training) | Growing (Cortex AI, Snowpark) |
| Open source commitment | Strong (Spark, Delta Lake, MLflow, Unity Catalog) | Moderate (Iceberg adoption, Open Catalog) |
| Data engineering | Native strength | Acquired capability |
| Data warehousing | Growing strength | Native strength |
| Pricing model | Consumption-based | Consumption-based |
| Unstructured data | Native support for text, images, files | Optimized for structured and semi-structured |
| Learning curve | Code-centric (Python, SQL, Scala) | SQL-first, analyst-friendly |
Databricks has traditionally been stronger in data engineering and machine learning, while Snowflake has dominated the data warehousing and analytics market. Both companies are now converging on each other's territory, with Databricks investing heavily in its SQL and warehousing capabilities and Snowflake expanding into AI and ML. The introduction of Databricks' Lakebase (Postgres-compatible transactional database) and Snowflake's Cortex AI in 2025 further blurred the lines between the two platforms.
In 2025, Snowflake responded to Databricks' AI advances by doubling down on openness with Open Catalog and native Iceberg support, enabling teams to work with data in open formats. Snowflake also unveiled Openflow, a low-code ingestion and transformation service built on Apache NiFi, aimed at simplifying data pipelines for less technical users [16].
Databricks countered with several innovations:
Industry analysts generally view Databricks as having a deeper AI/ML stack due to the MosaicML acquisition and its extensive open-source ecosystem, while Snowflake retains advantages in ease of use for traditional analytics workloads and a larger installed base of SQL-focused users. The consensus for 2026 is that if the primary need is advanced analytics, machine learning, and unified data engineering, Databricks is the stronger choice; for SQL analytics, BI concurrency, and governed reporting, Snowflake typically fits better [16].
Databricks CEO Ali Ghodsi has said he would not rule out a 2026 initial public offering. As of early 2026, the company is generating positive free cash flow and its revenue growth rate exceeds 55% year over year. The $134 billion private valuation positions Databricks as potentially one of the largest technology IPOs in history if and when it proceeds. Industry observers expect that a Databricks IPO would be a landmark event for the enterprise AI and data platform market.
Databricks' open-source strategy has been central to its success. The company has consistently developed and contributed to open-source projects that form the foundation of its commercial platform:
| Project | Description | Status |
|---|---|---|
| Apache Spark | Distributed data processing framework | Apache Foundation |
| Delta Lake | ACID-compliant storage layer for data lakes | Linux Foundation |
| MLflow | ML lifecycle management platform | Linux Foundation |
| Unity Catalog | Unified governance for data and AI | Open source (2024) |
| DBRX | Mixture-of-experts language model | Open source |
| Dolly | Instruction-following language model | Open source |
This approach creates a broad ecosystem of users and contributors, many of whom eventually become Databricks customers. It also reduces vendor lock-in concerns, as organizations can use the open-source components independently of Databricks' commercial platform.
As of early 2026, Databricks is one of the most valuable private technology companies in the world at $134 billion. The company is approaching $5 billion in annual revenue, growing at over 55% year over year, and generating positive free cash flow. Its platform has expanded well beyond its Apache Spark roots to encompass data warehousing, AI model training and serving, natural language analytics (Genie), transactional databases (Lakebase), and comprehensive governance (Unity Catalog). The MosaicML acquisition, DBRX model, and Mosaic AI Agent Framework have established Databricks as a credible player in the foundation model and agentic AI spaces, while the company's open-source commitments and lakehouse architecture continue to differentiate it in the enterprise market. With a potential IPO on the horizon, Databricks is entering the next phase of its growth as a public-market-ready enterprise AI platform.