Databricks

20 min read

Updated Apr 28, 2026

Databricks is an American enterprise software company that provides a unified data analytics and artificial intelligence platform built around the data lakehouse architecture. Founded in 2013 by the creators of Apache Spark, including Ali Ghodsi and Matei Zaharia, Databricks has grown from an open-source data processing company into one of the most valuable private technology companies in the world. The company's platform combines data engineering, data warehousing, and machine learning into a single environment, and its aggressive expansion into AI through the acquisition of MosaicML and the release of open-source models has positioned it as a major player in the enterprise AI market. As of early 2026, Databricks is valued at $134 billion and is preparing for a potential IPO.

History and Founding

Databricks was founded in 2013 by seven co-founders, all of whom were connected through the AMPLab (Algorithms, Machines, and People Lab) at the University of California, Berkeley:

Co-founder	Role/Background
Ali Ghodsi	CEO; PhD from KTH Royal Institute of Technology; UC Berkeley researcher
Matei Zaharia	Chief Technologist; created Apache Spark during his PhD at UC Berkeley
Andy Konwinski	Co-creator of Apache Mesos
Arsalan Tavakoli-Shiraji	Former VP of Engineering
Ion Stoica	UC Berkeley professor; co-founder of Conviva
Patrick Wendell	Apache Spark release manager
Reynold Xin	Apache Spark contributor

The company grew directly out of the Apache Spark project, which Matei Zaharia created during his doctoral research at Berkeley. Spark was designed as a fast, general-purpose cluster computing system that could handle both batch and streaming data processing. It quickly became one of the most popular open-source data processing frameworks in the world, and Databricks was founded to build a commercial platform and managed service around it.

From the beginning, Databricks embraced an open-source-first strategy, contributing heavily to Apache Spark and later creating additional open-source projects including Delta Lake (a storage layer for data lakes), MLflow (a platform for the machine learning lifecycle), and Apache Spark Structured Streaming.

Data Lakehouse Architecture

Databricks is most closely associated with the data lakehouse concept, which it helped popularize. The data lakehouse combines the flexibility and low cost of a data lake with the data management and performance features of a traditional data warehouse.

Traditional Approaches

Historically, organizations maintained separate systems for different data workloads:

System	Strengths	Weaknesses
Data warehouse	Structured queries, ACID transactions, governance	Expensive, limited to structured data
Data lake	Cheap storage, supports all data types	Poor performance, no transactions, "data swamp" risk

The lakehouse architecture merges these approaches by adding warehouse-like features (ACID transactions, schema enforcement, indexing) directly on top of data lake storage (typically cloud object storage like Amazon S3 or Azure Blob Storage). Databricks implemented this through Delta Lake, an open-source storage layer that brings reliability and performance to data lakes.

Delta Lake

Delta Lake is the foundation of Databricks' lakehouse architecture. It provides:

ACID transactions on data lake storage, preventing data corruption
Schema enforcement and evolution, ensuring data quality
Time travel, allowing queries against historical versions of data
Unified batch and streaming data processing
Scalable metadata handling for petabyte-scale datasets

Delta Lake uses the open Parquet file format underneath, which means data stored in Delta Lake can be read by any tool that supports Parquet, avoiding vendor lock-in.

Delta Lake Technical Details

Delta Lake extends the Parquet format with a file-based transaction log that records every change to the data. This log enables several critical capabilities ^[11]:

Feature	Description	Benefit
ACID transactions	Serializable isolation level via optimistic concurrency control	Multiple concurrent writers and readers without corruption
Schema enforcement	Validates data against the table schema on write	Prevents silent data quality degradation
Schema evolution	Supports adding, renaming, and dropping columns	Adapts to changing data requirements without downtime
Time travel	Query data at any point in its history using version numbers or timestamps	Audit trails, reproducibility, rollback capability
Data skipping	Maintains statistics (min, max, count) for each file	Queries skip irrelevant files, reducing I/O
Z-ordering	Co-locates related data within files based on specified columns	Dramatically improves query performance for filtered reads
Change data feed	Tracks row-level changes (inserts, updates, deletes) between versions	Efficient incremental processing

Delta Lake is fully compatible with Apache Spark APIs and was developed for tight integration with Structured Streaming, allowing a single copy of data to serve both batch and streaming use cases. As of 2025, Delta Lake also supports interoperability with Apache Iceberg, allowing data stored in Delta format to be read by Iceberg-compatible tools and vice versa ^[11].

Lakebase

In 2025, Databricks unveiled Lakebase, a Postgres-compatible transactional database engine built for the lakehouse, allowing teams to run OLTP-style applications directly on the same data infrastructure used for analytics and AI workloads.

Lakebase entered public preview at the 2025 Data + AI Summit and reached general availability on February 3, 2026. It represents Databricks' entry into the online transaction processing (OLTP) market, traditionally the domain of dedicated database systems like PostgreSQL, MySQL, and cloud-native databases ^[12].

Key Lakebase features include:

Feature	Description
Postgres compatibility	Standard PostgreSQL wire protocol and SQL dialect
Serverless compute	Auto-scaling with scale-to-zero capability
Instant branching	Create database branches for development, testing, or experimentation
Point-in-time restore	Recover data to any previous point in time
Delta table sync	Managed synchronization between OLTP tables and Delta Lake analytics tables
Unity Catalog integration	Governance and access control through the same catalog as all other lakehouse assets
Postgres extension support	Compatible with PostgreSQL extensions for specialized functionality

Lakebase bridges the gap between operational applications and analytical workloads. Instead of maintaining separate OLTP and analytics databases with complex ETL pipelines between them, organizations can use Lakebase for transactional workloads while the built-in sync keeps Delta Lake tables updated for analytics and AI ^[12].

AI and Machine Learning Products

Mosaic ML Acquisition (2023)

In June 2023, Databricks acquired MosaicML for $1.3 billion, marking its most significant move into the generative AI space. MosaicML had built tools and infrastructure that simplified and reduced the cost of training large language models, making it possible for enterprises to train custom models without the massive engineering teams that organizations like OpenAI or Google maintained.

Following the acquisition, MosaicML's technology was integrated into Databricks as Mosaic AI, which covers the full machine learning lifecycle from feature engineering and model training to deployment and monitoring. The acquisition brought key talent, including MosaicML's expertise in efficient training techniques, and gave Databricks the capability to offer foundation model training as a service to its enterprise customers.

Mosaic AI Agent Framework

The Mosaic AI Agent Framework is Databricks' solution for building production-quality AI agent systems, including retrieval-augmented generation (RAG) applications. The framework provides a suite of tooling for developing, evaluating, and deploying compound AI systems that leverage multiple components such as tuned models, retrieval, tool use, and reasoning agents ^[13].

Key capabilities include:

Component	Function
Agent Bricks	Auto-optimized agent templates for common industry use cases (information extraction, knowledge assistance, text transformation)
Agent evaluation	Built-in tools for measuring agent accuracy, safety, and performance
Vector Search	Storage-optimized vector search supporting billions of vectors at 7x lower cost
Mosaic AI Gateway	Unified entry point for all AI services with centralized governance, usage logging, and control
Multi-agent systems	Support for building systems where multiple specialized agents collaborate

Agent Bricks, announced at the 2025 Data + AI Summit, simplifies agent development by allowing users to provide a high-level description of the agent's task and connect enterprise data, with the system handling optimization automatically ^[13]. Databricks' "2026 State of AI Agents" report highlighted a 327% increase in multi-agent workflow adoption over the latter half of 2025.

DBRX Open Model

In March 2024, Databricks released DBRX, its first foundation model, under the Databricks Open Model License. DBRX uses a mixture-of-experts (MoE) architecture built on the MegaBlocks open-source project. Key details of DBRX include:

Specification	Details
Architecture	Mixture-of-experts (MoE)
Training cost	~$10 million
License	Databricks Open Model License
Foundation	MegaBlocks open-source project
Training infrastructure	Databricks Mosaic AI
Serving	Available via pay-per-token and provisioned throughput endpoints

DBRX was designed to demonstrate that enterprises could build competitive foundation models at a fraction of the cost of frontier models from major AI labs, reinforcing Databricks' pitch that companies should own and customize their AI models rather than relying entirely on third-party APIs.

Dolly Models

In March 2023, shortly before the MosaicML acquisition, Databricks released Dolly, an open-source language model named after Dolly the sheep (the first cloned mammal). Dolly was a 6 billion parameter model based on EleutherAI's GPT-J, fine-tuned on a dataset of instruction-following examples generated by Databricks employees. Dolly was notable as one of the first demonstrations that a relatively small model could exhibit instruction-following capabilities similar to much larger models when fine-tuned on high-quality data.

Databricks later released Dolly 2.0, which used a commercially permissive training dataset created by Databricks employees, making it one of the first instruction-following LLMs that could be used for commercial purposes without licensing restrictions.

MLflow

MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. Originally created by Databricks and later donated to the Linux Foundation, MLflow provides:

Component	Function
MLflow Tracking	Logging experiments, parameters, metrics, and artifacts
MLflow Projects	Packaging ML code for reproducible runs
MLflow Models	Deploying models in diverse serving environments
MLflow Model Registry	Centralized model store with versioning and staging

MLflow has become one of the most widely adopted ML lifecycle management tools in the industry, with integration support across major cloud platforms and ML frameworks. Its open-source nature and broad compatibility have helped Databricks build mindshare in the data science community.

MLflow 3

MLflow 3, released in 2025, introduces significant enhancements to experiment tracking, observability, and performance evaluation for both traditional ML models and generative AI applications ^[14].

Key new concepts in MLflow 3 include:

Feature	Description
Logged Models	Persistent model objects that track a model's progress throughout its lifecycle, across environments and runs
Deployment Jobs	First-class tracking of model deployment status and configuration
Enhanced Model Registry	Direct capture of parameters, metrics, and metadata available across all workspaces
GenAI observability	Tracing and evaluation capabilities designed for LLM-powered applications
Agent evaluation	Purpose-built metrics and evaluation frameworks for AI agents

MLflow 3's integration with Unity Catalog means that models tracked in MLflow automatically benefit from centralized governance, lineage tracking, and access control across the Databricks platform ^[14].

Model Serving

Databricks Model Serving provides real-time and batch inference capabilities integrated with the lakehouse platform. It supports serving custom models trained on Databricks, foundation models accessed via APIs, and external models from providers like OpenAI and Anthropic. Model Serving integrates with MLflow for model versioning and lifecycle management, and with Unity Catalog for governance and access control.

The platform also includes Mosaic AI Model Serving for batch inferencing, which simplifies the infrastructure needed to process unstructured data at scale using large language models.

Model Serving Architecture

Mosaic AI Model Serving deploys models to REST API endpoints with automatic monitoring of requests and responses. The serving infrastructure supports several deployment patterns ^[14]:

Deployment Type	Description	Billing
Pay-per-token endpoints	Serverless endpoints billed by tokens processed	Per-token pricing
Provisioned throughput	Dedicated compute with guaranteed capacity	Per-compute-hour
Custom model endpoints	Serve models trained on Databricks	Per-compute-hour
External model endpoints	Proxy to external providers (OpenAI, Anthropic) with governance	Per-token (pass-through + gateway fee)

All served models are automatically registered in Unity Catalog, ensuring consistent governance and access control regardless of deployment type.

Genie

Genie (officially AI/BI Genie) is Databricks' natural language interface for data analysis. Generally available as of 2025, Genie allows business users to query data, build visualizations, and receive AI-generated insights using conversational language, without writing SQL or code. Genie represents Databricks' push to make its platform accessible to non-technical users and to demonstrate the practical value of AI integration in everyday business analytics.

An API in public preview also enables developers to integrate Genie into custom-built applications and productivity tools.

Unity Catalog

Unity Catalog is Databricks' unified governance solution for data and AI assets. It provides a single place to manage access controls, auditing, lineage, and discovery across all data, ML models, notebooks, and dashboards within a Databricks workspace. Key features include:

Centralized access control across all data and AI assets
Data lineage tracking showing how data flows through pipelines
Automated auditing for compliance and security
Cross-workspace governance for organizations with multiple Databricks deployments
AI model governance integrated with MLflow Model Registry

Unity Catalog addresses a critical enterprise need: as organizations deploy more AI models and manage more data, they require robust governance to ensure compliance with regulations, protect sensitive data, and maintain data quality.

Unity Catalog Deep Dive

Unity Catalog organizes assets in a three-level namespace: catalog, schema, and object. This hierarchy maps naturally to organizational structures and allows fine-grained access control ^[15].

Level	Description	Example
Catalog	Top-level container, typically representing a business unit or environment	`production`, `development`, `marketing`
Schema	Groups related objects within a catalog	`production.sales`, `production.finance`
Object	Individual data or AI asset	Tables, views, volumes, models, functions

Unity Catalog governs the following asset types:

Asset Type	Description
Managed tables	Delta Lake tables with storage managed by Unity Catalog
External tables	Tables pointing to data in customer-managed storage
Views	Virtual tables defined by SQL queries
Volumes	Managed and external file storage (images, documents, raw data)
ML models	Models registered through MLflow Model Registry
Functions	User-defined functions (UDFs) and AI functions
Connections	Metadata for external database connections (federation)

Unity Catalog Metrics

Introduced in 2025, Unity Catalog Metrics extends governance to business metrics definitions, ensuring that key performance indicators (KPIs) are defined once and used consistently across dashboards, reports, and AI applications. This prevents the common problem of different teams calculating the same metric in different ways ^[15].

Open-Source Unity Catalog

Databricks open-sourced Unity Catalog in 2024, allowing organizations to use its governance capabilities outside the Databricks platform. The open-source version supports Apache Iceberg, Delta Lake, and other table formats, reinforcing Databricks' strategy of building ecosystem adoption through open-source contributions.

Funding and Valuation

Databricks has raised substantial funding across numerous rounds, reflecting its rapid growth:

Round	Date	Amount	Valuation	Key Investors
Series A	2013	$14M	-	Andreessen Horowitz
Series B	2014	$33M	-	Andreessen Horowitz, New Enterprise Associates
Series C	2016	$60M	-	Various
Series D	2017	$140M	-	Andreessen Horowitz
Series E	2019	$250M	$2.75B	Andreessen Horowitz, Microsoft
Series F	2020	$400M	$6.2B	Various
Series G	2021	$1.6B	$38B	Franklin Templeton, Amazon
Series H	2023	$500M	$43B	Various
Series I	December 2024	$10B	$62B	Thrive Capital, a16z, various
Series K	September 2025	$1B	~$100B	Various
Series L	December 2025 / February 2026	$5B ($3B equity + $2B debt)	$134B	Insight Partners, Fidelity, JP Morgan

The jump from $62 billion in December 2024 to $134 billion by early 2026 reflects the accelerating demand for unified data and AI platforms and Databricks' strong revenue growth.

Financial Performance

Databricks has demonstrated strong financial metrics:

Metric	Value
Revenue run-rate (Q3 2025)	$4.8 billion
Year-over-year growth	>55%
Data warehousing revenue run-rate	>$1 billion
AI products revenue run-rate	>$1 billion
Free cash flow	Positive (trailing 12 months)

The company's transition to positive free cash flow is notable for a company of its size and growth rate, and it has been cited as a key factor in the company's readiness for a potential public listing.

Competition with Snowflake

The most frequently discussed competitive rivalry in the data platform market is between Databricks and Snowflake. The two companies approach the market from different directions:

Dimension	Databricks	Snowflake
Origin	Open-source data processing (Spark)	Cloud data warehousing
Architecture	Data lakehouse	Shared-data cloud warehouse
AI/ML capabilities	Deep (Mosaic AI, MLflow, model training)	Growing (Cortex AI, Snowpark)
Open source commitment	Strong (Spark, Delta Lake, MLflow, Unity Catalog)	Moderate (Iceberg adoption, Open Catalog)
Data engineering	Native strength	Acquired capability
Data warehousing	Growing strength	Native strength
Pricing model	Consumption-based	Consumption-based
Unstructured data	Native support for text, images, files	Optimized for structured and semi-structured
Learning curve	Code-centric (Python, SQL, Scala)	SQL-first, analyst-friendly

Databricks has traditionally been stronger in data engineering and machine learning, while Snowflake has dominated the data warehousing and analytics market. Both companies are now converging on each other's territory, with Databricks investing heavily in its SQL and warehousing capabilities and Snowflake expanding into AI and ML. The introduction of Databricks' Lakebase (Postgres-compatible transactional database) and Snowflake's Cortex AI in 2025 further blurred the lines between the two platforms.

Recent Competitive Developments (2025-2026)

In 2025, Snowflake responded to Databricks' AI advances by doubling down on openness with Open Catalog and native Iceberg support, enabling teams to work with data in open formats. Snowflake also unveiled Openflow, a low-code ingestion and transformation service built on Apache NiFi, aimed at simplifying data pipelines for less technical users ^[16].

Databricks countered with several innovations:

Lakebase brought OLTP capabilities to the lakehouse, eliminating the need for separate transactional databases
Delta Lake's Iceberg interoperability allowed data in Delta format to be read by Iceberg-compatible tools
Unity Catalog Metrics extended governance to business metrics definitions
Instructed Retriever delivered up to 70% better performance than traditional RAG for complex, instruction-based questions by using metadata more effectively ^[16]

Industry analysts generally view Databricks as having a deeper AI/ML stack due to the MosaicML acquisition and its extensive open-source ecosystem, while Snowflake retains advantages in ease of use for traditional analytics workloads and a larger installed base of SQL-focused users. The consensus for 2026 is that if the primary need is advanced analytics, machine learning, and unified data engineering, Databricks is the stronger choice; for SQL analytics, BI concurrency, and governed reporting, Snowflake typically fits better ^[16].

IPO Plans

Databricks CEO Ali Ghodsi has said he would not rule out a 2026 initial public offering. As of early 2026, the company is generating positive free cash flow and its revenue growth rate exceeds 55% year over year. The $134 billion private valuation positions Databricks as potentially one of the largest technology IPOs in history if and when it proceeds. Industry observers expect that a Databricks IPO would be a landmark event for the enterprise AI and data platform market.

Open Source Strategy

Databricks' open-source strategy has been central to its success. The company has consistently developed and contributed to open-source projects that form the foundation of its commercial platform:

Project	Description	Status
Apache Spark	Distributed data processing framework	Apache Foundation
Delta Lake	ACID-compliant storage layer for data lakes	Linux Foundation
MLflow	ML lifecycle management platform	Linux Foundation
Unity Catalog	Unified governance for data and AI	Open source (2024)
DBRX	Mixture-of-experts language model	Open source
Dolly	Instruction-following language model	Open source

This approach creates a broad ecosystem of users and contributors, many of whom eventually become Databricks customers. It also reduces vendor lock-in concerns, as organizations can use the open-source components independently of Databricks' commercial platform.

Current State

As of early 2026, Databricks is one of the most valuable private technology companies in the world at $134 billion. The company is approaching $5 billion in annual revenue, growing at over 55% year over year, and generating positive free cash flow. Its platform has expanded well beyond its Apache Spark roots to encompass data warehousing, AI model training and serving, natural language analytics (Genie), transactional databases (Lakebase), and comprehensive governance (Unity Catalog). The MosaicML acquisition, DBRX model, and Mosaic AI Agent Framework have established Databricks as a credible player in the foundation model and agentic AI spaces, while the company's open-source commitments and lakehouse architecture continue to differentiate it in the enterprise market. With a potential IPO on the horizon, Databricks is entering the next phase of its growth as a public-market-ready enterprise AI platform.

History and Founding

Data Lakehouse Architecture

Traditional Approaches

Delta Lake

Delta Lake Technical Details

Lakebase

AI and Machine Learning Products

Mosaic ML Acquisition (2023)

Mosaic AI Agent Framework

DBRX Open Model

Dolly Models

MLflow

MLflow 3

Model Serving

Model Serving Architecture

Genie

Unity Catalog

Unity Catalog Deep Dive

Unity Catalog Metrics

Open-Source Unity Catalog

Funding and Valuation

Financial Performance

Competition with Snowflake

Recent Competitive Developments (2025-2026)

IPO Plans

Open Source Strategy

Current State

References

Related Articles

Perplexity AI

Scale AI

Character.AI

ElevenLabs

Sam Altman

Black Forest Labs

History and Founding

Data Lakehouse Architecture

Traditional Approaches

Delta Lake

Delta Lake Technical Details

Lakebase

AI and Machine Learning Products

Mosaic ML Acquisition (2023)

Mosaic AI Agent Framework

DBRX Open Model

Dolly Models

MLflow

MLflow 3

Model Serving

Model Serving Architecture

Genie

Unity Catalog

Unity Catalog Deep Dive

Unity Catalog Metrics

Open-Source Unity Catalog

Funding and Valuation

Financial Performance

Competition with Snowflake

Recent Competitive Developments (2025-2026)

IPO Plans

Open Source Strategy

Current State

References

Related Articles

Perplexity AI

Scale AI

Character.AI

ElevenLabs

Sam Altman

Black Forest Labs