No items found.
arrow back
Back to Blog
Jim Dowling
link to linkedin
CEO and Co-Founder
Lex Avstreikh
link to linkedin
Head of Strategy
Article updated on

Introducing the AI Lakehouse

Extending the Lakehouse to support Batch, Real-Time, and LLM AI Systems
September 2, 2024
30 min
Read
Jim Dowling
Jim Dowlinglink to linkedin
CEO and Co-Founder
Hopsworks
Lex Avstreikh
Lex Avstreikhlink to linkedin
Head of Strategy
Hopsworks

TL;DR

Applications need to be made intelligent to stay competitive. The Lakehouse is the dominant data platform that helps us analyze the historical data generated by applications. But the Lakehouse is not sufficient to augment applications with AI to make them more intelligent. Enterprises that have adopted the Lakehouse still have problems in getting AI systems in production. Indeed, the highest value and most challenging use cases for AI, like TikTok’s AI-powered recommender system, are, today, only deployed at a few companies globally. The current Lakehouse architecture lacks capabilities to make applications AI enabled. In this article, we describe the capabilities that need to be added to Lakehouse to make it an AI Lakehouse that can support building and operating AI-enabled batch and real-time applications as well as LLM powered applications.

1 - The Lakehouse: a unified data layer, but lacking AI capabilities
2 - From Lakehouse to AI Lakehouse
3 - Real-time AI
a. Real-Time Features
b. Add more Precomputed Features to Real-Time Models
c. Unified Data and Model Monitoring
4 - AI Query Engine for the Lakehouse
5 - Large Language Models and the Lakehouse
6 - Catalogs and Governance
7 - Conclusion
Lakehouses, as they exist today, lack capabilities to become the productive software factory for AI.

The Lakehouse: a unified data layer, but lacking AI capabilities

Just as the cloud revolutionized Enterprise computing by separating storage and compute, the Lakehouse is revolutionizing Enterprise data by separating data from its query engines. This separation should lead to lower cost Data Warehouses through open standards, commodity compute, and commodity storage. 

“Cloud-based object storage using open-source formats will be the OLAP DBMS archetype for the next ten years.” Pavlo and Stonebraker, SIGMOD RECORD 2024.

Lakehouse is effectively a modular (pluggable) data warehouse (columnar store) that decouples the separate concerns of storage, transactions, compute (query/streaming engines), and metadata (catalog), see Figure 1. The Lakehouse layers are, from the bottom up:

  • physical data using open-source tabular file formats (Parquet) on any store (object store, distributed file system),
  • transaction management for physical data to ensure data correctness and concurrent access with table formats like Apache Iceberg, Delta Lake and Apache Hudi, providing ACID guarantees, concurrent read/writes, retries, incremental updates, and time-travel,
  • a catalog that includes an information schema (a mapping of table metadata to storage information for the table). The catalog is responsible for creating, dropping, and renaming tables and managing collections of tables in namespaces
  • pluggable compute/query engines so you can query/write/transform data on any compute infrastructure. 
The AI Lakehouse Stack
Figure 1: The Lakehouse Stack developed independently of MLOPs platforms, leading to a disconnect between these two stacks and the siloing of data engineering and data science teams on their respective stacks. This is a reason why MLOps has not yet been a success.

In parallel with the development of the Lakehouse, MLOps platforms have emerged to address challenges with operationalizing AI applications: 

  • connecting data to models for both training and inference with a feature registry and real-time feature serving,
  • providing infrastructural services for model management (model registry/serving) and model monitoring,
  • supporting retrieval augmented generation (RAG) for Large Language Models (LLMs) with Vector Indexes,
  • and managing graphical processing units (GPUs) for model training and serving.

How successful have MLOps platforms been in productionizing AI applications? In 2024, nearly half of all models (48% according to Gartner) still fail to cross the chasm to production. Adoption of MLOps platforms has only marginally helped get more models and better models in production, faster. 

In 2024, 48% of all models fail to cross the chasm to production

Why are MLOps platforms failing to deliver on their promise? One reason, however, is a disconnect between existing MLOps platforms and the Lakehouse - the siloing of data between analytics teams and data science teams. The Lakehouse is becoming the source of truth for governed, historical data for both analytics and AI, and MLOps platforms need to natively integrate with it. But the Lakehouse by itself is missing many capabilities needed to become a factory for building AI applications. The Lakehouse lacks support for real-time data for AI, high performance Python clients needed by developers to iterate faster, and infrastructural services (such as a model registry, model serving, and a feature serving database).  We believe the starting point for the next generation of AI platforms is the Lakehouse, and what is needed is an AI Lakehouse that extends the Lakehouse with support for building and operating all types of AI applications - batch, real-time, and LLM AI systems.

From Lakehouse to AI Lakehouse

The Lakehouse is built on open source table formats, with the main standards Apache Iceberg, Apache Hudi, and Delta Lake. There has been a proverbial cambrian explosion in the number of compute engines that can natively write and query data stored in table formats. The major proprietary data warehouses, from Snowflake to AWS Redshift to BigQuery, have now opened up and support querying data stored in open table formats. Most near real-time massively parallel databases (ClickHouse, Dremio, StarRocks, Trino), support open table formats. Distributed batch and stream processing frameworks (Spark, Ray, Flink, Daft), and single-host SQL/Python/Rust batch and stream processing frameworks (DuckDB, Polars, Feldera) also support one or more of Iceberg, Hudi, and Delta.

This huge growth in the number of query engines means that you can pick the best tool for reading or writing workloads. In Figure 2, we can see some popular data processing engines used for feature engineering (transforming input data into features that are used as input for both training and inference in models).

The Lakehouse separates storage from query engines
Figure 2: The Lakehouse separates storage from query engines, enabling you to pick the best query engine for your read or write workload. You can choose your framework based on whether your workload will process large data (requiring a distributed query engine) or small data, and whether batch processing is acceptable or you need fresher data, necessitation stream processing. 

What has changed in recent years, however, is that batch feature engineering jobs that process a hundred GB previously needed a Spark cluster or data warehouse, but can now easily run on a single virtual machine (VM) using DuckDB or Polars. Similarly, streaming feature pipelines can now be run on a single VM using SQL with Feldera or Python with Bytewax/Quix/Pathway. What these batch and streaming engines have in common is that they massively reduce the development and operational costs for feature pipelines

The AI Lakehouse requires query engines to read and write data that will be used for both analytics and AI. Databricks have been the leading company in developing and advocating the Lakehouse and have built many AI capabilities on top of their Lakehouse. With Unity Catalog, they now support a model registry (MLFlow model registry is now a catalog service), model serving, a feature serving database, and Python functions for computing features in real-time AI systems. However, they are still missing many read/writing capabilities in their Lakehouse: 

  • you can’t build the next TikTok recommender system on Databricks today due to its lack of support for real-time streaming data for AI, and 
  • Python is still a 2nd class citizen in Databricks when reading and writing from/to the Lakehouse, leading to slower developer iteration and poor integration with the Python ecosystem. Databricks serverless doesn’t even support installing Python libraries.

Some of the most successful companies in building AI systems on Lakehouse data extended the Lakehouse to meet specific AI requirements for data:

So, some progress has been made by Databricks in adding AI capabilities to its Lakehouse ( despite ongoing concerns about vendor lock-in) while other companies (ByteDance, Netflix) have extended the Lakehouse for real-time AI and Python native access.

In Figure 3, we illustrate the emerging AI Lakehouse architecture with the extensions that are needed to build and operate AI systems on Lakehouse data.

The AI Lakehouse
Figure 3: The AI Lakehouse requires AI pipelines, an AI query engine, catalog(s) for AI assets and metadata (feature/model registry, lineage, reproducibility), AI infrastructure services (model serving, a database for feature serving, a vector index for RAG, and governed datasets with unstructured data).

The AI Lakehouse is the set of platform services that need to be added to the Lakehouse so that it can be used to build and operate batch AI systems, real-time AI systems, and LLM-powered AI systems. We will now introduce a gap analysis of the current Lakehouse for AI, presenting the capabilities needed around real-time AI, Python support, model management, monitoring, vector search, and LLM support. 

Real-time AI

For many Internet services 'milliseconds cost millions' - slower responses leads to fewer users, reduced conversion, and lost revenue opportunities. This is as true for AI powered services as it is for conventional Internet services. The Lakehouse today is not fit for purpose for building real-time AI systems.

Real-time AI systems are built on two types of input features - neither of which are supported by the Lakehouse. The first type of feature are real-time features computed from request parameters passed by clients of the real-time AI system. The second type of features are precomputed and stored in a database. The real-time AI system retrieves precomputed features using IDs included in the user request (such as the user ID or a product ID in e-commerce). The Lakehouse has too high latency to be used as a database for precomputed feature data.

As presented earlier, the most valuable AI system in the world today, TikTok’s personalized recommendation engine, needs extensions to the Lakehouse for real-time AI. TikTok processes user interactions (watch time, likes and shares) and adjusts the content that will be recommended to users within a couple of seconds (it is a real-time AI system). Real-time adaptive recommendations is one of the main reasons for TikTok’s success and its addictive nature (Andrej Karpathy called it “Digital Crack”).

In Figure 4, you see an attempt at building TikTok on Delta Lake (Apache Iceberg/Hudi would be equivalent). First, when a user swipes or clicks the screen in TikTok, an event is pushed to a Kafka cluster (this takes less than 1 second). From there, features are computed on the data and the engineered features are written to a Delta Lake table (for example, with Kafka Delta Ingest). Then, a connector program that either runs 24x7 or on a schedule will read each commit to the Delta Lake table(s) and synchronize the data to the feature serving database. A real-time AI system can then read the precomputed features from the feature serving database to power its real-time predictions.

Building an AI system on Delta Lake
Figure 4: Building real-time AI systems with the Databricks Lakehouse architecture requires data first to land in Lakehouse tables before it is synchronized to a feature serving database, from where it is read by real-time AI systems. The end-to-end latency (feature freshness) for new information (for example, user clicks/swipes) to be available for AI systems is minutes. 

This Tiktok-style recommender system that we just designed on Delta Lake adapts to your preferences in minutes and would be dead on arrival.

While this Lakehouse powered TikTok recommender system can scale to handle massive volumes of data, it cannot provide the fresh data (a few seconds old, at most) needed by our AI recommender system. What is needed is support for streaming feature pipelines connected to a low latency feature serving database. The Lakehouse’s role in this real-time AI system should be to act as the low cost store of historical data (for model training and batch inference) - it should not be a message broker for our feature-serving database. 

Now, we look at a couple of other missing real-time AI capabilities missing from the Lakehouse: computing real-time features and supporting normalized data models (Snowflake schema data model) for Lakehouse tables that are synchronized with the feature serving database.

Real-Time Features

The Lakehouse and its existing query engines do not support computing real-time features at request time. For example, when a TikTok user clicks on a video, a couple of real-time features need to be computed using request parameters: 

  • the category of video the user clicked on (that will drive recommendations for subsequent videos) and 
  • how long the user spent watching the previous video.

These real-time features are computed on-demand and the code used to create them should be the same code used to create features using historical data stored in the Lakehouse. If the code is not identical, there is a risk of offline-online skew between the historical and real-time feature data (causing poor model performance). 

Add more Precomputed Features to Real-Time Models

Real-time AI systems use precomputed feature data that is served from a low latency database (an online feature store). That feature data originates in feature pipelines (either batch or stream processing) and an online table typically stores the latest feature data in a schema that matches the schema for a Lakehouse table that stores the historical feature data (this is a Type 4 SCD data model). 

The design of the table schemas is informed by the features the real-time AI systems need to retrieve at runtime and what entity IDs are available in those real-time AI systems to retrieve those features. Real-time AI systems read precomputed features from the online tables with primary key (or key-value) lookups using an entity ID (a userID, a sessionID, a credit card number, a storeID, etc). You also want to reuse the features in the online tables across models, so that each model does not require its own (denormalized) online table(s). 

Meta reported that “most features are used by many models”. They specifically said that the most popular 100 features are reused in over 100 different models. 

The solution is to support the snowflake schema data model for your precomputed features in both the Lakehouse and online tables. This will enable feature reuse, the retrieval of more features associated with more entities than entity IDs you have available in your real-time AI system, and a one-to-one mapping of Lakehouse tables to online tables, see Figure 5. 

The Snowflake Schema Data Model
Figure 5: The Snowflake Schema Data Model enables feature reuse and more features to be used in real-time AI systems than either the Star Schema or One Big Table data models.

In contrast, the one big table (OBT) data model cannot handle slowly changing dimensions (SCDs), which makes it unusable for storing time-series feature data. Feature reuse favors snowflake schema over star schema, which has denormalized dimensions, and snowflake schema can retrieve more features from normalized tables than star schema, tables can have nested foreign keys to related entities. You can read more about how a snowflake schema data model enables the use of more real-time features in an article by Hopsworks.

Unified Data and Model Monitoring

The Lakehouse should enable a unified solution for monitoring both data quality for Lakehouse tables and feature/prediction quality for models. Model monitoring can be made equivalent to data monitoring by storing inference logs (features, predictions, and outcomes, if they are available) in Lakehouse tables. This enables a unified monitoring solution that computes profile and drift metrics on your Lakehouse tables, as well as dashboards to visualize monitoring metrics and alerts to identify problems early.

One challenge with model monitoring, that is not widely understood, is that inputs to models are often encoded (categorical variables) or scaled (numerical variables), often making statistical techniques for drift detection unusable. In Hopsworks, our solution is to support the data transformation taxonomy for AI, where the feature transformations (encoding, scaling) are separated from feature engineering steps (aggregations, extraction, dimensionality reduction, etc), enabling storage of untransformed feature data in inference logs.

AI Query Engine for the Lakehouse

The original Lakehouse was built around SQL query engines, with little to no thought given to the needs of AI systems and Python. Python, however, is the language of AI and has very poor read and write throughput and latency on most existing Lakehouses. 

Netflix needed higher Python performance to improve the productivity of their Data Scientists and, so they developed an Arrow-native transfer protocol from Iceberg Tables to Pandas/Polars clients in Python (see Figure 6): "[Netflix] want to support last-mile data processing in Python, addressing use cases such as feature transformations, batch inference, and training."

How Netflix fixed performance bottleneck
Figure 6: Image by Netflix who fixed the performance bottleneck when using Apache Iceberg Tables in AI pipelines - they developed a fast Python client data library using Apache Arrow.

Many AI pipelines are run in Python. Training pipelines are typically run in Python and they often need to read large volumes of training data from the Lakehouse. Batch inference pipelines are often run in Python and, again, often need to read and process large amounts of data from the Lakehouse. Single VM feature pipelines can be run in Polars/DuckDB/Pandas and they need higher write throughput for Lakehouse tables. And online inference pipelines, that power real-time AI systems, are nearly always run in Python.

Existing Lakehouses, such as AWS Sagemaker and Databricks, provide Python support for querying Lakehouse tables via JDBC/ODBC APIs. This results in poor read throughput as data is first serialized and pivoted from column-oriented Lakehouse tables to a row-oriented over-the-wire format, and then deserialized and pivoted back to column oriented in clients (Pandas, Polars, DuckDB). Hopsworks built the Hopsworks Query Service to solve the problem of high speed data from the Lakehouse to Python and meet AI-specific requirements, such as the creation of point-in-time correct training data and the reproduction of training data that has been deleted. In the SIGMOD’24 research paper, Hopsworks showed a speedup of reading Lakehouse data to Pandas clients of between 9-45X compared to Databricks, AWS Sagemaker, and GCP Vertex. They also showed how it supports temporal JOINs over Lakehouse tables to build point-in-time correct training and batch inference datasets, pushdown filters to improve query performance, and credential vending to read Lakehouse data from S3 buckets. Figure 7 shows the higher throughput in Hopsworks Query Service for Pandas clients compared to legacy Lakehouse solutions in Databricks, Vertex, and Sagemaker.

Feature Group Benchmark

Feature Groups using a Point-in-time Join Benchmark
Figure 7: Hopsworks provides more than an order of magnitude higher performance compared to AWS Sagemaker, GCP Vertex, and Databricks when reading from Lakehouse tables to Pandas/Polars/DuckDB clients.

The Hopsworks Query Service also addresses the (soon to be mandated by regulation) problem of reproducibly creating training data for AI systems. The key insight here is that reproducing training data from multiple Lakehouse tables containing time-series data requires ingestion timestamps - event-time columns in your Lakehouse tables is not enough to reproduce training data, see Figure 8. 

Reproducible Training Data for AI
Figure 8: Reproducible Training Data for AI is based on data ingestion time, not event time. Here, we see some late arriving “Air Quality Measurement” data, just after training dataset v1 was created. To reliably reproduce v1, we need to know that the air quality measurements from the 6 previous days were ingested just after it was created.

Hopsworks query service leverages the existing time-travel capabilities in table formats to enable the reproduction of training data using ingestion timestamps.

Finally, Hopsworks Query Service enables high performance Python-based AI Systems using the FTI (feature, training, inference) architecture. The FTI architecture is a unified framework for building real-time, batch, and LLM AI systems where the AI pipelines fall into one of three different classes:

  • feature pipelines that take input data from a variety of data sources (including Lakehouse tables) and output feature data to Lakehouse tables (as well as feature serving tables if the features will be used by real-time AI systems),
  • training pipelines that take input feature data from Lakehouse tables and output a trained model to a model registry,
  • inference pipelines that take as input feature data from Laekhouse tables and one or more models, and output predictions. Batch, streaming, and real-time inference pipelines all belong to this class.

The FTI pipeline architecture reduces the cognitive and collaborative burden in building AI systems, as it decomposes the problems of data processing, model training, and inference into manageable modules that are naturally composed together by AI Lakehouse infrastructural services. The Hopsworks Query Service improves developer iteration speed by enabling faster data-centric approaches to experimentation when training models.

Large Language Models and the Lakehouse

The Lakehouse, as it exists today, is disconnected from LLMs and their data ecosystem that consists of vector databases for retrieval augmented generation (RAG) and instruction datasets for fine-tuning LLMs. Given that the most valuable Enterprise data is in the Lakehouse, that data should be made available for use in LLMs for both RAG and fine-tuning.

For example, consider the popular use case of using LLMs to automate customer support. An initial first LLM-based AI system could involve ingesting all the company’s help documentation and customer cases into a vector database to be used for RAG. This system may add a lot of value, but as competitors catch up, there will be pressure to further differentiate. The next logical step is to enable the LLM to also reason about the customer’s historical interactions with the company. This historical interaction data is not stored in the vector database - it is stored in the Lakehouse and it needs to be made available for RAG so that LLMs can reason about it. Mechanisms need to be designed to enable Lakehouse data to be queried and included in LLM prompts. Hopsworks has made steps in this direction using function calling to query Lakehouse data using the Hopsworks Query Service - all while maintaining data security and governance, see Figure 9.

There isArchitecture for how Lakehouse data can be used to power RAG for LLMs and create instruction datasets for fine-tuning LLMs.
Figure 9: There is no clear architecture today for how Lakehouse data can be used to power RAG for LLMs and create instruction datasets for fine-tuning LLMs.

The AI Lakehouse should become the source of truth for all data related to LLMs, including data used to fine-tune LLMs. Fine-tuned LLMs can exhibit good performance at a limited number of specialized tasks at higher performance and lower cost than larger, general purpose LLMs, additionally enabling LLMs on sovereign data in private data centers. The AI Lakehouse should provide improved support for creating instruction datasets for fine-tuning from Lakehouse tables.

Catalogs and Governance

The Hive Metastore has long been the de facto data catalog for mapping tables to files in analytical databases. With the introduction of the Lakehouse, new catalogs have appeared that manage an increasing number of assets, including AI assets. For example, Apache Iceberg introduced a REST API catalog, removing the need for the Hive Metastore. Databricks have also replaced the Hive Metastore with their open-source Unity Catalog adding support for AI assets such as models, features, functions, vector databases, unstructured datasets. Unity catalog also bundles authentication, access control, and credential vending. Snowflake introduced a competing open-source Polaris catalog that currently only supports the Iceberg REST API, and authentication, access control, and credential vending. By keeping security and asset sharing as proprietary services, both Databricks and Snowflake are making a less than transparent attempt at vendor lock-in to their new catalogs.

Whether the market adopts Unity Catalog, Polaris, or a mix of catalogs, the AI Lakehouse will need catalog support for:

  • centralized metadata: the ability to discover and use tables, views, and AI assets,
  • governance: the ability to set permissions to use data and AI assets. Authentication and access control can be either vendor supplied (e.g., Databricks, Snowflake) or plugable (e.g., Active Directory and LDAP),
  • lineage to track the provenance and transformations of data and its use in AI systems,
  • data and AI asset sharing, including models, tables, views, and datasets containing unstructured data.

Conclusion

The Lakehouse will be the historical data layer for AI systems for the next 10 years. However, we need to extend it to become an AI Lakehouse to power all classes of AI systems - from batch to real-time to LLMs. At Hopsworks, we have extended the Lakehouse to support real-time AI, made Python a first class citizen for querying data with the Hopsworks Query Service, and added support for LLMs using Lakehouse data. Our mission is to build an open, disaggregated AI Lakehouse stack that will power AI systems of the future.

References