Back to the Index

AI Lakehouse

What is an AI Lakehouse?

An AI Lakehouse is an architectural paradigm that combines elements of data lakes and data warehouses to support advanced AI and machine learning (ML) workloads. This infrastructure approach allows organizations to manage vast amounts of structured and unstructured data while enabling AI and ML workloads on the same platform. The AI lakehouse supports building and operating AI-enabled batch, real-time and LLM powered applications.

What are the differences between a Lakehouse and an AI Lakehouse?

The main difference between a Lakehouse and an AI Lakehouse lies in the specific infrastructure and capabilities they offer, particularly in relation to supporting artificial AI and ML workloads. A lakehouse is effectively a modular data warehouse that decouples the separate concerns of storage, transactions, compute and metadata. However, an AI Lakehouse extends this architecture by adding components specifically designed for AI/ML, such as an Online Store and Vector Index. 

The  AI Lakehouse therefore builds on the lakehouse architecture and  optimizes it for AI and ML applications allowing a more robust MLOps approach to the deployment and management of AI projects. Below, you can see  Hopsworks’ AI Lakehouse architecture with the functionalities that are needed to build and operate AI systems and apply MLOps principles on Lakehouse data.

The AI Lakehouse
The AI Lakehouse Architecture

What capabilities are needed for an AI Lakehouse?

As shown in the diagram above, certain capabilities are needed on top of a regular Lakehouse infrastructure to build and operate AI systems. These specific capabilities are present in Hopsworks AI lakehouse as the following:

  • AI pipelines: AI pipelines are structured pipelines or processes involved in developing, deploying, and maintaining ML and AI models. AI pipelines used for AI systems fall into three different categories; feature, training and inference pipelines (FTI pipelines).
  • AI query engine (Hopsworks Query Service): The original lakehouse infrastructure was built around SQL , their query engines do not support AI systems and Python. Hopsworks query service was built to solve this challenge and allow high speed data processing from the Lakehouse and meet AI-specific requirements; such as the creation of point-in-time correct training data and the reproduction of training data that has been deleted.  
  • Catalog(s) for AI assets and metadata: Assets include feature/model registry, lineage and reproducibility.
  • AI infrastructure services: AI infrastructure services include model serving, a database for feature serving, a vector index for RAG, and governed datasets with unstructured data. 
Does this content look outdated? If you are interested in helping us maintain this, feel free to contact us.