Back to the Index

AI Lakehouse

What is an AI Lakehouse?

An AI Lakehouse is an architectural paradigm that combines elements of data lakes and data warehouses to support advanced AI and machine learning (ML) workloads. This infrastructure approach allows organizations to manage vast amounts of structured and unstructured data while enabling AI and ML workloads on the same platform. The AI lakehouse supports building and operating AI-enabled batch, real-time and LLM powered applications.

What are the differences between a Lakehouse and an AI Lakehouse?

The main difference between a Lakehouse and an AI Lakehouse lies in the specific infrastructure and capabilities they offer, particularly in relation to supporting artificial AI and ML workloads. A lakehouse is effectively a modular data warehouse that decouples the separate concerns of storage, transactions, compute and metadata. However, an AI Lakehouse extends this architecture by adding components specifically designed for AI/ML, such as an Online Store and Vector Index.

The AI Lakehouse therefore builds on the lakehouse architecture and optimizes it for AI and ML applications allowing a more robust MLOps approach to the deployment and management of AI projects. Below, you can see Hopsworks’ AI Lakehouse architecture with the functionalities that are needed to build and operate AI systems and apply MLOps principles on Lakehouse data.

The AI Lakehouse — The AI Lakehouse Architecture

What capabilities are needed for an AI Lakehouse?

As shown in the diagram above, certain capabilities are needed on top of a regular Lakehouse infrastructure to build and operate AI systems. These specific capabilities are present in Hopsworks AI lakehouse as the following:

AI pipelines: AI pipelines are structured pipelines or processes involved in developing, deploying, and maintaining ML and AI models. AI pipelines used for AI systems fall into three different categories; feature, training and inference pipelines (FTI pipelines).‍
AI query engine (Hopsworks Query Service): The original lakehouse infrastructure was built around SQL , their query engines do not support AI systems and Python. Hopsworks query service was built to solve this challenge and allow high speed data processing from the Lakehouse and meet AI-specific requirements; such as the creation of point-in-time correct training data and the reproduction of training data that has been deleted. ‍
Catalog(s) for AI assets and metadata: Assets include feature/model registry, lineage and reproducibility.‍
AI infrastructure services: AI infrastructure services include model serving, a database for feature serving, a vector index for RAG, and governed datasets with unstructured data.

Interested for more?

🤖 Register for free on Hopsworks Serverless
🐍 Learn all about the Python-Centric Feature Store
🛠️ Explore all Hopsworks Integrations
🧩 Get started with codes and examples
⚖️ Compare other Feature Stores with Hopsworks

Does this content look outdated? If you are interested in helping us maintain this, feel free to contact us.

A

Auto-regressive Models

A

Backfill features

Backfill training data

Backpressure for feature stores

Batch Inference Pipeline

A

CI/CD for MLOps

Compound AI Systems

Context Window for LLMs

A

DAG Processing Model

Data Compatibility

Data Partitioning

Data Transformation

Data Type (for features)

Data Validation (for features)

Data-Centric ML

Dimensional Modeling and Feature Stores

A

Encoding (for Features)

A

Gradient Accumulation

Grouped Query Attention

A

Hallucinations in LLMs

Hyperparameter Tuning

A

Idempotent Machine Learning Pipelines

In Context Learning (ICL)

Inference Pipeline

Instruction Datasets for Fine-Tuning LLMs

A

LLM Code Interpreter

LLM Temperature

LLMs - Large Language Models

Lagged features

A

Natural Language Processing (NLP)

A

On-Demand Features

On-Demand Transformation

Online Inference Pipeline

Online-Offline Feature Skew

Online-Offline Feature Store Consistency

A

Parameter-Efficient Fine-Tuning (PEFT) of LLMs

Point-in-Time Correct Joins

Precomputed Features

Prompt Engineering

A

RLHF - Reinforcement Learning from Human Feedback

Real-Time Machine Learning

Recommender System

Representation Learning

Retrieval Augmented Generation (RAG) for LLMs

A

SQL UDF in Python

Similarity Search

Splitting Training Data

Streaming Feature Pipeline

Streaming Inference Pipeline

A

Theory-of-Mind Tasks

Time travel (for features)

Train (Training) Set

Training Pipeline

Training-Inference Skew

Two-Tower Embedding Model

Types of Machine Learning

A

A

Vector Database

Versioning (of ML Artifacts)