Scheduled upgrade from November 26, 07:00 UTC to November 26, 17:00 UTC
Kindly note that during the maintenance window, app.hopsworks.ai will not be accessible.
5
View the Changes
arrow back
Back to Blog
Jim Dowling
link to linkedin
CEO and Co-Founder
Article updated on

Modularity and Composability for AI Systems with AI Pipelines and Shared Storage

A Unified Architecture for Batch, Real-Time, and LLM AI Systems
June 25, 2024
25 min
Read
Jim Dowling
Jim Dowlinglink to linkedin
CEO and Co-Founder
Hopsworks

TL;DR

Modularity in software refers to decomposing a system into smaller, more manageable modules that can be independently developed and composed into a complete software system. Modularity helps us build better quality, more reliable software systems, as modules can be independently tested. AI systems can also benefit from modularity, enabling teams to build higher quality AI systems, faster. However, a lesson our community learnt by getting burnt with microservices was that modularity only helps if the modules can be easily composed into functioning systems. In this article, we argue that a shared storage layer with well-defined APIs should be the main mechanism for composing the modules that make up an AI system - from data collection and feature engineering, to model training, to inference. In particular, we introduce the feature/training/inference (FTI) architecture as a unified architecture building real-time, batch, and LLM AI systems, where the feature store and model registry acting as the shared storage layer. The feature store provides well-defined DataFrame APIs for reading and writing data (tabular data and embeddings), while the model registry provides APIs for storing/retrieving models and metadata around models. These highly available stateful services enable the modularization of AI systems into feature, training, and inference pipelines that provide a natural decomposition of work for data engineering, data science, and ML/AI engineering teams.

This article is part 1 in a 7 part series describing in lay terms concepts and results from a SIGMOD 2024 research paper on the Hopsworks Feature Store.

Other Parts: 2 (The Taxonomy for Data Transformations), 3 (Use all features: Snowflake Schema) , 4 (Lakehouse for AI), 5 (From Lakehouse to AI Lakehouse), 6 (Real-Time AI Database), 7 (Reproducible Data).

Introduction

Modularity and composability are the Yin and Yang of systems development. In this article, we introduce a blueprint for a software factory for AI that shows you how to both decompose an AI system into independent, modular components (AI pipelines) that can then later be easily composed into an AI system using a shared storage layer. Just like a factory, each AI pipeline will play a well-defined role in transforming input data into features and models, and using trained models and new input data to make predictions. Just like a factory, AI artifacts will be produced at intermediate stages and seamlessly integrated into the AI systems that generate value and justify the investment in the factory. We will pay attention to reuse of intermediate outputs to keep costs down and improve quality. 

The main contribution of this article is a unified software architecture for batch, real-time, and LLM AI systems that is based on a shared storage layer and a decomposition of machine learning (ML) pipelines into feature pipelines (that transform input data to features/labels), training pipelines (that transform features/labels into trained models), and inference pipelines that transform new features into predictions using trained models.

A Brief History of Modularity and Composability for AI Systems

In the 1980s, with the advent of local area networking, software systems made the transition from monolithic application architectures to client-server systems. With the advent of the Internet and Web Applications  in the 1990s, the industry moved to the 3-tier application architecture, where the business logic was separated from the presentation layer, with a database as the backend layer. This was a natural decomposition for web applications that served well for many years until data volumes increased and more scalable architectures were needed. In the early 2010s, microservices emerged as an alternative architecture to the then dominant monolithic 3-tier applications that became expensive to maintain and difficult to scale. By decomposing large systems into microservices, different teams could work independently and with well-defined interfaces between loosely coupled microservices, systems could be composed as connected graphs of microservices. As microservice architectures grew in complexity, see Figure 2, they introduced new problems when they were composed together into complete systems.  When microservices are used in larger systems, it becomes hard to update their APIs (without proper versioning support, which requires multi-version support). When graphs of RPC calls become deep, they become hard to trace. When state is fragmented over many different databases (they often have their own local database), it makes system-level backup and recovery harder. High unpredictable latencies are a consequence of the tail-at-scale. And in general, high availability is challenging (following Leslie Lamport’s maxim that "a distributed system is one where you can't get your work done because some machine you've never heard of is broken.").

How microservices decompose systems into manageable modules
Figure 1: Microservices decompose systems into manageable modules, but introduce new challenges in composing them into highly available, performant, observable services. Image from Zhang et Al.

There has, therefore, been a natural swing back towards more centralized (now serverless) architectures (aka macroservices) to prevent the composability and operational challenges that can spiral out of control in microservice architectures (see this funny video that captures the essence of the problem).  

What is the lesson here for AI systems? 

The first lesson is that if you decompose your AI system into too many fine-grained services, you increase complexity when you need to compose your system. Alternatively, if you have a single monolithic end-to-end system, it will not be maintainable and there will be little to no reuse of its components across other projects. 

The second point is that AI systems are a diverse bunch. AI systems are not always operational systems (applications that run 24x7). Some AI systems are batch systems that run on a schedule producing predictions (think Spotify weekly that produces recommendations for songs for the coming week). Other AI systems are operational machine-to-machine systems, such as a real-time credit-card fraud detection system. Other AI systems are user-facing operational systems, such as a LLM powered chatbot. 

The third point is that all AI systems have some offline/batch component to them - whether that is collecting data to retrain models, ingesting data for RAG, or training models. In many production AI systems, model training is run on a schedule to prevent model degradation from negatively impacting the performance of the AI system.

Then, we have to consider the main technical and operational challenge in building AI systems, which is managing state. State complicates building modular and composable AI systems. The solution that microservices architectures embrace is (1) local state stored at microservices (which is problematic when you need transactional operations that cross multiple microservices and also when you need to make your system highly available), and (2) stateless microservices that use one or more external data stores, such as a database, key-value store, or event bus. Operationally, the lowest cost architecture is often stateless microservices that share a common scalable operational data layer.

So, what is the equivalent state in AI systems? The minimal viable state that an AI system has to manage is:

  • data for training models;
  • the trained models themselves;
  • data for inference.

Data for training and inference is typically mutable data (data never stops coming), while the trained models are immutable. This decomposition of state in AI systems leads naturally to the prototypical 3-stage architecture for AI systems:

  • feature engineering to manage data for training and inference (the training datasets don’t create themselves, you know!), 
  • the offline model training process to create the trained models, and 
  • the (batch or online) inference systems that make the predictions with the model and inference data

You might think that these separate stages should all be connected in one directed acyclic graph (DAG), but you would be wrong. Training does not happen as part of inference - they are separate processes that run at their own cadences (you run the training process when you need a new model, inference when you need to make a prediction). We will see later the benefits of making feature engineering its own process, ensuring consistent feature data for training and inference (preventing training/serving skew). You may think this decomposition is reasonable, but have the opinion that it is too coarse-grained. We will also see later that if you want to further decompose any of these three stages, you can easily do so. The key technology enabling this decomposition is a stateful ML infrastructure layer. Let’s dive into managing state in AI systems.

Case Study: the evolution of MLOps in GCP from Microservices to Shared State

A few years ago, Google Cloud promoted building real-time ML systems as compositions of microservices, see Figure 2. This was based on their TensorFlow Extended (TFX) architecture that recently morphed (along with KubeFlow) into vertex pipelines.

ML systems built from separate indepedent stateless microservices
Figure 2: Should a composable real-time machine learning system be built from separate independent stateless microservices? Images from Lak Lakshmanan in Google Cloud.

Around 2023 (with the advent of GCP Vertex), Google started prompting a MLOps architecture for real-time AI systems, which has some stateful services (the feature store, model registry, and ML metadata store), and only one monolithic AI pipeline (data extraction, data  validation, data preparation model training, model evaluation, model validation). 

MLOps architecture
Figure 3: MLOps architecture for a real-time AI system by GCP (Image from GCP as of May 2024). The architecture has a mix of orchestrated tasks and stateful services, with numbers showing you where it starts and finishes. It is very confusing and a consultant’s dream.

Based on this MLOps architecture, an AI pipeline seems to have as input raw data and produces a model as output. The only definition I could find from GCP Vertex was this:“An AI pipeline is a portable and extensible description of an MLOps workflow as a series of steps called pipeline tasks. Each task performs a specific step in the workflow to train and/or deploy an ML model.” This definition implies the output of an AI pipeline is a trained model and/or a deployed model. But is that the accepted definition of an AI pipeline? Not in this article, where we argue that feature engineering and inference pipelines are also part of both MLOps and AI systems, in general.

What is an AI pipeline?

The unwritten assumption among many MLOps systems is that you can modularize an AI system by connecting independent AI pipelines together. But what if you only have one monolithic AI pipeline, like GCP? Is there an alternative, more fine-grained, decomposition of an AI system?

Yes. If you have a feature store, you can have feature pipelines that create feature data and store it there, along with labels (observations) for supervised machine learning (ML) models. The feature store enables a training data pipeline that starts by reading training data from the feature store, trains a model, and saves the trained model to a model registry. The model registry, in turn, enables an inference pipeline that reads feature (inference) data (from the feature store or from client requests) and the trained model and outputs predictions for use by an AI-enabled application. 

So, what is an AI pipeline? An AI pipeline is a program that either runs on a schedule or continuously, has well-defined input data, and creates one or more AI artifacts as output. We typically name an AI pipeline after the AI artifact(s) they create - a feature pipeline creates features, a training pipeline outputs a trained model, or an inference pipeline outputs predictions (makes inferences). Occasionally, you may name an AI pipeline based on how they modify an AI artifact - such as a model or feature validation pipeline that asynchronously validates a model or feature data, respectively. Or you could have a training dataset pipeline that materializes feature data from the feature store as files. The point here is that the term AI pipeline is abstract, not a concrete pipeline. When you want to be precise in discussing your AI system, always refer to the concrete name for the AI pipeline based on the AI artifact it outputs. If somebody asks you how to automate feature engineering for their AI system, telling them to build an AI pipeline conveys less information than telling them to build a feature pipeline (which implies the input data is the raw data for features, and the output is reusable feature data stored in a feature store).

AI Systems as Modular Stateful Systems

An AI system that is trained on a single (static) dataset and only makes a single prediction with a test set from that static dataset can only generate value once. It’s not much of a system, if it only runs once. AI systems need to manage state to be able to continually generate value with new data. In Figure 4, we can see how an AI system manages data by, over time, producing new training datasets, training new models, and making new inference data available to generate new predictions, continually generating value. Production AI systems are rarely trained on a static training dataset. Instead, they typically start with the batch/streaming/real-time data that is used to create the training datasets.

An AI system is a factory that produces ML assets, including: static training datasets, batches inference data, versioned models, and predictions
Figure 4: An AI system is a factory that produces ML assets, including: static training datasets, batches inference data, versioned models, and predictions.

With the advent of GenAI and pretrained models, you may think the above architecture does not apply to your AI system, as your large language model (LLM) is pre-trained! But if you plan to fine-tune a LLM or use RAG (retrieval augmented generation), the above architecture still applies. You will need to create new training datasets for fine-tuning or update your indexes for RAG (e.g., in a vector database).  So, whatever non-trivial AI system you build, you will need to manage newly arriving data. Your AI system will also need to manage the programs (AI pipelines) that create the features, models, and predictions from your data. So, let’s look now at the programs (pipelines) in AI systems.

The FTI Pipeline Architecture

The programs that make up an AI system handle its main concerns - ingesting and managing the training/inference data (cleaning, validating, transforming), training the models with training data, and inference (making predictions) with models and inference data.

There are many different types of AI systems - Batch, Real-Time, Streaming, and embedded systems, distinguished by how they make their predictions. Batch AI system produce predictions in batches, on a schedule. Real-time systems take prediction requests and return low-latency prediction responses. Streaming applications can use a model to make predictions on incoming streaming data. Embedded AI systems are embedded applications that typically make predictions on the data they acquired locally through sensors or network devices. The type of AI system is independent of the ML framework used - LLMs, decision trees, CNNs, logistic region, and so on.

Despite this heterogeneity in the types of AI systems,they have commonality in their core architecture, see Table 1. They all have programs that implement a set of data transformation steps, from ingesting raw data to refining that data into features (inputs for training and inference) and labels. Model training and inference can also be seen as (data transformation) functions. Model training takes features and labels as input and transforms it into a trained model as output. Inference takes a trained model and features as input and transforms them into predictions as output.

So, at a minimum, all AI systems have data transformation steps and state in the form of features, models, and predictions. Data transformations are the functions, whilst features, models, and predictions are the state in our AI pipelines.

:The most common ML steps and the assets created at each step.
Table 1: The most common ML steps and the assets created at each step.

This commonality is illustrated in an architecture diagram in Figure 5 as a set of three AI pipelines, connected by a shared storage layer. We call this the FTI (feature, training, inference) architecture:

Figure 5 is an abstract representation of an AI system using the FTI architecture.

FTI Pipelines
Figure 5: Feature Pipelines, Training Pipelines, Inference Pipelines are the independent AI Pipelines that together make up a ML System.

You may also think that our decomposition of an AI system into FTI pipelines is too coarse-grained and there are many systems that are not architected this way. However, the AI pipelines in the FTI architecture can be refactored into smaller, yet still composable pipelines, see Table 2, connected by the same data layer. A good practice for AI pipelines is to name them after the asset they produce - this naming pattern communicates its expected output in the AI system.

Fine-grained AI pipelines
Table 2: Fine-grained AI pipelines, named after the assets they create.

Let’s examine the examples of fine-grained AI pipelines from Table 3. We can refactor our feature pipeline to consist of the original feature pipeline (create features from raw input data) and a feature validation pipeline that validates feature data asynchronously after it has landed in the feature store. Similarly, model validation can be refactored out of a training pipeline into its own model validation pipeline. You might need a separate model validation pipeline if model training uses expensive GPUs and model validation takes a long time and only needs CPUs.  Feature monitoring and model monitoring often have their own pipelines, as is the case for inference logging for real-time AI systems.

AI Pipelines as Contracts

AI pipelines also have a well-defined input and output interface (or schema). For any AI pipeline, you should be able to write its contractual interface not just as its typed input data and output data, but also list its preconditions, postconditions, invariants, and non-functional requirements, as in Table 3.

AI pipeline information description
Table 3: Examples of some of the information that you can capture in contracts describing AI pipelines. Contracts help downstream consumers of the AI pipeline output understand how to use their outputs and what they can rely on.

Unified Architecture for AI Systems

The FTI architecture is a unified architecture for structuring AI systems, because the same architecture can be used to decompose:

  • LLM AI Systems
  • Batch AI Systems
  • Real-Time AI Systems

In the following sections, we describe these systems in terms of the FTI pipeline architecture.

LLM AI Systems

Table 4 shows a concrete example of the FTI architecture in terms of a LLM system, that includes both fine-tuning and retrieval augmented generation (RAG)

The FTI architecture in terms of a LLM system
Table 4: The FTI pipeline architecture describes an AI system that performs both fine-tuning and RAG for LLMs. Feature pipelines can chunk text that is then transformed into vector embeddings and stored in a vector DB. The same text is used to create instruction datasets to fine-tune a foundation LLM. The AI system then combines the user prompt with any RAG data from the vector DB to query the LLM and return a response.

We notice from Table 4 that the output of our feature pipeline now includes vector embeddings that should be indexed for approximate nearest neighbor (ANN) search. You could use a vector database to index the vector embeddings, but some feature stores (e.g., Hopsworks) have been extended to support vector embeddings with ANN search. So, you don’t have to add the extra data platform (vector database) to your ML infrastructure. Your choice.

Batch AI Systems

A batch AI system uses one or more models to make predictions on a schedule using batches of new inference data. Table 5 shows the main AI pipelines from the FTI architecture in terms of a batch AI system.

The FTI architecture in terms of a batch AI system.
Table 5: The FTI pipeline architecture describes a Batch AI system as a batch feature pipeline, a training pipeline, and a batch inference pipeline that runs on a schedule.

Batch AI systems run an inference pipeline on a schedule that takes new data and one or more trained ML models to make predictions that are typically stored in a database for later use. Batch AI systems are relatively easy to operate, as failures are not always time-critical - you have to fix a broken inference pipeline before its next scheduled run. They can also be made to scale to huge data volumes, with technologies such as PySpark.

Real-Time AI Systems

A real-time (interactive) AI system takes user input and uses one or more models to make one or more predictions that are returned as a response to the client. Table 6 shows the main AI pipelines from the FTI architecture in terms of a real-time AI system.

The FTI architecture in terms of a real-time AI system.
Table 6: The FTI pipeline architecture describes a Real-Time (Interactive) AI system. Streaming feature pipelines result in fresher features.

Real-time AI systems typically use streaming feature pipelines if they need very fresh feature data in the feature store. For example, TikTok uses Flink to ensure that clicks by users are available for use as features within a few seconds. Online inference pipelines need to be available 24x7, and are operational services that are typically deployed along with the model in model-serving infrastructure, such as KServe, MLFlow, Seldon, BentoML, AWS Sagemaker, or GCP Vertex.

Summary

Break the monolith. Decompose your AI systems into modular, maintainable AI (or machine learning) pipelines with clear input and output interfaces. But remember, modularity without ease of composition of those modules is a fool’s errand. The most natural decomposition for AI systems is the data preparation stage, the model training stage, and the inference stage. Different teams can take responsibility for the three different stages and you can easily specify the contracts for the AI artifacts produced by the FTI pipelines in terms of preconditions, postconditions, invariants, and non-functional requirements. The FTI pipeline architecture makes use of shared storage to connect your AI pipelines with a feature store and model registry, and it has usurped the shared-nothing storage architecture from microservices as the best practice for architecting AI systems. 

References