A Machine Learning Pipeline (ML pipeline) is a program that takes input and produces one or more ML artifacts as output. Typically, a ML pipeline is one of the following: a feature pipeline, a training pipeline, or an inference pipeline.
In the above figure, we can see three different examples of ML pipelines:
Feature pipelines can be batch programs or streaming programs. Inference pipelines can be batch programs or online inference pipelines, that wrap models made accessible via a network endpoint using model serving infrastructure.
ML pipelines enable you to move from training ML models on static data and making a single prediction on your static dataset to working with dynamic data, so your model can continually generate value by making predictions on new data. ML pipelines also help ensure the reproducibility and scalability of machine learning workflows. By encapsulating the entire process in multiple pipelines, it is easier to manage, version control, and share the different stages of the process.
A monolithic ML pipeline is a single program that can be run as either (1) a feature pipeline followed by a training pipeline or (2) a feature pipeline followed by a batch inference pipeline.
A data pipeline can be a ML pipeline, but unfortunately, the term data pipeline is too generic to clearly define what the inputs and outputs to the data pipeline are - thus making it an unclear term when communicating about ML pipelines and ML systems.
If you have a feature store, you can decompose a monolithic ML pipeline into feature, training and inference pipelines, where the output of. The feature store becomes the data layer for your ML pipelines, storing the outputs of the feature pipeline, and providing inputs to the training and inference pipelines.
Our research paper, "The Hopsworks Feature Store for Machine Learning", is the first feature store to appear at the top-tier database or systems conference SIGMOD 2024. This article series is describing in lay terms concepts and results from this study.