Back to the Index

Feature Pipeline

What is a feature pipeline in machine learning?

‍A feature pipeline is a program that orchestrates the execution of a dataflow graph of feature functions (transformations on input data to create unencoded feature data), where the computed features are written to one or more feature groups. A feature pipeline can also include reading the input data from data sources, data validation, and any other steps needed when computing features.

‍Why do I need a feature pipeline?

Feature pipelines are needed to enable features to be computed on a schedule, or if they are streaming feature pipelines, run 24x7. The feature pipeline encapsulates the logic for computing features in feature groups, defines the data validation logic, and writes the features to feature groups. A batch feature pipeline needs to be run on a schedule by an orchestration engine, such as Airflow, Dagster, or, for simple cron-based scheduling, Modal.

‍What are the data sources for feature pipelines?

Feature pipelines read their input data from data sources such as data warehouses, message buses, databases, object stores, and Http APIs. The data sources can provide input live data, during scheduled executions, or historical data when backfilling feature groups. Feature pipelines should scale to handle the largest expected input volume size.

An example of the steps in feature pipeline might include:

Data ingestion: Raw data is read from various data sources for processing.
Data/feature validation: The raw data and/or feature data is validated to ensure that it is accurate, complete, and consistent.
Feature extraction/transformation: Relevant features are extracted from the raw data and transformed into a format that is optimized for machine learning models using techniques such as filtering, aggregation, dimensionality reduction (embeddings, PCA), binning, and feature crossing.
Feature storage: The features are stored in feature groups in the feature store for access training and inference pipelines.

Interested for more?

🤖 Register for free on Hopsworks Serverless
🐍 Learn all about the Python-Centric Feature Store
🛠️ Explore all Hopsworks Integrations
🧩 Get started with codes and examples
⚖️ Compare other Feature Stores with Hopsworks

Does this content look outdated? If you are interested in helping us maintain this, feel free to contact us.

F

Auto-regressive Models

F

Backfill features

Backfill training data

Backpressure for feature stores

Batch Inference Pipeline

F

CI/CD for MLOps

Compound AI Systems

Context Window for LLMs

F

DAG Processing Model

Data Compatibility

Data Partitioning

Data Transformation

Data Type (for features)

Data Validation (for features)

Data-Centric ML

Dimensional Modeling and Feature Stores

F

Encoding (for Features)

F

Gradient Accumulation

Grouped Query Attention

F

Hallucinations in LLMs

Hyperparameter Tuning

F

Idempotent Machine Learning Pipelines

In Context Learning (ICL)

Inference Pipeline

Instruction Datasets for Fine-Tuning LLMs

F

LLM Code Interpreter

LLM Temperature

LLMs - Large Language Models

Lagged features

F

Natural Language Processing (NLP)

F

On-Demand Features

On-Demand Transformation

Online Inference Pipeline

Online-Offline Feature Skew

Online-Offline Feature Store Consistency

F

Parameter-Efficient Fine-Tuning (PEFT) of LLMs

Point-in-Time Correct Joins

Precomputed Features

Prompt Engineering

F

RLHF - Reinforcement Learning from Human Feedback

Real-Time Machine Learning

Recommender System

Representation Learning

Retrieval Augmented Generation (RAG) for LLMs

F

SQL UDF in Python

Similarity Search

Splitting Training Data

Streaming Feature Pipeline

Streaming Inference Pipeline

F

Theory-of-Mind Tasks

Time travel (for features)

Train (Training) Set

Training Pipeline

Training-Inference Skew

Two-Tower Embedding Model

Types of Machine Learning

F

F

Vector Database

Versioning (of ML Artifacts)