A feature pipeline is a program that orchestrates the execution of a dataflow graph of feature functions (transformations on input data to create unencoded feature data), where the computed features are written to one or more feature groups. A feature pipeline can also include reading the input data from data sources, data validation, and any other steps needed when computing features.
Feature pipelines are needed to enable features to be computed on a schedule, or if they are streaming feature pipelines, run 24x7. The feature pipeline encapsulates the logic for computing features in feature groups, defines the data validation logic, and writes the features to feature groups. A batch feature pipeline needs to be run on a schedule by an orchestration engine, such as Airflow, Dagster, or, for simple cron-based scheduling, Modal.
Feature pipelines read their input data from data sources such as data warehouses, message buses, databases, object stores, and Http APIs. The data sources can provide input live data, during scheduled executions, or historical data when backfilling feature groups. Feature pipelines should scale to handle the largest expected input volume size.