Scheduled upgrade on April 4, 08:00 UTC

Kindly note that during the maintenance window, app.hopsworks.ai will not be accessible.

April 4, 2025

App Status

Back to Blog

Haziqa Sajid

Data Scientist

Let's keep in touch!

Subscribe to our newsletter and receive the latest product updates, upcoming events, and industry news.

More Blogs

Hopsworks AI Lakehouse Now Supports NVIDIA NIM Microservices

How we secure your data with Hopsworks

Migrating from AWS to a European Cloud - How We Cut Costs by 62%

The 10 Fallacies of MLOps

Hopsworks AI Lakehouse: The Power of Integrated MLOps Components

Article updated on

Automated Feature Engineering with FeatureTools

August 23, 2023

13 min

Read

Haziqa Sajid

Data Scientist

Freelance

Data Engineering

MLOps

Feature Store

TL;DR

A machine learning model’s ability to learn and read data patterns largely depends on the quality of its features. Well-designed features through feature engineering can significantly improve a model's performance while also requiring less input data than the raw data. With frameworks such as FeatureTools ML practitioners can automate the feature engineering process, and combined with a Feature Store for storing features, ML models can rapidly be put into production.

Introduction

Features are the data points input to a machine learning (ML) model. They are extracted from extensive datasets and contain vital information that describes the original data. In many cases, the original columns of the data might be the features as it is. However, in more complicated cases, features are engineered by passing data points via statistical algorithms.

An ML model's performance largely depends on the quality and quantity of these features. The better quality of features, the easier it is for our model to learn the data patterns. This article will discuss how ML practitioners can utilize automated feature engineering to rapidly create meaningful features and enhance the MLOps pipeline.

Why Engineer Features?

A machine learning model should be trained on a diverse set of features for it to perform well in practical scenarios. One way for diversification is to collect more data. However, data collection can be expensive and time-consuming.

Feature engineering uses aggregation techniques and data analysis to extract additional, yet vital, information from an existing dataset. This additional information helps ML models learn new patterns to achieve better performance.

Automated Feature Engineering

Deep learning automates feature engineering but requires large volumes of labeled data to work effectively. When insufficient training data is available for deep learning, explicit feature engineering can be used to create features from data sources. Feature engineering requires technical and in-depth domain knowledge for appropriate feature creation. In practical scenarios, the source data for features is often spread across various data tables and needs to be combined to create effective features. The overall process is labor-intensive and prone to errors, requiring iterative development approaches where new features are created and tested before being adopted by models.

There are, however, automated feature engineering tools to simplify the feature engineering process for tabular data. These libraries, such as FeatureTools, provide a simple API to carry out all necessary feature processing in an automated fashion. Furthermore, these libraries can be integrated into feature pipelines for production usage.

FeatureTools: Automated Feature Engineering in Python

FeatureTools is a popular open-source Python framework for automated feature engineering. It works across multiple related tables and applies various transformations for feature generation. The entire process is carried out using a technique called “Deep Feature Synthesis” (DFS) which recursively applies transformations across entity sets to generate complex features. The FeatureTools framework consists of the following components:

Entity Set
An entity is a data table holding information. It is the most fundamental building block of the framework. A collection of such entities is called an Entity Set. The entity sets also include additional information like schemas, metadata, and various entities' relationships.

Feature Primitives
Primitives are the statistical functions applied to transform the data present in the entity set. The functions include aggregations, ratios, percentages, etc. Primitives may process multiple data entities to create a single value, such as sum, min, or max, or apply a transformation on entire columns to create a new feature.

Deep Feature Synthesis (DFS)
DFS is the algorithm used by the framework for the automated extraction of features. It uses a combination of primitives and applies them to the entity sets to generate the features. The primitives are applied so that the new features result from complex operations applied across various dataset parts.

FeatureTools Implementation

FeatureTools allows users to input their datasets, create EntitySets, and use the sets for automated feature engineering. For demonstration purposes, the framework includes dummy datasets that allow users to explore its functionality.

Let's test it. First, we have to install the framework. Run the following command in the terminal.


pip install featuretools

After installation is completed, we can load the library and the relevant data in the following manner.


# import featuretools
import featuretools as ft
# load dummy data
es = ft.demo.load_retail()

Let’s view the data.

 
print(es)

***Figure 1:*** *EntitySet: Demo retail data*

We can see that EntitySet contains information about individual entities and the relationship between them. Using this, we can create features by calling a single function.


feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    agg_primitives=["count", "sum", "min"],
    trans_primitives=["month"],
    max_depth=5,
)

The code snippet calls the `.dfs` function from the library. The function takes the EntitySet, the main dataframe name, and the type of transformations required to build the features. Let’s take a look at what the output looks like.


print(feature_matrix)

The dataframe contains all the relevant features that can be used to train a well-performing ML model.

Custom Dataset

Ingesting multiple tables and establishing relationships is a vital part of FeatureTools. Let’s see how we can work with a custom dataset. For this exercise, we will use the Home Credit Default Risk dataset.

The dataset consists of 8 key tables, each related to the other via a common field. Our first step is to load these datasets into memory.


import pandas as pd

# Load datasets
app_train = pd.read_csv('data/application_train.csv')
app_test = pd.read_csv('data/application_test.csv')
bureau = pd.read_csv('data/bureau.csv')
bureau_balance = pd.read_csv('data/bureau_balance.csv')
cash = pd.read_csv('data/POS_CASH_balance.csv')
credit = pd.read_csv('data/credit_card_balance.csv')
previous = pd.read_csv('data/previous_application.csv')
installments = pd.read_csv('data/installments_payments.csv')

We have separate training and testing datasets for the application data. We need to combine it first.

 
# Combine the train and test files for better processing
app_test['TARGET'] = np.nan

# Join together training and testing
app = app_train.append(app_test, ignore_index = True, sort = True)

Next, we need to perform some data analysis. It is important to note that several processing techniques can be used in the dataset, such as asserting data types or data imputation, but that is beyond the scope of this article. We will only assess the dataset for NaN values.

 
# check NULL values in the datasets
print("app: ")
print(app.isnull().sum())
print("bureau: ")
print(bureau.isnull().sum())
print("bureau_balance: ")
print(bureau_balance.isnull().sum())

There are several columns that contain NaN values. For this demonstration, we will fill them with zeros.

 
# fill all NaN values with zero so they do not hinder with the processing
app.fillna(0, inplace=True)
bureau.fillna(0, inplace=True)
bureau_balance.fillna(0, inplace=True)
cash.fillna(0, inplace=True)
credit.fillna(0, inplace=True)
previous.fillna(0, inplace=True)
installments.fillna(0, inplace=True)

Now let’s drop some columns that will not be used.


# drop useless columns to prevent creating useless features
installments = installments.drop(columns = ['SK_ID_CURR'])
credit = credit.drop(columns = ['SK_ID_CURR'])
cash = cash.drop(columns = ['SK_ID_CURR'])

Finally, we create the EntitySet using the data loaded from the CSV files.


# Empty entity set with id applications
es = ft.EntitySet(id = 'clients')

# Entities with a unique index
es = es.add_dataframe(dataframe_name= 'app', dataframe = app, 
index = 'SK_ID_CURR')
es = es.add_dataframe(dataframe_name= 'bureau', dataframe = bureau, 
index = 'SK_ID_BUREAU')
es = es.add_dataframe(dataframe_name= 'previous', dataframe = previous, 
index = 'SK_ID_PREV')
# Entities that do not have a unique index
es = es.add_dataframe(dataframe_name= 'bureau_balance', dataframe = bureau_balance, 
    make_index = True, index = 'bureaubalance_index')
es = es.add_dataframe(dataframe_name= 'cash', dataframe = cash, 
    make_index = True, index = 'cash_index')
es = es.add_dataframe(dataframe_name= 'installments', dataframe = installments,
    make_index = True, index = 'installments_index')
es = es.add_dataframe(dataframe_name= 'credit', dataframe = credit,
    make_index = True, index = 'credit_index')

Note above that, to be part of an EntitySet; every table must have a column as a unique identifier. Our `app,` `bureau,` and `previous` data frames already have these columns, but for the rest, we set the `make_index` flag to True so FeatureTools creates an identifier itself.


# view the set
print(es)

All our data is loaded into a single EntitySet, but no relationships are still established. FeatureTools needs information regarding relationships between the tables. All the tables in our dataset are linked to one another via column fields. Now we need to create these relationships within the FeatureTools EntitySet.

 
# Relationship between app_train and bureau
es = es.add_relationship('app', 'SK_ID_CURR', 'bureau', 'SK_ID_CURR')
es = es.add_relationship('bureau', 'SK_ID_BUREAU', 'bureau_balance', 'SK_ID_BUREAU')
es = es.add_relationship('app','SK_ID_CURR', 'previous', 'SK_ID_CURR')
es = es.add_relationship('previous', 'SK_ID_PREV', 'cash', 'SK_ID_PREV')
es = es.add_relationship('previous', 'SK_ID_PREV', 'installments', 'SK_ID_PREV')
es = es.add_relationship('previous', 'SK_ID_PREV', 'credit', 'SK_ID_PREV')

print(es)

***Figure 5***: *EntitySet after creating relationships*

The EntitySet contains all relevant entities and relationships. The set is ready for creating features using deep feature synthesis. We can use the same code as our previous example.


# Default primitives from featuretools
agg_primitives =  ["sum", "std", "max", "skew", "min", "mean", "count", "percent_true", "num_unique", "mode"]
trans_primitives =  ["day", "year", "month", "weekday", "haversine", 
"num_words", "num_characters"]

# DFS with specified primitives
feature_matrix, feature_defs = ft.dfs(entityset = es, 
target_dataframe_name = 'app',
    trans_primitives = trans_primitives,
    agg_primitives=agg_primitives,
    max_depth = 4, n_jobs = -1, verbose = 1)

# view first 10 features
print(feature_defs[:10])

And just like that, within a few seconds, we were able to generate several features that can be used for model training. The extracted features can be stored in a feature store.

Storing Features

Feature stores are a convenient way of storing calculated features. Whether using automated methods or manual calculations, the feature vectors need to be stored in a secure location for later use.

Hopsworks For Feature Storage

We can store the features we generated earlier in the Hopsworks feature store in a few lines of code. First we install the library.


pip install hsfs

Once the installation is successful, we can proceed to creating the feature store and saving our dataframe.


import hsfs

# create connection to HSFS
connection = hsfs.connection()
# load the default feature store
fs = connection.get_feature_store()

# initialize the feature group
fg = fs.create_feature_group("Demo Retail Data",
    version=1,
    description="Features created for demo retail data using FeatureTools",
    primary_key=['SK_ID_CURR', 'SK_ID_BUREAU', 'SK_ID_PREV', 'bureaubalance_index', 'cash_index', 'installments_index', 'credit_index'],
    online_enabled=True)

# save our created features as to the feature group
fg.insert(feature_matrix)

Summary

Engineering features is a tiresome process which is why many engineers opt for automated feature engineering. They use frameworks like FeatureTools to engineer new features within seconds. The framework treats datasets as EntitySets and uses the Deep Feature Synthesis technique for feature processing. DFS works across various related data tables and applies primitives like average, sum, count, etc., to create new features.

The engineered features are finally stored in a feature store. The feature store is a centralized storage for features from various business domains. It is an access point for ML engineers who use these for model training and enable effective team communication.

References

Interested for more?

🤖 Register for free on Hopsworks Serverless
🌐 Read about the open, disaggregated AI Lakehouse stack
📚 Get your early copy: O'Reilly's 'Building Machine Learning Systems' book
🛠️ Explore all Hopsworks Integrations
🧩 Get started with codes and examples
⚖️ Compare other Feature Stores with Hopsworks

More blogs

An ML models performance depends on quality training features. We review and compare Python libraries, Pandas, Polars and Pandas2, for feature engineering.

Pandas2 and Polars for Feature Engineering

We review Python libraries, such as Pandas, Pandas2 and Polars, for Feature Engineering, evaluate their performance and explore how they power machine learning use cases.

Haziqa Sajid

Reproducible data is crucial for AI systems, Hopsworks ensures that training datasets can be accurately recreated to meet coming AI regulations.

Reproducible Data for the AI Lakehouse

We present how Hopsworks leverages its time-travel capabilities for feature groups to support reproducible creation of training datasets using metadata.

Jim Dowling

In this blog we explore what are and how to create embeddings in machine learning and their diverse applications in data-driven decision-making.

Machine Learning Embeddings as Features for Models

Delve into the profound implications of machine learning embeddings, their diverse applications, and their crucial role in reshaping the way we interact with data.

Prithivee Ramalingam