app.hopsworks.ai is experiencing some issues - we are investigating
2
arrow back
Back to Blog
Haziqa Sajid
link to linkedin
Data Scientist
Article updated on

Automated Feature Engineering with FeatureTools

August 23, 2023
13 min
Read
Haziqa Sajid
Haziqa Sajidlink to linkedin
Data Scientist
Freelance

TL;DR

A machine learning model’s ability to learn and read data patterns largely depends on the quality of its features. Well-designed features through feature engineering can significantly improve a model's performance while also requiring less input data than the raw data. With frameworks such as FeatureTools ML practitioners can automate the feature engineering process, and combined with a Feature Store for storing features, ML models can rapidly be put into production.

Introduction

Features are the data points input to a machine learning (ML) model. They are extracted from extensive datasets and contain vital information that describes the original data. In many cases, the original columns of the data might be the features as it is. However, in more complicated cases, features are engineered by passing data points via statistical algorithms.

An ML model's performance largely depends on the quality and quantity of these features. The better quality of features, the easier it is for our model to learn the data patterns. This article will discuss how ML practitioners can utilize automated feature engineering to rapidly create meaningful features and enhance the MLOps pipeline.

Why Engineer Features?

A machine learning model should be trained on a diverse set of features for it to perform well in practical scenarios. One way for diversification is to collect more data. However, data collection can be expensive and time-consuming.

Feature engineering uses aggregation techniques and data analysis to extract additional, yet vital, information from an existing dataset. This additional information helps ML models learn new patterns to achieve better performance. 

Automated Feature Engineering

Deep learning automates feature engineering but requires large volumes of labeled data to work effectively. When insufficient training data is available for deep learning, explicit feature engineering can be used to create features from data sources. Feature engineering requires technical and in-depth domain knowledge for appropriate feature creation. In practical scenarios, the source data for features is often spread across various data tables and needs to be combined to create effective features. The overall process is labor-intensive and prone to errors, requiring iterative development approaches where new features are created and tested before being adopted by models.

There are, however, automated feature engineering tools to simplify the feature engineering process for tabular data. These libraries, such as FeatureTools, provide a simple API to carry out all necessary feature processing in an automated fashion. Furthermore, these libraries can be integrated into feature pipelines for production usage.

FeatureTools: Automated Feature Engineering in Python

FeatureTools is a popular open-source Python framework for automated feature engineering. It works across multiple related tables and applies various transformations for feature generation. The entire process is carried out using a technique called “Deep Feature Synthesis” (DFS) which recursively applies transformations across entity sets to generate complex features. The FeatureTools framework consists of the following components:

  1. Entity Set
    An entity is a data table holding information. It is the most fundamental building block of the framework. A collection of such entities is called an Entity Set. The entity sets also include additional information like schemas, metadata, and various entities' relationships.
  1. Feature Primitives
    Primitives are the statistical functions applied to transform the data present in the entity set. The functions include aggregations, ratios, percentages, etc. Primitives may process multiple data entities to create a single value, such as sum, min, or max, or apply a transformation on entire columns to create a new feature.
  1. Deep Feature Synthesis (DFS)
    DFS is the algorithm used by the framework for the automated extraction of features. It uses a combination of primitives and applies them to the entity sets to generate the features. The primitives are applied so that the new features result from complex operations applied across various dataset parts.

FeatureTools Implementation 

FeatureTools allows users to input their datasets, create EntitySets, and use the sets for automated feature engineering. For demonstration purposes, the framework includes dummy datasets that allow users to explore its functionality. 

Let's test it. First, we have to install the framework. Run the following command in the terminal.


pip install featuretools

After installation is completed, we can load the library and the relevant data in the following manner.


# import featuretools
import featuretools as ft
# load dummy data
es = ft.demo.load_retail()

Let’s view the data.

 
print(es)
EntitySet: Demo retail data
Figure 1: EntitySet: Demo retail data

We can see that EntitySet contains information about individual entities and the relationship between them. Using this, we can create features by calling a single function.


feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    agg_primitives=["count", "sum", "min"],
    trans_primitives=["month"],
    max_depth=5,
)

The code snippet calls the `.dfs` function from the library. The function takes the EntitySet, the main dataframe name, and the type of transformations required to build the features. Let’s take a look at what the output looks like.


print(feature_matrix)
Feature matrix
Figure 2: Feature matrix

The dataframe contains all the relevant features that can be used to train a well-performing ML model.

Custom Dataset

Ingesting multiple tables and establishing relationships is a vital part of FeatureTools. Let’s see how we can work with a custom dataset. For this exercise, we will use the Home Credit Default Risk dataset.

The dataset consists of 8 key tables, each related to the other via a common field. Our first step is to load these datasets into memory.


import pandas as pd

# Load datasets
app_train = pd.read_csv('data/application_train.csv')
app_test = pd.read_csv('data/application_test.csv')
bureau = pd.read_csv('data/bureau.csv')
bureau_balance = pd.read_csv('data/bureau_balance.csv')
cash = pd.read_csv('data/POS_CASH_balance.csv')
credit = pd.read_csv('data/credit_card_balance.csv')
previous = pd.read_csv('data/previous_application.csv')
installments = pd.read_csv('data/installments_payments.csv')

We have separate training and testing datasets for the application data. We need to combine it first.

 
# Combine the train and test files for better processing
app_test['TARGET'] = np.nan

# Join together training and testing
app = app_train.append(app_test, ignore_index = True, sort = True)

Next, we need to perform some data analysis. It is important to note that several processing techniques can be used in the dataset, such as asserting data types or data imputation, but that is beyond the scope of this article. We will only assess the dataset for NaN values.

 
# check NULL values in the datasets
print("app: ")
print(app.isnull().sum())
print("bureau: ")
print(bureau.isnull().sum())
print("bureau_balance: ")
print(bureau_balance.isnull().sum())
NaN values
Figure 3: Columns with NaN values

There are several columns that contain NaN values. For this demonstration, we will fill them with zeros.

 
# fill all NaN values with zero so they do not hinder with the processing
app.fillna(0, inplace=True)
bureau.fillna(0, inplace=True)
bureau_balance.fillna(0, inplace=True)
cash.fillna(0, inplace=True)
credit.fillna(0, inplace=True)
previous.fillna(0, inplace=True)
installments.fillna(0, inplace=True)

Now let’s drop some columns that will not be used.


# drop useless columns to prevent creating useless features
installments = installments.drop(columns = ['SK_ID_CURR'])
credit = credit.drop(columns = ['SK_ID_CURR'])
cash = cash.drop(columns = ['SK_ID_CURR'])

Finally, we create the EntitySet using the data loaded from the CSV files.


# Empty entity set with id applications
es = ft.EntitySet(id = 'clients')

# Entities with a unique index
es = es.add_dataframe(dataframe_name= 'app', dataframe = app, 
index = 'SK_ID_CURR')
es = es.add_dataframe(dataframe_name= 'bureau', dataframe = bureau, 
index = 'SK_ID_BUREAU')
es = es.add_dataframe(dataframe_name= 'previous', dataframe = previous, 
index = 'SK_ID_PREV')
# Entities that do not have a unique index
es = es.add_dataframe(dataframe_name= 'bureau_balance', dataframe = bureau_balance, 
    make_index = True, index = 'bureaubalance_index')
es = es.add_dataframe(dataframe_name= 'cash', dataframe = cash, 
    make_index = True, index = 'cash_index')
es = es.add_dataframe(dataframe_name= 'installments', dataframe = installments,
    make_index = True, index = 'installments_index')
es = es.add_dataframe(dataframe_name= 'credit', dataframe = credit,
    make_index = True, index = 'credit_index')

Note above that, to be part of an EntitySet; every table must have a column as a unique identifier. Our `app,` `bureau,` and `previous` data frames already have these columns, but for the rest, we set the `make_index` flag to True so FeatureTools creates an identifier itself.


# view the set
print(es)
EntitySet
Figure 4: EntitySet

All our data is loaded into a single EntitySet, but no relationships are still established. FeatureTools needs information regarding relationships between the tables. All the tables in our dataset are linked to one another via column fields. Now we need to create these relationships within the FeatureTools EntitySet.

 
# Relationship between app_train and bureau
es = es.add_relationship('app', 'SK_ID_CURR', 'bureau', 'SK_ID_CURR')
es = es.add_relationship('bureau', 'SK_ID_BUREAU', 'bureau_balance', 'SK_ID_BUREAU')
es = es.add_relationship('app','SK_ID_CURR', 'previous', 'SK_ID_CURR')
es = es.add_relationship('previous', 'SK_ID_PREV', 'cash', 'SK_ID_PREV')
es = es.add_relationship('previous', 'SK_ID_PREV', 'installments', 'SK_ID_PREV')
es = es.add_relationship('previous', 'SK_ID_PREV', 'credit', 'SK_ID_PREV')

print(es)
EntitySet after creating relationships
Figure 5: EntitySet after creating relationships

The EntitySet contains all relevant entities and relationships. The set is ready for creating features using deep feature synthesis. We can use the same code as our previous example.


# Default primitives from featuretools
agg_primitives =  ["sum", "std", "max", "skew", "min", "mean", "count", "percent_true", "num_unique", "mode"]
trans_primitives =  ["day", "year", "month", "weekday", "haversine", 
"num_words", "num_characters"]

# DFS with specified primitives
feature_matrix, feature_defs = ft.dfs(entityset = es, 
target_dataframe_name = 'app',
    trans_primitives = trans_primitives,
    agg_primitives=agg_primitives,
    max_depth = 4, n_jobs = -1, verbose = 1)

# view first 10 features
print(feature_defs[:10])
Engineered Features
Figure 6: Engineered Features

And just like that, within a few seconds, we were able to generate several features that can be used for model training. The extracted features can be stored in a feature store.

Storing Features

Feature stores are a convenient way of storing calculated features. Whether using automated methods or manual calculations, the feature vectors need to be stored in a secure location for later use.

Hopsworks For Feature Storage

We can store the features we generated earlier in the Hopsworks feature store in a few lines of code. First we install the library.


pip install hsfs

Once the installation is successful, we can proceed to creating the feature store and saving our dataframe.


import hsfs

# create connection to HSFS
connection = hsfs.connection()
# load the default feature store
fs = connection.get_feature_store()

# initialize the feature group
fg = fs.create_feature_group("Demo Retail Data",
    version=1,
    description="Features created for demo retail data using FeatureTools",
    primary_key=['SK_ID_CURR', 'SK_ID_BUREAU', 'SK_ID_PREV', 'bureaubalance_index', 'cash_index', 'installments_index', 'credit_index'],
    online_enabled=True)

# save our created features as to the feature group
fg.insert(feature_matrix)

Summary

Engineering features is a tiresome process which is why many engineers opt for automated feature engineering. They use frameworks like FeatureTools to engineer new features within seconds. The framework treats datasets as EntitySets and uses the Deep Feature Synthesis technique for feature processing. DFS works across various related data tables and applies primitives like average, sum, count, etc., to create new features.

The engineered features are finally stored in a feature store. The feature store is a centralized storage for features from various business domains. It is an access point for ML engineers who use these for model training and enable effective team communication.

References