Scheduled upgrade on April 4, 08:00 UTC

Kindly note that during the maintenance window, app.hopsworks.ai will not be accessible.

April 4, 2025

App Status

Back to Blog

Haziqa Sajid

Data Scientist

Let's keep in touch!

Subscribe to our newsletter and receive the latest product updates, upcoming events, and industry news.

More Blogs

Hopsworks AI Lakehouse Now Supports NVIDIA NIM Microservices

How we secure your data with Hopsworks

Migrating from AWS to a European Cloud - How We Cut Costs by 62%

The 10 Fallacies of MLOps

Hopsworks AI Lakehouse: The Power of Integrated MLOps Components

Article updated on

Pandas2 and Polars for Feature Engineering

November 6, 2023

20 min

Read

Haziqa Sajid

Data Scientist

Freelance

Data Engineering

TL;DR

In this blog we look at different Python libraries for Feature Engineering and how they power machine learning use cases. We put Pandas, Polars and Pandas2 under the magnifying glass and run a comparative performance analysis to see which framework offers the highest performance gains. Whilst Pandas is one of the most popular frameworks to use, Polars has proven to be a tough contender as it showed performance gains of up to 10x in all comparative actions.

Introduction

In the world of Data Science, Python is one of the most common terms you will come across. Python is a general-purpose programming language loved and cherished by Data Science professionals. It is easy to use and hosts many libraries and frameworks designed to handle data-related workloads. These libraries provide convenient functions that help with tasks like

Data Ingestion
Data Processing
Feature Engineering
Machine Learning Training
Model Deployment

This blog will talk about using Python for Feature Engineering tasks. More specifically, we will discuss libraries, including Pandas2 and Polars, and their role in enabling easy feature engineering for machine learning use cases. Moreover, we will also conduct a comparative performance analysis to decide which framework suits better in what scenario. Before any of that, let’s start by understanding what Feature Engineering is.

What is Feature Engineering?

Feature engineering is a wide umbrella that encompasses various data processing techniques to generate new data from existing features. Features are the crucial information input to a machine learning model and help it understand patterns within data. The model learns to connect these features to the output values and make predictions in unseen scenarios. Features that can represent the output variable clearly are considered high-quality and help improve the model's performance.

However, in most practical cases, the features in default datasets are insufficient for the model's learning. Feature engineering also allows data scientists to extract better representations of the information from existing data through new features. These new features help the model perform better and save engineers from the hassle of additional data collection.

Common Feature Engineering Techniques

Feature engineering uses common mathematical and statistical functions on existing features to generate new values. Some common techniques include:

SUM: Summing values across subsets of data is helpful in getting aggregated information. An example would be to sum total sales on a weekly basis.
Average: Similar working to summation, but it would give the average of a value. Calculate the average price of a product sold over a few months.
Count: Aggregate the data by getting the number of records against a subset of the data.
Rolling Average: Create a running total for specific values. This feature provides granular aggregations based on a daily basis.
Date Parsing: Dates are primarily presented in a combined format (dd/mm/yyyy etc.).
The values can be parsed and separated into
a. Day
b. Month
c. Year
d. Week
e. Day of Week

The individual values allow the model to emphasize each aspect of the data more.

These techniques are difficult to understand without practical demonstration. Let’s try to apply them in Python on a real-world dataset.

Feature Engineering Frameworks: Pandas, Polars and Pandas2

There is little benefit to discussing such techniques without viewing their practical applications. For this purpose, we will use the data processing frameworks hosted in Python. We will carry out similar processing in Pandas and Polars and then in Pandas2 (The latest release from Pandas). This section will see feature engineering in action and compare the frameworks in terms of performance.

Here are the systems specs used to carry out the data processing:

Processor: AMD Ryzen 5 3600 (6 core 12 Thread)

RAM: 16 GB

Disk: 1 TB SATA SSD

Note: For a fair comparison, all frameworks are tested on the same system, and the processing times presented are from the first run; this is important since some frameworks can cache commands for better performance in subsequent runs.

We will be using the Python notebook magic function `%%time` to evaluate the execution time of the commands, and the time comparisons will be mentioned in the next section. You can access the colab notebook here.

Pandas

First, we need to install the appropriate version of Pandas. Run the following command in the terminal.

pip install pandas==1.5.3

Specifying the version number is important since the latest stable version is 2.0.3 at the time of writing. To test the conventional framework, we must use the older stable release.

Once installed, we can read our test dataset. The dataset we are using is the testing data from the TalkingData AdTracking Fraud Detection Challenge 2018. The dataset is 843 MB on disk. Let’s load it into memory.

import pandas as pd

%%time
data = pd.read_csv('test.csv')

Let’s see the shape of this dataset and take a look at the data present inside.

data.shape

Our dataset contains roughly 18 Million rows!

data.head()

***Figure 2:*** *First 5 rows of the dataset*

It is also helpful to get some insights about the data. Luckily Pandas has a function for that.

data.info()

Looks like our dataset is consuming a little over 1 gigabyte of memory. We also see that it contains 7 columns, all of which are inferred as int64 except for `click_time`, which is loaded as an object. Our first step will be to convert this into a datetime type.

%%time

# set datatypes for timestamps
data['click_time'] = pd.to_datetime(data['click_time'], errors='coerce')

***Figure 4:*** *Datatypes after conversion*

Data types are all set now. Let’s move on to performing date parsing. We will extract Year, Month, Day, Hour, Minute, and Second from the date column.

%%time
# parse dates from `click_time`
data['Click_Year'] = data['click_time'].dt.year
data['Click_Month'] = data['click_time'].dt.month
data['Click_Day'] = data['click_time'].dt.day
data['Click_Hour'] = data['click_time'].dt.hour
data['Click_Minute'] = data['click_time'].dt.minute
data['Click_Second'] = data['click_time'].dt.second

***Figure 5:*** *Dataset after date parsing*

Now, it's time for some complex calculations. For further feature engineering, we will calculate the number of clicks an app gets on an advertisement per second, minute, and hour. This will require us to apply a `groupby` clause across the time features and count the number of occurrences against the app id. Then we will append the values back to the dataframe using the join operation. We will monitor the performance of all these actions combined to get a realistic view of how Pandas perform.

%%time
#calculate clicks per second, minute, and hour
clicks_per_second_per_app = data.groupby(['Click_Year', 'Click_Month', 'Click_Day','Click_Hour','Click_Minute', 'Click_Second','app']).count()['click_id'].reset_index()
clicks_per_minute_per_app = data.groupby(['Click_Year', 'Click_Month', 'Click_Day','Click_Hour','Click_Minute','app']).count()['click_id']
.reset_index()
clicks_per_hour_per_app = data.groupby(['Click_Year', 'Click_Month', 'Click_Day','Click_Hour','app']).count()['click_id'].reset_index()


#rename the relevant columns
clicks_per_second_per_app.rename(columns={
"click_id":"clicks_per_second_per_app"}, inplace=True)
clicks_per_minute_per_app.rename(columns={
"click_id":"clicks_per_minute_per_app"}, inplace=True)
clicks_per_hour_per_app.rename(columns={
"click_id":"clicks_per_hour_per_app"}, inplace=True)


#concat column back
data = pd.merge(data, clicks_per_second_per_app, on =['Click_Year', 'Click_Month', 'Click_Day','Click_Hour','Click_Minute', 'Click_Second','app'])
data = pd.merge(data, clicks_per_minute_per_app, on =['Click_Year', 'Click_Month', 'Click_Day','Click_Hour','Click_Minute','app'])
data = pd.merge(data, clicks_per_hour_per_app, on =['Click_Year', 'Click_Month', 'Click_Day','Click_Hour','app'])

This operation concludes our feature engineering process. Let’s look at the final dataframe.

We will now perform these same operations in Polars.

Polars

Polars is a new contender for Pandas. It is implemented in RUST and possesses amazing parallel processing caching capabilities which provides it with blazing fast performance. Now we will see how Polars compares to Pandas in the type of operations that we performed above. The first step would be to install the required framework.

pip install polars

Next, we will carry out the same procedures as in Pandas. Luckily many of the functions in Polars follow the same nomenclature as Pandas so the learning curve is not too steep for new users.

import polars as pl

Reading data file.

%%time
data_pl = pl.read_csv('test.csv')

Polars does not have any function to get information for the dataset, so we will use the `estimated_size` function to get the dataset size in memory.

# get size of polars dataframe in memory in megabytes
pl_data.estimated_size("mb")

The dataset, loaded in Polars, occupies a total of 1343.99 MB of space. Let's preview the dataset.

pl_data.head()

Polars displays the dataframe in a similar fashion as Pandas and it also displays the data type for each column. The `click_time` is still interpreted here as a string so we will have to change that.

%%time
pl_data = pl_data.with_columns(pl.col('click_time')
.str.to_datetime("%Y-%m-%d %H:%M:%S"))

***Figure 8:*** *Dataset after datetime conversion*

Now we will parse the date field into granular components.

%%time
pl_data = pl_data.with_columns(
    pl.col("click_time").dt.year().alias("Click_Year"),
    pl.col("click_time").dt.month().alias("Click_Month"),
    pl.col("click_time").dt.day().alias("Click_Day"),
    pl.col("click_time").dt.hour().alias("Click_Hour"),
    pl.col("click_time").dt.minute().alias("Click_Minute"),
    pl.col("click_time").dt.second().alias("Click_Second"),
)

Now the last step remains to perform aggregations to get total clicks on a per second, per minute and per hour basis.

%%time
# Calculate aggregations
clicks_per_second_per_app = pl_data.groupby(['Click_Year', 'Click_Month',
'Click_Day','Click_Hour','Click_Minute',
'Click_Second','app']).agg(pl.col("click_id").count().alias
("clicks_per_second_per_app"))
clicks_per_minute_per_app = pl_data.groupby(['Click_Year', 'Click_Month',
'Click_Day','Click_Hour','Click_Minute','app']).agg(pl.col("click_id").count()
.alias("clicks_per_minute_per_app"))
clicks_per_hour_per_app = pl_data.groupby(['Click_Year', 'Click_Month',
'Click_Day','Click_Hour','app']).agg(pl.col("click_id").count().alias
("clicks_per_hour_per_app"))

# Merge aggregations back to original df
pl_data = pl_data.join(clicks_per_second_per_app, on =['Click_Year', 'Click_Month',
'Click_Day','Click_Hour','Click_Minute', 'Click_Second','app'])
pl_data = pl_data.join(clicks_per_minute_per_app, on =['Click_Year', 'Click_Month',
'Click_Day','Click_Hour','Click_Minute', 'app'])
pl_data = pl_data.join(clicks_per_hour_per_app, on =['Click_Year', 'Click_Month', 
'Click_Day','Click_Hour','app'])

***Figure 10:*** *Dataset after all feature engineering procedures*

That concludes our feature engineering code with Polars. The time comparisons have some very interesting results but we will discuss those in the Performance Review section. Now we will try out the new iteration of Pandas, i.e., Pandas2.

Pandas2: A New Challenger

Pandas2 was released earlier this year with the aim of offering a significant performance gain over its predecessor. The main change comes in the form of Apache Arrow which is used in Panda2 as a back-end. Arrow is designed to handle large-scale datasets and has added benefits like parallel processing and lazy evaluation.

Let’s begin by installing Pandas and Pyarrow in a new environment. Note that presently the default stable version for Pandas is greater than 2.0 but we will specify it just to be sure.

pip install pandas>=2.0.0 pyarrow

After the installation is complete you will notice that not much has changed. The library will be used in the exact same way as its predecessor. The only change will occur when setting the back-end.

import pandas as pd

%%time
data = pd.read_csv('test.csv', engine='pyarrow', dtype_backend='pyarrow')

Note that this time we have specified the engine as ‘pyarrow’ and the data types as Arrow dtypes. The syntax will remain exactly the same for the rest of the processing.

data.info()

Dataset info with Pandas 2.0 — ***Figure 11:*** *Dataset info with Pandas2*

The data types are now in the pyarrow format and it has auto-detected the `click_time` column as a timestamp so that’s one step we can skip in this code. Since all the processing remains the same, the entire code is presented as a single code block below.

%%time
# parse dates from `click_time`
data['Click_Year'] = data['click_time'].dt.year
data['Click_Month'] = data['click_time'].dt.month
data['Click_Day'] = data['click_time'].dt.day
data['Click_Hour'] = data['click_time'].dt.hour
data['Click_Minute'] = data['click_time'].dt.minute
data['Click_Second'] = data['click_time'].dt.second

%%time
#calculate clicks per day
clicks_per_second_per_app = data.groupby(['Click_Year', 'Click_Month', 
'Click_Day','Click_Hour','Click_Minute', 'Click_Second','app']).count()
['click_id'].reset_index()
clicks_per_minute_per_app = data.groupby(['Click_Year', 'Click_Month', 
'Click_Day','Click_Hour','Click_Minute','app']).count()['click_id'].reset_index()
clicks_per_hour_per_app = data.groupby(['Click_Year', 'Click_Month',
'Click_Day','Click_Hour','app']).count()['click_id'].reset_index()

#rename the relevant columns
clicks_per_second_per_app.rename(columns={
"click_id":"clicks_per_second_per_app"}, inplace=True)
clicks_per_minute_per_app.rename(columns={"click_id":"clicks_per_minute_per_app"}, 
inplace=True)
clicks_per_hour_per_app.rename(columns={"click_id":"clicks_per_hour_per_app"},
inplace=True)

#concat column back
data = pd.merge(data, clicks_per_second_per_app, on =['Click_Year', 'Click_Month', 
'Click_Day','Click_Hour','Click_Minute', 'Click_Second','app'])
data = pd.merge(data, clicks_per_minute_per_app, on =['Click_Year', 'Click_Month', 
'Click_Day','Click_Hour','Click_Minute','app'])
data = pd.merge(data, clicks_per_hour_per_app, on =['Click_Year', 'Click_Month', 
'Click_Day','Click_Hour','app'])

Final data set after Pandas 2.0 processing — ***Figure 12:*** *Final data set after Pandas2 processing*

Performance Review

Each of the frameworks were evaluated against the time they took to perform the operations. The results are summarized below.

***Figure 13:*** *Summarized results of processing times*

*type-casting was skipped with Pandas2 since the framework auto inferred the correct data type.

***Figure 14:*** *Processing times comparison*

We have some interesting results in our hands. While Polars took a few extra Mega Bytes to store the data, it clearly outshines Pandas, with a performance gain of up to 10x, in every action we have performed. While Pandas2 does offer a performance gain over its older version, it seems like a hit-or-miss process. While it took less than half the time in our date parsing step, it showed worse performance compared to both its counterparts.

Which Should You Use?

The decision to use a specific data framework depends on the data type you are working with. We have already seen that Polars is a clear winner against Pandas in terms of performance. This is largely due to its ability to spread processing across various CPU cores. If you are working with a large-scale dataset spanning Gigabytes or even Terabytes, then Polars is a clear choice, especially since it is now stable enough. However, if you are used to Pandas and work with light data, there is no reason to switch since adjusting to something new would take time.

Pandas2, on the other hand, is a different story. While it does have some performance benefits over Pandas, the processing times are not consistent. Despite regular updates, Pandas2 is still under construction and should not be used for day-to-day professional work.

Writing Features to Hopsworks

The final step to a feature engineering pipeline is storing the features in a scalable, fault-tolerant feature store. Hopsworks feature store offers a simple API to store and access data features. Let’s store our engineered feature in a designated feature store.

pip install hsfs

import hsfs

# create connection to HSFS
connection = hsfs.connection()
# load the default feature store
fs = connection.get_feature_store()

# initialize the feature group
fg = fs.create_feature_group("TalkingData AdTracking Fraud Detection",
    version=1,
    description="Feature Engineering in Python",
    primary_key=["click_id"],
    online_enabled=True)

# features created using the respective framework (Pandas or Polars) are saved to the feature store
fg.insert(data)

The features can now be accessed using the same API whenever required and can be versioned over various iterations of the feature engineering pipeline.

Summary

A machine learning model’s performance depends on quality training features, which most real-world datasets lack. Feature engineering techniques allow data scientists to extract meaningful information from the available data and append it as additional features. Python includes various libraries that facilitate the feature engineering process. Pandas and Polars are popular frameworks that include various functions to ease data processing procedures and throughout this blog we have compared Pandas, Polars and Pandas2. While Pandas and Pandas2 are the go-to solutions for almost every data scientist, Polars is becoming increasingly popular due to its exceptional performance gain over the former. The last step to every successful feature engineering pipeline is storing the features in a suitable feature store such as Hopsworks. Hopsworks provides fault-tolerance and version control for your features and facilitates your machine-learning experimentation.

References

Interested for more?

🤖 Register for free on Hopsworks Serverless
🌐 Read about the open, disaggregated AI Lakehouse stack
📚 Get your early copy: O'Reilly's 'Building Machine Learning Systems' book
🛠️ Explore all Hopsworks Integrations
🧩 Get started with codes and examples
⚖️ Compare other Feature Stores with Hopsworks

More blogs

Hopsworks 3.0 - Connecting Python to the Modern Data Stack

Hopsworks 3.0 is a new release focused on best-in-class Python support, Feature Views unifying Offline and Online read APIs to the Feature Store, Great Expectations support, KServe and a Model serving

Jim Dowling

A guide to popular file formats used in open source frameworks for machine learning in Python, including TensorFlow/Keras, PyTorch, Scikit-Lean & PySpark.

Guide to File Formats for Machine Learning

This blog is a guide to the popular file formats used in open source frameworks for machine learning in Python, including TensorFlow/Keras, PyTorch, Scikit-Learn, and PySpark.

Jim Dowling

Hopsworks' Kubernetes migration has significantly streamlined operations, reduced costs, and enhanced performance for all and our customers.

Migrating Hopsworks to Kubernetes

Nearly a year ago, the Hopsworks team embarked on a journey to migrate its infrastructure to Kubernetes. In this article we describe three main pillars of our Kubernetes migration.

Javier Cabrera