Scheduled upgrade on April 4, 08:00 UTC

Kindly note that during the maintenance window, app.hopsworks.ai will not be accessible.

April 4, 2025

App Status

Back to Blog

Dhananjay Mukhedkar

Software Engineer

Javier de la Rúa Martínez

Software Engineer

Mikael Ronström

Head of Data

Jim Dowling

CEO and Co-Founder

Davit Bzhalava

Head of Data Science

Let's keep in touch!

Subscribe to our newsletter and receive the latest product updates, upcoming events, and industry news.

More Blogs

Hopsworks AI Lakehouse Now Supports NVIDIA NIM Microservices

How we secure your data with Hopsworks

Migrating from AWS to a European Cloud - How We Cut Costs by 62%

The 10 Fallacies of MLOps

Hopsworks AI Lakehouse: The Power of Integrated MLOps Components

Article updated on

The Journey from Star Schema to Snowflake Schema in the Feature Store

Add more features to your ML model with a Snowflake Schema

July 8, 2024

14 min

Read

Dhananjay Mukhedkar

Software Engineer

Hopsworks

Javier de la Rúa Martínez

Software Engineer

Hopsworks

Mikael Ronström

Head of Data

Hopsworks

Jim Dowling

CEO and Co-Founder

Hopsworks

Davit Bzhalava

Head of Data Science

Hopsworks

Feature Store

SIGMOD 2024

TL;DR

Features are the input data that determines the quality of the predictions produced by AI systems. In real-time AI systems, features can be (1) computed on-demand from user-input and (2) retrieved as precomputed features from a feature store using some entity ID(s) provided by the application or user-input. Precomputed features are low-latency and information-rich, as they provide history and context information not available in user-input. In existing feature stores, the more entity IDs your real-time AI system can provide as input, the more precomputed features you can retrieve for different entities. In general, as you add features your AI system makes marginal gains in predictive power, hopefully, translating into competitive advantage. But where do the entity IDs come from? In interactive applications, you often have to ask the user for their booking ID. But what happens if, for usability reasons, you don’t want to ask the user to enter their bank account number and social security number? For one use case, you could build and maintain a denormalized (flat) table in your operational database. But that requires building and operating ETL pipelines, leads to no feature reuse, and doesn’t scale in operational databases. Until now, existing feature stores only supported Star Schema data models, where clients have to provide all the entity ID(s). In this article, we also introduce the snowflake schema data model for feature stores, and show how it helps you include more features to make better predictions - while keeping normalized tables (and reusing features). We also describe the database capabilities we built to support low latency queries (a pushdown LEFT JOIN in RonDB). There are bonus performance tips about how to optimize your feature store using projection pushdown and predicate pushdown.

This article is part 3 in a 7 part series describing in lay terms concepts and results from a SIGMOD 2024 research paper on the Hopsworks Feature Store.

Other Parts: 1 (Modular/Composable AI Systems) , 2 (The Taxonomy for Data Transformations), 4 (Lakehouse for AI), 5 (From Lakehouse to AI Lakehouse), 6 (Real-Time AI Database), 7 (Reproducible Data).

More Features makes for Better Predictions

Mary is building a real-time credit card fraud prediction model that has a strict 50ms latency SLO (service level objective). Every credit card payment arrives at a server with minimal information - the credit card number, transaction amount, longitude and latitude of where the translation took place, and whether it was online in the real-world at a merchant point of payment. That information is not sufficient to train a high performance machine learning (ML) model to predict whether the incoming transaction is fraudulent or not. She learns that feature stores can provide her with low latency access to precomputed features that could enable her to have a rich enough model to perform good enough for fraud detection. The credit card transactions land in a Kafka cluster and a streaming feature pipeline can be used to compute aggregated usage statistics over windows of time (last 15 mins, hour, day, week). Anomalous patterns of credit card usage can be detected by the model - great. She could also use the location of the current credit card transaction and the location of the previous transaction to compute a feature that is predictive of a geographic attack - for example, if the card is used in Paris, then 1 hour later in New York.

Star schema — ***Figure 1:*** Mary can’t use the cc_issuer_details or account_details features as the credit card transaction as her feature store only supports a star schema data model. The credit card transaction does not include primary keys for the cc_issuer_details or account_details, and it would introduce too much latency to resolve them from the existing operational database, so she doesn’t use those features in her fraud model.

Mary does some offline experiments that show that there are features related to the account holder and the credit card issuer (recently, there has been an increase in single use virtual credit cards) that would further help improve the accuracy of her model. However, she encounters a problem. Existing feature stores only support a star schema data model. That means, when the credit card transaction arrives at the online model, she needs to provide the primary keys for all features that are retrieved from the feature store. That was easy for credit card features (and merchant features, when the merchant identifier is included with the transaction) as she could use the credit card number to retrieve the precomputed features. The transaction did not include, however, an identifier for the card issuer or the account holder. She considered making calls to retrieve these 2 IDs using primary key lookups on two different tables in a Postgres database. However, operations were not happy about opening access to their production database and the ~15ms p95 latency would have blown her SLO budget. So, she just went to production without the card issuer and account holder features. Not her fault.

So, how could Mary have included the card issuer and account holder features? If her feature store supported a Snowflake Schema (nested tables), not just a star schema, she could have retrieved all.

Snowflake Schema in Hopsworks

In Hopsworks, we call tables of features feature groups. Feature pipelines are programs (in Pandas, Polars, SQL, Spark, Flink, Feldera, Bytewax, Quix, etc) that periodically (or continually for streaming feature pipelines) update the feature data stored in feature groups. Each feature group stores its historical feature data in a table format (Apache Hudi or Delta Lake) on object storage and its low latency feature data for online models in a table in RonDB. So, how do we define a data model in Hopsworks?

Each feature group in Hopsworks has one or more columns as its primary key (the entity identifier). The foreign keys (shown in figures 1 and 2) are just columns in a feature group (there are no foreign key constraints). When you want to create a model and select features from more than one feature group, you have to use our feature selection API. For the Star Schema data model from figure 1, the code looks like this:

selected_features = credit_card_transactions.select_all()
    .join(aggregated_cc_transactions.select_all())
    .join(account_details.select_all())
    .join(merchant_details.select_all())
    .join(cc_issuer_details.select_all())

Here, for every child of the credit_card_transactions table in the Star Schema, we call nested join methods. All tables are children of the parent credit_card_transactions table. We pass the selected features to a feature view in Hopsworks, that is later used to create training data and retrieve feature data for inference, ensuring no skew between training and inference.

In online inference, when you want to retrieve features in your online model, you have to provide all 4 foreign key values from the credit_card_transactions feature group to retrieve your precomputed feature values using the feature view.

feature vector = feature_view.get_feature_vector({
  ‘cc_num’: “1234 5555 3333 8888”, 
  ‘issuer_id’: 20440455, 
  ‘merchant_id’: 44208484,
  ‘account_id’: 84403331
})

For the Snowflake Schema data model from figure 1, the feature selection code looks as different, as we have to build up the subtrees:

nested_selection = aggregated_cc_transactions.select_all()
.join(account_details.select_all())
.join(cc_issuer_details.select_all())

selected_features = credit_card_transactions.select_all()
		.join(nested_selection)
.join(merchant_details.select_all()

First, we build the subtrees with nested tables (subtrees) using joins, and then we join the subtrees to their parents (the root credit_card_transactions node).

Now, we have the benefit, that in online inference, we only need to pass two foreign key values to retrieve the precomputed features:

feature vector = feature_view.get_feature_vector({
  ‘cc_num’: “1234 5555 3333 8888”, 
  ‘merchant_id’: 44208484,
})

Internally, Hopsworks will perform two parallel queries - a primary key lookup on the merchant table in RonDB, and a pushdown LEFT JOIN to retrieve the other features from the aggregated_cc_transactions, issuer_details, and account_details tables.

Pushdown Left Join Performance Analysis

RonDB implements a pushdown LEFT JOIN to reduce the latency in retrieving nested features from a Snowflake schema data model. Figure 3 shows how we implement pushdown LEFT JOINS in our online feature store database, RonDB.

How we implemented pushdown LEFT JOINS in our online feature store database, RonDB. — ***Figure 3:*** In our SIGMOD paper, we show how we implement the pushdown LEFT JOIN, where the nested foreign keys are iteratively resolved in the database nodes, instead of in the API client (which would lead to higher latency). Star schema features are resolved with parallel primary key lookups to the tables, while nested features are resolved with a pushdown LEFT JOIN.

The Hopsworks feature store client issues parallel queries for all features from different tables (feature groups) in RonDB. For child tables (star schema), a primary key lookup is used to retrieve features from RonDB, while nested features are retrieved by issuing a pushdown LEFT JOIN to RonDB. For wide tables, with many features, and where only a small number of features are required, RonDB supports projection pushdown. For example, if you have a feature group (table) with hundreds of features, but only need a few of them for the current model, RonDB will only read and return the data for those features. Many databases or key-value stores used as feature stores, most notably Redis, do not support projection pushdown. Without projection pushdown, the entire row is returned and the filtering is performed in the client. For rows of 10s or 100s of KB, this could mean 100s of times more data is transferred than needed, negatively impacting latency and throughput (and potentially also cost).

In Figure 4 below, we also see the performance improvement of pushing down the LEFT JOIN to the database.

***Figure 4:*** *Pushdown LEFT JOINs reduce latency and improve throughput for feature stores.*

We didn’t include any performance numbers in the paper for projection pushdown or predicate pushdown (both supported by RonDB), as these are not novel concepts in the database world - research papers favor novelty over known concepts. However, projection pushdown and predicate pushdown are must-have functionality for your feature store if you care about performance at scale.

Conclusions

This article explores the limitations of existing feature stores that only support Star Schema data models and introduces the Snowflake Schema data model, now supported by the Hopsworks Feature Store. The Snowflake Schema enhances online models by incorporating additional features through nested tables. In RonDB, we showed how the pushdown LEFT JOIN in RonDB reduces the latency and improves throughput for nested queries in Snowflake Schema data models.

References

Interested for more?

🤖 Register for free on Hopsworks Serverless
🌐 Read about the open, disaggregated AI Lakehouse stack
📚 Get your early copy: O'Reilly's 'Building Machine Learning Systems' book
🛠️ Explore all Hopsworks Integrations
🧩 Get started with codes and examples
⚖️ Compare other Feature Stores with Hopsworks

More blogs

With HopsFS and NVMe storage technology, massive datasets can be stored as multi-file datasets to efficiently & systematically train deep learning models.

NVMe now in HopsFS

Read how HopsFS addresses this issue of storing and accessing datasets as image files stored in a distributed file system (DFS) at large scale.

Jim Dowling

In this blog we explore what are and how to create embeddings in machine learning and their diverse applications in data-driven decision-making.

Machine Learning Embeddings as Features for Models

Delve into the profound implications of machine learning embeddings, their diverse applications, and their crucial role in reshaping the way we interact with data.

Prithivee Ramalingam