Hopsworks compared to

Sagemaker
Hopsworks Feature Store's capabilities and strenghts compared to
Sagemaker

Capabilities

Hopsworks logo

Hopsworks

Version:
3.1
Hopsworks logo

Sagemaker

Version:
Jul 10th 2023

What is Hopsworks?

Hopsworks is a feature store that enables organizations to streamline their machine learning workflows. It offers a state-of-the-art solution that is the most feature-rich and versatile on the market. Hopsworks provides the highest level of integrability with any other ecosystem, making it easy to use with a wide range of data sources. The feature store offers Python APIs that are easy to use, providing developers with great flexibility.

With its extensive capabilities, Hopsworks allows for a seamless feature engineering workflow, making it easy for data scientists to generate training datasets from raw data. Hopsworks' commitment to build the best Feature Store means that it focuses exclusively on providing the best solution for Data Scientists and ML Engineers

What is Sagemaker?

Sagemaker is an ML platform with a feature store, that offers a machine learning service designed to build, train, and deploy models at scale. Sagemaker can be used to build end-to-end machine learning pipelines using pre-built algorithms and frameworks. However, Sagemaker has limitations to consider. There is a risk of vendor lock-in, as the platform is tied to the Amazon Web Services ecosystem. This can be problematic for businesses that want to use a variety of cloud services or who may want to migrate to a different cloud provider in the future. Sagemaker also offers limited capabilities when compared to other feature stores, and the potential high price may be a barrier for small or medium-sized businesses. However, Sagemaker is a pragmatic solution for businesses that want to exclusively remain within the AWS ecosystem.

How to choose?

In terms of use cases, Hopsworks stands out for its rich capabilities and versatility, making it ideal for businesses that require low-latency data processing and support for multiple data sources or complex use cases.

Hopsworks' product philosophy centered around feature, training and inference pipelines allow data scientists to define how features are computed and can read data from multiple sources. On the other hand, Sagemaker is a viable option for businesses that want to remain within the AWS ecosystem and already use other AWS services. While both solutions have their strengths and limitations, the choice ultimately depends on the specific needs and resources of the customer.

Feature Store Capabilities

Hopsworks logo
Hopsworks
Hopsworks logo
Sagemaker
Engineering

Engineering

Feature Computation Engines

What frameworks/languages are supported to create features?
Any compute engine compatible with Sagemaker Python SDK

Feature pipelines computed from multiple Data Sources

Some feature stores ingest only pre-computed data, while others support defining feature pipelines.
Yes, using any data sources supported by Sagemaker Python SDK (Spark, Pandas or PutRecord)

Creating Training Data and Batch Inference Data

How is feature data returned in batches for training or batch inference?
Python/Spark job that returns  Training Data or Batch Inference Data as either a DataFrame or Files (Parquet, TFRecord, CSV)
Batch Job using Python or Spark returning either a CSV file or Pandas DataFrame

On-Demand Features

Is there support for computing features on data only available from clients at request-time?
Python UDFs
N/A

Data types

What (Python) language-level data types are supported.
Most Spark and Pandas datatypes (including timestamps and arrays)
String, FP32, Int, EventTimestamp

Datatype for entity/primary keys

What (Python) language-level data types are supported by the feature store for defining primary keys for entities?
Unknown

Versioning

Does the platform provide support for versioning of features or Feature Tables/Groups.
None. Semantic versioning using names

Data Validation

Is there support for validating data in feature pipelines before the features are written to the feature store?
N/A

Feature Testing and CI/CD

Best practices for testing and CI/CD for feature development in machine learning.
Supports industry standard DevOps processes, with Git, PyTest, and CI/CD services (Jenkins, Github Actions, etc)
N/A

Retrieving Feature Vectors from Online Store

What APIs are supported for reading a row of feature values from the online feature store?
Python API
operations

Operations

Pipeline Orchestration

How are the feature/training/inference pipelines that use the feature store scheduled to run? What orchestration engines are supported?
Any Python or Spark Orchestration tool (Airflow, Dagster, AWS Lambda, etc)
Any Python Orchestration tool (Airflow, Dagster, AWS Lambda, etc)

Offline Feature Store

What data warehouse / lakehouse / object store is supported for storing offline feature data?
Hudi on HopsFS/S3 or External Tables (Snowflake, S3, GCS, JDBC, etc)
AWS Glue or Iceberg on S3

Platform Support

What platforms is the feature store available on
AWS, Azure, GCP, On-Prem
AWS

Online Feature Store

What operational database is supported for storing online features?
RonDB
DynamoDB

Batch Ingestion

How are features written to the offline feature store.
Spark Job or Python Job to ingest features from a Data Source

Streaming Ingestion

Does the platform support computing features in a streaming application.
Put Record API (goes to DynamoDB)

Join Engine

A join engine can help achieve point-in-time correctness for training data.
N/A

Reuse Features

Does the platform support feature encoding (model-dependent transformations) after the data has been stored in a Feature Table/Group?
N/A

Feature Monitoring

Is there support for identifying (and alerting) when there are anomalous changes in a feature as it is updated over time?
Feature ingestion monitoring with Great Expectations and alerting (email or slack)
N/A

Backfill Features

Is there any additional support for specifying a job to fill up a feature table/group with feature values from data source(s) that contains historical data?
Repeated Parameterized Python or Spark Job
Batch Ingestion Job

Ranking and Retrieval Architecture Support

If you are using the feature store to build a personalized recommendation or search system, what support is there for vector DB integration?
Out-of-the-box, with OpenSearch K-NN included. External Vector Databases can be integrated.
External Vector Databases can be integrated

Model Registry & Model Serving Support

Is there support for storing the models in a registry and for running the online inference pipelines in a model serving platform?
Yes, with KServe for Model Serving
SageMaker Inference and Training APIs
security and governance

Security & Governance

Access Control

What support is there in the platform for authenticating users and then definining policies.
Platform level access control and Project Membership RBAC Inside Projects
AWS KMS Permissions

Custom metadata and search

What type of tags can be created - string-based or schematized tags? How is search performed?
Names, descriptions, keywords, schematized tags - with free-text search
Name, description and key-value tags

Provenance

What support is there for tracking the lineage of features - what raw data are they computed on, what training data or models are they used in?
N/A
If you would like a more detailed comparison and complete review of the above products feel free to contact us.