Hopsworks compared to

Databricks
Hopsworks Feature Store's capabilities and strenghts compared to
Databricks

Capabilities

Hopsworks logo

Hopsworks

Version:
3.1
Hopsworks logo

Databricks

Version:
12.0

What is Hopsworks?

Hopsworks is a machine learning platform that offers a state-of-the-art feature store solution, making it one of the most feature-rich and versatile feature stores on the market. It provides the highest level of integrability with any other ecosystem, making it easy to use with a wide range of data sources. Additionally, Hopsworks offers Python APIs that are easy to use, providing developers with great flexibility. With its multitude of sources, Hopsworks allows for a seamless feature engineering workflow, making it easy for data scientists to generate training datasets from raw data. Hopsworks is ideal for businesses that require low-latency data processing and support for multiple data sources.

What is Databricks?

Databricks is a unified data analytics platform that allows businesses to build data pipelines and create collaborative workflows. While Databricks provides a range of capabilities, its feature store is lighter in terms of technical capacities compared to most of the other feature store solutions. The feature store can only ingest pre-computed data and does not support defining feature pipelines. While this can be limiting, Databricks is still highly versatile, making it a great option for businesses that require a more comprehensive data analytics platform.

How to Choose?

While Hopsworks provides a state-of-the-art feature store with a multitude of sources, Databricks provides a comprehensive data analytics platform with a lighter feature store. Businesses looking for a solution centered around a feature store with the highest level of integrability and support for multiple data sources should consider Hopsworks. In contrast, businesses looking for a more comprehensive data analytics platform that does includes a feature store but is not their main requirement should consider Databricks.

Feature Store Capabilities

Hopsworks logo
Hopsworks
Hopsworks logo
Databricks
Engineering

Engineering

Feature Computation Engines

What frameworks/languages are supported to create features?
Spark on Databricks

Feature pipelines computed from multiple Data Sources

Some feature stores ingest only pre-computed data, while others support defining feature pipelines.
Yes, using any data sources supported by Spark

Creating Training Data and Batch Inference Data

How is feature data returned in batches for training or batch inference?
Python/Spark job that returns  Training Data or Batch Inference Data as either a DataFrame or Files (Parquet, TFRecord, CSV)
Spark Job returns Spark DataFrame

On-Demand Features

Is there support for computing features on data only available from clients at request-time?
Python UDFs
Python UDFs in MLFlow

Data types

What (Python) language-level data types are supported.
Most Spark and Pandas datatypes (including timestamps and arrays)
Most PySpark Data Types

Datatype for entity/primary keys

What (Python) language-level data types are supported by the feature store for defining primary keys for entities?
String, Int, Long, Date

Versioning

Does the platform provide support for versioning of features or Feature Tables/Groups.
N/A - Semantic versioning using names

Data Validation

Is there support for validating data in feature pipelines before the features are written to the feature store?
N/A

Feature Testing and CI/CD

Best practices for testing and CI/CD for feature development in machine learning.
Supports industry standard DevOps processes, with Git, PyTest, and CI/CD services (Jenkins, Github Actions, etc)
The same testing practices as you use for PySpark on Databricks

Retrieving Feature Vectors from Online Store

What APIs are supported for reading a row of feature values from the online feature store?
Python or REST API
operations

Operations

Pipeline Orchestration

How are the feature/training/inference pipelines that use the feature store scheduled to run? What orchestration engines are supported?
Any Python or Spark Orchestration tool (Airflow, Dagster, AWS Lambda, etc)
Databricks Worfklow Orchestrator

Offline Feature Store

What data warehouse / lakehouse / object store is supported for storing offline feature data?
Hudi on HopsFS/S3 or External Tables (Snowflake, S3, GCS, JDBC, etc)
Delta Lake

Platform Support

What platforms is the feature store available on
AWS, Azure, GCP, On-Prem
AWS, Azure, GCP

Online Feature Store

What operational database is supported for storing online features?
RonDB
DynamoDB or MySQL (Aurora or RDS)

Batch Ingestion

How are features written to the offline feature store.
Spark DataFrame API

Streaming Ingestion

Does the platform support computing features in a streaming application.
N/A

Join Engine

A join engine can help achieve point-in-time correctness for training data.
Spark

Reuse Features

Does the platform support feature encoding (model-dependent transformations) after the data has been stored in a Feature Table/Group?
N/A

Feature Monitoring

Is there support for identifying (and alerting) when there are anomalous changes in a feature as it is updated over time?
Feature ingestion monitoring with Great Expectations and alerting (email or slack)
N/A

Backfill Features

Is there any additional support for specifying a job to fill up a feature table/group with feature values from data source(s) that contains historical data?
Repeated Parameterized Python or Spark Job
Batch Ingestion Spark Job

Ranking and Retrieval Architecture Support

If you are using the feature store to build a personalized recommendation or search system, what support is there for vector DB integration?
Out-of-the-box, with OpenSearch K-NN included. External Vector Databases can be integrated.
External Vector Databases can be integrated

Model Registry & Model Serving Support

Is there support for storing the models in a registry and for running the online inference pipelines in a model serving platform?
Yes, with KServe for Model Serving
Yes wiith MLFlow
security and governance

Security & Governance

Access Control

What support is there in the platform for authenticating users and then definining policies.
Platform level access control and Project Membership RBAC Inside Projects
RBAC for Feature Tables

Custom metadata and search

What type of tags can be created - string-based or schematized tags? How is search performed?
Names, descriptions, keywords, schematized tags - with free-text search
Name, description and Tags

Provenance

What support is there for tracking the lineage of features - what raw data are they computed on, what training data or models are they used in?
N/A
If you would like a more detailed comparison and complete review of the above products feel free to contact us.