Hopsworks allows you to manage all your data for machine learning on a Feature Store platform that integrates with GCP services such as Google Cloud Storage, BigQuery, DataProc, and Vertex. Hopsworks also integrates with almost all other data stores, including Databricks and Snowflake.
Why Hopsworks on Google Cloud?
Open - Source
Hopsworks is a versatile open-source platform that offers open APIs and seamless integration with a variety of feature computation engines such as Python, Spark, Beam, Flink, and SQL. It can also work with any model training platform and model inference system.
Highest Performance
Hopsworks Feature Store is built on RonDB (our cloud-native MySQL Cluster), that powers most of the world’s Network Operator Databases with 7 nines of availability. It is the feature store with the highest throughput and lowest-latency available today.
Python Native
Support for feature engineering in Spark, Flink, and SQL, with unique Python APIs that provide high performance reading/writing of features from any Python environment (including AWS Sagemaker, KubeFlow pipelines, AWS MWAA (Airflow), Dagster, Astronomer, or Notebooks).
Integration with GCS, BigQuery, Vertex, and GKS
Hopsworks' stores its offline feature groups as Hudi tables on a Google Cloud Storage (GCS) bucket. Hopsworks also supports external feature groups in its offline store with connectors for BigQuery and GCS. This means, you can keep your existing feature pipelines that create tables in BigQuery or Parquet tables on GCS and just mount them as external feature groups in Hopsworks.
You can also write feature pipelines in Beam/Python/Spark/Flink that read from almost any data source and write to Hopsworks feature groups in Hudi. This is a lower-cost and higher-performance alternative to storage tables of features in a data warehouse. Hudi tables are read from your GCS bucket via a high performance caching layer, called HopsFS, that enables NVMe read performance for your working set, instead of waiting for GCS buckets. In Python, we provide lightning fast access via our FlyingDuck service that transfers data to Pandas clients in Arrow format (without serialization/deserialization) and server-side uses DuckDB to do push-down filtering and point-in-time correct joins across feature groups.
Hopsworks can also be connected to a GKS cluster, enabling you, from within Hopsworks, to run Python jobs, Jupyter notebooks, and model serving with KServe on the managed Kubernetes cluster.
Hopsworks.ai for managed Hopsworks Clusters on GCP
Hopsworks.ai manages your Hopsworks cluster securely inside your GCP Project. In your GCP account, you need to create a service account with an IAM role (see our docs for details), and then the Hopsworks control plane can manage your Hopsworks clusters for you, leaving you free to put your models to work, rather than maintaining your infrastructure. Hopsworks.ai control plane supports the following Cluster Commands using either the UI or a REST API:
cluster create/destroy (Terraform support available)
cluster stop/start
cluster backup/restore
cluster upgrade
online add/remove compute workers (optionally using spot pricing)
online add/remove feature store query nodes
online scale up/down RonDB data nodes
Hopsworks.ai manages your clusters securely inside your GCP cloud account.
A Hopsworks cluster running in your cloud account will periodically send heartbeats (containing minimal information about cluster status) to the Hopsworks Control plane so that it knows the operational status of your cluster. For easier debugging and monitoring by Hopsworks engineers, you can also optionally select to export your monitoring metrics and logs to Hopsworks.ai.
Using your managed Hopsworks cluster on GCP
In Hopsworks.ai, you can configure your Hopsworks cluster(s) to support single sign-on (SSO), such as ActiveDirectory, LDAP, or O-Auth2. This way, members of your organization can be just given a URL to the cluster and they will authenticate via your existing SSO mechanisms. Hopsworks users (data scientists, ML/data engineers, analysts) can work directly with the Hopsworks cluster, and do not need access to the Hopsworks.ai control plane. Only administrators, responsible for managing the size and lifecycle of your cluster, require access on Hopsworks.ai.