Hopsworks 3.1 is now generally available. This version includes improvements in the feature store (time-series splits for training data, support for managing thousands of models), stability, and user-interface improvements.
Hopsworks 3.0 introduced new Python APIs to extend its support to moderate-sized data challenges, where Python and Pandas are important technologies for feature pipelines to create features, training pipelines to create models, and batch inference pipelines to produce predictions.
One of the challenges when creating training data for a model is splitting the training data into train, test, and validation sets. When the data is time independent (such as a static dataset that does not change over time), a random split of the training data into train/test/validation sets is appropriate. However, much Enterprise data is time-dependent - such as consumer/sales/orders that are seasonal and are affected by exogenous shocks and changes in human behavior that occur at a slower timescale. To this end, we introduce API support for time-series splits of training data from feature views. In the example below, we can see how to create train/test Pandas DataFrames split using time ranges using a feature view containing electricity prices.
Companies are gaining competitive advantage by personalizing their AI, and training models for individual customers or groups of users. Many of these personalized models share the same set of features for training and inference, but differ in the data that is used to train them. For example, you may have 3 models for customers in the US, EU, and Asia. You define the same set of features for all customers, but when you want to train a model or get a batch of inference data for one region, you only want the features for customers from that one region - e.g., training data for customers in the EU.
With training data filters, you can now easily create training data filtered out to only include the desired grouping. The example below shows how to retrieve the electricity price features for the region “SE1” that we can then use to train a model to predict electricity prices for the region “SE1”. You could have similar code to retrieve training data for models for the regions “SE2”, “SE3”, and “SE4”. When you create batch inference data (get_batch_data), using the same feature view and the training dataset identifier, it will inherit the same training data filters, making it easier to implement batch inference pipelines at scale for each of your personalized models. You can scale training data filters to easily manage thousands of training datasets, enabling easier management of training data for personalized models.
We have deprecated our older example notebooks, and our new tutorials that now are available on Github. We modified the RBAC capabilities of the Data scientist role - they are no longer able to create new feature groups or edit existing feature store entities that they are not the creator of. Also, feature stores can now be shared using only "read-only" access rights
Why HopsFS is a great choice as a distributed file system (DFS) in a time when DFS is becoming increasingly indispensable as a central store for training data, logs, model serving, and checkpoints.