From Horse Racing calculation to Anti-Cheat Lookups
We have been working with Paddy Power, an Irish gambling company, to help them calculate odds for horse racing by using two different models: one for All Weather races and one for Flat races. Paddy Power also uses Hopsworks with batch predictions as an anti-cheating system making sure that no user “knows more than the betting company”.
The Aim:
- Data scientists were unable to easily discover and experiment with existing features and pipelines
- Sharing features across models was not possible
- Difficulties re-using features between the all weather model and the flat racing model
- Infrastructure depending heavily on a small and dedicated team to maintain it
- The data warehouse did not provide feature statistics or metadata, slowing down the process of feature engineering
- Python, the preferred programming language choice of most data scientists, is not supported in Redshift
- No centralized storage/sharing of features
- Maintenance issues
- Lacked the ability to collaborate
Why Hopsworks?
- They integrated the Hopsworks Feature Store as a repository of features ready to be used for training models with the existing AWS SageMaker architecture.
- Data scientists and analysts can now browse available features, inspect their metadata, investigate pre-computed statistics, and preview sample feature values.
- Hopsworks also allows better centralization and accessibility of data as well as collaboration between teams.
Results:
Improved Feature Quality
Improved models that generate more revenue
Faster Feature Engineering
Access to statistics and metadata, decreasing the time to generate training datasets
Exploratory Data Analysis
Discover pre-computed features, types of those features, descriptive statistics and the distribution of feature values
Feature Reusability
Previously engineered and quality-assured features become available to be reused - ready for training
Consolidated Feature Engineering Pipelines
Feature engineering code is not duplicated in applications, instead a single pipeline computes features for serving and training
Faster Models to Production
Data scientists can concentrate on improving models, and not on complex infrastructure for ensuring training and serving pipelines are kept in sync