Hopsworks feature store can be configured to leverage the content of data warehouses to simplify the data science workflow. For data scientists, using data directly from a data warehouse presents three challenges: data in the data warehouse is often updated making it impossible to reproduce previously generated training data and previous experiments. Data warehouses often lack the historical view of the data, leaving to data scientists the chore of building it. Finally productionizing a model often requires building additional pipelines to make the same data available in a low latency database for online serving.
In this talk we will discuss how Hopsworks can be connected to existing cloud native data warehouses like Snowflake, Redshift and BigQuery. We will show how to use data warehouses as a source of data to build historical and reproducible training dataset. We will show how to leverage the core functionalities of Hopsworks: Python centric APIs, time travel, statistics, search and data validation to build historical, clean and reproducible dataset to train and productionize machine learning models.