SIGMOD 2024

The Hopsworks Feature Store for Machine Learning

Javier de la Rúa Martínez, Fabio Buso, Antonios Kouzoupis, Alexandru A. Ormenisan, Salman Niazi, Davit Bzhalava, Kenneth Mak, Victor Jouffrey, Mikael Ronström, Raymond Cunningham, Ralfs Zangis, Dhananjay Mukhedkar, Ayushman Khazanchi, Vladimir Vlassov, Jim Dowling

Data management is the most challenging aspect of building Machine Learning (ML) systems. ML systems can read large volumes of historical data when training models, but inference workloads are more varied, depending on whether it is a batch or online ML system. The feature store for ML has recently emerged as a single data platform for managing ML data throughout the ML lifecycle, from feature engineering to model training to inference. In this paper, we present the Hopsworks feature store for machine learning as a highly available platform for managing feature data with API support for columnar, row-oriented, and similarity search query workloads. We introduce and address challenges solved by the feature stores related to feature reuse, how to organize data transformations, and how to ensure correct and consistent data between feature engineering, model training, and model inference. We present the engineering challenges in building high-performance query services for a feature store and show how Hopsworks outperforms existing cloud feature stores for training and online inference query workloads.

Download Paper

All Publications

2023-07-12

Multi-Year Mapping of Water Demand at Crop Level

Giulio Weikmann, Daniele Marinelli, Claudia Paris, Silke Migdall, Eva Gleisberg, Florian Appel, Heike Bach, Jim Dowling, Lorenzo Bruzzone

This article presents a novel system that produces multiyear high-resolution irrigation water demand maps for agricultural areas, enabling a new level of detail for irrigation support for farmers and agricultural stakeholders. The system is based on a scalable distributed deep learning (DL) model trained on dense time series of Sentinel-2 images and a large training set for the first year of observation and fine tuned on new labeled data for the consecutive years. The trained models are used to generate multiyear crop type maps, which are assimilated together with the Sentinel-2 dense time series and the meteorological data into a physically based agrohydrological model to derive the irrigation water demand for different crops. To process the required large volume of multiyear Copernicus Sentinel-2 data, the software architecture of the proposed system has been built on the integration of the Food Security thematic exploitation platform (TEP) and the data-intensive artificial intelligence Hopsworks platform. While the Food Security TEP provides easy access to Sentinel-2 data and the possibility of developing processing algorithms directly in the cloud, the Hopsworks platform has been used to train DL algorithms in a distributed manner. The experimental analysis was carried out in the upper part of the Danube Basin for the years 2018, 2019, and 2020 considering 37 Sentinel-2 tiles acquired in Austria, Moravia, Hungary, Slovakia, and Germany.

Download Paper

2023-04-05

ANIARA Project - Automation of Network Edge Infrastructure and Applications with AI

Wolfgang John, Ali Balador, Jalil Taghia, Andreas Johnsson, Johan Sjöberg, Ian Marsh, Jonas Gustafsson, Federico Tonini, Paolo Monti, Pontus Sköldström, Jim Dowling

Emerging use-cases like smart manufacturing and smart cities pose challenges in terms of latency, which cannot be satisfied by traditional centralized infrastructure. Edge networks, which bring computational capacity closer to the users/clients, are a promising solution for supporting these critical low latency services. Different from traditional centralized networks, the edge is distributed by nature and is usually equipped with limited compute capacity. This creates a complex network to handle, subject to failures of different natures, that requires novel solutions to work in practice. To reduce complexity, edge application technology enablers, advanced infrastructure and application orchestration techniques need to be in place where AI and ML are key players.

Download Paper

2022-04-14

Scalable Artificial Intelligence for Earth Observation Data Using Hopsworks

Desta Haileselassie Hagos, Theofilos Kakantousis, Sina Sheikholeslami, Tianze Wang, Vladimir Vlassov, Amir Hossein Payberah, Moritz Meister, Robin Andersson, Jim Dowling

This paper introduces the Hopsworks platform to the entire Earth Observation (EO) data community and the Copernicus programme. Hopsworks is a scalable data-intensive open-source Artificial Intelligence (AI) platform that was jointly developed by Logical Clocks and the KTH Royal Institute of Technology for building end-to-end Machine Learning (ML)/Deep Learning (DL) pipelines for EO data. It provides the full stack of services needed to manage the entire life cycle of data in ML. In particular, Hopsworks supports the development of horizontally scalable DL applications in notebooks and the operation of workflows to support those applications, including parallel data processing, model training, and model deployment at scale. To the best of our knowledge, this is the first work that demonstrates the services and features of the Hopsworks platform, which provide users with the means to build scalable end-to-end ML/DL pipelines for EO data, as well as support for the discovery and search for EO metadata. This paper serves as a demonstration and walkthrough of the stages of building a production-level model that includes data ingestion, data preparation, feature extraction, model training, model serving, and monitoring. To this end, we provide a practical example that demonstrates the aforementioned stages with real-world EO data and includes source code that implements the functionality of the platform. We also perform an experimental evaluation of two frameworks built on top of Hopsworks, namely Maggy and AutoAblation. We show that using Maggy for hyperparameter tuning results in roughly half the wall-clock time required to execute the same number of hyperparameter tuning trials using Spark while providing linear scalability as more workers are added. Furthermore, we demonstrate how AutoAblation facilitates the definition of ablation studies and enables the asynchronous parallel execution of ablation trials.

Download Paper

2021-12-01

HEAP - Human Exposome Assessment Platform

Roxana Merino Martinez, Heimo Müller, Stefan Negru, Alex Ormenisan, Laila Sara Arroyo Mühr, Xinyue Zhang, Frederik Trier Møller, Mark S Clements, Zisis Kozlakidis, Ville N Pimenoff, Bartlomiej Wilkowski, Martin Boeckhout, Hanna Öhman, Steven Chong, Andreas Holzinger, Matti Lehtinen, Evert-Ben van Veen, Piotr Bała, Martin Widschwendter, Jim Dowling, Juha Törnroos, Michael P Snyder, Joakim Dillner

The Human Exposome Assessment Platform (HEAP) is a research resource for the integrated and efficient management and analysis of human exposome data. The project will provide the complete workflow for obtaining exposome actionable knowledge from population-based cohorts. HEAP is a state-of-the-science service composed of computational resources from partner institutions, accessed through a software framework that provides the world’s fastest Hadoop platform for data warehousing and applied artificial intelligence (AI). The software, will provide a decision support system for researchers and policymakers. All the data managed and processed by HEAP, together with the analysis pipelines, will be available for future research. In addition, the platform enables adding new data and analysis pipelines. HEAP’s final product can be deployed in multiple instances to create a network of shareable and reusable knowledge on the impact of exposures on public health.

Download Paper

2021-08-26

ExtremeEarth meets Satellite Data from Space

Desta Haileselassie Hagos, Theofilos Kakantousis, Vladimir Vlassov, Sina Sheikholeslami, Tianze Wang, Jim Dowling, Claudia Paris, Daniele Marinelli, Giulio Weikmann, Lorenzo Bruzzone, Salman Khaleghian, Thomas Kræmer, Torbjørn Eltoft, Andrea Marinoni, Despina-Athanasia Pantazi, George Stamoulis, Dimitris Bilidas, George Papadakis, George Mandilaras, Manolis Koubarakis, Antonis Troumpoukis, Stasinos Konstantopoulos, Markus Muerth, Florian Appel, Andrew Fleming, Andreas Cziferszky

Bringing together a number of cutting-edge technologies that range from storing extremely large volumes of data all the way to developing scalable machine learning and deep learning algorithms in a distributed manner and having them operate over the same infrastructure poses unprecedented challenges. One of these challenges is the integration of European Space Agency (ESA)'s Thematic Exploitation Platforms (TEPs) and data information access service platforms with a data platform, namely Hopsworks, which enables scalable data processing, machine learning, and deep learning on Copernicus data, and development of very large training datasets for deep learning architectures targeting the classification of Sentinel images. In this article, we present the software architecture of ExtremeEarth that aims at the development of scalable deep learning and geospatial analytics techniques for processing and analyzing petabytes of Copernicus data. The ExtremeEarth software infrastructure seamlessly integrates existing and novel software platforms and tools for storing, accessing, processing, analyzing, and visualizing large amounts of Copernicus data. New techniques in the areas of remote sensing and artificial intelligence with an emphasis on deep learning are developed. These techniques and corresponding software presented in this article are to be integrated with and used in two ESA TEPs, namely Polar and Food Security TEPs. Furthermore, we present the integration of Hopsworks with the Polar and Food Security use cases and the flow of events for the products offered through the TEPs.

Download Paper

2021-04-16

Autoablation: Automated parallel ablation studies for deep learning

Sina Sheikholeslami, Moritz Meister, Tianze Wang, Amir H Payberah, Vladimir Vlassov, Jim Dowling

Ablation studies provide insights into the relative contribution of different architectural and regularization components to machine learning models' performance. In this paper, we introduce AutoAblation, a new framework for the design and parallel execution of ablation experiments. AutoAblation provides a declarative approach to defining ablation experiments on model architectures and training datasets, and enables the parallel execution of ablation trials. This reduces the execution time and allows more comprehensive experiments by exploiting larger amounts of computational resources. We show that AutoAblation can provide near-linear scalability by performing an ablation study on the modules of the Inception-v3 network trained on the TenGeoPSAR dataset.

Download Paper

2021-02-01

Deepcube: Explainable Ai Pipelines for Big Copernicus Data

Ioannis Papoutsis, Alkyoni Baglatzi, Souzana Touloumtzi, Markus Reichstein, Nuno Carvalhais, Fabian Gans, Gustau Camps-Valls, Maria Piles, Theofilos Kakantousis, Jim Dowling, Manolis Koubarakis, Dimitris Bilidas, Despina-Athanasia Pantazi, George Stamoulis, Christophe Demange, Léo-Gad Journel, Marco Bianchi, Chiara Gervasi, Alessio Rucci, Ioannis Tsampoulatidis, Eleni Kamateri, Tarek Habib, Alejandro Dıaz Bolıvar, Zisoula Ntasiou, Anastasios Paschalis

The H2020 DeepCube project leverages advances in the fields of Artificial Intelligence and Semantic Web to unlock the potential of Copernicus Big Data and contribute to the Digital Twin Earth initiative. DeepCube aims to address problems of high socio-environmental impact and enhance our understanding of Earth’s processes correlated with Climate Change. To achieve this, the project employs novel technologies, such as the Earth System Data Cube, the Semantic Cube, the Hopsworks platform for distributed deep learning, and visual analytics tools, integrating them into an open, cloud-interoperable platform. DeepCube will develop Deep Learning architectures that extend to non-conventional data, apply hybrid modeling for data-driven AI models that respect physical laws, and open up the Deep Learning black box with Explainable Artificial Intelligence and Causality.

Download Paper

2020-11-19

Distributed Hierarchical File Systems strike back in the Cloud

Mahmoud Ismail, Salman Niazi, Mauritz Sundell, Mikael Ronstrom, Seif Haridi, and Jim Dowling

Cloud service providers have aligned on availability zones as an important unit of failure and replication for storage systems. An availability zone (AZ) has independent power, networking, and cooling systems and consists of one or more data centers. Multiple AZs in close geographic proximity form a region that can support replicated low latency storage services that can survive the failure of one or more AZs. Recent reductions in inter-AZ latency have made synchronous replication protocols increasingly viable, instead of traditional quorum-based replication protocols. We introduce HopsFS-CL, a distributed hierarchical file system with support for high- availability (HA) across AZs, backed by AZ-aware synchronously replicated metadata and AZ-aware block replication. HopsFS-CL is a redesign of HopsFS, a version of HDFS with distributed metadata, and its design involved making replication protocols and block placement protocols AZ-aware at all layers of its stack: the metadata serving, the metadata storage, and block storage layers. In experiments on a real-world workload from Spotify, we show that HopsFS-CL, deployed in HA mode over 3 AZs, reaches 1.66 million ops/s, and has similar performance to HopsFS when deployed in a single AZ, while preserving the same semantics.

Download Paper

2020-11-19

HopsFS-S3: Extending Object Stores with POSIX-like Semantics

Mahmoud Ismail, Salman Niazi, Gautier Berthou, Mikael Ronström, Seif Haridi, Jim Dowling

Object stores have become the de-facto platform for storage in the cloud due to their scalability, high availability, and low cost. However, they provide weaker metadata semantics and lower performance compared to distributed hierarchical file systems. In this paper, we introduce HopsFS-S3, a hybrid distributed hierarchical file system backed by an object store while preserving the file system’s strong consistency semantics. We base our implementation on HopsFS, a next- generation distribution of HDFS with distributed metadata. We redesigned HopsFS’ block storage layer to transparently use an object store to store the file’s blocks without sacrificing the file system’s semantics. We also introduced a new block caching service to leverage faster NVMe storage for hot blocks. In our experiments, we show that HopsFS-S3 outperforms EMRFS for IO-bound workloads, with up to 20% higher performance and delivers up to 3.4𝑋 the aggregated read throughput of EMRFS. Moreover, we demonstrate that metadata operations on HopsFS-S3 (such as directory re- name) are up to two orders of magnitude faster than EMRFS. Finally, HopsFS-S3 opens up the currently closed metadata in object stores, enabling correctly-ordered change notifications with HopsFS’ change data capture (CDC) API and customized extensions to metadata.

Download Paper

2020-11-19

Maggy: Scalable Asynchronous Parallel Hyperparameter Search

Moritz Meister, Sina Sheikholeslami, Amir H. Payberah, Vladimir Vlassov, Jim Dowling

Running extensive experiments is essential for building Machine Learning (ML) models. Such experiments usually require iterative execution of many trials with varying run times. In recent years, Apache Spark has become the de-facto standard for parallel data processing in the industry, in which iterative processes are im- plemented within the bulk-synchronous parallel (BSP) execution model. The BSP approach is also being used to parallelize ML trials in Spark. However, the BSP task synchronization barriers prevent asynchronous execution of trials, which leads to a reduced number of trials that can be run on a given computational budget. In this paper, we introduce Maggy, an open-source framework based on Spark, to execute ML trials asynchronously in parallel, with the ability to early stop poorly performing trials. In the experiments, we compare Maggy with the BSP execution of parallel trials in Spark and show that on random hyperparameter search on a con- volutional neural network for the Fashion-MNIST dataset Maggy reduces the required time to execute a fixed number of trials by 33% to 58%, without any loss in the final model accuracy.

Download Paper

2020-07-28

Time Travel and Provenance for Machine Learning Pipelines

Alexandru A. Ormenisan, Moritz Meister, Fabio Buso, Robin Andersson, Seif Haridi, Jim Dowling

Machine learning pipelines have become the defacto paradigm for productionizing machine learning applications as they clearly abstract the processing steps involved in transforming raw data into engineered features that are then used to train models. In this paper, we use a bottom-up method for capturing provenance information regarding the processing steps and artifacts produced in ML pipelines. Our approach is based on replacing traditional intrusive hooks in application code (to capture ML pipeline events) with standardized change-data-capture support in the systems involved in ML pipelines: the distributed file system, feature store, resource manager, and applications themselves. In particular, we leverage data versioning and time-travel capabilities in our feature store to show how provenance can enable model reproducibility and debugging.

Download Paper

2020-03-02

Towards Distribution Transparency for Supervised ML With Oblivious Training Functions

Moritz Meister, Sina Sheikholeslami, Robin Andersson, Alexandru A. Ormenisan, Jim Dowling

Building and productionizing Machine Learning (ML) models is a process of interdependent steps of iterative code updates, including exploratory model design, hyperparameter tuning, ablation experiments, and model training. Industrial-strength ML involves doing this at scale, using many compute resources, and this requires rewriting the training code to account for distribution. The result is that moving from a single host program to a cluster hinders iterative development of the software, as iterative development would require multiple versions of the software to be maintained and kept consistent. In this paper, we introduce the distribution oblivious training function as an abstraction for ML development in Python, whereby developers can reuse the same training function when running a notebook on a laptop or performing scale-out hyperparameter search and distributed training on clusters. Programs written in our framework look like industry-standard ML programs as we factor out dependencies using best-practice programming idioms (such as functions to generate models and data batches). We believe that our approach takes a step towards unifying single-host and distributed ML development.

Download Paper

2020-02-24

Implicit Provenance for Machine Learning Artifacts

Alexandru A. Ormenisan, Mahmoud Ismail, Seif Haridi, Jim Dowling

Machine learning (ML) presents new challenges for reproducible software engineering, as the artifacts required for repeatably training models are not just versioned code, but also hyperparameters, code dependencies, and the exact version of the training data. Existing systems for tracking the lineage of ML artifacts, such as TensorFlow Extended or MLFlow, are invasive, requiring developers to refactor their code that now is controlled by the external system. In this paper, we present an alternative approach, we call implicit provenance, where we instrument a distributed file system and APIs to capture changes to ML artifacts, that, along with file naming conventions, mean that full lineage can be tracked for TensorFlow/Keras/Pytorch programs without requiring code changes. We address challenges related to adding strongly consistent metadata extensions to the distributed file system, while minimizing provenance overhead, and ensuring transparent eventual consistent replication of extended metadata to an efficient search engine, Elasticsearch. Our provenance framework is integrated into the open-source Hopsworks framework, and used in production to enable full provenance for end-to-end machine learning pipelines.

Download Paper

2019-07-05

Scalable Block Reporting for HopsFS - Best Student Paper award at IEEE BigDataCongress’19

Mahmoud Ismail, August Bonds, Salman Niazi, Seif Haridi, Jim Dowling

Distributed hierarchical file systems typically decouple the storage of the file system's metadata from the data (file system blocks) to enable the scalability of the file system. This decoupling, however, requires the introduction of a periodic synchronization protocol to ensure the consistency of the file system's metadata and its blocks. Apache HDFS and HopsFS implement a protocol, called block reporting, where each data server periodically sends ground truth information about all its file system blocks to the metadata servers, allowing the metadata to be synchronized with the actual state of the data blocks in the file system. The network and processing overhead of the existing block reporting protocol, however, increases with cluster size, ultimately limiting cluster scalability. In this paper, we introduce a new block reporting protocol for HopsFS that reduces the protocol bandwidth and processing overhead by up to three orders of magnitude, compared to HDFS/HopsFS' existing protocol. Our new protocol removes a major bottleneck that prevented HopsFS clusters scaling to tens of thousands of servers.

Download Paper

2019-05-22

ePipe: Near Real-Time Polyglot Persistence of HopsFS Metadata

Mahmoud Ismail, Mikael Ronström, Seif Haridi, Jim Dowling

Distributed OLTP databases are now used to manage metadata for distributed file systems, but they cannot also efficiently support complex queries or aggregations. To solve this problem, we introduce ePipe, a databus that both creates a consistent change stream for a distributed, hierarchical file system (HopsFS) and eventually delivers the correctly ordered stream with low latency to downstream clients. ePipe can be used to provide polyglot storage for file system metadata, allowing metadata queries to be handled by the most efficient engine for that query. For file system notifications, we show that ePipe achieves up to 56X throughput improvement over HDFS INotify and Trumpet with up to 3 orders of magnitude lower latency. For Spotify's Hadoop workload, we show that ePipe can replicate all file system changes from HopsFS to Elasticsearch with an average replication lag of only 330 ms.

Download Paper

2019-03-20

Horizontally Scalable ML Pipelines with a Feature Store

Alexandru A. Ormenisan, Mahmoud Ismail, Kim Hammar, Robin Andersson, Ermias Gebremeskel, Theofilos Kakantousis, Antonios Kouzoupis, Fabio Buso, Gautier Berthou, Jim Dowling, Seif Haridi

Machine Learning (ML) pipelines are the fundamental building block for productionizing ML models. However, much introductory material for machine learning and deep learning emphasizes ad-hoc feature engineering and training pipelines to experiment with ML models. Such pipelines have a tendency to become complex over time and do not allow features to be easily re-used across different pipelines. Duplicating features can even lead to correctness problems when features have different implementations for training and serving. In this demo, we introduce the Feature Store as a new data layer in horizontally scalable machine learning pipelines.

Download Paper

2018-12-18

Size Matters: Improving the Performance of Small Files in Hadoop

Salman Niazi, Seif Haridi, Mikael Ronström, Jim Dowling

The Hadoop Distributed File System (HDFS) is designed to handle massive amounts of data, preferably stored in very large files. The poor performance of HDFS in managing small files has long been a bane of the Hadoop community. In many production deployments of HDFS, almost 25% of the files are less than 16 KB in size and as much as 42% of all the file system operations are performed on these small files. We have designed an adaptive tiered storage using inmemory and on-disk tables stored in a high-performance distributed database to efficiently store and improve the performance of the small files in HDFS. Our solution is completely transparent, and it does not require any changes in the HDFS clients or the applications using the Hadoop platform. In experiments, we observed up to 61 times higher throughput in writing files, and for real-world workloads from Spotify our solution reduces the latency of reading and writing small files by a factor of 3.15 and 7.39 respectively.

Download Paper

2017-05-24

Scaling HDFS to more than 1 million operations per second with HopsFS

Salman Niazi, Mahmoud Ismail, Mikael Ronström, Seif Haridi, Jim Dowling

HopsFS is an open-source, next generation distribution of the Apache Hadoop Distributed File System (HDFS) that replaces the main scalability bottleneck in HDFS, single node in-memory metadata service, with a no-shared state distributed system built on a NewSQL database. By removing the metadata bottleneck in Apache HDFS, HopsFS enables significantly larger cluster sizes, more than an order of magnitude higher throughput, and significantly lower client latencies for large clusters. In this paper, we detail the techniques and optimizations that enable HopsFS to surpass 1 million file system operations per second - at least 16 times higher throughput than HDFS. In particular, we discuss how we exploit recent high performance features from NewSQL databases, such as application defined partitioning, partition-pruned index scans, and distribution aware transactions. Together with more traditional techniques, such as batching and write-ahead caches, we show how many incremental optimizations have enabled a revolution in distributed hierarchical file system performance.

Download Paper

2017-02-07

HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases

Salman Niazi, Mahmoud Ismail, Mikael Ronström, Steffen Grohsschmiedt, Seif Haridi, Jim Dowling

Recent improvements in both the performance and scalability of shared-nothing, transactional, in-memory NewSQL databases have reopened the research question of whether distributed metadata for hierarchical file systems can be managed using commodity databases. In this paper, we introduce HopsFS, a next generation distribution of the Hadoop Distributed File System (HDFS) that replaces HDFS’ single node in-memory metadata service, with a distributed metadata service built on a NewSQL database. By removing the metadata bottleneck, HopsFS enables an order of magnitude larger and higher throughput clusters compared to HDFS. Metadata capacity has been increased to at least 37 times HDFS’ capacity, and in experiments based on a workload trace from Spotify, we show that HopsFS supports 16 to 37 times the throughput of Apache HDFS. HopsFS also has lower latency for many concurrent clients, and no downtime during failover. Finally, as metadata is now stored in a commodity database, it can be safely extended and easily exported to external systems for online analysis and free-text search.

Download Paper

2015-06-18

Leader Election Using NewSQL Database Systems

Salman Niazi, Mahmoud Ismail, Gautier Berthou, Jim Dowling

Leader election protocols are a fundamental building block for replicated distributed services. They ease the design of leader-based coordination protocols that tolerate failures. In partially synchronous systems, designing a leader election algorithm, that does not permit multiple leaders while the system is unstable, is a complex task. As a result many production systems use third-party distributed coordination services, such as ZooKeeper and Chubby, to provide a reliable leader election service. However, adding a third-party service such as ZooKeeper to a distributed system incurs additional operational costs and complexity. ZooKeeper instances must be kept running on at least three machines to ensure its high availability. In this paper, we present a novel leader election protocol using NewSQL databases for partially synchronous systems, that ensures at most one leader at any given time. The leader election protocol uses the database as distributed shared memory. Our work enables distributed systems that already use NewSQL databases to save the operational overhead of managing an additional third-party service for leader election. Our main contribution is the design, implementation and validation of a practical leader election algorithm, based on NewSQL databases, that has performance comparable to a leader election implementation using a state-of-the-art distributed coordination service, ZooKeeper.

Download Paper