The Hopsworks Feature Store for Machine Learning

June 9, 2024

Data management is the most challenging aspect of building Machine Learning (ML) systems. ML systems can read large volumes of historical data when training models, but inference workloads are more varied, depending on whether it is a batch or online ML system.

Multi-Year Mapping of Water Demand at Crop Level

July 12, 2023

This article presents a novel system that produces multiyear high-resolution irrigation water demand maps for agricultural areas, enabling a new level of detail for irrigation support for farmers and agricultural stakeholders.

ANIARA Project - Automation of Network Edge Infrastructure and Applications with AI

April 5, 2023

Emerging use-cases like smart manufacturing and smart cities pose challenges in terms of latency, which cannot be satisfied by traditional centralized infrastructure.

Scalable Artificial Intelligence for Earth Observation Data Using Hopsworks

April 14, 2022

This paper introduces the Hopsworks platform to the entire Earth Observation (EO) data community and the Copernicus programme. Hopsworks is a scalable data-intensive open-source Artificial Intelligence (AI) platform that was jointly developed by Logical Clocks and KTH Royal Institute of Technology.

HEAP - Human Exposome Assessment Platform

December 1, 2021

The Human Exposome Assessment Platform (HEAP) is a research resource for the integrated and efficient management and analysis of human exposome data.

ExtremeEarth meets Satellite Data from Space

August 26, 2021

Bringing together a number of cutting-edge technologies that range from storing extremely large volumes of data all the way to developing scalable ML and deep learning algorithms in a distributed manner and having them operate over the same infrastructure poses unprecedented challenges.

Autoablation: Automated parallel ablation studies for deep learning

April 16, 2021

Ablation studies provide insights into the relative contribution of different architectural and regularization components to machine learning models' performance. In this paper, we introduce AutoAblation, a new framework for the design and parallel execution of ablation experiments.

Deepcube: Explainable Ai Pipelines for Big Copernicus Data

February 1, 2021

This project employs novel technologies, such as the Earth System Data Cube, the Semantic Cube, the Hopsworks platform for distributed deep learning, and visual analytics tools, integrating them into an open, cloud-interoperable platform.

Maggy: Scalable Asynchronous Parallel Hyperparameter Search

November 19, 2020

Maggy is an extension to Spark’s synchronous processing model to allow it to run asynchronous ML trials, enabling end-to-end state-of-the-art ML pipelines to be run fully on Spark. Maggy provides programming support for defining, optimizing, and running parallel ML trials.

Distributed Hierarchical File Systems strike back in the Cloud

November 19, 2020

HopsFS-CL is a highly available distributed hierarchical file system with native support for AZ awareness using synchronous replication protocols.

HopsFS-S3: Extending Object Stores with POSIX-like Semantics

November 19, 2020

HopsFS-S3 is a hybrid cloud-native distributed hierarchical file system that is available across availability zones, has the same cost as S3, but has 100X the performance of S3 for file move/rename operations, and 3.4X the read throughput of S3 (EMRFS) for the DFSIO Benchmark.

Time Travel and Provenance for Machine Learning Pipelines

July 28, 2020

Implicit model for provenance can be used next to a feature store with versioned data to build reproducible and more easily debugged ML pipelines. We provide development tools and visualization support that can help developers more easily navigate and re-run pipelines .

Towards Distribution Transparency for Supervised ML With Oblivious Training Functions

March 2, 2020

The distribution oblivious training function allows ML developers to reuse the same training function when running a single host Jupyter notebook or performing scale-out hyperparameter search and distributed training on clusters.

Implicit Provenance for Machine Learning Artifacts

February 24, 2020

Implicit provenance allows us to capture full lineage for ML programs, by only instrumenting the distributed file system and APIs and with no changes to the ML code.

Scalable Block Reporting for HopsFS - Best Student Paper award at IEEE BigDataCongress’19

July 5, 2019

New version of block reporting protocol for HopsFS that uses up to 1/1000th of the resources of HDFS' block reporting protocol. IEEE BigDataCongress’19.

ePipe: Near Real-Time Polyglot Persistence of HopsFS Metadata

May 22, 2019

Change Data Capture paper for HopsFS (ePipe). CCGRID’19.

Horizontally Scalable ML Pipelines with a Feature Store

March 20, 2019

Paper description of a demo given for Hopsworks ML pipeline at SysML 2019.

Size Matters: Improving the Performance of Small Files in Hadoop

December 18, 2018

Describes how HopsFS supports small files in metadata on NVMe disks. Middleware 2018.

Scaling HDFS to more than 1 million operations per second with HopsFS

May 24, 2017

IEEE Scale Prize Winning submission, May 2017. Heavy on database optimizations in HopsFS' metadata layer.

HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases

February 7, 2017

First main paper on HopsFS at USENIX FAST 2017.

Leader Election Using NewSQL Database Systems

June 18, 2015

HopsFS' leader election protocol that uses NDB as a backend. DAIS 2015: 158-172.

Hopsworks Research Papers

Discover how our research is driving innovation for data and AI.

The Hopsworks Feature Store for Machine Learning

Multi-Year Mapping of Water Demand at Crop Level

ANIARA Project - Automation of Network Edge Infrastructure and Applications with AI

Scalable Artificial Intelligence for Earth Observation Data Using Hopsworks

HEAP - Human Exposome Assessment Platform

ExtremeEarth meets Satellite Data from Space

Autoablation: Automated parallel ablation studies for deep learning

Deepcube: Explainable Ai Pipelines for Big Copernicus Data

Maggy: Scalable Asynchronous Parallel Hyperparameter Search

Distributed Hierarchical File Systems strike back in the Cloud

HopsFS-S3: Extending Object Stores with POSIX-like Semantics

Time Travel and Provenance for Machine Learning Pipelines

Towards Distribution Transparency for Supervised ML With Oblivious Training Functions

Implicit Provenance for Machine Learning Artifacts

Scalable Block Reporting for HopsFS - Best Student Paper award at IEEE BigDataCongress’19

ePipe: Near Real-Time Polyglot Persistence of HopsFS Metadata

Horizontally Scalable ML Pipelines with a Feature Store

Size Matters: Improving the Performance of Small Files in Hadoop

Scaling HDFS to more than 1 million operations per second with HopsFS

HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases

Leader Election Using NewSQL Database Systems