Data Engineering

Migrating from AWS to a European Cloud - How We Cut Costs by 62%

This post describes how we successfully migrated our serverless offering from AWS US-East to OVHCloud North America, reducing our monthly spend from $8,000 to $3,000 with no loss in service quality.

Jim Dowling

February 25, 2025

The 10 Fallacies of MLOps

MLOps fallacies can slow AI deployment and add complexity. This blog breaks down 10 common misconceptions, their impact, and how to avoid them to build scalable, production-ready AI systems efficient.

Jim Dowling

January 3, 2025

23 min

Amazon FSx for NetApp ONTAP interoperability test in a Hopsworks 4.x Deployment

By following this tutorial, you can evaluate the interoperability between Hopsworks 4.x and Amazon FSx for NetApp ONTAP

Javier Cabrera

December 9, 2024

16 min

Hopsworks PKI: The Unseen Hero

In this article we explore how our Public Key Infrastructure has changed over the years coming to its current form, a Kubernetes first-class citizen.

Antonios Kouzoupis

October 21, 2024

12 min

Migrating Hopsworks to Kubernetes

Nearly a year ago, the Hopsworks team embarked on a journey to migrate its infrastructure to Kubernetes. In this article we describe three main pillars of our Kubernetes migration.

Javier Cabrera

September 2, 2024

30 min

Introducing the AI Lakehouse

We describe the capabilities that need to be added to Lakehouse to make it an AI Lakehouse that can support building and operating AI-enabled batch and real-time applications as well LLM applications.

Jim Dowling

August 6, 2024

Reproducible Data for the AI Lakehouse

We present how Hopsworks leverages its time-travel capabilities for feature groups to support reproducible creation of training datasets using metadata.

Jim Dowling

July 25, 2024

8 min

RonDB: A Real-Time Database for Real-Time AI Systems

Learn more about how Hopsworks (RonDB) outperforms AWS Sagemaker and GCP Vertex in latency for real-time AI databases, based on a peer-reviewed SIGMOD 2024 benchmark.

Mikael Ronström

July 23, 2024

9 min

From Lakehouse to AI Lakehouse with a Python-Native Query Engine

Read how Hopsworks generates temporal queries from Python code, and how a native query engine built on Arrow can massively outperform JDBC/ODBC APIs.

Jim Dowling

July 1, 2024

30 min

The Taxonomy for Data Transformations in AI Systems

This article introduces a taxonomy for data transformations in AI applications that is fundamental for any AI system that wants to reuse feature data in more than one model.

Manu Joseph

June 25, 2024

25 min

Modularity and Composability for AI Systems with AI Pipelines and Shared Storage

We present a unified software architecture for batch, real-time, and LLM AI systems that is based on a shared storage layer and a decomposition of machine learning pipelines.

Jim Dowling

May 13, 2024

Building a Cheque Fraud Detection and Explanation AI System using a fine-tuned LLM

The third edition of the LLM Makerspace dived into an example of an LLM system for detecting check fraud.

Jim Dowling

April 17, 2024

Job Scheduling & Orchestration using Hopsworks and Airflow

This article covers the different aspects of Job Scheduling in Hopsworks including how simple jobs can be scheduled through the Hopsworks UI by non-technical users

Ehsan Heydari

April 11, 2024

6 min

Build Your Own Private PDF Search Tool

A summary from our LLM Makerspace event where we built our own PDF Search Tool using RAG and fine-tuning in one platform. Follow along the journey to build a LLM application from scratch.

Jim Dowling

April 10, 2024

17 min

Build Vs Buy: For Machine Learning/AI Feature Stores

On the decision of building versus buying a feature store there are strategic and technical components to consider as it impacts both cost and technological debt.

Rik Van Bruggen

April 2, 2024

7 min

Unlocking the Power of Function Calling with LLMs

This is a summary of our latest LLM Makerspace event where we pulled back the curtain on a exciting paradigm in AI – function calling with LLMs.

Jim Dowling

January 18, 2024

Common Error Messages in Pandas

We go through the most common errors messages in Pandas and offer solutions to these errors as well as provide efficiency tips for Pandas code.

Haziqa Sajid

November 30, 2023

Feature Engineering with DBT for Data Warehouses

Read about the advantages of using DBT for data warehouses and how it's positioned as a preferred solution for many data analytics and engineering teams.

Kais Laribi

November 6, 2023

Pandas2 and Polars for Feature Engineering

We review Python libraries, such as Pandas, Pandas2 and Polars, for Feature Engineering, evaluate their performance and explore how they power machine learning use cases.

Haziqa Sajid

October 9, 2023

Machine Learning Embeddings as Features for Models

Delve into the profound implications of machine learning embeddings, their diverse applications, and their crucial role in reshaping the way we interact with data.

Prithivee Ramalingam

September 13, 2023

25 min

From MLOps to ML Systems with Feature/Training/Inference Pipelines

We explain a new framework for ML systems as three independent ML pipelines: feature pipelines, training pipelines, and inference pipelines, creating a unified MLOps architecture.

Jim Dowling

September 4, 2023

18 min

Feature Engineering with Apache Airflow

Unlock the power of Apache Airflow in the context of feature engineering. We will delve into building a feature pipeline using Airflow, focusing on two tasks: feature binning and aggregations.

Prithivee Ramalingam

August 23, 2023

Automated Feature Engineering with FeatureTools

An ML model’s ability to learn and read data patterns largely depend on feature quality. With frameworks such as FeatureTools ML practitioners can automate the feature engineering process.

Haziqa Sajid

August 9, 2023

Faster reading from the Lakehouse to Python with DuckDB/ArrowFlight

In this article, we outline how we leveraged ArrowFlight with DuckDB to build a new service that massively improves the performance of Python clients reading from lakehouse data in the Feature Store

Till Döhmen

June 21, 2023

8 min

Building Feature Pipelines with Apache Flink

Find out how to use Flink to compute real-time features and make them available to online models within seconds using Hopsworks.

Fabio Buso

June 20, 2023

Feature Engineering for Categorical Features with Pandas

Explore the power of feature engineering for categorical features using Pandas. Learn essential techniques for handling categorical variables, and creating new features.

Prithivee Ramalingam

September 21, 2022

Data Validation for Enterprise AI: Using Great Expectations with Hopsworks

Learn more about how Hopsworks stores both data and validation artifacts, enabling easy monitoring on the Feature Group UI page.

Victor Jouffrey

September 15, 2022

9 min

How to use external data stores as an offline feature store in Hopsworks with Connector API

In this blog, we introduce Hopsworks Connector API that is used to mount a table in an external data source as an external feature group in Hopsworks.

Dhananjay Mukhedkar

August 23, 2022

From Pandas to Features to Models to Predictions - A deep dive into the Hopsworks APIs

Learn how the Hopsworks feature store APIs work and what it takes to go from a Pandas DataFrame to features used by models for both training and inference.

Fabio Buso

June 30, 2022

A Spark Join Operator for Point-in-Time Correct Joins

In this blog post we showcase the results of a study that examined point-in-time join optimization using Apache Spark in Hopsworks.

Axel Pettersson

May 27, 2022

15 min

Feature Types for Machine Learning

Programmers know data types, but what is a feature type to a programmer new to machine learning, given no mainstream programming language has native support for them?

Jim Dowling

April 26, 2022

17 min

Testing feature logic, transformations, and feature pipelines with pytest

Operational machine learning requires the offline and online testing of both features and models. In this article, we show you how to design, build, and run test for features.

Jim Dowling

November 23, 2021

3 min

Show me the code; how we linked notebooks to features

We are introducing a new feature in Hopsworks UI - feature code preview - ability to view the notebook used to create a Feature Group or Training Dataset.

Jim Dowling

October 8, 2021

End-to-end Deep Learning Pipelines with Earth Observation Data in Hopsworks

In this blog post we demonstrate how to build such a pipeline with real-world data in order to develop an iceberg classification model.

Theofilos Kakantousis

July 12, 2021

AI Software Architecture for Copernicus Data with Hopsworks

Hopsworks brings support for scale-out AI with the ExtremeEarth project which focuses on the most concerning issues of food security and sea mapping.

Theofilos Kakantousis

June 4, 2021

6 min

How to build ML models with fastai and Jupyter in Hopsworks

This tutorial gives an overview of how to work with Jupyter on the platform and train a state-of-the-art ML model using the fastai python library.

Robin Andersson

November 19, 2020