Scheduled upgrade on April 4, 08:00 UTC

Kindly note that during the maintenance window, app.hopsworks.ai will not be accessible.

April 4, 2025

App Status

Back to Blog

Rik Van Bruggen

VP EMEA

Let's keep in touch!

Subscribe to our newsletter and receive the latest product updates, upcoming events, and industry news.

More Blogs

How we secure your data with Hopsworks

Migrating from AWS to a European Cloud - How We Cut Costs by 62%

The 10 Fallacies of MLOps

Hopsworks AI Lakehouse: The Power of Integrated MLOps Components

Unlocking the Power of AI in Government

Article updated on

Optimizing AI Costs

Efficient Resource Management with the Hopsworks AI Lakehouse

January 30, 2025

9 min

Read

Rik Van Bruggen

VP EMEA

Hopsworks

TL;DR

The rapid advancement of machine learning (ML) and artificial intelligence (AI) has brought immense opportunities but also significant challenges, particularly in managing the high costs associated with training and deploying complex models. At Hopsworks, we have seen organizations grapple with escalating expenses related to computing resources, as they move their applications from experiments to production systems. What seemed like a fairly small cost in development, often blew up to completely unaffordable proportions that jeopardized the entire project.

This post will explore several key strategies to mitigate these costs, focusing on leveraging GPUs, transitioning to on-premise infrastructure, and implementing advanced resource management techniques, all while highlighting how Hopsworks can facilitate these optimizations.

The type of processor matters: GPUs can be cost effective when compared to CPUs

Traditionally, CPUs have been the workhorses of computing. However, for machine learning workloads, GPUs have emerged as the more efficient and cost-effective solution. GPUs are designed for parallel processing, making them ideal for the matrix operations that are fundamental to deep learning algorithms. GPUs offer significant advantages over CPUs in terms of speed, efficiency, and cost-effectiveness, particularly for tasks like neural network training. They can have thousands of cores, enabling them to perform many calculations simultaneously, unlike CPUs, which have a fraction of that number. This results in much higher compute power for a lower price. Research also shows that GPU clusters are faster and more cost-effective than CPU clusters, helping to meet SLAs. Additionally, GPUs can be used for other data science tasks besides model training, such as ETL jobs and data wrangling.

Location matters: the return to sovereign infrastructure

While cloud computing has provided accessibility and scalability for ML workloads, the cost can quickly become prohibitive, especially for long-term, production projects with 5 nines availability requirements and mission critical governance controls that are demanded by sovereign applications. At Hopsworks, we have found that customers can hit three birds with one stone: achieve higher levels of data sovereignty, improved performance and / or reduced response times, and reduce costs at the same time.

Transitioning to on-premise infrastructure can be a strategic move for cost optimization, as on-premise ML infrastructure provides cost savings due to lower long-term costs compared to cloud services, especially for production-level, 24x7 based ML resource usage. Additionally, it allows for greater control over hardware configurations and faster network connectivity that the one that is typically offered in standard public cloud infrastructures, leading to improved performance. Dedicated, not shared, resources can ensure consistent performance, and data locality avoids latency and costs of data transfer to and from the cloud.

When deciding between on-premise and cloud solutions, Hopsworks always recommends considering the long term needs and project requirements to maximize cost efficiency. For stable, long-running processes, on-premise solutions often come out as cheaper over time.

Advanced resource management with the AI Lakehouse

Once you have a GPU-based, on-premise machine learning infrastructure, implementing advanced resource management and scheduling techniques is essential to maximize resource utilization and reduce costs.

For example, there are several strategies available for helping enterprises optimize GPU usage and performance. These strategies include time-slicing to share GPUs among workloads, using Multi-Instance GPU (MIG) to partition GPUs for isolation, dynamic energy optimization to adjust energy use based on workload, batch processing to leverage parallel processing, right-sizing GPU instances to match workload demands, monitoring and analysis to track performance, checkpointing to assess and potentially terminate unproductive training, and mixed precision training to improve performance with lower precision arithmetic.

With the Hopsworks AI Lakehouse, users receive not just the feature store foundation, but they can also implement a centralized MLOps platform built around that feature store, designed for both cloud and on-premise deployments, offering flexibility in governance and cost management. Hopsworks provides essential resource management and scheduling features, allowing users to optimize their ML infrastructure. These features include things like

Project Configuration: Data owners can set defaults for jobs and Jupyter notebooks running within their projects, allowing for fine-grained control over resource allocation.
Resource Allocation: Hopsworks enables the configuration of resource allocation for model serving deployments.
Kubernetes Scheduling: Hopsworks allows users to configure Affinity and Priority Classes when running workloads, including jobs, Jupyter notebooks, and model deployments. Admins can control which labels and priority classes can be used by the cluster and by which project.

The implementation of the AI Lakehouse, therefore, how the GPU and on-premise cost optimizations can be brought to life, by allowing the management staff to dig into the details and do the optimizations that really contribute to the bottom line.

Conclusion

Optimizing machine learning costs requires a strategic approach that encompasses the choice of hardware, infrastructure, and resource management techniques. By adopting GPUs, moving to on-premise infrastructure, implementing advanced scheduling, and utilizing platforms like the Hopsworks AI Lakehouse, organizations can significantly reduce their ML expenses while maintaining high performance. Continuous monitoring, experimentation, and adaptation are key to achieving the most efficient and cost-effective machine learning workflows.

References

Interested for more?

🤖 Register for free on Hopsworks Serverless
🌐 Read about the open, disaggregated AI Lakehouse stack
📚 Get your early copy: O'Reilly's 'Building Machine Learning Systems' book
🛠️ Explore all Hopsworks Integrations
🧩 Get started with codes and examples
⚖️ Compare other Feature Stores with Hopsworks

More blogs

ML and AI applications are becoming increasingly demanding in terms of performance, we compare Redis to RonDB in terms of Scalability, Throughput and H.A.

AI/ML needs a Key-Value store, and Redis is not up to it

Seeing how Redis is a popular open-source feature store with features significantly similar to RonDB, we compared the innards of RonDB’s multithreading architecture to the commercial Redis products.

Mikael Ronström

This blog post describes the usage of Amazon FSx for NetApp ONTAP in a Hopsworks 4.x deployment in Amazon Elastic Kubernetes Service (Amazon EKS).

Amazon FSx for NetApp ONTAP interoperability test in a Hopsworks 4.x Deployment

By following this tutorial, you can evaluate the interoperability between Hopsworks 4.x and Amazon FSx for NetApp ONTAP

Javier Cabrera

Learn more about how Hopsworks Feature Store stores both data and validation artifacts, enabling easy monitoring on the Feature Group UI page.

Data Validation for Enterprise AI: Using Great Expectations with Hopsworks

Learn more about how Hopsworks stores both data and validation artifacts, enabling easy monitoring on the Feature Group UI page.

Victor Jouffrey