By following this tutorial you can evaluate how you can get a flexible, cost-effective, topology-aware NVME provisioning without sacrificing performance by using the Amazon FSx for NetApp ONTAP along with Hopsworks 4.x. We improve operational simplicity and more controlled storage management.
Introduction
This blog post describes the usage of Amazon FSx for NetApp ONTAP in a Hopsworks 4.x deployment in Amazon Elastic Kubernetes Service (Amazon EKS) . We connect NVMEs volumes on demand as Kubernetes Persistent Volume Claims (PVC) for the RonDB service.
Amazon FSx for NetApp ONTAP provides a scalable, highly available, and secure data management platform that integrates well with Kubernetes environments, specially in EKS clusters. By leveraging Amazon FSx for NetApp ONTAP in our Hopsworks 4.x deployment on AWS EKS, we can take advantage of its robust and cost-effective features for persistent storage, such as automated data tiering, snapshotting, and high-performance NVMe volumes.
Key points:
Test the interoperability of Amazon FSx for NetApp ONTAP for Hopsworks. In other words, we go step by step in order to connect an EKS to an Amazon FSx for NetApp ONTAP filesystem and then install Hopsworks.
Compare the performance of regular EBS gp3 volumes of AWS EKS vs NVME disks provided by the Amazon FSx for NetApp ONTAP.
Results: You can get flexible, topology-aware NVME provisioning without sacrificing performance. The benchmark numbers are essentially unchanged, so the advantage isn’t about speed; it’s about operational simplicity and more controlled storage management.
Requirements
kubectl cli tool installed in your machine
helm cli tool installed in your machine
AWS account
Setting Up the Cluster
The first step involves creating an EKS cluster with nodes that are prepared for NFS or iSCSI. Notice the preBootstrapCommands in the snippet below, the AmazonFSxFullAccess policy. Place the snippet below into a `cluster_def.yaml` file.
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: eks-netapp-ontap
region: eu-central-1
version: "1.29"
iam:
withOIDC: true
managedNodeGroups:
- name: ng-1
amiFamily: AmazonLinux2
instanceType: m6i.2xlarge
minSize: 1
maxSize: 9
desiredCapacity: 9
volumeSize: 256
ssh:
allow: true # will use ~/.ssh/id_rsa.pub as the default ssh key
iam:
attachPolicyARNs:
- arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
- arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
- arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
- arn:aws:iam::aws:policy/AmazonS3FullAccess
# This is needed for FSx NetAPP ONTAP
- arn:aws:iam::aws:policy/AmazonFSxFullAccess
withAddonPolicies:
imageBuilder: true
preBootstrapCommands:
- "sudo yum install -y nfs-utils" # For NFS support for ontap-nas
- "sudo yum install -y iscsi-initiator-utils" # For iSCSI support for ontap-san
- "sudo systemctl enable iscsid" # Enable iSCSI service on startup
- "sudo systemctl start iscsid" # Start iSCSI service
- "sudo yum install nvme-cli -y" # For NVMe support for ontap-san
- "sudo yum install linux-modules-extra-$(uname -r)"
- "sudo modprobe nvme-tcp"
addons:
- name: aws-ebs-csi-driver
wellKnownPolicies: # add IAM and service account
ebsCSIController: true
Then create the cluster by using the eksctl cli tool.
eksctl create cluster -f cluster_def.yaml
For the sake of benchmarking, we label the nodes as described in the following table.
We set 3 machines out of 9 to be able to provide NVMEs from Amazon FSx for NetApp ONTAP. 4 machines out of 9 are selected to run Hopsworks services. Finally, the remaining 2 machines are used to launch locust benchmarking workers. Notice that the NVME labeled machines are also used for Hopsworks services. Thus, it makes the Hopsworks deployment effective in 7 m6i.2xlarge EKSnodes.
If you prefer to run such labeling automatically, please execute the following script.
# Get nodes in zone eu-central-1b and mark them as nvme
for node in $(kubectl get nodes -o jsonpath='{.items[*].metadata.name}'); do
# use label topology.kubernetes.io/zone
zone=$(kubectl get node $node -o jsonpath='{.metadata.labels.topology\.kubernetes\.io/zone}')
# label is hw-group=nvme
if [ "$zone" == "eu-central-1b" ]; then
kubectl label node $node hw-group=nvme && echo "labeled as nvme"
else
kubectl label node $node hw-group=hw && echo "labeled as hw"
fi
done
# label two of the hw-group=hw nodes and label them as locust
COUNT=0
for node in $(kubectl get nodes -o jsonpath='{.items[*].metadata.name}'); do
if [ "$COUNT" -eq 2 ]; then
break
fi
hw_group=$(kubectl get node $node -o jsonpath='{.metadata.labels.hw-group}')
if [ "$hw_group" == "hw" ]; then
kubectl label node $node hw-group=locust --overwrite && echo "labeled as locust"
COUNT=$((COUNT+1))
fi
done
Create the Amazon FSx for NetApp ONTAP File System
Find the FSx category in the AWS portal. Then create an Amazon FSx for NetApp ONTAP file system like the image below shows. Create the file system with 3G/s throughput.
When creating the file system, select the VPC of the previously created cluster, that would effectively connect both networks. Thus, making the file system reachable from the EKS cluster nodes.
After creating the file system (it usually takes near 20 mins), make sure the route table of the file system is injected in the EKS cluster. Select all private subnetworks for all availability zones like the image below shows.
Checkpoint: If set correctly, check the EKS cluster networking. You should see two more entries in the route table like the image below, one for the manage-FSx-server and one for its virtual machine.
Set the credentials for the file system virtual machine. You can do so in the FSx object portal like the image below shows.
The previous credentials are used from the EKS nodes to interact with the file system. This process is done automatically by installing the trident Kubernetes operator. We describe this installation in further steps.
Once you set the credentials in the fsx virtual machine. Create an AWS secret with the same credentials. Use your terminal to run the following code.
The Trident Operator from NetApp is a Kubernetes operator that automates the provisioning and management of persistent storage volumes using NetApp storage solutions. It enables seamless integration between Kubernetes workloads and NetApp's storage systems, such as the previously mentioned Amazon FSx for NetApp ONTAP.
To install the trident operator, the first step is to create certain roles and permissions for it in AWS. Create an IAM policy for the trident user. To do so, first create an AWS policy template like the image below as `policy.json`. Notice that there is a placeholder in the snippet below. Copy there the ARN name of the FSX vm credentials that you created in the step above.
Once you create the policy file. Then run the following code.
aws iam create-policy --policy-name AmazonFSxNCSIDriverPolicy --policy-document
file://policy.json --description "This policy grants access to Trident CSI
to FSxN and Secret manager" --region eu-central-1
The previous command outputs the ARN name that needs to be used to link the policy permissions to roles. Save it for the next step.
The next step involves the creation of Kubernetes service account permissions for the trident operator that will be installed. Run the following command with the ARN outputted from the previous step.
Install and configure trident from the helm chart. Having the AWS add-on is not recommended because of this issue https://github.com/NetApp/trident/issues/906 and the fact that the AWS add-on installs an old version of the trident operator. Use the snippet below instead.
Modify the trident orchestrator Custom Resource Definition (CRD) to work only in nodes in the same availability zone as the FSx. Concretely, we want to only use the nodes labeled as NVME in the very first step of this blog post. Set the fields, controllerPluginNodeSelector, nodePluginNodeSelector, cloudIdentity and cloudProvider.
First, get the ARN name of the fsx role we created previously
aws iam list-roles | grep fsx-trident
Then, edit the CRD to ensure the following data.
# You can use Lens for a better edition experience
cloudIdentity: "eks.amazonaws.com/role-arn: "
cloudProvider: "AWS"
controllerPluginNodeSelector:
topology.kubernetes.io/zone: eu-central-1b
nodePluginNodeSelector:
topology.kubernetes.io/zone: eu-central-1b
Checkpoint: Check the trident deployment by executing the following command. Since we limit the number of NetAPP nodes, you should see only 3 trident-node deployments.
Enable Trident Backends and Kubernetes Storage Classes
For trident to provide NVME volumes based on Kubernetes definitions, we need to install the Backend CRDs. Create a file with the snippet below. That creates a storage backend to support ontap-san (NVME) ontap-nas (NFS). The protocol is specified in the storageDriverName
For this PoC we use topology aware backends and storage classes since the FSx is assumed to work in the same availability zone of the nodes.
Notice there are two placeholders in the snippet above, <GET THE FILESYSTEM ID FROM THE AWS CONSOLE> and <THE SECRET NAME ARN CREATED FOR ACCESSING THE FSX VM> . The first one you can get from the AWS console portal. The second placeholder can be filled with the secret ARN for the fsx virtual machine we created previously.
After completing the snippet above, run the following command.
kubectl apply -f backend.yaml
Create the Kubernetes Storage Classes
The last step enables the Kubernetes storage classes to use the trident backends. Create a file with the following snippet.
Then add the storage classes to the Kubernetes cluster.
kubectl apply -f storageclass.yaml
Benchmarking
After the EKS cluster and the Amazon FSx for NetApp ONTAP filesystem are deployed we can install and test Hopsworks.
The main character of this benchmark is RonDB. We have two setups for this benchmark. First the volumes used by RonDB are the standard EBS volumes of EKS with 3G/s throughput. Second, we reinstall the cluster and set the NetAPP ONTAP NVMEs to the RonDB volumes.
Installing Hopsworks
To install hopsworks, first create a values.yaml file to set up where the services will be located regarding the node labels we have set previously. We have prepared the following values.yaml file. Notice that we use an S3 bucket named netapp_poc. Ensure such a bucket is created and that it is empty in the AWS portal.
Moreover, the snippet below will be used for both benchmarking scenarios. Notice we commented out the storage classes used by RonDB. To test the NVME, we just uncomment them, reinstall the cluster and run the benchmarks again.
global:
_hopsworks:
storageClassName: null
cloudProvider: "AWS"
managedDockerRegistery:
enabled: true
domain: ".dkr.ecr.eu-central-1.amazonaws.com"
namespace: "netapp_poc"
credHelper:
enabled: true
secretName: &awsregcred "awsregcred"
managedObjectStorage:
enabled: true
s3:
bucket:
name: &bucket "netapp-poc"
region: ®ion "eu-central-1"
endpoint: &awsendpoint "https://s3-accesspoint.eu-central-1.amazonaws.com"
secret:
name: &awscredentialsname "aws-credentials"
acess_key_id: &awskeyid "access-key-id"
secret_key_id: &awsaccesskey "secret-access-key"
minio:
enabled: false
hopsworks:
# Not in the same machine as mysql
nodeSelector:
hw-group: hw
variables:
docker_operations_managed_docker_secrets: *awsregcred
# We *need* to put awsregcred here because this is the list of
# Secrets that are copied from hopsworks namespace to Projects namespace
# during project creation.
docker_operations_image_pull_secrets: "awsregcred"
dockerRegistry:
preset:
usePullPush: false
secrets:
- *awsregcred
certs-operator:
ca:
httpTimeout: 120s
# Less consul workers
consul:
consul:
server:
enabled: true
replicas: 1
hopsfs:
datanode:
count: 3
storage:
# storageClassName:
size: 256Gi
nodeSelector:
hw-group: hw
namenode:
nodeSelector:
hw-group: hw
objectStorage:
enabled: true
provider: "S3"
s3:
bucket:
name: *bucket
region: *region
# Taken from the large bench test
rondb:
# Go to NVME type of machines
nodeSelector:
mgmd:
hw-group: nvme
ndbmtd:
hw-group: nvme
rdrs:
hw-group: nvme
clusterSize:
activeDataReplicas: 1
numNodeGroups: 1
minNumMySQLServers: 1
maxNumMySQLServers: 1
minNumRdrs: 1
maxNumRdrs: 1
resources:
limits:
cpus:
mgmds: 0.2
ndbmtds: 6
mysqlds: 6
rdrs: 2
benchs: 2
memory:
ndbmtdsMiB: 9000
rdrsMiB: 700
benchsMiB: 700
requests:
cpus:
mgmds: 0.2
mysqlds: 6
rdrs: 2
benchs: 1
memory:
rdrsMiB: 100
benchsMiB: 100
storage:
# If commented will use default ebs storage class
classes:
# default: protection-gold6
# diskColumns: protection-gold6 # nvme
diskColumnGiB: 32
redoLogGiB: 32
undoLogsGiB: 32
slackGiB: 2
rondbConfig:
InitialTablespaceSizeGiB: 10
airflow:
enabled: false
hive:
nodeSelector:
hw-group: hw
metastore:
deployment:
replicas: 2
jvm_resources:
xms: 7g
xmx: 7g
olk:
logstash:
nodeSelector:
hw-group: hw
dashboard:
nodeSelector:
hw-group: hw
opensearch:
nodeSelector:
hw-group: hw
backup:
# We enable it again in the upgrade stage of the CI
enabled: false
Once the values.yaml file is created, run the following code to deploy Hopsworks.
You can get the external IP by executing the following command
kubectl get svc -n ingress-nginx ingress-nginx-controller
Once with the IP, access the Hopsworks application. Use the default credentials, user admin@hopsworks.ai and password `admin`. Entering the Hopsworks application, create a project named `test` and an API KEY. Copy the Api Key for the next step.
Once you create the file. Execute the following command to deploy its definitions.
kubectl apply -f benchmark.yaml
The benchmark takes nearly 10mins to be ready to execute. Once it finishes, check the locust head deployment. It has a sidecar in which you can collect the results.
Benchmark Insights
The results indicate that the Median Response Time, Average Response Time, and Requests/s are identical in both of our scenarios. Therefore, using Amazon FSx for NetApp ONTAP provide a more reliable volume provisioning while maintaining the expected performance for a Hopsworks deployment.
Standard EBS Volumes Results
These are the locust benchmark results for regular EBS volumes provided by EKS. This means, no storage class setting in the values.yaml previously discussed.
Request statistics
Response time statistics
NetAPP ONTAP NVME volumes results
These are the locust benchmark results for NetAPP ONTAP NVME volumes. This means, setting storage class the values.yaml previously discussed as `protection-gold6`.
Request statistics
Response time statistics
Conclusion
By following this tutorial, you can evaluate the interoperability between Hopsworks 4.x and Amazon FSx for NetApp ONTAP. Our results demonstrate successful integration with geographically-aware NVMe provisioning, without compromising performance. The benchmark metrics remain unchanged in both scenarios (standard EBS gp3 disks for EKS and Amazon FSx for NetApp ONTAP NVMe disks), underscoring that the primary advantages lie in operational simplicity and more efficient storage management rather than raw speed. Consequently, Amazon FSx for NetApp ONTAP provides more reliable volume provisioning while maintaining the expected performance for a Hopsworks deployment.