Release 4.0 marks the transition of Hopsworks to a Kubernetes native platform. This is a major change from the infrastructure point of view which challenged many of our core sub-systems. In this article we explore how our Public Key Infrastructure has changed over the years coming to its current form, a Kubernetes first-class citizen.
For every product out there a major release is a milestone and a reason to celebrate. Hopsworks is no different. Each major release is bringing new features and functionalities establishing Hopsworks in the AI Lakehouse area. The 4.0 release though is somehow special. It is special because it’s the first release that Hopsworks runs natively on Kubernetes. Until now Hopsworks was deployed in physical or virtualized environments having Kubernetes as an integration. Today you can simply install Hopsworks on any Kubernetes cluster with a single script.
This made me recollect how much a cornerstone module in Hopsworks has changed over the years. A module that nobody really sees, unless something goes wrong :) At its core, Hopsworks is a data storing and processing platform. Public data, sensitive data, medical data, intellectual property etc are common types of data, machine learning models use to make more accurate predictions. Obviously, protecting these data and guarding access to it is of paramount importance to us.
Since the early beginnings of Hopsworks we realized that enterprise authentication systems such as Kerberos, Active Directory etc were not the right fit to use internally in the system. Steve Loughran, core committer of Apache Hadoop quotes H.P. Lovecraft in his “Hadoop and Kerberos: The Madness Beyond the Gate” book:
What he wrote was true: there are some things humanity was not meant to know.
Of course we do integrate with enterprise identity management such as Active Directory but only for logging into Hopsworks and project management. Service-to-service and user-to-service authentication and authorization is done using Public Key Infrastructure and TLS. That means we have to run a Certificate Authority inside each cluster and everybody knows that operating a PKI is not an easy task.
This section serves as a primer to X.509 certificates and Public Key Infrastructure. If you are familiar with the concept feel free to jump ahead. At this point I have to add a disclaimer that this is not a lecture on cryptography, I am not qualified to lecture and that’s not the aim of this article.
The basis of asymmetric cryptography is algorithms that easily generate a pair of numbers which are inseparable from each other but it is “almost” impossible to derive one of them given the other. RSA is one of the oldest and most widely used public-key cryptosystems. The name public comes from the fact that you can safely distribute one part of the pair as long as the other part you keep it safe. This brings us to digital certificates and X.509. Similarly to a paper certificate, a digital certificate is a file which proves the authenticity of the holder. Digital certificates are the public part of asymmetric cryptography, the private part you should never expose and you should always keep it secure.
A certificate by itself does not mean much. Similarly to real life a passport which is issued by yourself probably won’t get you far. It must be signed by a trusted authority, your local police department. Again a police department in Sweden and a police department in Australia probably do not have much in common. Sweden and Australia have signed treaties that they must trust each other's passports. So eventually if you get checked in a village in Australia, your passport will be accepted.
The analogy is exactly the same in the digital world. X.509 certificates are issued and signed by a trusted entity, whose certificate is signed by another trusted entity and so on. These trusted entities are called Certificate Authorities. So, in the figure above all certificates trust each other although they are not signed by the same CA.
Certificates encode important information about the holder of the certificate which are digitally signed. If somebody tampers the information the signature verification will fail. Also, certificates similar to passports are not valid forever, they expire. An expired certificate is not valid. Finally, a certificate may be revoked before its expiration date in which case it becomes invalid. Certificate Authorities publish a list with revoked certificates which is called Certificate Revocation List.
Now that we know the basics of public-key cryptography let’s see how it is applied in Hopsworks. We will not examine every single component of Hopsworks and we will not dive into technical details as this would require a series of articles.
For this example we will use one of our core systems which is HopsFS, our award winning, POSIX-compliant distributed filesystem. In the case of security, the POSIX compliance brings the regular file permissions and ownership. So imagine a file with owner bob, group engineer and permissions 740. This means only Bob can read and write. Other users in the engineer group can only read but nobody else can read or write or execute the file.
Since HopsFS is a distributed filesystem, all operations are remote procedure calls. A user from a remote location claims he is Bob and wants to edit the file. If there was no validation, everybody could claim they’re Bob, but Bob is only one. On the other side, a malicious user can start a HopsFS server advertising it’s the correct server to connect to. Bob could believe it and write his sensitive data to the fake filesystem.
To solve these problems we employ X.509 certificates and mutual-TLS authentication. With mutual-TLS both the user and the server must present a valid X.509 certificate. The authentication and authorization happens in two layers, the transport and application layer.
First the transport layer validation happens. If any of the checks do not pass, the connection is closed. Due to mutual-TLS the same checks must happen on both the client and the server side.
If the process above finished successfully, the application layer validation takes place.
In the example above we used HopsFS but the same process applies to a great extent with every service-to-service or user-to-service communication. It’s not easy but Hopsworks is a multi-tenant platform by design so we have to make sure only the right people can access data.
So far we have seen how X.509 certificates are used within Hopsworks and how we leverage a chain of trust but we haven’t touched at all the certificate issuing part.
As it has already been mentioned, we decided to go with X.509 certificates for our internal authentication, authorization and encryption since the beginning of Hopsworks. We had to solve the PKI problem before we moved on implementing the rest of the system.
Premature optimization is the root of all evil
– Donald Knuth
It’s a common observation that us engineers tend to overcomplicate solutions. We optimize for this 3% or we build complex systems for this rare edge-case. The result is a solution which is extremely complicated, delayed and most likely prone to bugs due to the inherent complexity.
The first implementation of Hopsworks Certificate Authority came out and it was practically a wrapper around openssl, probably the most famous cryptographic library. OpenSSL provides some command line tools for generating key-pairs, signing certificate requests, invalidating certificates etc All the functionality a CA has to have it was there but it was a CLI tool.
We didn’t want our users to have anything to do with certificates, we wanted everything to be transparent to the users and done programmatically. So we built a Java wrapper around openssl with a RESTful interface that other programs can consume. Authentication to this interface was already handled by the main Hopsworks application.
And so it happened, the Hopsworks Certificate Authority was born issuing certificates for our users and internal services. We came up with the following chain of trust which is almost the same even today.
The Root CA does not sign any user or services certificates directly. Instead we delegate the responsibility to Intermediate CA and Kubernetes CA for signing certificates of Kubernetes infrastructure such as Kubelet, Kubernetes API server etc.
If it ain’t broke, don’t fix it
– Everybody
The setup above served us well for a few years. There were a few issues here and there but nothing major. The biggest disadvantage of that solution was that openssl is file based, everything was at the local filesystem. Consistent backups were problematic, access control was difficult but most importantly High Availability was virtually impossible. CA keys, certificates, certificate revocation lists, indexes etc were on the hard drive of the server running the Hopsworks application. How can we safely failover to another CA server?
That’s when we decided that it was time to move on. We re-wrote our Certificate Authority as a purely Java application using industry standard BouncyCastle cryptographic library. From a user perspective nothing has changed. We maintained the same API but all cryptographic material is now stored in the database. Working with the cryptographic APIs of BouncyCastle is not for the faint-hearted but in the end it was really worth the effort giving us more freedom.
Besides having more control of the process we can now easily have multiple Hopsworks instances running side by side with multiple CA servers. Database is the synchronization point. Backups is a no-brainer as everything is stored in the database. Highly available multi-region configurations are also possible. RonDB, our database of choice, is replicating on the other site all certificates, keys, revocation lists etc. ready to be picked up should a disaster strike.
As it was mentioned in the introduction of this article, 4.0 is a major release bringing Hopsworks natively to Kubernetes. In Kubernetes the de-facto way of deploying and configuring services is with YAML files. You declare how you would like your service to be deployed and Kubernetes does it for you - assuming all resources are available, conditions are met etc.
Suddenly the “old-fasioned” way of generating a key pair and the Certificate Signing Request, sending it to Hopsworks CA, getting the signed certificate and then deploying a service did not fit.
To solve this challenge we implemented certs-operator, a custom Kubernetes operator watching for our HopsworksCert CustomResourceDefinition. Hopsworks certificates became first-class Kubernetes citizens. The CRD describes in YAML how the certificate should look like, for example the YAML below will issue a certificate with Locality=stockholm and CommonName=hopsfs in the hopsworks Namespace.
When you apply this YAML file, certs-operator will forward the request to Hopsworks CA. It will get back the signed certificate and it will create a Kubernetes Secret with the certificate, key and chain of trust in the formats that different services expect.
Something which we didn’t touch upon previously is certificate rotation, how do you initiate and re-distribute the new certificate. Now it has become a trivial operation. You only need to edit the YAML file. The operator will see the change, it will issue a new certificate, update the Secret and revoke the old certificate.
Finally, it will optionally set the Owner of the HopsworksCert instance to another Kubernetes object. In the example above, the owner will be the Deployment “hopsfs”. If you delete the deployment, the deletion will be cascaded to HopsworksCert. The operator will observe this change and it will revoke the certificate before automatically deleting the Secret from the cluster.
Hopsworks CA was an independent module in Hopsworks application but it was still running in the same application server as the main application. This was an overkill for the CA so we decided to decouple it entirely from Hopsworks.
The CA now runs in its own lightweight application server in separate Pods. Startup time has significantly improved and we can manage resources, ACLs and lifetime separately. We can run as many instances as we want for HA, restart them separately and apply stricter network ACLs.
That was the story, so far, of an unseen hero of Hopsworks throughout the years. Transitioning to Kubernetes has challenged a lot of our past architectural decisions but it was certainly worth it. If you want to try out the latest and greatest of Hopsworks for free I urge you to register in our Serverless platform. Happy coding!