“If you learn how to build Machine Learning systems, a real system with real data coming in that’s making predictions. I think if you work on that, that's probably where you can make the most impact most quickly.”
In our new ‘5-minute interview’ series we are going to meet AI and ML professionals who will share their experiences on working in the field. How did they start out and what aspects of AI and ML make them tick? What are their passions and what drives them to create major tech innovations? Follow along on this journey to find out!
In the New Years-special of our 5-minute interview series we meet Jim Dowling, the CEO and Co-founder of Hopsworks. Jim tells us about how the idea for Hopsworks feature store came about, what big trends are emerging in AI and ML and why you should start building Machine Learning systems
Jim:
My name is Jim Dowling and I'm the CEO and Co-founder of Hopsworks. I have a background primarily in research, apart from a stint at MySQL. I did a PHD in Distributed Systems and AI Reinforcement Learning back in 2004. I’m originally from Ireland but I currently live in Stockholm, Sweden. The company came out of ideas from both MySQL and the work we were doing on scalable, distributed deep learning and machine learning.
I've been working in AI for a while, in the 90s I did my bachelor's on Hidden Markov models (HMM) and I worked with distributed reinforcement learning into the 2000s. I also taught the first course in deep learning in Sweden, which is kind of surprising considering I'm a systems guy. So I have a pretty long background, not just in terms of AI, but systems research as well.
Jim:
We built this scalable data science platform called Hopsworks that we managed to get Venture Capital investments to build it as a next generation data science platform. So before you had these data science platforms that ran on one server, but we said that data science platforms should be using lots of servers and GPUs (because we have deep learning). So we built a platform with a two layer architecture, where we stored large volumes of data on a file system called HopsFS. Then we had this metadata layer, which is a scaleout layer called RonDB (which was a MySQL cluster at the time), and we built this as a part of the architecture.
Around that time I saw that Uber wrote a feature store blog post. I said, hang on, we have that two layer architecture. And if Uber had this problem of managing data for machine learning, then everyone else, who's going to be doing machine learning at scale, is going to have the same problems. We realized that we had an architecture ready in place that fits this. So when we realized this it was basically just a question of who can work on the first version of the feature store within our team. When we started working on the feature store, it was very quiet in the space. We did some content, workshops and talks on it. But then we got our first customer and realized that there is indeed a market for this and that this is something we should focus on.
Jim:
Someone who had a big influence on me, early in my career, was a researcher called Richard Hammond. He always asks the same question to every researcher: What's the most important topic in your field? Why are you not working on it?
If you're in computer science right now, I can't think of a more important topic than AI. It's redefining how we are as a species. What if we're not the most intelligent, sentient being on the planet? That's pretty revolutionary even though we're not there yet, but we're in exciting times and the pace of innovation is massive.
So since I have a background in computer science, systems research and AI, I got drawn into where my skill set really comes fourth, which is: how do we provide system support for these AI applications? That ended up primarily being at the data layer, so how do we manage the data for AI? So how to make that data available for training models but also for building Machine Learning Systems.
Jim:
Back in the 90s we always heard about Moore's Law. Moore's law said the number of transistors on a chip doubles every 18 months and we knew that the physics of Moore's Law would eventually kind of wear out. We've seen the same happen for deep learning. So there was a pretty interesting heuristic that came out a few years ago. It said that for many different classes of deep Learning Systems, anything from image classification to natural language processing (i.e. translations in Google Translate or large language models), we can roughly see an order of magnitude increase in the amount of data and the model size, and we can see a logarithmic improvement in model performance. That's a kind of rule of thumb.
So the question everybody has is related to Moore's Law and is about the amount of data we have available to train models on. We're basically running out of human generated text to train large language models (LLMs) on, so where will the new data come from? Will it be generated by other LLMs or will we hit those scaling limits soon? When we hit those limits, what does that mean? So something interesting that I think will happen is that we'll invent new techniques to train models. I think by being able to learn new behaviors with a smaller number of samples, we'll be able to make another progress jump. But it’s very hard to say when that will happen.
Jim: Firstly, I wouldn’t recommend getting interested in the politics of all of this, there is a lot of nonsense out there. Since I'm a builder I think the best thing you can do to get started in the field is just to build things, so build Machine Learning systems.
So in a course I did on Serverless Machine Learning, the first tutorial you build is a Machine Learning System. A real system with real data coming in that’s making predictions. I think if you work on that, that's probably where you can make the most impact most quickly. It's really just Python Programming. You don't need to get into too much system details to actually build real things.
For example, if you can take your location where you live and find an air sensor nearby (that’s making its measurements available on the internet) you can build a prediction engine to predict air quality where you live in just a couple of hours. Machine Learning systems can help people in your community and bring actual value to society.
Github: Air Quality Prediction Tutorial
Listen to this and other episodes on Spotify: