Scheduled upgrade from November 26, 07:00 UTC to November 26, 17:00 UTC
Kindly note that during the maintenance window, app.hopsworks.ai will not be accessible.
5
View the Changes
arrow back
Back to Blog
Hopsworks Team
link to linkedin
Hopsworks Experts
Article updated on

5-minute interview Tun Shwe

Episode 20: Tun Shwe, VP of Data/DevRel - Quix
July 8, 2024
6 min
Read
Hopsworks Team
Hopsworks Teamlink to linkedin
Hopsworks Experts

TL;DR

In the early days, we were trying to create structure from data through feature engineering to handle unstructured or semi-structured data. Now, we're in this magical time with vector databases where you don't need to worry about structure and can deal with unstructured data.

We’re at the 20th episode in this series! We chat with the VP of Data/DevRel at Quix, Tun Shwe. Tun highlights the evolving challenges and innovations in data engineering and shares his journey from product engineering to data science.

Tell us a little bit about yourself.

My name is Tun Shwe. I'm the VP of Data, and I look after developer relations at Quix. Quix is a stream processing Python developer tools company. We created Quix Streams, which is an open-source, pure Python stream processing library. The remarkable thing about that is we use streaming data frames. We also manage a product called Quix Cloud, which is a serverless SaaS platform that enables you to build out your data pipelines and deploy the steps in the pipeline as Docker containers.

What is stream processing in the context of ML and AI?

Stream processing is about processing data as it is generated. Oftentimes, we work with batch data. That seems to be the majority of what people do; they allow data to be generated, collect it over time, and then, at some point in the future, schedule a job to read and process it. With stream processing, you are processing that data as it's generated. So, we're talking low latency and no delays.

How did you get into the field?

I used to run a Product Engineering studio, where we were solving what was called "big data" problems for companies. We used a tech stack with tools like Hadoop and Spark. Around that time, I found out about this fascinating creature called the Data Scientist, essentially a Statistician who could code. It was an interesting intersection of two things I liked: math and software engineering. I wanted to understand that more and maybe work with them. I ended up getting a job with a large scientific publishing company, and my first machine learning projects were around recommender systems. I spent my early years covering principles that you see again today with LLMs and techniques like RAG, such as top-k and figuring out the nearest neighbor. So, it's kind of come full circle with the power of LLMs.

Why do you think this is such an interesting field?

I quickly learned that when you start working with big data or data at high volumes, you uncover some very unique challenges, which often tease out the best parts of data engineering. All the best practices from software engineering, especially distributed programming, come into play. I was really fascinated by that niche of software engineering where we deal with really difficult problems at scale. Scaling issues were very important. If you go back nine years, it was still very early days for data science. Companies realized the power of deploying machine learning models. In the early days, you'd have them in Jupyter notebooks. Then, it became about how to go from a Jupyter notebook to something running as a service in production. It required a different toolchain and a different set of skills. The skills I had built up over the years helped me apply the right principles to put those into production. The fascinating part is it's come full circle. In the early days, we were trying to create structure from data through feature engineering to handle unstructured or semi-structured data. Now, we're in this magical time with vector databases where you don't need to worry about structure and can deal with unstructured data. It's a really interesting time where you can use almost anything as an input. That will never get boring.

Any resources you can recommend?

I guess the mission that I'm on right now is to bring stream processing and allow Python developers everywhere to access real-time data in Kafka. I recommend checking out Quix Streams on YouTube. We're focused on building things from scratch, so in a code-along style, learning concepts from scratch. As I mentioned, I ran a product engineering studio, and I think a lot of Data Scientists and Data Engineers would benefit from bringing product thinking to their work. One of the books I usually recommend is "Sprint" by a couple of authors from Google Ventures. It's about creating value in a short space of time and testing it, fitting well within the confines of scientific research and putting things into production. I always recommend learning about product thinking.

References