Hi everybody, this is Julien from Arcee and welcome to episode 6 of my podcast. Don't forget to subscribe to be notified of future episodes. In this podcast, I'm talking to my friend Cosmin from Denmark. Cosmin is a data scientist, a blogger, and he also runs the Apache MXNet meetup in Copenhagen. We talk about getting started with ML, running your machine learning projects right, best practices, and a whole bunch of different things. I'm sure you will enjoy the conversation and learn a few things. Let's not wait; let's just listen to Cosmin. Cosmin, thank you very much for taking the time to speak to me today. I guess we need to start with an introduction. So tell us a little bit about you, what you're doing today, and how you got started with machine learning.
Right. So I started with data in general about 10 years ago. I was very interested in doing reporting and fiddling with database management. So then I moved into doing more and more data engineering. At the same time, I was lucky enough to work for companies that spearheaded machine learning efforts in Denmark. So I naturally became interested in the data science domain.
Okay. So can you tell us a little bit about your company and the kind of projects you work on on a daily basis?
Right. Just to give a little bit of an overview before I dive into details. Audience Project helps brands, agencies, and publishers to plan, optimize, and validate digital online campaigns. At the same time, we're also helping our customers to grow audience segments that are of high value. To achieve that, we use data science and machine learning at several levels in our organization, from the actual projects to the operations. An example is our solution, Audience Hub, which helps our customers, like publishers, for example, to grow audience segments from deterministic data. We use extrapolation driven by machine learning models to grow these audience segments. Another example is our two frequency graph that we use to understand how many times an average person has been exposed to an online campaign. For that, we don't necessarily use machine learning but graph algorithms. One last example that you might relate to is where we have used machine learning to understand which availability zones from Amazon are best to bid in for EC2 spot instances. So that we can have stability over time and also low price. Wow. So this is an example of how we have used machine learning and data science at different levels in our organization to deliver value.
Okay, that's pretty cool. So tell us about the typical project. How do you get started? People tend to focus on algorithms and the technicalities. Of course, it depends very much on the problem at hand. The way I usually approach a project, an ML project, is that I try to use my previous knowledge or experience, and then I do some research online. I essentially trust the community, the crowd wisdom. I try to find example projects that are similar to what I'm trying to do and fit that into a solution for my problem. The point of that is to become familiar with the problem so that I'm confident enough to discuss it. Then I would probably move towards doing some exploratory data analysis. I would understand the data, do cleaning. I would also approach my colleagues and ask for opinions and validations. I'm fortunate enough to be surrounded by smart people, so that's really helpful. There's always good feedback. And then I guess I would go towards implementation. I try to productionize work to have a working prototype as soon as possible. That allows me to have a framework for doing multiple iterations towards better results.
I think that's a very important point because one of my beliefs is that machine learning is software engineering, and you need tooling, agile techniques, and iterations. Sometimes I meet people who tell me, well, I just spent six months researching the thing and then I'll tell you in six months if I can build a model or not. For sure. So if you're doing pure research, that's okay. But you're working for a private company, right? I have business constraints. Exactly. I need to be pragmatic. We need to be pragmatic into the constraints of doing something that is very good and doing it in a fixed amount of time and within a certain budget. So we need to be pragmatic.
I also wanted to add one more thing that I believe is important. When I or we start building a solution, we build it towards change. It's important to assume that things will change, especially in ML. The model might change, the data might change, and assumptions that you had first might not be realistic in a production scenario. So assume change and engineer towards that design. Just agility and validate assumptions all the time.
Any advice you would give to a young ML engineer to get started right? What should they focus on in the early steps of their projects and careers?
To see the pros and cons of different approaches. So, you might say that neural networks are very powerful and you can solve some problems with them. But how about explainability? That's a trade-off. It might work in some cases, it might not work in others. You need to understand your context very well, not to have a hammer and then look for nails. Oh, you know a library very well and then you try to use it for everything. Again, this goes hand in hand with the pros and cons advice. If something doesn't work, maybe it's time to look at something else, not try to force your problem into a certain box. Create a project for yourself; that's what I like doing. I create data sets, I create artificial data sets, or I derive data sets from existing ones, and then I try to solve a problem. This is what I do, for example, on my blog. I create some data sets, and then I artificially create a problem, and I try to solve it. So this is one way to gain experience. And I think experience is the most important trait to have. Experience gives you intuition, and intuition is extremely important in data science. It allows you to choose one model over the other. Ideally, you're able to explore scientifically all reasonable paths. But in practice, it might not be feasible to do that. So at that point, you have to use your experience to narrow down where you need to look. What are actual possible solutions to your problem? What are the limited set of possible solutions to your problem? The experience of the team is also important. So the same project might be handled in a different way in a different team with different skill sets that arrive at the same product with the same quality. So where I work, we have some experience and we work with tools that we find most comfortable to work with. Another team might find a different combination of tools to be comfortable. I guess the moral is, use what you know and use the best tool for the job at any given point. I think there's a combination between using the right tool for the job and using what you know. If you only use what you know, you might be missing out. So there is this tendency these days to, for example, in the past, XGBoost was very popular and this is one of my go-to tools. But if I try to use XGBoost for everything, I might not get the appropriate results. There's a certain type of problem that I would apply XGBoost to, and perhaps some other solutions would be good. But if the problem fits and my experience fits, I will use XGBoost. Otherwise, I would have to possibly look for a different library. So curiosity is important, right? Trying out new algos and maintaining a balance.
Yeah, I agree. So you mentioned you were using SageMaker. Can you tell us a little bit about that? What you like about it and what you think the really strong areas are in SageMaker?
So I think the most important thing for us is that it gives us resources that are already provisioned with the libraries we need, and we don't need to maintain that. We're using Apache MXNet, and I'm a great fan of Apache MXNet. It comes directly provisioned in the SageMaker notebooks, which makes our life very easy. We start the notebook cluster, and then in a few minutes, we're ready to dive into the fun stuff. No fuss, no setup. Another thing that is great, and I don't think I've heard it mentioned many times, but I think it's important, is that SageMaker gracefully encourages best practices. It's like a framework for doing data science and machine learning that doesn't force you into a certain way, but it certainly encourages best practices. You can easily provision machines dedicated to training and validation. You can provision endpoints, so you have some separation in that sense. The documentation and implementation are also very well targeted towards best practices. The last point would be the efficiency. The way the resources are being provisioned is very efficient. In the past, I would maybe launch a GPU cluster in EMR just to have MXNet running on it, and that would work just fine. But there would be a lot of wasted dollars. In SageMaker, I can do my data engineering on a cheaper machine that works just fine, and then I can launch my training on a very powerful GPU machine for just a few minutes, every dollar spent would be spent towards actual training. There are some recent developments for SageMaker, like the experiments library. I was very interested in that. I was so interested in that, that before I knew SageMaker experiments would appear, I made a commit or a pull request to the MLflow product from Databricks for a plugin essentially for Apache MXNet's glue-on. But now we have SageMaker experiments, so I'm going to see which one I'm going to use.
Well, you know, just try both. Let us know if we're missing anything.
So, Cosmin, we're almost done. Any last words?
Just thank you for inviting me. It's always a pleasure to have a talk with you, Julien.
Well, thank you very much. Thanks for your time. And thanks for sharing the knowledge. I'm sure this will be much appreciated by the listeners. That's it for this episode. I hope you enjoyed it. Don't forget to subscribe to my channel. And I'll see you soon with more conversations and more content. Until then, keep rocking.