AWS AI Machine Learning Podcast Episode 4

January 05, 2020
In this episode, I have a chat with Pavlos Mitsoulis-Ntompos, a Data Scientist for the Expedia Group and an AWS Machine Learning Hero. We talk about real life ML, and what it takes for ML projects to be successful. ⭐️⭐️⭐️ Don't forget to subscribe and to enable notifications ⭐️⭐️⭐️ Check out Sagify, an excellent CLI tool for Amazon SageMaker written by Pavlos: * https://kenza-ai.github.io/sagify/ * End to end demo: https://youtu.be/cWv8zR2Qu94 This podcast is also available in audio at https://julsimon.buzzsprout.com/ For more content, follow me on: * Medium: https://medium.com/@julsimon * Twitter: https://twitter.com/julsimon

Transcript

Hi everybody, this is Julien from Arcee. Welcome to episode 4 of my podcast. Don't forget to subscribe to be notified of future episodes. In this episode, I'm actually talking to one of our machine learning heroes, Pavlos, working for the Expedia group. This is an interview that I did a few weeks ago for the AWS Innovate online conference, but I thought it was so good that I should use it for the podcast as well. So without further ado, let's listen to what Pavlos has to say on machine learning. See you in a while. So Pavlos will introduce himself, but just to give you a little background, I actually started following him on Twitter a while ago because he was sharing all kinds of really cool stuff, including AWS content and SageMaker. I thought, okay, I gotta talk to this guy. And then completely by chance, we ended up speaking at the same meetup in Athens. It was the big data meetup. So hi to everybody from Greece, if you're watching, looking forward to being back there. And we said, oh yeah, we know each other, right? And then we talked a little more, and it was a very easy decision for us to decide that Pavlos should be a machine learning hero. If you're curious about machine learning heroes, you probably know about AWS heroes. Members of the AWS community who do a lot, helping other developers with tools, blogs, and projects. Now we have machine learning heroes. So, Pavlos, you're one of those. Yes, thank you. Very selective. I'm really excited to be an AWS ML hero. So, well, yeah, I'm a staff data scientist working for Expedia Group and an ML hero. I'm so excited about machine learning and really privileged to live in this era. I started working in ML maybe seven or eight years ago, when it wasn't so fancy. Did you actually study it or did you fake it until you made it? Well, I studied operations research and then computer science. I took many ML courses or courses about optimization. But there wasn't any course back then specialized in machine learning. So I had some formal training, but that's all. Scikit-learn wasn't a thing back then. I remember it was WEKA. It was a good tool for small data. And, yeah, it was like MapReduce, a big thing. I published a paper about MapReduce. So people were starting to talk about big data, how to store all this data, how to process it. So it was the beginning of the first step of machine learning. Yes, it's a good point. I have a slightly similar story, although I didn't take any formal courses. But big data, you know, 2010, 2011, right? It was all about Hadoop and big data and piling up web logs and trying to figure out what to do with them. And machine learning kind of followed on, right? I thought, oh, we have data, we have computing power, and we have to do something with them. Let's do something. We can't just do basic aggregation or whatever. Let's be smart. So we had all the data on S3, you know, terabytes of data, and the question was, okay, what are we doing now? And then we had to use all these MapReduce and Hadoop processing tools to prepare the data for the machine learning models. And then it was a painful era where we didn't have a lot of good ML tools to process big data. Yes, and so I keep saying it. Machine learning is not something on the side, right? Machine learning is just one step in your project, and you need data and big data tools. We had a session today on what big data services on AWS can help you with that. You know, Glue and Athena and EMR. So data engineering seems to be the buzzword now. Data engineering is really important, and ML just follows that, right? Exactly, it's a big process that involves many people from different disciplines: data engineers, software engineers, data scientists, machine learning engineers, even product managers. So it's really challenging to make all these people work together. And then now we have all these tools, and it's more like an organizational problem. Interesting. So you're part of the Expedia group. Tell us about what kind of machine learning projects you work on and how you organize them. Absolutely, tech is on the table, we can pick what we need and assemble things and build, but actually running the project, getting from the business question to the actual prediction that helps improve the business KPI, that's the big story. So tell us a little bit about what you do on a daily basis. Well, I think 90% of my job is to prepare data, mine data, and the 10% which is actually machine learning. That's the number I keep hearing, 80% at least. Yes, exactly. So I'm really happy because I think in most other jobs the fun part is one or two percent. So relatively speaking, it's good. At 10%, it's the absolute minimum, right? But I'm really privileged to be part of Expedia because it's a big company, and they have an ML platform that essentially helps data scientists to unlock their talent. I think one key thing to make a successful machine learning project in a big company or a small company, it doesn't matter, is to have a really good machine learning platform. That's the key thing. And that's why, for example, SageMaker can enhance and make machine learning power many different features on the companies out there. Yes, I hear that from a lot of customers. Of course, when we do those small demos like we've done today, we look at toy problems, very small data sets, and we try to solve one thing. But I guess, you know, when you're working on ML at Expedia or any other company, you're maybe looking at 50 problems and 200 data sets and thousands of models. Because if you're trying to build a classifier, there's no one way to do it. There are so many ways. And then what about hyperparameter tuning, which we talked about? More models, more models, more models. So, and if you have a larger team, if you have one data scientist, that person's already going to build lots of things. If you have 10 or 50, 100, then you're looking at potentially thousands of training jobs every day and the cleaning processes, etc. Right? So that's the main issue right now. And you need to make sure that you're not duplicating any work because probably someone else in another office has built a similar classifier. So discovering ML models in a company is another big problem. And here SageMaker can help you. Where you can search for the different training jobs, what were the parameters, what was the input data, how often this model is trained, if it is deployed, all this stuff. Yes, so data wrangling, data engineering is important. Model discovery, model versioning, data set preparation is important. What about the actual training and deployment process? I mean, that's a question I ask everyone. How much automation do you have today? Do you still have a human in the loop for model validation, whatever, model QA? Or, you know, what's the automated part? What's the manual part? And how far are you willing to go? And you can't lie, right? There are a lot of questions. I think that at the moment the deployment process is not fully automated. So I have a CLI tool that deploys a model, for example, at Expedia. But I have to remind myself that this model needs to be retrained on new data. Exactly. So I think that universally speaking, there is no continuous delivery from machine learning out there. Essentially, you can have a file like a Jenkins file that will tell your CD tool, okay, go and retrain the model every week using this data, and then deploy it to the machine learning as a RESTful endpoint, have three instances. Oh, and before we deploy it, make sure that, for example, the precision and recall is higher than a specific threshold. So I don't think we have something like that. So SageMaker is... Do you think it's possible? I mean, do you think we'll get to one-click automation, just like we've done for web apps and containers and everything else? Or do we always need a human at some point looking at the model and deciding, OK, this is a good one or this is a bad one. And OK, click and then go and push it, which is automated. What's your gut feeling here? I think it's possible. But it will take time. I think the whole machine learning community needs to agree to the best practices. There are many different ways to do it. I think that, for example, SageMaker enforces in a good way the best practices. And a company like Amazon that does machine learning for maybe decades, you know all the painful points. Essentially, you know what needs not to be done. That's very important. So the community can learn from that and can transcend to the best practices out there. I think it's possible. Now we have SageMaker, which is essentially the ML infrastructure. But let's take a step back. An ML platform consists of two things: an ML infrastructure like SageMaker and an interface between the data scientists and the ML infrastructure. That interface needs to be a user interface and the CLI tool at the same time. So Sageify, the CLI tool I've built... Yes, we're going to get a demo of that later in the session, okay? It's a very cool tool. So DevOps, right? And that's why we wanted to have automation and DevOps and data wrangling sessions today. Because I don't know if you see that as well, but a lot of customer discussions around ML tend to drift to neural networks and SGD and the crazy stuff in five minutes. And I'm always saying, whoa, whoa, okay. Next person who says SGD just leaves the room, you know? What's the business problem, right? Exactly. So I guess my question is, and without disclosing anything, of course, how fancy are real-life models, right? You know, Hacker News and arXiv papers and about the crazy new GAN or the new crazy NLP models, BERT and all the variations around BERT are really, really exciting. Now, if we come back to reality, isn't just everyone using linear regression and XGBoost and... I think 90% of the cases, yeah. It's really funny because linear regression is out there maybe 100 years old, something like that. And yeah, someone uses linear regression and says, OK, I'm doing AI. And then statisticians get a little bit angry and they were like, oh, I was doing AI 25 years ago. But it's a good thing. We need baselines. Every time that you start a new ML project, you need a baseline. Just use linear regression, logistic regression. Maybe use a random classifier. So would you actually recommend, because we have a lot of people who are new to machine learning also watching us, is that something you work on every day? Like, let's try the simple things first. Let's try the simple or basic algos. And I mean that in a positive sense. Simple is good. We love simple. I love simple because I understand them. So would you run those first before going crazy with neural networks and everything else? Definitely. There is the Occam's Razor theorem that simplicity is the best thing. So definitely because it will make your whole process simpler. You will deploy a model in production in much less time and will make everyone in the business happy. So in this way, you have something in production, you will get quick feedback if the model works really well in production. You have simple code. It's understandable by most engineers and product managers. And then if everyone is happy and they want to move on and make it more accurate, then you can move on to deep learning or more fancy tools, algorithms, etc. So that's great advice to everybody out there, especially the people who are kind of new to ML. Again, that's why we wanted to have this session on the introduction to ML with Scikit-Learn and go through linear regression, logistic regression, trees, and PCAs. A lot of people ask me, oh, I want to learn about deep learning. How do I do that? My answer is always the same. It's like, how much do you know about statistical ML? You have to walk before you run. So spend some time learning those because like Pavlos just said, a lot of companies just use that. I mean, XGBoost is still working, winning on Kaggle all the time, right? So it's winning competitions, it's still very good, and a lot of people still use it. It's probably the most popular algo today, right? Exactly. And if you're starting, it's really good to find a good mentor, an informal one, someone that has been doing machine learning for many years and try to learn from that person what didn't work in the workplace, what worked, what open source tools they use, how they collaborate with other members in the company. I think it's a key thing before you start applying this stuff in an industrial setting. Yes, that's good advice. So, you know, I'm going to be more brutal about it, so don't believe the hype, I guess, okay? You know, just like everyone else, I read the blogs and the arXiv papers and the technical blogs from cool companies, etc. But real life is usually more boring, and it's great because boring is nice. Boring technology is simple to understand and simple to explain. So if you solve your problem with random forests or linear regression, then fine, you are doing machine learning. Okay, that's the end of this episode. I hope you enjoyed the conversation and don't forget to subscribe to my channel. I'll be back next week with more AWS news and demos and God knows what else. Until then, keep rocking.

Tags

MachineLearningDataEngineeringAWSSageMakerExpediaGroup