In this episode, I have a chat with Ségolène Dessertine-Panhard, a Data Scientist with the AWS Machine Learning Solutions Lab (https://aws.amazon.com/ml-solutions-lab/). As she spends her time working on customer projects, we talk about ML in the trenches: framing problems, building datasets, picking algos, deploying to production, and more.
⭐️⭐️⭐️ Don't forget to subscribe to be notified of future episodes ⭐️⭐️⭐️
This podcast is also available in audio at https://julsimon.buzzsprout.com
For more content, follow me on:
- Medium: https://medium.com/@julsimon
- Twitter: https://twitter.com/julsimon
Transcript
Hi, this is Julien from Arcee. Welcome to episode 8 of my podcast. Don't forget to subscribe to be notified of future episodes. In this episode, I talk to my colleague, Ségolène. Ségolène is a data scientist who works for the ML Solutions Lab inside of AWS. We talk about machine learning in the trenches. What does it take to have successful ML projects, from framing the problem to building datasets, to picking algos, to deploying in production? Lots of really good tips in there. Let's not wait and jump to the discussion.
Ségolène, thank you very much for taking the time to talk to me today. Can you tell us a little bit about your role at AWS?
Thank you, Julien, for having me. Good morning everyone. I'm a data scientist in the Machine Learning Solutions Lab at AWS. My team is dedicated to helping our customers do machine learning on AWS.
Okay, that's a big topic and it's honestly a dream job. Is your team hiring a lot?
Yes, we are. So get in touch, right?
What's the first thing customers need to focus on when they start a machine learning project?
I think the first thing is to focus on the business problems they want to solve and how to frame it. The business problem should be something you can write on a whiteboard, something simple. This is one thing I always say to customers at the very first step when I meet them: keep it simple, the KISS method. Keep it simple and stupid. That's very important to be able to formulate in a clear manner what you want to solve. Otherwise, the rest of the project will be very hard to solve.
Do you recommend identifying key business metrics that would be improved by the prediction?
Exactly, and this is exactly what we do when we meet the customer for the first time. We try to understand what business problem they want to solve and what are the KPIs. That's really a keyword for us, the KPI which can be improved by the ML or deep learning model.
Okay, so write something simple on the whiteboard with metrics and then what? What's the next step?
The next step is to get your hands dirty and try to see if you have some data to solve this problem. Sometimes we can see that with some customers, they have a big problem, but most of the time they don't have the data or they don't know where the data is, etc. So the second step is to understand and know your data.
In large companies, I suppose this means lots of investigation, discussion with different teams involved. So who would you need to sit around the table to be successful here?
In this case, I will need some business people, so a stakeholder of the business who knows the business, and IT guys who know the data and the infrastructure. These two types of people are super important when you try to do ML projects.
How important is it to understand the baseline? Some machine learning projects are completely new, but a lot of them actually try to replace a manual process or a traditional IT system. And I guess you need to understand that as well, right?
Exactly. Sometimes, customers say, "We want to automate this task because we know that we've got a lot of human errors." One of the quick questions I ask them every time is, "Who is in charge of this manual process?" Because you need to talk with the people in charge of the manual process to understand how they do it. This is how you can improve the task when you understand well who is doing what and why they are making mistakes. If you don't understand how a human does it, then how could you build a model that does it?
Even in a very big customer with a lot of data, they might tell you, "We've got four people in charge of doing that." Before doing some crazy deep learning things on the data, you really need to understand the human behavior.
When you're looking at KPIs that are going to be improved by the model, what would be the best practice in trying to improve those KPIs? Are you looking at reasonable improvements, like 5%, or should you be more aggressive and aim for 50%?
From my own experience, it's always better to keep it simple at the beginning. Once you see some improvement, it means you understand the process, and it will be easier to improve faster. It's good to understand the baseline and set expectations with the business stakeholders. You will iterate anyway because data discovery is not something you complete in one go. You always find new sources of data.
The more you go into your ML project, the more you have new ideas, new hypotheses, and you want to test them. This is why it's very important to start with a very simple model. New ideas, new data, and new people will want to be involved in the project. It's iteration, iteration, iteration, and baseline.
Now let's talk about datasets because they're the blood of machine learning models. What best practices would you recommend to customers, or what mistakes do you see customers making when building and curating datasets?
Sometimes they don't know their data. One of my first recommendations is to spend time understanding where the data comes from, why there are missing values or bad quality, and errors. Spending time on exploratory data analysis is such an important step in any kind of ML model. People want to rush into the ML things, but it's super important to take your time. As data scientists, we say, "Garbage in, garbage out." It's true in the 70s, it's true in the 20s, and it will be true in the next 50 years.
If you do computer vision and get a lot of noise in your data, it's important to understand the data before doing any kind of transformation.
We discussed the role of the data engineer and how difficult it is to access data for data scientists. Do you see this role as a well-defined role with customers at the moment, or is it still emerging? What's your opinion on how to be a successful data engineer?
In my team at the Machine Learning Solutions Lab, we have different types of profiles. We have data scientists, deep learning architects who create the pipeline for ML and DL projects, and data engineers. These three roles are very much in need of each other. Data engineering and data pipeline are really super important and crucial because, as a data scientist, if I don't have access to data, I cannot do anything. Once the data is here, you need someone to make sense of it. Sometimes, customers have only one person in charge of everything, like the data scientist and engineer. We provide tools and services to help in this area.
Building a dataset is never something that gets done. It's never over because you're going to have new ideas to test, new models, and new data to add. It's a very interactive and dynamic process, so it's an ongoing activity, and you need to plan for that.
Now it's time to try out algos. It's a maze of algos, from statistical machine learning to deep learning. How do you get started here?
One of my most important recommendations is to start with the baseline. Take the easiest, go very simple, see what happens, and then introduce more complex stuff. You might start with a subset of the data and try statistical algorithms like logistic regression to see if you can get to 80% accuracy. It's super important to play with the data, do exploratory data analysis, and see the distribution of the data. Then add more features, try different models, compare with the baseline, and see where the improvement is.
For instance, Amazon Personalize and Amazon Forecast work like this. You start with a simple model, which is your baseline, and then try different algorithms. You might do some hyperparameter tuning with Forecast. The very important thing is to have a base number. Convince yourself you have something interesting, and then you can unleash a collection of algos and do hyperparameter tuning. Tooling is important because you don't want to do that stuff manually, but sometimes it's important to do manual exploration and then automate.
When do you know it's time to stop?
When you are tired, when you dream about it during the night, and you say, "Okay, now I'm done, it drives me crazy." The difference between pure research and business is that in business, it doesn't make sense to spend months just to improve by 1% or 2%. If you are at 60 or 70% accuracy, you won't spend another eight weeks to just go to 72%. At one point, you say, "It's not perfect, but it will work, and I will monitor and retrain it in the future." You can have a human in the loop to monitor and catch whatever. It's super important to have a human in the loop because once you put it into production, the job is not done. You need someone to take care of the algorithm and the production.
Once you reach the accuracy level that is good enough for your business, you have to deploy in production. I keep saying this is the hardest part and the most dangerous part. Do you agree?
Yes, it's very tricky. Putting a model into production without A/B testing can lead to issues. Customers might call you saying, "What's happened?" You need to be very picky on each step of the productionization of your model. You need to be careful about how the data is ingested, what kind of metrics you can follow, and the pipeline of the ingestion of the data. Services like SageMaker help you deploy and put into production in one click, but you need to ensure that CloudWatch, step functions, and outputs work well. It's important to keep control because data can evolve a lot in a short time. Data drift and missing features can corrupt the data sent to the model. SageMaker Model Monitor helps solve that problem.
Production is the hardest part, but once it's in production, it's super cool because you see that all the work you've done is real. You can look at your KPIs and show that the improvement is real. The truth lies in production. If you can replicate in production the results you had in the sandbox, then congratulations. It's not easy, but it's rewarding when you see the improvement and people understand the work and the power.
We can continue for hours, but we're almost out of time. So, we'll play the top 3 games. The top 3 things that are important for a machine learning project are data, people, and business outcomes.
The top 3 things that kill ML projects are also data, people, and business outcomes. If you don't have a good business outcome well defined, people won't be motivated to find the good data. On the other side, if you have a super business outcome with motivated people and good data, you will succeed.
If you need help, the ML Solutions Lab, Ségolène, and the rest of the team, who are amazing, are here to help. So get in touch.
Thank you very much for sharing the real-life stories and advice for customers and everyone else. Much appreciated. I wish you many successful projects.
Thank you very much. Thank you, Julien.