AWS AI Machine Learning Podcast Episode 9

Transcript

Hello everyone, this is Julien from Arcee and I'm not in my usual settings because I am in Ireland today getting ready for the AWS user group in Dublin tonight. But I wanted to get this new episode out, and it's episode 9 already. I had a great conversation with my friend Leo Suke. Leo is a data scientist, currently finishing his PhD in Paris. Believe it or not, he also co-founded a school a few years ago. This school is called the Data Science Tech Institute, and they train data scientists and data engineers with a lot of emphasis on math and theory, my favorite subject. Anyway, Leo has a great perspective on data engineers, data scientists, and what it takes to be successful, what those jobs are, how important they are, and how they relate to each other. So let's not wait, let's listen to this conversation. And while you're doing this, I'll have more coffee for the meetup, not beer yet. Leo, thank you very much for taking the time to talk to me today. I have plenty of questions, and I guess the first one is, can you tell us a little bit about DSTi? Thank you for having me today. DSTi is a postgraduate school that aims to train skilled data scientists and data engineers. Okay, and when did you start? We started in October 2015, and we now have two cohorts per year, one in spring and one in autumn. Okay, and you have an online training and an in-class training, right? We have three different modes exactly. One is the regular on-campus training, where you come in for classes on campus. We have the equivalent online, so you follow classes live online from home, wherever you are around the world. And we have a third mode, which we call the SPOC mode, standing for self-paced online courses, designed for people who work and will follow the classes over the course of three years at their own pace. Okay. So you mentioned your training data engineers and data scientists. Tell us a little bit about those two roles. It's an interesting question because data scientists have lately been in the spotlight. It's the job of the 21st century, the sexiest job, etc. But data engineers are here as well. For us, the data scientist is more the math athlete, the math side of the force, whereas the data engineer is more related to infrastructure. In fact, you have a lot of different studies from Gartner and others that say in a data team, you should have at least two or three data engineers for one data scientist. This is an interesting point to highlight that, in our opinion, data engineers suffer a bit from the shadow of the fame of data scientists, but they are equally, if not even a bit more crucial. Okay, interesting. I'm now wondering what happens when you have a team of data scientists with zero data in general. This is a team that has been wrongly sized or hired out of, you know, we need to do AI and data scientists to the place. No, they usually struggle. In every team, one will end up doing the data engineer role. Okay, so let's zoom in on the two roles. Let's start with the data engineer. I suppose they sit upstream to data scientists. What does a data engineer do? What kind of unique skills do they need to have? What are their daily tools? First, I would like to specify that they do sit upstream but also downstream. What I mean is they are here to master infrastructure. Ideally, big data infrastructure, but infrastructure in general, including all network-related issues, system-related issues, storage. Their role is to collect, sometimes pre-process, clean the data, store it, and make it available for data scientists to play with. But they also, in some companies, sit downstream because once the data scientist has developed the models, they will industrialize the models, setting in production what has been developed by the data scientists. So, under the big hood of data engineering, there are many different skills, of course. It can be split into different areas, but all related to a strong IT background. They need to be strong with IT, strong with development, and aware of trends. They don't have to be specialized in a specific language or tool. It's more about a state of mind, a philosophy, like DevOps. It's a set of tools and technologies, but also a way to optimize and automate processes. They also need to master SQL. We had this conversation last week with Francesco, and he said the top skill you need in machine learning is SQL. No one is going to get your data for you. They need to be aware of DevOps, data ops, CI/CD pipelines, and cloud infrastructure. They allow you to have the power. So, a strong IT background and mindset are essential. And they need to be knowledgeable about the domain. Otherwise, you have a bunch of SQL tables or logs, and what you do with that, right? It's not just plumbing and IT. They need to understand what they're doing, especially if they want to scrape data or gather external sources. They also need to know, and I know IT people would disagree, a little bit of math to discuss and collaborate with data scientists, who are usually math people. They don't have to go into the equations, but they should know what linear regression is, just enough to communicate. So the data engineer is a unicorn too, right? Actually, yes. Skills like these are rare, and after rushing to hire data scientists, companies realize they need data engineers, the IT guys. It's a good career path. We have companies, for instance, a big French group, who say they struggle to hire them and would pay them 1.5 times the salary they pay data scientists. If you hear that, get in touch with me or if you need contacts, we try to help. We try to get those IT guys out of their development jobs and say, okay, you can go further. If you're bored with vanilla DevOps, you can become a skilled data engineer. You can probably easily transition to a data engineer with a lot of work, of course, but it's along your path. Now, what about data scientists? Your vision, because I know you have a very specific approach. How do you see those people? For us, they are, as I said, math at least. They are good in math and IT because they need to develop and prove that the idea works. When they want to train a job on AWS in the cloud, they just need to do that themselves. Launch your own machine, that's nothing complicated, but you need to be independent on that side. But our little difference is that DSTI's curriculum emphasizes math. The trend nowadays is to use out-of-the-box, black-box tools, off-the-shelf models, like the famous XGBoost, which usually works well. But our concern is that if you need to use them, you need to know the math behind it. You're not going to be crushing equations every day, but the better you understand the algorithm you're using, the better you will be able to make value out of it. We had feedback from students a few cohorts back that going from 0 to 90% in predictions is rather easy with all the tools. At DSTI, we learn how to go from 90 to 95%. This is the difficult part because how do you optimize your model? How do you clean your data and handle math-related issues? You also need to prove what you're doing. It's not just, "Here it works," but if you were to put an algorithm in a plane, you would need to know how and why the algorithm actually works. Interesting. It's a good way of putting it. But I ask myself, where would I go next? How do I understand how to get to 95 or 97%? That's what you're talking about—looking at those missed predictions and understanding what happens exactly and how to get further, how to explain what you've done, how to prove it. These are the kinds of questions we try to answer by understanding the math behind it and going beyond just out-of-the-box tools. We also had an experience where, for example, if you take Python, you have plenty of libraries. What if one of these calculations is actually wrong? You shouldn't rely entirely on the result. You need to understand if the output makes sense, if there's logic behind it. As you say, going further, understanding the outliers, understanding the residuals, and those kinds of things. You need to understand the math behind it. So let's keep digging and tell us about two or three math domains that help you do that. When you look at all those great MOOCs, like Andrew Ng's or Fast.ai, there is some element of math, but not as deep as what you offer. So tell us about a couple of those courses and how they help. For instance, statistics, like the famous linear regression. But what about multivariate linear regression, logistic regression? In which context can you actually use them? How are your datasets properly designed, and is your data correctly distributed so you can use linear regression? When do you use Kosmogorov-Smirnov tests? In which context? I know they're difficult to pronounce, but they're also difficult to use. All these statistical concepts are crucial. For instance, we also have distributed environments nowadays, and our old statistical algorithms need to be distributable. Are the results correct when you distribute? Because you have concurrent calculations, you need to add some together. So all these challenges require understanding the math behind it. Also, if you mentioned deep learning, it's very easy to go, especially with Keras, a few layers, add, add, add, boom, train, up 92% for each day. Keras, you're the best. But what is behind it? It's the whole field of optimization. How does stochastic gradient descent actually work? When you look at your validation accuracy curve, validation loss curve, how do you know you're on the right path? What would lead to any improvements? Understanding the field of optimization, which is the foundation of neural networks, is crucial. We go very deep on that. This is the point of DSTI. For instance, time series analysis is a huge field where regular algorithms cannot help, and deep learning is not the best. How do you handle time series analysis? Once again, if you just use out-of-the-shelf algorithms, you might end up making the wrong decision. Understanding the math behind it always helps you make the right prediction. And if you were to present in front of your CEO or board, you can say, "This is my prediction, but this is my confidence interval. This is what it means. I'm that confident, or the algorithm is not good." This is also a huge concern. Tools will always give you a result, and it's easy to interpret them, especially for those who don't understand them. When you see what you want to see, exactly. So you need to know if the model is not good and why. These are examples where understanding the math would significantly help you be a better data scientist. How far can we actually go on explaining algorithms and models? I would say the limit tends to be neural networks and deep learning. Up to random forests and CART algorithms, these can be mathematically proven, and you need to be very bright, but it's doable. You can prove that the algorithm actually works. But on deep learning, there's a lot of work on explaining how the algorithm came to that conclusion and certifying that this algorithm will surely work in a specific context. This is the limit where I draw the line in the latest advances in deep learning because the model works, but can you explain why? If you're going on a plane, are you driven by a deep learning model? Today, I wouldn't say so. Specific industries like aeronautics or defense are working on these techniques, but there's another story about putting them into production. That's why a good linear regression is well, it works very well. That's where I draw the line. It doesn't mean they're not interesting, but you need to know these limits and how far you can use them. So what's your opinion on our data and big data and machine learning services? I would say one of the milestones has been SageMaker. As a teacher, when we set up Jupyter and when you have to do it by hand, installing Jupyter and everything, some data scientists are not wanting to do too much system IT work and are more than happy when you click a button, and there's your Jupyter, ready to go. But yes, SageMaker is probably my favorite because it makes my life much easier, and the fact that you can just launch and get your training on other instances, all that kind of stuff. We are also looking forward to the DeepRacer, which will be launched in France for the school. Have you heard that? So yes, we are even more excited. This is not getting cut. We need a DeepRacer in France. We need to be able to buy it. I'm just asking you to buy it. And elsewhere, not just France. No, yeah. I'm brushing my own. And the one I use the most is Rekognition. To be able to quickly show, okay, you want to do some AI proof of concept. You don't have to go and implement the whole deep learning model, train it for weeks. Someone always did it for you. If you want to prove that facial recognition works or object detection works, here's a simple API call, play around with it. In a few hours, you can make a fairly good proof of concept that will prove to your manager or end users that this is what you will actually do. There is a business case. And this is what data scientists tend to forget—the actual business case. They tend to focus on how to get 99%. Then do you need 99%? Should I use ADAM or should I use SGD? Or should I write my own? Or should I write, good luck, my friend? So Leo, this is great. We could continue for hours, but we're almost out of time. So I guess my last question is, how do you start your data engineer career? If you're a developer and you're interested in that role, how do you get started? My best advice is to find one of the certifications on one of the providers you have online and try to follow the courses because passing those professional certifications usually gives you a well-structured path. Look into the industry you're interested in and the kind of technology and IT challenges they encounter. There's no need to go into DevOps if you're not in an industry mature enough for that kind of thing. If you're still in a company not mature, you will probably end up handling a lot of data wrangling, a lot of SQL, all that kind of thing. So DevOps might be a bit further down the road. Depending on the industry, but those online platforms really get you started on high-quality courses. So this is where I would start. And I'm afraid of the answer, but then if you want to be a data scientist, where do you start? I would say the end-to-end machine learning course would be a good start. Andrew Ng, everyone loves you. That course has changed lives, I think. For me, it's the perfect trail between IT and math. So I would start there and then, once again, you can't be a full data scientist, meaning you understand every single aspect of everything. It's more about what kind of application you want to do. If you want to do pure industrialized applications, you're probably just going to need the basic integration logistics up to XGBoost maybe or sub-count. If you're really interested in advanced AI, are you interested in computer vision or NLP, and then take those frameworks out of the box to help you follow and get something quick, and then specialize yourself in one of those areas to develop your real expertise. Real expertise, exactly. Because if you want to be a full data scientist, you're going to need to spend time in class, obviously. But getting into your job, that would be good. And of course, for every single data scientist, don't be afraid of IT because you will do a lot of IT. See, it's not just me. Learn about the cloud, guys. Leo, this was really great. Thanks again. Thank you very much for joining me today, and I'll put all the details about DSTi in the video description. Check them out, get in touch with Leo on LinkedIn if you have questions, and I'll see you soon with another episode. Thank you very much.

AWS AI Machine Learning Podcast Episode 9

Transcript

Tags

About the Author