Expert corner Let s chat with Francesco Pochetti AWS Machine Learning Hero February 2021

February 24, 2021
Session from AWS Innovate AI & Machine Learning EMEA: https://aws.amazon.com/events/aws-innovate/machine-learning/online/emea/ In this session, Julien will chat with Francesco Pochetti, an AWS Machine Learning Hero and a seasoned Data Scientist. Discussing the concrete tasks that data scientists work on daily, as well as new requirements such as fairness and explainability, they'll share as many best practices and tips that they can fit in 30 minutes, in order to help you build high-quality models faster. ⭐️⭐️⭐️ Don't forget to subscribe to be notified of future episodes ⭐️⭐️⭐️ For more content: * AWS blog: https://aws.amazon.com/blogs/aws/ * Medium blog: https://julsimon.medium.com/ * YouTube: https://youtube.com/juliensimonfr * Podcast: http://julsimon.buzzsprout.com * Twitter https://twitter.com/@julsimon

Transcript

Hi everybody, my name is Julien and I'm a dev advocate focusing on AI and machine learning. In this session, we're super lucky to have Francesco who's going to share lots of data insights with us. Francesco, welcome, tell us a little bit about you. Hello guys, thanks for having me. So, I'm Francesco, a senior machine learning engineer at Faction, a Belgian startup in the AI domain. I'm a chemist who converted into data science after a few years in the aerospace industry. I joined Amazon, worked in the Kindle business, which was books, then moved to ride-hailing, and then a bit of fintech. Now, I'm back in Belgium with Faction. It's been a lot of fun with machine learning and data science so far. All right, yeah, and Francesco has a really cool data science blog. Make sure you go and check that out. The reason I wanted to have you on this session today is because, as you said, you've covered a lot of ground with data science, working in different companies and use cases. This is great because we're in the data scientist track, so we can get a sense of the state of data science today, what works, what is still a huge challenge, and what we can do about it. So, tell us a little bit about the state of data science today compared to when you started a few years ago. How is it looking? How much progress have we made, and what is still very difficult? Yeah, sure. It's a very good question. To understand the state of data science now, we need to review how the machine learning pipeline looks. You start with a business problem because you don't just wake up one day and decide to apply machine learning to something. You have your business problem, and then hopefully, you get your data. You decide what to optimize for, and that's the second part. You do exploratory data analysis, then move to the modeling side, which includes feature engineering and modeling. After that, there's the production part, which is the big monster, including monitoring. To answer your question, the only part of this workflow that works really well is the modeling part. We have tons of frameworks, from statistical learning like XGBoost, LightGBM, CatBoost, to deep learning frameworks like TensorFlow, PyTorch, and MXNet. We can train models in just a few lines of code, and there's a lot of research behind those lines. We have models on GitHub, and state-of-the-art stuff gets published with code you can fine-tune. So, we have a luxury of choice and the ability to train models easily. This part has moved the fastest, but it's not fully solved. It's the bit that works best. For the rest, we have a lot of progress to make. We've made a lot of progress since I started, but there's still a lot to do. Let's go from A to Z, starting with the business side. Absolutely. So, the business side involves communication with business stakeholders, which is often overlooked. Data scientists and machine learning scientists want to talk about the technical stuff, but a project doesn't exist if you can't communicate it. This is really important, and it's not working very well. We need to make a lot of progress here. Next is data, particularly versioning. Data gathering, exploration, and cleaning are well-known challenges, but in production, versioning is a big issue. You train a model on a dataset, and three months later, good luck knowing which dataset you used, which features you engineered, and how you engineered them. Then, there's the myth that deep learning eliminates the need for feature engineering. In most cases, deep learning is a small minority, and ensembles and linear models are powerful and require good feature engineering. Moving from one domain to another changes everything, so this needs to be addressed. Finally, there's production, which involves automation, monitoring, and continuous integration and deployment (CI/CD). We could talk for hours about this, but it's crucial for making sure models are reliable and up-to-date. Okay, so let's look at each of these problems in detail. Last time we met, you had interesting insights on working with business stakeholders. It's not a technical topic, but it's critical. If you don't align with business stakeholders, it's unclear what you're building. So, let's talk about working with non-technical teams and business stakeholders to frame the project. Absolutely. This is probably the hardest thing I still have to do, and it's not something you learn from a book. You learn it the hard way, often through failure. The first rule is to assume that the person in front of you doesn't have the same understanding as you. Setting expectations is crucial—explain what machine learning can and cannot do. Second, understand the business problem deeply. If someone asks for a machine learning solution, they likely have an existing solution, even if it's rule-based. They want you to improve it, so you need to understand the current situation thoroughly. Talk to your business stakeholders and understand their entire pipeline, including inputs, outputs, and how they use the results. Don't assume what people want; understand their expectations and how they will use the output. Make sure to provide feedback every week. Machine learning is more like a research project than a typical project. You don't know if it will work, so communicate your progress and insights regularly. This helps manage expectations and keeps everyone aligned. Let's move on to the next challenge: understanding and labeling data. What insights and best practices can you share on data? The single biggest advice I have is to look at your data. Open up a table and stare at your rows and columns. Understand the interactions between columns. For example, if you're building a credit scoring model, check if the income column matches with debts. Spend time looking at your data to catch nuances. Data labeling is a big challenge, especially with images. You often have to label data yourself, which is time-consuming. There's no magic solution, but it forces you to look at your data closely. Versioning is also crucial. You need to keep track of the data you use, especially as new data comes in daily. The machine learning community needs to address these issues. If you have labeling pains, consider services like SageMaker Ground Truth, which lets you annotate all types of data, including images, and scale labeling jobs using a third-party workforce. SageMaker Pipelines can help with traceability and model lineage, making it easier to track from model to dataset. Let's talk about bias and explainability. This is a big problem. Have you faced it, and what should we look for? Absolutely. I faced this in the financial industry, working on application credit scoring. Bias is a serious issue, especially in regulated domains. It often goes undetected, and when it's detected, it's hard to fix. For example, if gender is a feature and it works well, you need to understand why and whether it should be. Dropping the feature is a common but not ideal solution. Services like SageMaker Clarify help detect bias at different stages, from data to modeling to post-processing. It alerts you to discrepancies, but you need to decide if they are acceptable. Let's move on to feature engineering. What advice do you have? The best advice is to check out Kaggle competitions. It's a gold mine of feature engineering techniques. Feature engineering is domain-dependent, and there's no one-size-fits-all solution. Talk to business stakeholders about domain knowledge and build features based on that. AutoML is trying to solve this, but for complex use cases, especially in life sciences and chemistry, it's still a challenge. SageMaker Data Wrangler and Glue DataBrew can help with building and exporting transformation pipelines. Finally, let's discuss production. It's a monster. How do you handle it? Production is a real challenge, from monitoring to deploying new models. The biggest challenge is streamlining the training-to-production pipeline, similar to CI/CD. SageMaker Pipelines is addressing this by ensuring that code changes trigger a series of checks and tests before reaching production. We need to remember that we are software engineers first. Machine learning should be done as a great software developer, not just a great machine learning scientist. What's the best advice you could give to a junior data scientist? Number one, be a good software engineer. Contribute to open source projects to improve your coding skills. Number two, start a blog. Document your tests and write about new algorithms and frameworks. This improves your communication skills and is great for job hunting. Number three, pick a project and stick to it. Data science is overwhelming, so focus on one topic, dive deep, and write about it. Great stuff. Thanks, Francesco, for this invaluable exchange and the great advice. I hope you enjoyed it as much as I did. See you soon, and everyone, enjoy the rest of the conference. Thank you very much. Bye-bye. Thank you so much.

Tags

DataScienceMachineLearningBusinessStakeholderCommunicationFeatureEngineeringModelProduction

About the Author

Julien Simon is the Chief Evangelist at Arcee AI , specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.

With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.

Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.

Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.