Detect potential bias in your datasets and explain how your models predict February 2021

February 24, 2021
Session from AWS Innovate AI & Machine Learning EMEA: https://aws.amazon.com/events/aws-innovate/machine-learning/online/emea/ As ML models are built by training algorithms that learn statistical patterns present in datasets, several questions immediately come to mind. First, can we ever hope to explain why our ML model comes up with a particular prediction? Second, what if our dataset doesn’t faithfully describe the real-life problem we were trying to model? Could we even detect such issues? Would they introduce some sort of bias in imperceptible ways? These are not speculative questions at all, and their implications can be far-reaching. Unfortunately, even with the best of intentions, bias issues may exist in datasets and be introduced into models with business, ethical, and legal consequences. It is thus important for model builders and administrators to be aware of potential sources of bias in production systems. In addition, many companies and organizations need ML models to be explainable before they can be used in production. In fact, some regulations explicitly require model explainability for consequential decision making. In this hands-on session, you’ll learn how Amazon SageMaker Clarify can help you tackle bias and explainability issues, and how to use it with both the SageMaker Studio user interface and the SageMaker SDK. You’ll also see how it works together with SageMaker Model Monitor to track bias metrics over time on your prediction endpoints. ⭐️⭐️⭐️ Don't forget to subscribe to be notified of future episodes ⭐️⭐️⭐️ For more content: * AWS blog: https://aws.amazon.com/blogs/aws/ * Medium blog: https://julsimon.medium.com/ * YouTube: https://youtube.com/juliensimonfr * Podcast: http://julsimon.buzzsprout.com * Twitter https://twitter.com/@julsimon

Transcript

Hi, my name is Julien and I'm a dev advocate focusing on AI and machine learning. In this session, we're going to discuss two significant problems for machine learning projects: bias and explainability. Before we jump into the demo, let me explain what these issues are. Bias, in general, is an unfair representation of reality. As we use datasets for machine learning to build models, there's always a risk that the datasets we use do not fully and fairly represent the reality we're trying to model. Unfortunately, these problems are often related to sensitive attributes like age, sex, cities where you live, etc. As you have all those different groups in your dataset, it's possible that one or potentially several groups are underrepresented, giving the algorithm fewer instances to learn from. Eventually, this could lead to a model that predicts very well and positively for a certain group of data instances and not so accurately or favorably for another group of instances. This would be considered bias. There are many angles to this problem, but this is the kind of issue we're going to look at today. The second issue with machine learning is explainability. As models get more complex, especially deep learning models, it's becoming almost impossible to understand exactly why a model comes up with a certain prediction. You might think, if I use the test set and have high accuracy, then it's fine; I don't need to know the reason behind the prediction, I just need to know it's the right one. While accuracy and testing are important, increasingly, organizations and companies want to understand what's happening inside the model. They particularly want to understand why a model comes up with a certain outcome for a certain instance. Which features were important, and maybe down to the feature value, which feature values contribute more to a positive outcome and which contribute more to a negative outcome? Think about hiring, credit decisions, health applications—these are critical areas where we really need to understand what's going on. Additionally, organizations and companies have regulatory and legal obligations to explain how their models work. That's why explainability is such an important issue. To highlight these two problems and how to solve them, we're going to jump to the demo now. We'll start with a dataset called the adult dataset, extracted from US census data. This dataset includes demographic information and is labeled with whether a certain individual earns more than $50,000 per year. We'll train a binary classification model using the XGBoost algorithm to predict if a certain citizen makes more or less than $50,000 a year. Our focus is not just on solving the problem but on identifying bias in the dataset and model, and understanding how the model predicts. We'll use a new SageMaker capability called SageMaker Clarify. I'll share the URL to this notebook on the final slide, so you can easily reproduce this. The first step is to download the dataset, which is already split for training and validation. We load these using pandas and assign column names to the datasets. The dataset includes age, work class, education, marital status, ethnic group, sex, capital gain, capital loss, country, and the label indicating whether the individual earns more or less than $50,000 per year. Early on, you would want to get a sense of what the data looks like, so you would run various visualizations. For the sake of time, I'll zoom directly to a potential problem. We'll look at sex, counting how many men and women are in the dataset. We see it's about one-third female and two-thirds male, indicating a mild imbalance. When we look at the labels, the imbalance is more severe, with a ratio of about six to one or seven to one. This means the algorithm will get six to seven times fewer female instances labeled as earning more than $50,000. This could make it more difficult for the model to pick up the right patterns, potentially leading to bias towards the majority class. We'll encode the data to replace strings with integers using the label encoder in scikit-learn. The female sex value is encoded as zero, and male as one. We need to remember these values for configuring the bias analysis. The dataset is then uploaded to S3, and we configure the estimator with the XGBoost algorithm, setting hyperparameters, including binary logistic for binary classification. We also add early stopping to avoid overfitting. After training, we create the model in SageMaker for post-training metrics. SageMaker Clarify will automatically deploy the model to a temporary endpoint for analysis. Next, we run the bias analysis using SageMaker processing. We configure the analysis to run both pre-training and post-training. We define the dataset location, label name, column names, and model name. The key part is specifying the facet (sex) and the feature value for the potentially biased group (women, encoded as zero). We also specify the positive label value (1, meaning earning more than $50,000). The processing job runs, computing various metrics and generating a report. The report shows class imbalance and DPL (Disparate Positive Label), confirming that the majority class (men) has a higher proportion of desirable outcomes (earning more than $50,000). Post-training metrics, such as disparate impact, show that the model predicts more favorably for the majority class (men) than for the minority class (women). To learn more about these metrics, you can refer to white papers and the Amazon SageMaker Clarify repository, which contains the open-source code for metric computation. For model explainability, we use SageMaker Clarify with the SHAP (SHapley Additive exPlanations) library. We define a SHAP config, provide a baseline, and compute SHAP values for the dataset. The most important features are capital gain, age, and relationship. Individual SHAP values show how specific feature values impact the predicted outcome. These values can be visualized using the SHAP library to understand how the model works. Finally, you can use SageMaker Model Monitor to track bias metrics over time in production, helping you detect drift and changes in data properties. That's pretty much it. Here are some resources: - Documentation for bias analysis and model explainability - White papers on bias metrics - SageMaker Clarify repository - SHAP repository - My own repository with the notebook used today I hope you had a good time, and enjoy the rest of the conference. Bye.

Tags

MachineLearningBiasExplainableAISageMakerClarifyDataImbalanceSHAPValues

About the Author

Julien Simon is the Chief Evangelist at Arcee AI , specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.

With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.

Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.

Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.