Introducing Amazon SageMaker Clarify part 1 Bias detection AWS re Invent 2020

December 08, 2020
In this video, I show how you to use the bias detection capability in Amazon SageMaker Clarify, using bias metrics computed on a credit dataset, and on a classification model trained on this dataset. https://aws.amazon.com/sagemaker/ https://aws.amazon.com/blogs/aws/new-amazon-sagemaker-clarify-detects-bias-and-increases-the-transparency-of-machine-learning-models/ ⭐️⭐️⭐️ Don't forget to subscribe to be notified of future episodes ⭐️⭐️⭐️ For more content: * AWS blog: https://aws.amazon.com/blogs/aws/ * Medium blog: https://julsimon.medium.com/ * YouTube: https://youtube.com/juliensimonfr * Podcast: http://julsimon.buzzsprout.com * Twitter https://twitter.com/@julsimon

Transcript

Hi everybody, this is Julien from Arcee. In this video, I would like to introduce Amazon SageMaker Clarify, a new capability that was just announced at AWS re:Invent. SageMaker Clarify helps you understand if there's any bias in your datasets and models, and it also helps you understand how your models predict using SHAP values. Let's see how this works. We're going to run an example based on the German credit data set. It's a binary classification problem where we start from customer features and build a model deciding whether a certain customer gets their credit approved or not. We're going to train a model on this data set using the XGBoost algorithm. Then we will use SageMaker Clarify to analyze both the dataset and the trained model for bias using a number of bias metrics. Finally, we'll use SHAP values to understand how the model predicts which features are the most important and so on. Pretty cool stuff. You could use SageMaker Clarify to analyze the dataset only, and you would get pre-training bias metrics. Instead, here, I'm going to train a model so that we can run the bias analysis both on the dataset for pre-training metrics and on the trained model for post-training metrics. This will let us see if the algorithm actually reduced bias. Training the model itself is straightforward. We download the dataset, and it looks like this: the German credit dataset. It's anonymized and encoded. If you want to know what those A values mean, you can go to the dataset repo. For example, Attribute 3 is credit history. So A30 means no credits taken, A31 means all credits paid back, etc. Attribute 4 is the credit purpose. Attribute 5 is a numerical attribute, the amount, etc. We give names to those columns, and the last column is the label column, a one or zero value indicating if a certain credit has been approved or not. We use one-hot encoding for the categorical columns, separate samples and labels, and then split the data set for training, validation, and testing. We save these three datasets to CSV files, upload them to Amazon S3, and train the model as usual. We grab the container for XGBoost, configure the estimator, set the hyperparameters, define the location of the datasets, and train. We train a model, deploy it to an endpoint, and run some predictions, plotting the ROC curve. The purpose here is just to train a model that we can analyze. Nothing specific to SageMaker Clarify at all. Now we can move on to the next step, which is running the analysis. Obviously, we need to know which model we're going to analyze. The way you run this analysis is very simple. You run it as an analysis batch job using a built-in image for SageMaker Clarify. First, we need to grab the dataset we want to analyze for bias, and then we want to run the analysis. We use this processor object from the SageMaker SDK, passing the name of the SageMaker Clarify image and infrastructure requirements. On top of this, we need to pass an analysis configuration, where we ask for certain metrics and provide some information. Here's what the file looks like. Dataset type is CSV, and column headers are defined. We name the two features we want to look for and the value we should test for. For example, we're building bias metrics for data instances that are not foreign workers and for data instances where 40 is the threshold for age. These are the facets we're interested in, and we need to pass the label value we're interested in. Remember, this is the label column name, and getting your credit approved means this label is set to one. This configuration tells SageMaker Clarify the facets, values, and label we're interested in. There's also a SHAP baseline here, which is optional. If you pass it, it will compute SHAP information as well. We'll probably zoom in on this in a future video, but it's the baseline you can compute on your golden dataset. We want all pre-training bias metrics and all post-training bias metrics. To compute this, we deploy a temporary endpoint automatically, so the analyzer job will do that, deploy an endpoint, predict with it, and measure the bias metrics. It then takes down the endpoint automatically. So you don't have to do it. Just pass an A for that. The key thing is the label that says, "Yes, your credit is approved," and the two facets we want to compute metrics for, along with the values or thresholds we're interested in. This JSON file is defined as an input for our processing job, and the dataset itself is another input. The output will be a report. We just run this, passing the inputs and the output, and it runs to completion. It analyzes the dataset, deploys the model on a temporary endpoint, computes the post-training metrics, and takes down the endpoint. Once complete, I have information in S3, and I also see information in SageMaker Studio. If you find your job in the experiment and click on "Open in Trial Details," you'll see two new views: Bias Report and Model Insights, which is model explainability. Let's look at the Bias Report first. We were interested in figuring out credit approval with a value of 1 for foreign workers with a value of 0. Now we can see with these metrics if being a domestic worker helps or hurts your chances. If you want to understand these metrics in detail, you can click on any of them for additional information. You can also read the technical paper and white paper written by Amazon teams, which go into great detail on how these metrics are computed and what they mean. The first one is class imbalance. We have much fewer domestic workers than foreign workers. This could be a problem because we only have a thousand samples here. With a 92% imbalance, it means we only have a handful of domestic workers. This might not be enough for the algorithm to successfully pick up the statistical patterns. Class imbalance is always something to look for. The next one is the difference in positive proportions in labels. This tells you whether one group has a significant advantage or disadvantage when it comes to positive labels. Domestic workers have a smaller proportion of positive labels, which could indicate bias. However, it's too early to say. We need to look at the data instances in detail to see if there's a good reason this group gets more negative answers. We also get post-training metrics, such as the difference in positive proportions in predicted labels. This measures if the model predicts different proportions of positive labels for different groups. The metric is still negative, indicating that domestic workers get a smaller proportion of positive labels. Again, this needs investigation to determine if it's legitimate or biased. Looking at age, we see a group of customers aged 40 to 75. There's a slight class imbalance, but it's not as bad as the domestic versus foreign worker situation. The difference in positive proportions is slightly negative, indicating a slight difference. The DPPL is about the same, suggesting no big difference here. The metrics show that the difference between this age group and the other group is not as large as between the domestic worker and foreign worker groups. As you can see, it's pretty easy to compute bias metrics on datasets and models and visualize results. Interpreting these metrics requires more work because only you can do it. You understand your business context and the dataset, so you have to figure out what these numbers mean in your context and if they reveal bias in the dataset or model.

Tags

Amazon SageMaker ClarifyBias AnalysisMachine Learning Model EvaluationSHAP ValuesGerman Credit Dataset

About the Author

Julien Simon is the Chief Evangelist at Arcee AI , specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.

With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.

Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.

Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.