SageMaker Fridays Season 2 Episode 4 Credit Application Explainability October 2020
October 31, 2020
Broadcasted live on 30/10/2020. Join us for more episodes at https://amazonsagemakerfridays.splashthat.com/
⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos ⭐️⭐️⭐️
This project provides an end-to-end solution to score credit applications, and to explain decisions made the the model. We first build a synthetic dataset with Faker and AWS Glue. Then, we use SageMaker to train a LightGBM model on Scikit-Learn. Finally, we deploy the model to a real-time endpoint, and we use SHAP to explain predictions.
Transcript
All right, good morning, good afternoon, good evening, everyone. It's Julien and Segolin, and we're back for episode four of SageMaker Fridays. We have a huge change this week. As you can see, we are not in the Paris office. We are in separate locations because we are in lockdown again. Yes, but this is not going to stop us from meeting all of you out there and helping you with machine learning, right? So, I'll just introduce myself again, but you know us by now. I'm Julien. I'm a principal developer advocate working on AI and machine learning. And, of course, we have Segole-Hen with us. Hello, everyone. My name is Segole-Hen, and I'm a senior data scientist working with the AWS Machine Learning Solution Lab. All right, and thank you again for joining us, Segole. We're ready for this; it's a bit of a crazy time for everyone here, but we'll manage it.
We are live, as you can guess, and please ask us all your questions. We have machine learning specialists waiting for your questions, and don't be shy. I keep saying this, but don't be shy. Ask all your questions and make sure you learn as much as possible. Our team will help you out, and thanks to everyone who's helping us with this session today. We appreciate it. So let's get started now.
In previous episodes, we talked about predictive maintenance, demand forecasting, and last week we did fraud detection. This week, we're going to do something a little different. Of course, we're going to look at a business problem, which is credit decision. But we're also going to focus on explaining how the model predicts. The whole explainability thing is a really big topic. So, Segole, can you tell us a little bit about the content today? What are we going to learn?
So, today we are going to work on a very interesting real-world use case. Let's say you want to apply for a credit at your bank to buy a new dishwasher because yours is at the end. Oh yeah, it broke. All right. And based on some of your personal data, the bank will approve or reject your application. But you can be very frustrated if the bank denies it without giving you some explanation. You would even say that the decision is quite arbitrary, and you would like to know what kind of factor had a negative impact on your application. This is what we are going to do today. We're going to first build a classifier to approve or reject credit applications based on personal data. Then, we will understand how the model predicts the response by introducing some explainability into the results of our ML model.
Okay, that's nice. Model explainability is the degree to which humans can understand the cause of decisions made by machine learning models. After this episode, you won't say anymore that ML models are only some black balls full of witchery. So, when you say humans can understand machine learning models, you mean real humans, right? Not data. This is great. We really need to crack the mystery of machine learning prediction. It's a big concern for a lot of companies, a lot of customers, especially if you work in highly regulated industries like financial services or healthcare. You just have to explain why you came up with the answer. So, tell us a little more about the tools we're going to use today, and then we'll discuss the use case in more detail.
So, today we are going to do a classification model based on a dataset containing features that describe a credit application and its applicants. In order to do so, we are not going to use the famous XGBoost, which we already mentioned last week, but another gradient boosting model, LightGBM, and we are going to use CycleLearn on Amazon SageMaker to do so. Then, we will explain the results of our model using SHAP. In a few words, it will help us understand which features have the most impact on the bond decision.
Okay, that's really good stuff. SHAP, or SHAPE, or however you want to pronounce it, is a very French term, but let's go with SHAP. It's one of those techniques for model explainability, and I think SHAP is a really interesting one. I'm really curious what we're going to learn here. We're going to go pretty deep again. We can't help it; we love it. We hope you love it, too. So, get some coffee, get anything you need to keep you awake in the next hour. We're going to get ready. As usual, all of it is online. Let me show you the repo we're going to use. Here it is. Hopefully, you can see my screen. So, it's another great repository built by our machine learning teams. You can find all of these under the AWS Labs repo, and this one is called SageMaker Explaining Credit Decisions. This one is particularly interesting because, as we will see later, it uses additional AWS services. It's not just training and deploying; it also generates a dataset and uses a bunch of extra services in the process. We'll talk a little bit about that, but it's a very, very good one. I really, really recommend it.
So, we're going to dive into this, but as always, our focus is not just clicking through the notebook. We want to understand why we're even doing this. So, let's discuss the problem and how we're going to solve it. Yes, it is a classification problem. We already discussed this in the fraud detection episode, and we use binary classification. It's kind of a similar problem, right? Binary classification, yes or no. Does your credit get approved or not? We'll look at this, of course, but it's not so different from what we've done before. I think we should focus on the explainability problem. But, of course, we have to explain what explainability means, right? And Segole, help us! Big question!
So, let's start with a simple linear regression model. It's quite straightforward to read the model outputs. The sign of each coefficient indicates the degree and direction of the relationship between the joint value of the input variables and the resulting predicted response value. The same goes for a single decision tree. The model is highly interpretable, and my grandmother can read it. But most of the time, a single decision tree is not enough to model complex tasks on top of huge datasets. So, you need to use advanced models with a lot of parameters, and you can easily get lost when you read the model outputs. But the good news is that nowadays many methods exist for formulating explanations from complex models that are interpretable and faithful. As a result, the explanation will give you a way to understand the relationships and patterns learned by the machine learning model. This means more confidence in the reliability and robustness of the model, of course, for real-world deployments. It is crucial for building trust in the system.
Yeah, I see what you mean. Basic models are easy to understand, but when you move to bigger things and deep learning, it's just impossible to understand what's going on. It's a major concern for customers, and sometimes business and legal and even human stakes can be very high. So, credit application, of course, but I guess energy production, healthcare, pharmaceuticals. If you find a cure for something or use machine learning to predict whether someone has a condition, you want accuracy, but you also want to explain why you came to this conclusion before you can act on it. So, it's really, really critical to understand how these models work.
Okay, we understand the problem a little bit more, and let's take a look at the dataset. Here, it's a little bit different. Let me show you what the data looks like. Okay, zooming in. It's JSON. I know Segolin is a big fan of JSON, like everyone else. Yes, absolutely. So, where does this come from? This is a synthetic dataset, built on purpose for this example. When we run the example in this repository, we start from a CloudFormation template. You can launch the CloudFormation template here and build a stack, and it's going to create a whole bunch of different things. When it comes to the dataset, let's focus on those icons on the left. The first thing it's going to create is a Lambda function. This piece of code running on managed infrastructure is going to generate the synthetic data using a library called Faker. You'll find the URL on the last slide. So, Faker lets us generate addresses and people information, etc. We generate datasets. We have one credit dataset, one people dataset, and one contact dataset. Faker is a very cool library. If you're not familiar with it, it's nice. So, we have these fake files in S3. Then, we run a Glue job to generate the actual dataset. This is the data written back to S3 as a training set and a validation set, etc.
And, why did I click on this? Sorry. Maybe I should show you what Faker is. Here it is. Okay, so very simple library to generate fake datasets. Again, all the links will be on the last slide, so don't worry about the resources. Okay, right. So, once we have this data ready, as a training set and test set, etc., we use it on SageMaker as we usually do. Glue is a fascinating topic, but we're not going to go extremely deep on it. I'm just going to show you quickly what we're doing here. So, first, we have a crawler. A crawler is exactly what the name suggests. We fetch data from the raw data written in S3 by the Lambda function. We fetch it, and then we run a job. As you can imagine, all this stuff takes place on fully managed infrastructure. Running the job really means running a script. Glue is based on Spark, so if you like Spark, you can bring your own script. You can see here, basically, we're loading the data. Let me zoom in a bit. It's probably a little hard to read. Then, we join the three JSON files we generated and write everything back to S3.
So, we get JSON lines with numbers and features. Each feature belongs to a certain category. We have a credit category, finance category, employment category, etc. And within those categories, you have the actual features: employment type, employment permit, checking accounts, checking balance, etc. This is what the data looks like. No one would actually use real-life, non-anonymized data for this, right? So, if you ask your bank for a dataset, they'll probably say no. This is a good option. The last thing I want to mention is that we have a schema as well. It's generated by the script, so we have the type and description for each feature, which is useful when we're going to do feature engineering and feature processing later on. We have about 1,000 credit applications in this case. Not a big dataset, but we're just trying to understand how this model predicts. So, toy example, not a lot of data needed.
Okay, I think we're good on the dataset. Now, let's talk about the algorithm. So, Segole, you mentioned the LightGBM algorithm. Yeah, the new one. I hope it's not going to be a light topic, but I'll do my best. Can you explain LightGBM?
Sure. For people interested in this algorithm, please have a look at the paper; it's super well explained. So, what is LightGBM? It means a light gradient boosting machine. It's a gradient boosting framework that uses tree-based learning algorithms. We are going to use it through Scikit-learn. This algorithm is designed to be distributed and efficient. It can be used for regression, binary classification, multi-classification, etc. But let me explain what it means for real-life data. It's pretty much the same problems as XGBoost, right? Yeah, exactly. It's like classification. But depending on the data you have, the amount of data, sometimes you need to change the algorithms or compare the performance of different algorithms. GBM can be a good candidate in some circumstances. GBM is a new implementation of the well-known family of gradient boosting decision tree algorithms, GBDT. We quickly explained last week when we discussed XGBoost. In summary, GBDT algorithms build an ensemble of tree-based models where each tree tries to fix the prediction mistakes made by the previous trees.
The new thing with LightGBM is that this algorithm adds two new techniques that address some potential weaknesses of the GBDT implementation, especially when applied to large datasets, and especially if these large datasets are very sparse. The main problem is that in order to build the prediction trees, this implementation of GBDT typically scans all the data points to estimate the information gain for all possible split points when the trees are built. As a consequence, the computational complexity grows quickly with the number of features and data points. This is why implementing traditional GBDT on large datasets can be very time-consuming. So, you actually spend lots of time in the training process exploring all possible splits on all possible data points.
Now, you mentioned two techniques. What are the names of those two techniques? It's not scary. The first one is called GOSS, which means Gradient-Based One-Side Sampling, and the second one is called EFB, Exclusive Feature Bundling. The intuition is quite simple because you want to filter a little bit your data, and these two techniques act like filters. GOSS, Gradient-Based One-Side Sampling, is an effective sampling technique that helps reduce the number of data points. EFB, Exclusive Feature Bundling, reduces the number of features by grouping mutually exclusive features, treating them as a single feature. The idea is to work both on the number of data points and the number of features.
Thanks to this combination, the authors of the LightGBM algorithm show that this algorithm can significantly outperform XGBoost and stochastic gradient boosting in terms of computational speed and memory consumption. I understand the intuition. Reduce the number of data points and combine features to reduce the number of dimensions to the problem. Exactly. This should help with training speed, and it should go faster. What about accuracy? Do we also get an accuracy improvement compared to other implementations?
You can have a little bit more accuracy, but the idea is that it improves the accuracy and the speed of training. GBDT typically grows trees with a level-wise strategy. On the other side, LightGBM uses a leaf-wise tree growth strategy. We have a slide to explain this. So, let me show that slide. Level-wise, we're stacking levels, right? We're stacking layers on the tree. LightGBM grows trees in the other direction. There's a lot of math associated with this, and you can read a great summary in the LightGBM documentation and the research paper.
So, we understand LightGBM a little better. It's a very versatile algorithm, just like XGBoost, but it uses fancy techniques to speed up training and, in some cases, achieve higher accuracy. We're going to use an existing implementation with Scikit-learn, as you mentioned. Let's take a look at the training script and start looking at some code. We're going to focus on the most important bits. This is a Scikit-learn script that we're going to run inside a Scikit-learn container on SageMaker. We're building a custom container because we need to have LightGBM and SHAP installed.
Let's quickly look at the training function. It's pretty simple because it's Scikit-learn, and it's always very neat. First, we load the data, which is organized into different datasets. We have the training data, testing data, training labels, and testing labels. We load these from locations passed to the script by SageMaker. This is called script mode, a simple way to run your framework code on SageMaker, receiving hyperparameters and dataset locations on the command line. We read them as command line arguments. Then, we have very simple preprocessing. It converts all numerical values to 32-bit floating points and one-hot encodes categorical variables. It's a five-line function.
Then, we create the actual classifier. We have some parameters here. Max depth is how deep the tree is, which is super important, especially to fight overfitting, especially when your data is small. Number of leaves is the maximum number of leaves on the tree. Keep the tree small enough to avoid overfitting. Number of estimators is how many trees we're building. Min child samples is the minimal number of data in one leaf, another parameter to play with to avoid overfitting. Boosting type is how the trees actually predict. The default is GBDT, but in our case, we use the Dart one, which means dropouts meet multiple additive regression trees. Depending on the data, you can play with this hyperparameter.
We use a standard Scikit-learn pipeline to combine the preprocessing and training steps. We train the pipeline on the training set and run the test set through the pipeline to get some metrics, and then we save the model in joblib format in the location passed as a command line argument. This is really vanilla Scikit-learn. The only difference is that we read hyperparameters, dataset, and model path on the command line and use them inside the script. If you have existing Scikit-learn code, it's very easy to adapt it and make it run on SageMaker.
So, what metric will we use for this model? LightGBM supports lots of different metrics, but in our case, we are going to use AUC, the area under the curve, which is a good metric for binary classifiers. Our focus today is not to get the maximum accuracy but to train a model that we can explain.
We have a dataset, an algorithm, and training code. Let's put all these pieces together and train the model. Let me switch to the training notebook. This example is broken into a number of notebooks. The first notebook is dataset generation, starting from the Lambda-generated data with Faker, running the Glue job, and ending up with the JSON file and schema I showed you. Part two is training. We need to train, and we mentioned we would create a custom container. We showed this in a previous episode, but this example does exactly that. Instead of running AWS CLI calls in the notebook, it uses another AWS service called CodeBuild. CodeBuild is a managed service that lets you build pretty much anything. It's completely agnostic with respect to the language and technology you're using.
We start with a project. You get all the logging and lifecycle information you could want. You can trigger builds, see metrics, and get lots of stuff. Most importantly, you see all the details, and it comes down to the build script you pass. This is a YAML file called a build specification. Here's the one we're using. It's pretty simple, with two phases: install dependencies and build. But you can have more steps. The commands we use here are the exact same ones we used when running things in the notebook. We create, log into our ECR registry, build the image, tag it, and push it.
If you've never used CodeBuild and find yourself building containers a lot, CodeBuild is a really good way to do it on managed infrastructure, simplify everything, and automate everything. It's fully integrated with CodeDeploy. If we were pushing this container to a cluster, like ECS or EKS, we could build pipelines to automate everything. All these tools make sense here as well.
This is how we build our container. This is the script. Pretty simple. Now, let's go back to the notebook. Here, we see the output of that container. CodeBuild succeeded and pushed the container to Amazon ECR. We're back into SageMaker territory. We use the Scikit-learn estimator from the SageMaker SDK. We pass the name of the image we just built, the script, and the entry point for training, which calls the train function I showed you before. Hyperparameters, infrastructure, and all the paths for the model. There's nothing really new here. The only thing is some extra parameters like source_dir and dependencies because we want to inject extra source files and dependencies. We have a homemade package we want to load in the script. We can use these estimator parameters to pass them to the entry point.
We create the estimator and call fit, passing the different channels. In all the examples we've used, we only had one or two channels, like training and test. People ask if we can have more, and the answer is yes. Here, we can see five channels: training data, test data, training labels, test labels, and the schema. The script can understand which type each feature has for simple processing. You can pass multiple channels, and if your training data is broken into different pieces all over S3, you can have training channel one, training channel two, etc.
We call fit, and it's business as usual. We train for 75 seconds. We didn't set up spot training here, but we could save some money by doing that. We showed you this in a previous example, so go and look at those. Seigo, you told us we would use the area under the curve, and we get almost 80%. Is that okay? Super okay. If you're happy, I'm happy. The metrics are relative, so you need to create a baseline and try different hyperparameters to improve accuracy. It depends on the data, and here, it's not so bad. For a first attempt, 80% is pretty good.
We have a model in S3, and now we can deploy it. This time, we're going to deploy to a real-time endpoint and use a custom script with SHAP to return feature importance. We're almost at the end of the workflow. Data generation, training, deploying, and predicting. It's a really fun and efficient way to combine all these services.
We looked at endpoints before, so we're going to look at predictors, but it's nothing new. Let's focus on SHAP. It's a very popular topic but a bit mysterious. If it's the first time you hear about it, it sounds a little strange. So, Segole, we need your help once again. What is SHAP, and how does it help? Why is this a good tool for explainability?
SHAPE stands for Shapley Additive Explanations. It relates to a game theory concept called Shapley values used to create explanations. In game theory, Shapley Value describes the marginal contribution of each player when considering all possible coalitions. The idea is to capture some weight when considering the world of all possible coalitions. It came from game theory but is used in a machine learning context. The Shapley value describes the marginal contribution of each feature when considering all possible sets of features. The additive part of the name means these Shapley values can be summed together to give the final model prediction. In a nutshell, the contribution of each feature is a positive or negative number that either increases or decreases the predicted output.
In our case, the output is a probability. It's a yes or no problem, but it's a 0 or 1 probability on your credit risk. Zero means you're a super safe customer, and 100% means you will certainly default on your credit, which is not good. SHAP will say which feature and the value of that feature for a specific customer is increasing or decreasing the predicted credit default.
So, the output is a probability, and SHAP will tell us how each feature contributes to that probability. For each individual sample, we can see the values for each feature. Once you train your model, you can see for each individual data point how each feature contributes to the final prediction.
Let's go to the prediction notebook now. We're getting the model we just trained. If all of this was running inside the same notebook, we would just call estimator.deploy in the previous notebook. Here, we're running this in a different notebook, so we don't have access to that estimator object. We have to rebuild that content using the Scikit-learn model object from the SageMaker SDK, which references the trained model and creates an estimator object where we can call deploy. This is a good technique because you might train on Monday and deploy on Tuesday or Wednesday.
We deploy as usual on a C5 instance. We wait for a few minutes and have an endpoint. Now, we're ready to predict. We're using a custom container for this. Let me show you the prediction script. The model function is responsible for loading the model. When the endpoint comes up, SageMaker calls the modelFn function to load the model. We load everything and instantiate it. We have a model ready to go.
There's a little preprocessing to ensure the incoming data is in a format LightGBM can predict with. Then, there's the predict function, which is a bit complicated because it accounts for different scenarios. We predict the probabilities on the features and output all the SHAP information. Most of this is vanilla SHAP code, but it looks a bit complicated because we account for different scenarios. You can simplify this for your own example. The SHAP repo has some examples, and you can run those first.
Let's take a sample here. Here's a sample customer. We see all those features. We predict by pushing the data to the HTTP endpoint we created. We see a default risk of 27.62%. If you say no to my dishwasher credit and tell me there's a 27% chance I'll default, that's not working for me. Plus, it doesn't tell the whole story. This is how SHAP is actually working. There's some nice visualization here.
What do we see here? 27% is the actual output. The red and green bars are based on SHAP scores. Depending on the score and the contribution of each feature on the output of your model, you can see which one has the most impact on the bank's decision. The features in the employment category seem to increase the risk pretty significantly, like plus five or plus six percent, while the finance features reduce the risk. For this guy, employment duration is zero, which is not reassuring for credit. On the other hand, finance has a positive impact.
We can see the detailed explanation for each individual feature. For example, employment duration zero is a very negative feature. The banker could tell me, "Sorry, you have no employment history. I won't lend you the money." It's a reasonable decision. If I challenge the decision, they could say, "It's our policy to say no in this situation."
What's really interesting is that ML models are seen as black boxes, but thanks to SHAP values, you can understand the decision made by the algorithm. You can visualize it, which is good. Here, I just printed out the actual values. You can see the feature name and the SHAP value. As a last example, let's change one feature. This customer doesn't have a checking account. Now, let's change the feature and say the customer does have a checking account, but the balance is negative.
If we predict again, the credit risk is now 50%. This feature, negative savings, is not helping at all. Everything looks bad here. If we did this for thousands of customers and averaged everything, we could see feature importance. We could say the feature that contributes most negatively is employment duration. If your employment duration is zero or less than 12, it's very negative for your credit approval.
Sometimes, it's also a way to detect potential bias. If age was a feature and you could see that if you're over 45, your credit chance is not as good, even though the risk is identical to a 25-year-old, you could say, "Maybe I'm being treated unfairly." Bias is a different story, but SHAP values are one way to understand what could be going wrong in your model.
I think we're almost out of time. Let me show you the last slide with the resources. All the usual ones, the SageMaker URLs, and here's the repo we used, the Faker information, the LightGBM paper, and the LightGBM and SHAP documentation. Don't forget, re:Invent is coming this year. In a few weeks, it's free, and we hope to see you there. It's going to be a lot of great content, especially SageMaker content.
I mentioned it so many times, it's embarrassing, but I have this SageMaker book out. For a few more weeks, you can get a good discount on the paper edition and the ebook edition. I'll leave it on for a few more seconds so you can take a screenshot. Thank you, everyone, for watching us again today. Thank you, Segole, for the invaluable explanation on data science and machine learning. I learned a lot today again.
Next week, we're going to switch topics and talk about natural language processing and a technique called topic modeling, which is pretty cool. It's an unsupervised learning technique to group look-alike or related documents, and we'll use the Amazon reviews dataset, which is very fun to work with. There's going to be a lot of crazy stuff to run again, so make sure to join us next week. Thanks again, Segole, and thanks to all our AWS colleagues who helped with this session today. We couldn't do it without you. We hope everyone had a great time and can't wait to see you next week.
Until then, if you're in lockdown, stay safe, be reasonable, don't do anything stupid, take care of your loved ones, read about machine learning, and keep rocking with machine learning. See you next week. Bye-bye.
Julien Simon is the Chief Evangelist at Arcee AI
, specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.
With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.
Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.
Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.