SageMaker Fridays Season 4 Episode 2 Detecting fraudulent insurance claims

Transcript

Hi everybody and welcome to this new episode of Sage Makeup Fridays season 4. My name is Julien and I'm a principal developer advocate focusing on AI and machine learning. Once again, please meet my co-presenter. Hi everyone, my name is Sigolen, I'm a senior data scientist working with the AWS Machine Learning Solution Lab. My role is to help customers get their ML project on the right track in order to create business value as fast as possible. Thank you for being with us once again. In this new season, we have three parts. The first part is about building models, specifically high-quality models. We look at things like data prep, feature stores, bias analysis, and more. Today, we're going to focus on a financial services use case: fraud detection. We'll discuss this in a few minutes. The second part of the season will revisit some of these use cases but focus on automation and ops in general. Finally, we'll close the season with a third group of episodes on AutoML, where we'll explore an easier approach to model building and compare the results to our manual efforts, hoping to see if we've done a better job. This is the GitHub repository for today's episode. Take a screenshot; we'll show it again at the end of the episode. If you have any questions during the episode, please ask. We have friendly moderators here to help, so make sure you learn as much as possible. So, Sigolen, what are we doing this week? This week, we are working on a fraud detection use case. We will try to determine if an auto insurance claim is fraudulent or not. You love fraud detection, right? Yeah, it's a good use case. We will also analyze bias in the dataset and apply mitigation techniques to build a better model, potentially with better accuracy. Let's look at the pipeline. We have claims for car accidents, and we'll see the dataset in a minute. There's plenty of information on the claim and the users. Our goal is to determine if the claim is real or fraudulent. We're trying to prevent people from stealing money from the insurance company, which is not a good thing. This is the pipeline, or architecture, we'll work on. We start with a CSV dataset, which is reasonably simple. We'll process it using scripts, train a model, and run a bias analysis using SageMaker Clarify to identify any potential bias issues. We'll then work on mitigating those issues and retrain the model. By doing this, we aim to remove potential bias and improve model accuracy. Let's get started. We'll close this slide and go back to the repo later. I'll keep my demo glasses on. Here we go. First, let's look at the dataset. The problem is binary classification. Let's look at the dataset right away. It's not very large; it's 5,000 claims and 5,000 customers. So, small scale and training will be fast. It's a synthetic dataset, completely fake with no personal information. If we look at the claims, we see a policy ID, the driver's relationship to the policy owner (self, spouse, child), the type of incident, location of the collision, severity, number of vehicles involved, number of injuries, police reports, date, and the fraud label. We see mostly zeros and a few ones, indicating that most people are honest, but some are trying to defraud the insurance company. Now, let's look at the customer dataset. We have the policy ID, customer age, length of time as a customer, number of claims in the last year, state of policy registration, premium, and gender. Gender is a sensitive attribute that can introduce bias, so we need to be cautious. We also see missing information, which we'll address. As you can see, it's a simple dataset with a reasonably simple binary classification problem. We have numerical and categorical values. We need to apply processing before training. For this, we'll use SageMaker Data Wrangler. If you watched last week's episode, we already covered Data Wrangler, but let's look at it again. It's a simple and intuitive way to work on your data. For example, let's look at the claims data. The first step is to import your data, which is just one click away. Here's my data in S3. You can apply data types, and Data Wrangler generally picks the right type for each column, but you may want to double-check. Then, you have steps. We've already done this previously, so let's look at the transforms we applied. We have quite a few transforms, including custom pandas, string formatting, encoding, one-hot encoding for categorical variables, and more. Let's look at the string transforms. We see a sample of the dataset and the list of transforms. Data Wrangler offers hundreds of transforms for missing values, outliers, and column operations. If you don't find what you need, you can apply custom formulas, custom transforms, and Spark or SQL. For example, we removed symbols from the driver relationship column to clean the data. We did the same for collision time. We iteratively added transforms to clean and engineer the data. We also applied one-hot encoding and custom pandas transforms, such as replacing education levels with integers. We created a new column with a timestamp for the engineered features. Now that we've applied these transformations to both datasets, it's time to process the data. There are many ways to do this. You can preview the transforms to ensure they work as intended. If you want to run this on the full dataset, you can export your steps. You can select the steps, and then use the export button to get four options. The basic option is to apply all steps to your data and save the result to S3 using SageMaker processing. This creates a Jupyter notebook with all the necessary code. We have a generated notebook with inputs and outputs. We upload the inputs to S3 and run the notebook. This processes the data and saves it to S3. You can also export this to Python code, which is independent of any AWS service. You can export it to another notebook that transforms the data and uploads it to the Feature Store, which is what we did for this example. You can also export it to a pipeline built with SageMaker Pipeline, but we'll cover that in another episode. In the interest of time, we have processed versions of the dataset in the repo. You can find them here: claims processed and customers processed. Feel free to export and run the processing job to get the same files in S3. Let's open one of them to check that the transformations are applied correctly. We should see one-hot encoding and other transformations reflected in the data. Now that we've applied reasonable transformations, let's visualize the data. We can visualize the raw data, but it's more useful to visualize the processed data. We load the preprocessed files and plot some visualizations. We see some imbalance, but it's not severe. For example, we have about 70% male and 30% female, which is not a huge imbalance. Some problems have much more severe imbalances, like 1 to 100 or 1 to 1,000. Here, we might still want to fix the imbalance, especially if regulations mandate it. For now, let's train the model and see if it's a problem. We can also plot fraud versus non-fraud. Most people are honest, and we have a few percent of fraudulent transactions, making it a more imbalanced problem. We need the model to work hard to detect the few fraudulent cases. We can also plot education levels and claim amounts, and look at fraud by gender. The ratio is similar to the overall gender ratio in the dataset, indicating no significant bias in fraud detection by gender. We have processed data with some imbalance on sensitive attributes. Let's proceed with training and see if there's a problem. We use the preprocessed files, but if you ran your SageMaker Data Wrangler flows, you would have the same. Now, we want to move the processed data to SageMaker Feature Store, which requires a schema. You can describe the data with an explicit JSON dictionary or let Feature Store infer the types at ingestion time. We're ingesting from a Pandas DataFrame, so you can choose either option. Personally, I prefer to be exact about what gets ingested, but if your DataFrame has the right types, that works too. We create feature groups, which reflect data sources. Here, we have CSV files, but you could have SQL tables or DynamoDB tables. Generally, you have one feature group per data source. We create one for claims and one for customers, giving them names and loading the feature definitions. We call create to create the feature groups, which have two important columns: record ID and timestamp. The record ID is the unique ID for a row in the feature group, and the timestamp indicates when the feature was created. We use the policy ID as the unique ID in both datasets. We create the feature groups and wait for them to be created, which is very fast. Now, we're ready to ingest the data. We can run a SageMaker processing job to ingest data automatically, or we can explicitly ingest the data by running code in the notebook. We call the ingest API, pass the DataFrame, the number of threads, and whether to wait for the operation to complete. If some rows fail to ingest, you get the raw IDs and can retry. The data is written to S3 in the offline store, and there's also an online store for predictions. We wait for the data to flow and check the bucket for objects. Once the objects are there, we've processed the data, pushed it to the feature groups, and it sits in S3. Now, it's time to train. First, we need to build a dataset. We have data sources, but we might need to do a little more. Sometimes, the dataset is one CSV file, but in this case, we have two data sources. We can use different features and subsets of data, which is why the Feature Store is useful. We process everything, push it to feature groups, and then pick what we want. When we create feature groups, we automatically create a table in Amazon Athena. Let's look at the Athena console. We see the tables for claims and customers. We can preview the data, which is Athena querying the data in S3. There's no extra table or backend; it's just data in S3 with a table and schema in Athena. This is useful for data exploration. We can write queries here and move them back to the notebook when they work. The query joins claims and customers on policy ID. We run the query, which returns a DataFrame. We can save the results as a CSV file, but we'll continue with the notebook. We move the fraud column to the front for XGBoost, which requires the label to be the first column. We split the data for training and testing, save it to CSV files, and upload it to S3, where SageMaker expects the data. We see the training set with fraud as the first column and all features. Now, we're ready to train. Training is simple. We define the infrastructure requirements, hyperparameters, and create an XGBoost estimator. XGBoost in SageMaker is available as a built-in algorithm and a framework. The built-in algorithm requires only a dataset, while the framework mode allows more flexibility by passing a script. We define the script, which receives hyperparameters on the command line. The rest is standard XGBoost. We configure the estimator, set the infrastructure requirements, hyperparameters, and where to store the model, and then call fit. This creates a managed instance, loads the XGBoost container, script, and dataset, and starts training. The process takes 77 seconds, and the AUC, our metric, is 0.79. Let's remember that. We trained a model, and it's okay, but not awesome. Now, let's consider the gender imbalance. Using SageMaker Clarify, we can run a bias analysis on the dataset. We set up objects for infrastructure requirements, dataset configuration, and bias configuration. We run a pre-training analysis on the dataset and a post-training analysis on the model to see if bias persists. We look at the customer gender female column and compute bias metrics for the not-fraud label. We want to see if female customers suffer from bias in being predicted as honest. We run the bias analysis, which computes various metrics. The report shows pre-training and post-training metrics. Class imbalance and other bias metrics are generally fine when set to zero. Non-zero values indicate potential issues. The metrics here are not hugely worrying, but some are significant and need attention. We can try to rebalance the dataset and train again to see if we get a better model. The imbalance problem is easy to fix. We have too many men and not enough women. We can either remove men to rebalance or add women. One way to add women is to create synthetic samples using SMOTE (Synthetic Minority Over-sampling Technique). SMOTE generates new instances for the minority class with similar statistical properties. Initially, we have more men than women. Using the imbalanced-learn library, we can resample the data in one line of code. We keep all the men samples and generate the same number of women samples. This solves the balance problem. We can train again with the same code and script. After training, we see that the AUC has significantly improved by almost six points. This is significant because the imbalance was not severe, but rebalancing improved accuracy. The model gets more samples to learn from, which helps it identify patterns specific to men and women. This suggests that SMOTE is doing a good job. If the synthetic samples were poor, the AUC would be worse. We can run the bias analysis again to see the impact. The class imbalance is now zero, and other metrics haven't changed much. We might need to investigate further to understand where the extra accuracy comes from, but the metrics are easy to compute and provide a good sense of what's happening. To recap, we started with a binary classification problem, processed two CSV files using Data Wrangler, uploaded them to SageMaker Feature Store, and used Athena to build datasets. We trained models, evaluated them for accuracy and bias, and used SageMaker Clarify to analyze and mitigate bias. We'll cover more topics in future episodes, such as deployment, model registration, automation, and lineage. Let me show you the repo again if you want to take a screenshot. Go and run this, have fun, ask your questions, and get in touch. Sigolen, thanks again for your help in preparing and delivering this. We'll see you next week with another example on computer vision for cancer detection. It's a serious topic, but an important one, and we'll see how machine learning can help with image datasets. Hopefully, we'll see you there on August 20th. Sigolen, thanks again. Bye-bye. Bye-bye, everybody. Have a good week. See you soon. Cheers.

SageMaker Fridays Season 4 Episode 2 Detecting fraudulent insurance claims

Transcript

Tags

About the Author