SageMaker Fridays Season 2 Episode 3 Fraud Detection October 2020

Transcript

Welcome to episode three of this new season of Sage Makeup Fridays. My name is Julien, and I'm a principal developer advocate focusing on AI and machine learning. Please meet my co-presenter today. Hi everyone. My name is Ségolène, and I am a senior data scientist working with the AWS Machine Learning Solution Lab. Great. It's nice to have you on board to help us understand machine learning. So, everybody watching this, let me remind you that episodes are live. Feel free to ask all your questions in the chat. We have friendly moderators to answer everything. Okay? And remember, don't be shy. There are no silly questions. Make sure you learn as much as possible, which is really our main purpose today. Okay, so let's get started with this new episode. In the first two episodes, we discussed predictive maintenance and demand forecasting, and we dived pretty deep into time series and LSTMs. I hope you all recovered. So that was quite fun, and it's time to talk about something else. This week, we're going to explore another very important topic and a popular use case for machine learning, which is fraud detection. So, Ségolène, can you introduce that topic for us? It's a big one too. Fraud is a very important problem that can cost businesses billions of dollars annually and damage customer trust. Many companies use a rule-based approach to detect fraudulent activity where fraud patterns are defined as rules. However, implementing and maintaining a rule-based model can be very complex and time-consuming because fraud is constantly evolving, and rules require known fraud patterns. This can lead to false positives and false negatives. The idea is to better understand fraud, how it happens, and manage the risk. Today, we will explore how machine learning can help. Yeah, I can see why every company selling goods or services online should worry about fraud. Fraudsters are extremely creative, and new types of fraud can be amazingly clever, making it very difficult to keep track and come up with business rules that detect everything. So, machine learning to the rescue. I'm sure we can use various algorithms for this problem. What are we going to rely on today? Today, we are going to work with an interesting dataset: a publicly available anonymized credit card transaction dataset. We will use a combination of different ML techniques and algorithms. We will start with unsupervised and supervised machine learning algorithms and see how the Random Cut Forest and the XGBoost model can be very complementary, creating a robust framework for fraud detection. Oh, so two algorithms? Yeah, two algorithms. One is not enough. But don't worry, we are going to use built-in algorithms in SageMaker. So, what's a built-in algorithm? Amazon SageMaker provides several built-in machine learning algorithms, the most common ones, which you can use for various problem types. These algorithms are already fine-tuned, optimizing data transfer between instances and utilizing GPUs effectively when needed. They are highly scalable, especially compared to open-source algorithms, saving you time and money. That's a relief because working with one algorithm is already a bit of work, and two would be too much for me. I love these built-in algos because they are off the shelf. With very little code, we can train and deploy models. They are great because even if you're not familiar with machine learning algorithms, you can find one that works well for your problem and get to work. By saving time, you can focus on exploring, understanding, and processing data, which is often the key to great results. We won't dive into algorithm code, but we will explain them a bit. Stay tuned for that. Get some coffee, anything to keep you awake for the next 45 or 50 minutes. This will be another fun episode, and we hope you learn a lot. We're using a GitHub repository again. Let me share my screen and show you what we're going to use. You can find it on GitHub, of course, under AWS Labs, fraud detection using machine learning. You can clone this and work with the notebook in the source repository. There's also a CloudFormation template if you want to create all the AWS resources, a notebook instance, etc., to run your notebook. There's also some automation for creating a Lambda function, etc. In the interest of time, we'll focus on the machine learning part, but there's a lot to study here, like feeding data and predicting with models. Before we dive into the code, let's discuss the machine learning problem and how we're going to solve it. We mentioned we would be working with fraud detection using credit card data. Can you explain the problem a bit more? The idea is that you have a lot of historical credit card transactions. Most of the time, you have big files, and you want to detect fraudulent transactions to prevent issues for customers. The data is anonymized, so we won't see the exact features, but we can imagine features like the time of day, day of the week, transaction amount, location, and customer ID. Before we look at the data, fraud detection is a wide topic. Here, we're looking at credit cards, but fraud comes in many forms. For example, building fake domains for phishing purposes, trying to fool customers into believing they are legitimate domains. We have some cool use cases from customers, like Euler Hermes and Infoblox, which use natural language processing techniques to detect fake domains. We'll include links in the final slide. Financial services, credit cards, and companies like New Data Security (part of MasterCard) and Coinbase also use AWS to build fraud detection solutions. In the telecom industry, fraud detection is popular, trying to identify fraudulent calls that trick people into calling expensive numbers. We did a session with Lebara, a telecom operator, where they identified these techniques, which are amazingly clever and require machine learning. Why don't we use statistical models for this? Most fraud detection is maintained by rule-based processes, which are hard to maintain. ML models don't use predefined rules to determine if activity is fraudulent. Instead, they are trained to recognize fraud patterns in data and are self-learning, adapting to new and unknown fraud patterns. This is crucial because fraud is a continuously evolving problem. Some ML models, like unsupervised ones, allow us to extract knowledge from unlabeled data, which is very important. I was thinking that dealing with credit card transactions or phone calls involves staggering amounts of data. Labeling data is time-consuming, and you don't know what you don't know. Labeling such high volumes and complex use cases is a big problem, which is why unsupervised learning is interesting. Can you tell us more about the algorithms we could consider today? Today, we will show why ML and AI are fascinating. Data scientists have a wide range of model techniques and algorithms at their disposal, and we can combine them for complex problems. Fraud detection within millions of observations is hard and complex. We will combine different types of ML. The primary colors for a painter are red, blue, and yellow. When mixed, they create new colors. Similarly, we want to define fraud as an anomaly and classify transactions into two classes: fraud and non-fraud, to detect unlabelled observations in real-time. We will use unsupervised learning techniques to create labels and then supervised learning to create a binary classification model using XGBoost. If we have labeled data, we can use supervised learning. If not, or if we don't have enough, we can use unsupervised learning and combine them. We can do ensemble prediction and score transactions with both models. We will try that. Let's discuss the dataset now. Let me share my screen for a second. This is the dataset we're using today, available on Kaggle. It's anonymized because it's credit card data. The original dataset has been transformed using PCA (Principal Component Analysis). The sensitive features are masked, and it looks like this. PCA is another built-in algorithm in SageMaker. Data anonymization is important, and PCA is one example, but there are others like encryption. We see 28 features, and the actual amount has not been anonymized. The label is zero for non-fraudulent and one for fraudulent. We have about 285,000 transactions. Is that enough data? Fraudulent activity is not common, so you need enough data to have enough fraud records to learn from. Out of 285,000 transactions, only 0.17% are fraudulent, about 500. This is a hard problem to solve because you're looking for the needle in a haystack. We will use an open-source library called SMOTE to fix the imbalance problem. Now, let's talk about the algorithms. We said we would use XGBoost and Random Cut Forest. I have a couple of slides to introduce these algorithms without going too deep into the details. I'll do XGBoost, and you can do Random Cut Forest. XGBoost is a very popular algorithm, winning many Kaggle competitions. It's versatile and effective on various problems. It's open source, and you can find it on GitHub. It's based on trees and can be used for regression, classification, and ranking. It builds a forest of trees, organizing samples into trees. Unlike random forests, XGBoost builds trees that correct prediction errors made by previous trees. Each tree fixes mistakes from the previous one, creating a cumulative way of predicting and learning. This makes XGBoost very accurate. It's robust to missing values, works well with sparse data, and can do distributed training. On SageMaker, you can train XGBoost on large datasets that don't fit in RAM. XGBoost is a must-know algorithm in machine learning. It has many hyperparameters that can be fine-tuned, but sticking to defaults is often a good start. Random Cut Forest is an unsupervised algorithm for detecting anomalous data points. It associates an anomaly score with data points, where a low score indicates normal data and a high score indicates an anomaly. Scores beyond three standard deviations from the mean are considered anomalous. At training time, we build an ensemble of tree-based models on random subsets of the data using random features. When predicting a new data point, we insert it into every tree and measure the disruption caused. If the node fits nicely, it's not anomalous. If it causes significant disruption, it's probably an anomaly. The score measures this disruption. The name "Random Cut Forest" is fitting. The math is heavy, but the intuition is clear. Adding a node that fits well doesn't change the tree much, while an outlier causes significant disruption. Let's talk about some hyperparameters. For XGBoost, there are many, but we need to specify that we want binary classification. We also need to handle the class imbalance. In our dataset, only 0.17% of transactions are fraudulent, creating a highly unbalanced class distribution. Traditional classification algorithms like XGBoost can perform poorly without modeling this imbalance. We need to give extra weight to the minority class (fraudulent transactions) to ensure they are not forgotten. We can do this by setting a scaling weight parameter, which is the square root of the ratio of the majority class to the minority class. Let's go into the notebook and look at the code. I'm using SageMaker Studio, a web-based IDE for machine learning based on JupyterLab. We start by reading our dataset, which has 285,000 samples. We compute statistics, but the most important part is the number of fraudulent and non-fraudulent transactions. We separate features and labels and split the dataset for training and validation. We train the Random Cut Forest algorithm, which is unsupervised and uses only the features. We import the SageMaker SDK, grab an S3 bucket, and use the Random Cut Forest estimator. We train 50 trees, each with 512 samples, totaling about 25,000 samples per tree. We train the model on one C4 Xlarge instance and deploy it to a real-time endpoint. We predict anomaly scores for the test set and plot the distribution. There is a clear separation between the classes, showing that the model can capture the difference between fraud and non-fraud. Now, let's try the other algorithm. This time, we use supervised learning with labels. We upload the labeled data to S3, grab the container name for the XGBoost algorithm, and set parameters, including the scaling weight for the class imbalance. We use logistic regression to build a binary classifier and set the objective and evaluation metric. We train the model on one C4 Xlarge instance and deploy it to an endpoint. We predict the test set and get classification probabilities. We set a threshold to decide what's fraudulent and plot the confusion matrix. The model performs well, with very few misclassified transactions. To address the class imbalance, we can use techniques from the Imbalanced Learn package, such as oversampling, undersampling, or generating synthetic samples with SMOTE. We oversample from 0.17% to 50%, creating a perfectly balanced dataset. We train the model again, remove the scaling weight parameter, and deploy it. We predict the test set and get a balanced accuracy of 0.91255, which is better than before. However, the Cohen's Kappa metric is not as good, and the confusion matrix shows more mistakes. Oversampling from 0.17% to 50% is very brutal and might shift the dataset too much. In real-world scenarios, these techniques can dramatically improve the model, but you need to decide what's worse: false negatives or false positives. To wrap up, we learned how to use Amazon SageMaker to train fraud detection models using built-in algorithms and handle class imbalance. We covered a lot, and there is more to learn. Thank you for watching. If you have questions, send us an email. Join us next Friday for another episode of SageMaker Fridays. Thanks to my AWS colleagues for answering questions and helping with this Twitch session. Thank you, Ségolène, and until next week, keep rocking with machine learning. Bye-bye.

SageMaker Fridays Season 2 Episode 3 Fraud Detection October 2020

Transcript

Tags

About the Author