SageMaker Fridays Season 2 Episode 1 Predictive Maintenance October 2020
October 12, 2020
Broadcasted live on 9/10/2020. Join us for more episodes at https://amazonsagemakerfridays.splashthat.com/
⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos ⭐️⭐️⭐️
This project uses Amazon SageMaker to train a model predicting the remaining useful life of aircraft engines. The model is a stacked bidirectional LSTM neural network implemented with the MXNet deep learning framework.
Transcript
Music. Thank you. Alright, good morning, everyone, and welcome to season 2 of SageMaker Friday. My name is Julien, and I'm a principal developer advocate focusing on AI and machine learning. Before we explain what SageMaker Fridays are about, please meet my co-presenter. Hi, everyone, my name is Segolen, and I'm a senior data scientist working with the AWS Machine Learning Solutions Lab. My role is to help customers get their ML projects on the right track to create business value as fast as possible.
Alright, so we're in the same room, but we have two different cameras, so don't get confused. We're live from the Paris office. So, in a way, today it's French machine learning cuisine, right? Hopefully, it's going to be tasty. And it's really cool to have you on board because we are going to need your expertise. So let's quickly explain what SageMaker Fridays are about.
Every week from today until November 13th, so six episodes, we will discuss a real-life use case for machine learning and see how it can be solved using a service called Amazon SageMaker, which was launched almost three years ago now. As you can imagine, SageMaker is a fully managed service for machine learning. All episodes are 100% slide-free because we know you want to see some code, and that's what you're going to get. Lots of code. So I hope that's all right. And in addition, all episodes are live, as I said. Feel free to ask all your questions in the chat. We have a team of machine learning specialists helping us answer the questions. So thank you very much for helping us, guys. We really, really appreciate it. And by the way, there are no silly questions. So please don't be shy and ask all your questions. Learn as much as possible. We're really here to help you understand more about ML, AWS, and SageMaker. So all questions are welcome. Please use the opportunity.
Okay, so it's time to get started. Segolen, what is this episode about?
In this episode, we are going to talk about predictive maintenance, which deals with predicting when a piece of equipment will fail. By predicting failure, advanced maintenance teams can fix or replace the equipment and avoid unplanned downtime. That's nice. I really like that topic. I wish my dishwasher could do that. Those things break down at the worst possible time, and then you're stuck until you get a new one and have to do the dishes manually, which I don't really enjoy. So it would be nice to get advanced warning.
Exactly, that's the point. But today, we are not going to work with data from your dishwasher. We are going to use NASA turbofan engine datasets, which contain time series data representing sensor information coming from aircraft engines. With this dataset, we are going to train a deep learning model based on a CNN bidirectional LSTM architecture on SageMaker to predict the remaining useful life of an engine.
That sounds amazing. And yes, it sounds more sophisticated than dishwashers. So grab a cup of coffee. It looks like we're going to learn a lot about SageMaker and deep learning today. By the way, all the material we're going to use is online. We're using a GitHub repository, and I will share the URL later so you can replay exactly what we're doing today. You can easily clone the repo and get to work. There's also a CloudFormation template. CloudFormation is our infrastructure as code service, and it builds all the resources you need, including a Lambda function for automation, but we'll talk about automation later in the show.
Okay. So before we dive into the code, let's take a few minutes to discuss the machine learning problem itself and how we're going to solve it. Let's start from the first step: what are we trying to solve, and what data do we need? Why would we use the algorithm we're using today? This is what a machine learning team would do—analyze the business problem and figure out where to go next. So let's start with a very simple question. What's the problem we're really trying to solve from a data perspective?
When we talk about predictive maintenance, it's a big one because what you want to do is look into the future. In French, we have a proverb that says it is better to anticipate than to cure. This is really the goal of today's session because, in any kind of manufacturing, downtime can be super expensive, especially because of the domino effect. People look for solutions to forecast the state of their equipment in the coming minutes, hours, or days. If you can model it, you can make the right decision at the right moment, avoiding disastrous consequences due to unexpected downtime. It's a real management problem.
We can give you a real-life example. We have a customer called Vilyar Water Technologies, the global leader in water processing. They used machine learning and SageMaker to predict when they should change water filters in water processing plants. It's a really cool example of predictive maintenance. When do you need to replace the equipment? They have a very good video explaining this project, and I will share that URL with you at the end of the session. Predictive maintenance is a common problem in industry and a very important one because it's very expensive. Unexpected failures can cost a lot of money, so it's better to avoid them. If you replace equipment too soon, it costs money as well, so you have to find the right balance.
Okay, so my next question is, why should we even use machine learning for this? I know machine learning is exciting, and we all want to learn more about it, but why can't we use traditional statistical techniques? We can do forecasting with traditional techniques, so why not?
The idea is that when you open a file containing logs from sensors, you see it's a big file, sometimes very messy, and traditional statistical models cannot deal with such an amount of data. You really need to use deep learning models to ingest all the data generated by the equipment, especially if you are at a very fine-grained resolution, maybe at the second or something like that. Traditional models cannot work. Predictive maintenance includes data from many sensors, so you have multivariate time series, not just one thing. You need to be able to combine the patterns between the series. So, deep learning all the way, then.
Yes. So, what kind of data do we even need here? When you start a predictive maintenance project, you will have to look first for data generated by the equipment of interest, and most of the time, data from sensors. When you do predictive maintenance projects, take your time and grab extra coffee because it's going to take a lot of time. You're going to spend a lot of time gathering, denoising, and understanding your data. Data cleaning is typically 80% of the work, sometimes 90%.
Fortunately, we have a very friendly dataset today. It's ready to go. So, we said deep learning is probably an interesting solution to deal with the volume of data and multivariate time series. What kind of algorithm could we consider using here? To understand predictive maintenance, you need to see it as a kind of survival analysis, a branch of statistics for quantifying the time it takes for an event of interest to occur—in our case, machine failure. Time is really the core aspect of predictive maintenance. You need to use a dynamic algorithm that allows you to model not only single data points but also an entire sequence of data to predict what's going to happen at T0 and T+1. This is why we are going to use recurrent networks.
Yes. Experience and intuition are important. Deep learning is crazy complicated, and you can build an infinite combination of layers and models. But you work with these topics all the time, so you naturally know what kind of algorithm is going to work well. As a starting point, try a baseline, maybe a quite simple deep learning algorithm. If you want to add complexity, modify it again. Start simple, start small, is always good advice, especially with deep learning.
Okay, so now that we understand the problem, let's talk about the dataset we're going to use today. Tell us a little bit about our dataset.
When you start a predictive maintenance project, you use data from a sensor network, which can be embedded within an IoT environment. But today, we are going to use time series data from a NASA software called C-MAPSS, which stands for Commercial Modular Aero-Propulsion System Simulation. It's an open-source dataset that simulates the behavior of aircraft engines. You can test the performance of your algorithm and validate generic engine models. It's a very clean dataset, so we won't be running much preprocessing on this. But in real life, you would typically take a lot of time to clean your data, denoise it, and understand the noise. The raw data quality from sensors can be super bad, and you need to make your hands dirty when cleaning.
Here, the dataset we are going to use is already clean and aligned, and we're just going to normalize the columns. Maybe let's take a look at the dataset. Let me switch to the other display and give me a second here.
So, let's explain a little bit what we have. This is the NASA dataset called CMAPS, and it comes in four parts. We have data for four different engines, so four files numbered 1234. We have a training set and a test set. Here, we see data for engine one. What are those columns? The first number is called the trajectory. We have multiple trajectories, and I think we have 100 in this dataset. Each trajectory is a full run for the engine, simulating takeoff, flying, and landing. We have a number of runs, and for each trajectory, we have cycles. This is a moment in time where we capture data. Then we see sensor data. We have 25 individual values. The first two or three columns are operational settings, and the rest are sensors. These are the time series. We can see the time series cycle after cycle.
We want to predict the next failure, which is the remaining useful life (RUL). It is also part of the dataset. We have 100 values. At the beginning of trajectory one, RUL is 112 cycles. At the beginning of trajectory two, the remaining useful life is 98, etc. This is the ground truth, and that's what we're going to predict. The dataset looks like this, and we have a test set that looks like the training set. We have different files, and we're just going to use this file for training today. We have 100 trajectories, each representing full runs for the engine. Each trajectory is a series of steps called cycles, and each cycle captures 26 sensors. These 26 sensors over time build our time series. We're trying to predict the RUL value.
It's not a big dataset, right? No, which is why it's probably a good, clean, and not too big. It's easy to work with. A question I get a lot is how much data do I need? How do we know it's enough data? We'll trust NASA. But, in general, it really depends on the machine you are studying and what you want to predict. You need enough data to capture different full cycles to be able to capture potential seasonality and patterns. With deep learning models, you need data to capture and extract patterns. So, don't be shy and take some data.
Okay, so let's see what kind of results we get. Now let's talk about the algorithm. Let me switch to the code because, of course, we're going to look at the code. What are we using to train this model? The notion of time and dynamic sequence is really at the core of predictive maintenance. This solution leverages a custom stack, a long short-term memory neural network model, or LSTM model. From the RNN family, it allows you to capture and store long-term dependencies in your time series through a memory cell.
Yes, we use a convolutional bidirectional LSTM model. What does that mean? You're going to use a convolution, the CNN part, to identify patterns in our time series. The convolution will act like an information filter. The LSTM part will help us capture the temporal evolution of this pattern. So, two sides: CNN for pattern extraction and LSTM for temporal dependencies.
Here's the model, and we're using Gluon, an API part of Apache MXNet. Gluon is a high-level API on top of MXNet, which is very flexible and great for experimentation. We'll focus on the architecture and explain the rationale. The individual parameters don't matter much for this discussion.
We run a Lambda function to put data columns in the right order. The shape of the data here is the batch size, and by default, we use one. The first dimension of the input tensor is one. The second dimension is the number of cycles, and the third dimension is the number of time series, which is 25. This line of code just swaps dimension one with dimension two. Now the shape of the tensor is batch size one, 25 time series, and the number of cycles. We do this because in the next layer, we use convolution, which is performed on the last dimension of the tensor.
We run convolution, which is well known for images, but here we use a one-dimensional filter. We have 32 one-dimensional kernels, and these will be parameters that are learned. The output will still be a tensor of shape one, 25 time series, and a number of convoluted features. We do it again to extract more patterns. The shape of the data will be 125 and something variable length. Then we transpose it back to the original order and run LSTM.
The LSTM layers have parameters like the number of units and layers, and you can iterate to see if you want to modify some parameters to improve the metric and performance of your algorithm. A dense layer, a fully connected layer, reduces the output of the LSTM into a single value, the RUL. Even if you don't know Gluon, you get an idea of what we're doing: convoluting twice to extract patterns and then sending them to the LSTM to find the time element and let LSTM do its magic. Gluon makes it very compact to define even a slightly complicated model.
Let's talk about hyperparameters. Some are linked to SageMaker, which will automatically pass the location of the training set and the location to save the trained model. This is called script mode in SageMaker, and it's how you run framework code on SageMaker. If you have scikit-learn, TensorFlow, PyTorch, or MXNet code, you just need to integrate that code using environment variables passed by SageMaker. Adapting code for SageMaker is really simple. Script mode is what you need to look for.
If we look at architecture parameters, we see batch size, epochs, learning rate, number of layers, and units. These are the default values, so one unit, one layer. We'll see if we end up using those. The main question is, what parameters would give us the best results? No one really knows, but you could use SageMaker's automatic model tuning to automatically explore high-performance parameter ranges. You could launch a tuning job and say, "Try maybe one, two, three layers from one to 16 units, and go figure it out." Run 10 jobs, and automatic model tuning would quickly find the optimal combination. We're not going to do that today, but generally, automatic model tuning is a super useful feature. I highly recommend using it even for architecture search.
What else could we talk about? Maybe we can take a quick look at data loading. The loss function we're using is RMSE (root mean square error). The idea of time series prediction is to compare the predicted versus the actual value. RMSE is a good metric for a regression problem, trying to predict a unique value.
The training loop looks quite standard. Iterating over epochs, getting the next batch of data, reading the data and label, recording gradients automatically, forward propagation, reading predictions, applying the loss function, and backward propagation. Then stepping through the next step. It's very typical for a deep learning training job. At the end, we save the model in the location SageMaker pointed us to. Saving is just saving the parameters. Hybridization is interesting. Gluon is a very dynamic library, great for experimentation, but sometimes it comes at the cost of performance. Hybridization is a way to optimize that code, improving speed and memory allocation, getting close to the performance of a fully static library like TensorFlow or vanilla MXNet.
We apply learning rates, etc., and train. When it comes to predicting, we have a function to load the model, which is quite simple. Predicting is just forwarding data and reading the output. Batch prediction is a good fit for predictive maintenance. It's a good practice in this area. Usually, we pull data from machines, and it's not really a real-time process, though it could be, depending on the machine's importance in the supply chain. But we're going to use batch.
Now it's time to put everything together and start running this code, training the model, and using SageMaker. I am using SageMaker Studio, our machine learning IDE, launched at re:Invent just about a year ago. You can find it in the AWS console. SageMaker is available in many regions, and Studio is available in a smaller number of regions, including a couple of American and European regions. I'm using the Ireland region, EU West one. If you are in a supported region, you just click on this link. Creating a user for SageMaker Studio takes 30 seconds. You have a simple wizard, just click, click, click. Then you click on Open Studio, and you will jump into something that looks like this.
Studio is a web-based IDE based on JupyterLab, so it will look and feel very familiar. We also added integrations with SageMaker's capabilities. We have SageMaker experiments to manage and compare all the jobs associated with your experiments, processing jobs, training jobs, batch transform jobs, etc. You can track everything nicely here. You can also create an experiment and launch a SageMaker Autopilot job, an AutoML capability that lets you build regression and classification models from tabular data. This is a no-code experience, and there are more features like endpoints, SageMaker endpoints, which are HTTPS prediction APIs based on your models.
Now it's time to look at this notebook, available in a repo. If you want to follow along, it's the AWS Labs predictive maintenance using machine learning. If you go to source notebooks, you'll find that notebook. I'll wait for a few seconds if you want to type that name, "predictive maintenance using machine learning." You can follow along, clone it, and run it.
First, we install the SageMaker SDK, the Python SDK that drives all the training and deployment activity. We had a major SDK release in early August, SDK 2.x, with a few breaking changes, but nothing really bad. You shouldn't have much trouble migrating your notebooks from V1 to V2. I'm using V2 here, of course. We import some local libraries from the repo and the SageMaker SDK. We grab the dataset from S3, extract it, and apply preprocessing, which is normalization. We normalize the columns of the dataset to compare and study time series on the same scale. One sensor could be recording values between zero and 1000, and another between minus 10 and plus 10. You want the same scale to compare.
We can plot some values. It's important to have a good visualization of the sensors before doing any deep learning model. For instance, sensor number five has the same flat value throughout. Maybe at one moment, you will drop sensor five. It's important to have an idea of the behavior and exploratory analysis. If you have the same value over the whole thing, either the sensor is broken, or it's more of a setting than a time series. It's important to visualize your data.
We can see something going up, maybe engine temperature or vibration, and then it breaks. It's a good thing I'm not on the plane. The risk is that it might be a terminal fact. I'll never feel safe on a plane now. No, don't say that. It's bad luck.
So, we need to put the data in S3 because that's where SageMaker expects the data. We can store data in Amazon EFS or Amazon FSX for Lustre for high-performance computing, but for this, S3 is fine. We take the normalized data, the test set, and the train set, and upload them to an S3 bucket. We can see the location and the data. We see the ID for the trajectory, the cycles for each trajectory, the normalized sensor values and settings, and the RUL. The pre-processing script also computes the RUL. It starts from the ground truth value and computes the actual value at each cycle. At the beginning of trajectory one, RUL is 191. After one cycle, it's 190, and so on. It goes all the way to zero, where the engine fails.
The idea is to see the link between all these multivariate time series and the variable of interest. You want to see the correlation between different time series. For example, if S2, S3, and S4 are going up, it might be a bad sign. If only one is going up, you might be fine.
We have data in S3, and we'll store all training artifacts and the model in S3 as well. This is the training script, the Gluon script we looked at. Now we get to the point where we train. This is my favorite part of SageMaker because it's really one line of code. We use the MXNet estimator and pass it the location of our training script, which is local to SageMaker Studio. We ask for one training instance of a P3 2xlarge, a GPU instance, because we're running deep learning, and GPUs are a good way to accelerate that. This is the only infrastructure work you need to do. Depending on the workload, you might need different types of instances. If you need more GPUs, you can use a P3 16xlarge with eight GPUs, and MXNet would automatically leverage those GPUs.
We enable spot instances to use unused compute capacity at a deep discount. We have some hyperparameters, and I looked in my crystal ball and decided to use eight LSTM units, two layers, 200 epochs, and the Adam optimizer. We see the RMSE going down, but are these the best values? No. You would need to use automatic model tuning to find those values.
When we run this, we see the training log, which is also available in Amazon CloudWatch. We see epochs going by, and the RMSE starts at 30 something and goes down. It's about two seconds per epoch, so this runs very quickly, thanks to the GPU instance. It goes down to single digits, which is okay. We could do better by tweaking parameters. The model is saved in S3.
We trained for a little less than 10 minutes, but we were only billed for 167 seconds, thanks to Spot, a 70.1% discount. This is a feature you want to use. It's called Managed Spot Training, and all it takes is setting those parameters in the estimator.
Now we get to the most profitable three lines of code ever. We create a transformer object with the instance type we want to use and unleash it on data stored in S3. If you wanted to deploy to a real-time endpoint, we would call the deploy method, and it's one line of code. Most operations are one line of code. It's a very intuitive workflow and a reasonably simple SDK. You can learn it in a day. We process some data, transform it, and have transformed data in S3. We can view the results. We see fractional values, which are a little different from the RUL. We normalize the output, so we need additional transformation to get the actual format of the data. But it works.
We're really close to the end of the episode. If you have more questions, we have a few more minutes left. So ask them now. Predictive maintenance is quite powerful and a pretty interesting technique. You don't rely on a crystal ball, which is great, especially if I'm on a plane. I'll share a link with you to another example of predictive maintenance in the aviation industry. It's a session I ran at re:Invent last year with British Airways, explaining how they pull data from their airline fleet and process it to find faults before they happen. It's very cool.
It's time to wrap up. Let me show the resources to our friends. Thank you for joining us. Today, we learned how to use Amazon SageMaker to train and deploy a predictive maintenance model implemented with a CNN bidirectional LSTM to predict time series data. If you want more, please have a look at the different resources you can find related to this episode. If you have questions and feedback, please don't hesitate to send them to sagemaker-fridays@amazon.com. We are looking forward to your feedback.
I published a book on SageMaker, and you can buy a copy at a pretty sweet discount. The paper edition is discounted on Amazon.com, but it's only for the US website. Anyone can order, but it's only on Amazon.com. If you're interested in a nice discount on the ebook, you can buy it from the publisher's website and use the 20SageMaker discount code to get 20% off. This is only valid until November 11th, so don't wait too long.
I think we're done, and we're out of time. We're very excited about machine learning and could keep going for days. See you next week. I hope you learned a lot. It's great to have Segolen and her expertise with us. Thank you again. Thank you to our friendly moderators and experts answering questions. Next week, we'll be live again from the Paris office, hopefully, and we will talk about LSTMs with demand forecasting. In future episodes, we'll talk about computer vision, natural language processing, and fraud detection. We have a lot more, but it's more than enough for today. Thanks again. Thanks to the AWS team for their support. Thank you for all the other people watching this and for your trust. We'll see you next week. Have a great weekend. Learn a lot. See you on Friday. Bye-bye. Au revoir.