Hi everybody, this is Julien from Arcee. Welcome to episode 18 of my podcast. Don't forget to subscribe to be notified of future videos. I hope you're still safe wherever you are. As for me, things are improving where I live, so I can actually go out a little bit and sit in front of the house. In this episode, we're going to talk about AWS news as usual, high-level services, deep learning containers, and so on. And I'm going to do a demo on transcribed profanity filtering again, so make sure the kids are not in front of the screen. Let's get started.
As usual, let's start with the high-level services. The first feature I want to talk about is Amazon Forecast. Amazon Forecast is a managed service that lets you easily build forecasting models for time series. This is a very useful feature. It automatically fills in missing data in your time series and related time series. You can select from different techniques like median, min, max, etc. Any missing or not a number value in the time series is going to be automatically fixed. This is really important because time series are very sensitive to missing values. It works with the target value itself, the actual variable you want to predict, and also with related time series. If you're talking about retail, it could be stock information, price information, etc. If you have missing values in those, you can now automatically replace them as well. This is a very nice feature that should help you increase the accuracy of your models without any pre-processing.
The next one is a CodeGuru feature. CodeGuru is a managed service that does two things: it automatically reviews your code using the pull request workflow and identifies potential problems in your code, such as security issues and performance issues. This is CodeGuru Reviewer. There's also a Profiler module, CodeGuru Profiler, that inspects your application while they're running in production and gives you performance reports, helping you pinpoint potential performance problems. Here, we're talking about a new Reviewer feature where you can connect Bitbucket repositories to CodeGuru. Initially, you could use CodeCommit and GitHub, so now we can use Bitbucket as well. Pretty nice.
The next feature is vocabulary filtering. You may remember in a past episode I showed you this feature available on Transcribe, our speech-to-text service. It was available in batch mode and now it's available in real-time mode. So, this is not safe for work, not safe for kids. Please turn the volume down if you don't have a headset and make sure no one is going to be offended by what I say here. But again, I'm French, so I can't help being offensive. It's in my DNA, I guess. The way this works is you just go to the Amazon Transcribe console, select real-time transcription, and we can pick from English and Spanish. I can curse pretty well in English, but not in Spanish, so I'm going to go for English. We want to go to additional settings and enable vocabulary filtering. You select a vocabulary filter, a profanity list of words. I uploaded it already here. One word per line that you want to either highlight or remove from the output. We can mask it, remove the word, or tag it and just leave it in the transcript but tag it as a word that should be filtered. I'm going to go for mask. I'm using the Ireland region here. Let's allow my mic. Good morning. My name is Julien. And I'm really, really tired of talking to you, motherfucker. I'm really, really tired of dealing with such an asshole. You feed me so much crap and so much bullshit. I think that's enough. So this is really simple to use. Just upload your filter, one word per line in a text file, and off you go. Now available in streaming mode as well as batch mode. So profanity filtering and any unwanted words can just be highlighted or masked. Pretty good. Another really cool feature from the Transcribe team. Well done, guys.
Let's move on to deep learning containers. This team has been on fire. What's going on with you guys? They released a bunch of updates in very rapid fashion. PyTorch 1.5 is available in deep learning containers, as well as TensorFlow 1.15.2 with Python 3.7, a Python update. TensorFlow 2.2 is now available, the brand new version. They also added elastic inference support for TensorFlow and Python for more recent versions of TensorFlow and PyTorch, namely PyTorch 1.3.1, TensorFlow 1.15, and TensorFlow 2.0. Elastic inference support requires extra work. Usually, you will see the newer frameworks popping up in the deep learning containers, and then they get updated to Elastic Inference pretty quickly.
I want to talk about deep learning containers again. These are AWS managed containers for the most popular deep learning libraries. You can run them on EC2, cluster services like ECS and EKS, and SageMaker. I think they are terrific. They save you from building and managing your own containers. Of course, you can find these containers on GitHub as well, so you can build them yourself, inspect them, customize them, and run them on your own server or laptop if you want to. If you've never looked at this and you train deep learning models, I sincerely think you're missing out. They are extremely useful.
I also want to talk about elastic inference again because it looks like there are still a bunch of customers who've never heard about it, and it has a lot of value. Elastic inference is the ability to add fractional GPU acceleration to any instance. You can apply this to EC2 instances, notebook instances in SageMaker, and SageMaker endpoints. By fractional GPU acceleration, I mean that for some models, CPU prediction is too slow, and GPU prediction is required. But maybe your model or use case doesn't fully use that GPU instance. You might be paying for a P3 instance or a G4 instance and not keeping the GPU busy enough. Sure, it's fast, but when you look at the cost-performance ratio, it's not the best. For those use cases where CPU is too slow and GPU is too much, that's when you should look at elastic inference.
This service came out a while ago with elastic inference accelerators, first generation, and now we have a second generation. You can use these to benchmark your application and find the cost-performance ratio that works for you. This is available on EC2, and we have extended TensorFlow and PyTorch and MXNet libraries to support this. We have AWS maintained packages and extensions for these libraries, a version of TensorFlow Serving, etc., that supports Elastic Inference. Of course, it's available on SageMaker.
I want to quickly walk you through a simple example to show how easy it is to test and how quickly you could realize some serious savings. I'm training a convolutional neural network on the fashion MNIST dataset using TensorFlow 2.0 in this example. It's an image classification problem with 10 classes. I download the dataset and use a Keras model to predict this. I pile up some Keras layers for convolution, etc., and use script mode to get hyperparameters and extra training information. It's vanilla Keras code adapted for script mode on SageMaker. At the end, we save the model. I upload my dataset to S3, the training set, and validation set, and can train on a GPU instance. Here, I use the TensorFlow estimator by seeing my script and train on a G4 DNXL instance, which is the most cost-effective GPU instance we have today, still quite powerful. I train for 10 epochs, and the model trains for about 400 seconds.
For deployment, let's say we need GPU prediction for this. You would look at the pricing and say the most cost-effective instance type is G4 DNXL. For example, in EU West, it costs $0.822 per hour. But maybe looking at the CloudWatch metrics for the endpoint, you see the GPU is 20% to 30% busy. It never goes up to 100% because you're predicting at low volume and don't need that much performance anyway. The alternative is to deploy on a CPU instance, a C5 large, which is very cost-effective, and add an elastic inference accelerator. They come in three sizes: medium, large, and x-large. Each size gives you a certain number of teraflops. When we predict with our model, the CPU instance runs the Python part and the prediction, but the actual model prediction runs on the elastic accelerator. If you add up those two prices, you see a fraction of the G4DN cost, probably at least two-thirds cheaper. You get good performance. If medium is too slow, you might want to try large and benchmark again. For these use cases, you will get a very significant discount, maybe 30%, 50%, or 70%. You can save a ton of money compared to using a full-fledged GPU instance, especially if you run those endpoints 24/7 for a month. We're talking hundreds of dollars per instance per month. That's serious money you can save by just trying this and experimenting with it. Then we just predict as we would normally, calling the predict API on the endpoint, and it works just fine. For a more significant model, you could find the exact level of performance and cost that works for you, saving hundreds of dollars per month per instance if you do it right. I think it's worth trying out.
That's it for this episode. I really want to apologize for that demo on profanity filtering. My evil twin made me do it. I'm not responsible in any way. Again, don't forget to subscribe for more news and more silly demos. Stay safe until I see you, keep rocking.