Good morning, everybody, or good afternoon, depending on where you are. Welcome to episode six of season three of SageMaker Fridays. My name is Julien and I'm a dev advocate focusing on AI and machine learning. And once again, please meet my co-presenter. Hi, everyone. My name is Segelen and I'm a senior data scientist working with AWS Machine Learning Solutions Lab. My role is to help customers get their ML project on the right track in order to create business value as fast as possible. Thank you. As usual, we're live, no slides, and we have friendly moderators waiting for your questions. So please ask all your questions. Make sure you learn a lot, and I think there's going to be a lot to learn. It's a very action-packed and dense episode today. It's also a little bit different. So what is this episode about? In this episode, we are going to focus on optimizing costs with machine learning, looking at both the business angle and the technical angle. OK, so tell me about the business angle. That's new. Yes, we are going to have a chat with our special guest, Greg Kokio, technology risk manager working for Amazon. We will work through a large-scale document processing automation project that he is currently working on for a B2B customer operating in chemicals. Okay, so it's the first time we have a guest, so that's going to be pretty cool, and I hope the discussion will be interesting. What about the technical angle? We're still going to run code, right? Exactly! In the second part of the episode, we will dive into a large-scale computer vision workload running on SageMaker and pull out all the stops to optimize cost from image labeling to training to predicting. Along the way, you will learn about labeling with SageMaker Ground Truth, right-sizing your training infrastructure, training with managed spot training and pipe mode, deploying with elastic inference, and much more. Okay, so let's not waste any time. We still have an hour. Hopefully, we won't go too much longer than that. Right? So let's introduce Greg. Welcome, Greg. Welcome to SageMaker Fridays. Can you please introduce yourself and explain your role at Amazon, please?
Absolutely. Thank you, Julien. Thank you, Segelen, for having me. I'm Greg Kokio, technology manager for Amazon's private brand division. I provide compliance as a service to my customers, promoting trust by ensuring products being sold are safe while meeting import and regulatory requirements in all countries where we do business through automated workflows. Currently, I'm responsible for the technology roadmap and drive AI and ML adoption for my department by leading projects in collaboration with global teams comprised of data scientists and software engineers. Most of my projects fall in the innovation category for our customers, and the minimum uncertainty embedded in AI ML projects is something I enjoy. The strategy I typically use is prioritizing business use cases based on needs, efforts, impact, followed by building and testing a minimum viable product that can scale once we test and validate it.
This is great. Thank you so much, Greg, for this introduction. Can you tell us a little bit more about your customer and the challenge they were facing?
Sure thing. Our customer is a B2B multinational working in the chemical industry. As you may know, per regulatory requirement, manufacturers in the chemical industry must collect material safety data sheets from all of their suppliers and produce their own material safety data sheets containing elements from documents collected from their suppliers so they can sell the products to their own customers. These documents are often received in different languages and come from all over the world. Let's show my screen for a second, and we can see some examples. There's actually a good example on Wikipedia if you're interested. So, this is a material safety data sheet with a product name, chromium three acetate hydroxide. That's the information you were referring to, Greg, right?
That is correct. Walk us through this example. This example, at a high level, is a document that helps you, as a consumer of that product, understand how to store and manipulate this chemical. It also provides emergency response information, such as how to protect yourself, when to call 911, or when to provide immediate emergency response in case of an accident. There are multiple claims and technical information about the chemical. As a manufacturer, you want to collect a lot of information about this document because you're taking all these components and making a finished product that requires its own material safety data sheet for the consumer. Here's another example from the European Community. Correct. Similar text, labels, pictograms, and so on. Exactly. If you have 20 chemicals that go into your finished product, you want to extract the information from the 20 data sheets and consolidate everything into a single data sheet for your own product. It's about extracting information from such documents, especially the safety-critical aspects when you mix them together. You want to ensure you capture that. You need to extract text, signatures, pictograms, addresses, and make sure you're not leaving any suppliers out because you need every supplier's material safety data sheets in your database.
Thank you so much, Greg. It seems very important in your case. Can you tell me about how your customer is currently doing the extraction and so on today?
Before the work was performed by a global team of 60 employees processing 100 to 150 documents daily, with a yearly headcount of 2.1 million. Bottlenecks were due to data latency as they key in the extracted data, lack of traceability of supplier MSDS to produce their own for finished goods. Therefore, customers would likely receive products without an MSDS, which is a red flag if audited by the government. Customers are also subject to regulatory fines from the government when found noncompliant. Other bottlenecks include the lack of multilingual subject matter experts, so it took time to translate and validate foreign documents.
So I think I heard a bottleneck: manual work. Why is it a good problem to solve with machine learning?
Well, document text extraction is a big problem many companies are going through, and machine learning is a great solution because it can automate this revision process through natural language processing. When you think about how a human reviews a document, you realize most of the work includes text detection, translation, entity extraction, like identifying a chemical name or a company address. Part of that automation also includes computer vision for detection and extraction of pictograms, which tell you whether a chemical is dangerous or semi-dangerous, and even signatures for document authentication. So, text detection, text translation, and computer vision are key components. What does the automated workflow look like?
As a solution, we created a web portal for suppliers to submit the MSDS sheet for their raw materials at the time of shipping. This submission triggers a series of Lambda functions that perform different tasks. We go through API calls of AI services, whether it's Amazon Textract, Translate, Comprehend, and Replace. This is where we extract different things, translate from German to English, identify the supplier and their location, detect a pictogram for safety measures, or identify key claims inside the document. Let's show my screen and we have a sample architecture. It's not the actual architecture we implemented, but it's close enough and typical of building such a pipeline. So, Greg, walk us through the main blocks here and how they work together.
Absolutely. So, at a high level, you have the ingestion module, the OCR module, the NLP module, and the computer vision module. It starts with ingesting the document into an S3 bucket and pushing it through a Lambda function. A Lambda function drives traffic and guides it to different parts of the architecture. It lands on Amazon Textract, where you extract metadata about the document and push it through Comprehend, exporting all that data into a Redshift table or another S3 bucket, depending on your solution needs. In this case, the whole extraction process goes through an analytics module leveraging Neptune or other solutions. We have a different infrastructure leveraging Redshift tables and Amazon Elasticsearch service to create a searchable document repository.
What about metrics? What's the improvement you saw?
Big improvements in this case include cost reductions. We reduced headcount to 20 people for a total of $700,000 a year. We also saw a 90% plus reduction in manual errors and a 50% throughput improvement, allowing our customer to scale and onboard multiple suppliers. The average model performance was around 96%. However, we focused on minimizing false negatives instead of false positives because we wouldn't want to misidentify a dangerous chemical as not dangerous.
Definitely. And I love such metrics. Accuracy is very important, especially in your case and for safety applications. How do you make sure this information is correct?
Absolutely. We also use Amazon Augmented AI to build a human-in-the-loop system for our customer. This allows users to review model inferences that didn't cross a set confidence level threshold. The good thing about Amazon A2i is that it provides a continuous improvement environment for the model, allowing them to retrain using human-updated predictions.
Cool. Thank you. So, human in the loop plus a pure deep learning model. Any tips and best practices you could share with us?
Yeah, absolutely. I have a couple of do's. Start small, for example, by selecting suppliers based on one country to test and validate your proof of concept. It's important to fully explore the end-to-end business processes, making sure you understand how transactional data flows throughout these processes. Lastly, programs precede technology. Focus on creating the program and workflows with a clear understanding of where technology needs to be injected. A couple of don'ts: Do not try to solve all issues at the same time. The first phase of your project can just be text extraction, which can be a great start to build on. One thing I wish I knew earlier was that our customer was also building an ERP tool in-house. Now they want us to integrate our solution within their tool, which will take working with their team of software engineers. However, I see this as an extension of the current project more than anything else.
Okay, thank you so much, Greg. That's a really nice project, and we wish you all the best for the next step. But I think now it's high time to dive deep into cost optimization on SageMaker.
Absolutely. And thank you, Greg. This was a very interesting discussion. I learned quite a lot. It was a pleasure to have you on the episode. Let's stay in touch. I'm curious what you're going to build next. Thank you very much. Thanks a lot.
So now it's time to talk about SageMaker. We saw in Greg's project how we could integrate different high-level services. You could always find room for custom modeling, maybe if you want to do very specific image detection in those documents. You may want to build a computer vision model. We're going to work with a large-scale dataset, the ImageNet dataset, which is pretty big, with over a million images in a thousand classes. We're training from scratch because you might have a unique dataset, and you can't find a pre-trained model that works for you. Or maybe you want to build a pre-trained model that your data scientists can use as a baseline for fine-tuning on other tasks. So there are good reasons to train from scratch. Can you say a few words about ImageNet?
ImageNet is the reference dataset for many computer vision applications. It has revolutionized the field of large-scale visual recognition and serves as a benchmark for many computer vision models. It was launched more than 10 years ago by Fei-Fei Li to provide researchers with high-quality image datasets. ImageNet has over 15 million labeled high-resolution images belonging to roughly 22,000 categories organized in a hierarchical structure. The version we're using is a bit smaller, with 1.2 million images in a thousand classes. Let's start by looking at the dataset. The most difficult thing with the ImageNet dataset is downloading it because it's 150 gigabytes and takes quite a while. It took me five days using an EC2 instance. You go to the ImageNet website and use a script from the TensorFlow repository to download it. Make sure you launch this in a way that doesn't get interrupted because it takes days. Once you have the dataset, you get the training set and the validation set, totaling just under 150 gigabytes. When you extract it, you get a file tree with image categories. Each category contains images. We don't want to move all those images around because it takes a long time. Instead, we're going to pack them into about a hundred files. This makes it easier to move the files around and distribute them to different GPUs. The technique I'm using is a file format called RecordIO, part of Apache MXNet. I'm using this because I'm training with the image classification algorithm in SageMaker, which is implemented with Apache MXNet. If you use TensorFlow, there's a TFRecord format that's very similar. I'm not sure about PyTorch, but there's likely an equivalent format. We can run a simple tool called IM2REC, which converts images to RecordIO. We just say, "Give me six chunks for the validation dataset and 140 chunks for the training set." This reduces the 1.2 million images to 140 files, each about 300 megabytes. This is large enough to reduce the dataset size but small enough to distribute efficiently across the training cluster. This process takes a few minutes. Once you're done, put everything into S3, and SageMaker can start working with it.
There's another problem, though. We have a labeled dataset, but what if we need to start from scratch? Let's talk about data labeling. Data labeling is a cost problem because manual work is very expensive. If you need to label millions of images, you need a large team, and it takes a lot of time. Instead, we can use SageMaker Ground Truth. I'll show you a couple of labeling steps. For an end-to-end demo, please check my YouTube channel, where there's a four-part series on Ground Truth. Here, I have a simple demo with a workforce setup, just me, a single worker. I created a labeling job where I'm labeling guitars. I have a few images, and I want to do semantic segmentation on guitars. Let's log in to the worker portal, and I can start working. I'm presented with images and detailed instructions. I need to segment guitars and bass guitars. Let's try this. Is this a guitar or a bass guitar? No, it's a guitar. I can use the auto-segmentation tool, and voila. I can see a bit that hasn't been properly labeled, so I can use the brush to fix it. It's like coloring, a very relaxing activity. I can zoom out, and it looks good to me. Submit. I'll do another one. Here's another one. No, it's a guitar. Let's do this one. The tools in SageMaker Ground Truth make it very easy. Once you're done, you have a fully labeled dataset. You can distribute this to a bigger private workforce, an AWS partner, or scale out on Mechanical Turk. Once you're done, you can see the labeled images and get information in S3, which I copied to my EC2 instance. We see the augmented manifest, which is the list of images with annotation information. In this case, the mask points to the image mask I drew on the guitars. I can use this for training.
Now, if we had a million images to label, we wouldn't do it manually. SageMaker Ground Truth has an automatic labeling feature using active learning techniques to cut down on data labeling costs. Active learning identifies data that should be labeled by workers and data that can be automatically labeled. We start manually, and as the model improves, it starts labeling at scale, significantly speeding up the process and reducing costs.
So now we know how to accelerate our labeling efforts. Let's talk about data storage and data loading. We split the dataset and synced it to an S3 bucket. Once we have that, we can get to work. Let's jump to the notebook. We've synced our training and validation sets into S3. The .rec files are in S3, and we have the training and validation channels. Remember, it's 150 gigabytes. Who wants to copy 150 gigabytes to each training instance every time you start a training job? Even with an instance with 100 gigabit networking, it takes more than 20 minutes, and you pay for that time. So we use pipe mode. When you define your training input, the default value is "file," which means copy everything to the instance. Instead, we set it to "pipe," which means stream the data from S3 to your training instances. We'll see in the training log that there isn't any time spent on downloading because we're streaming. This makes it easy to distribute the data. One RecordIO file can go to one instance, and another to another instance. We use a shuffle config with a random seed to shuffle the RecordIO files and distribute different files to each instance. This ensures each instance gets a different subset of the data set.
Now let's move on to training. We have our data in S3, we define our channels, and we grab the image classification algorithm. We configure the training job. Which instance should we pick? It's pretty obvious we want to go with GPUs. We're using a ResNet image classifier with deep learning, which means GPUs. We should start small. I ran a quick test on a P3.2xlarge instance, a single GPU instance, with a batch size of 128. It was training at about 335 images per second. Given that we have over 1.2 million images, one epoch will take about an hour and four minutes. Training 150 epochs would take 158 hours, or about 6.5 days. From a business perspective, this is too long. From a cost perspective, this instance costs $3.825 per hour, so training costs would be about $600, which sounds reasonable, but the 6.5 days are terrible. The productivity waste makes the training cost irrelevant. You can't iterate if you have to wait a week.
The next step would be to try a multi-GPU instance. I took the biggest, a P3DN.24xlarge, which has eight GPUs. I set the instance storage to the minimum value, one gigabyte, because we're streaming and don't need local storage. We just need a little bit of storage to save the model. The default value is 30 gigabytes, but we can save a few cents. Hyperparameters: ResNet with 50 layers, trained from scratch, 1000 classes, batch size of 1024, learning rate of 0.4, and image augmentation. We're going to use synchronized gradient updates, which is usually more accurate. This is based on the native distributed training feature in MXNet. If you use TensorFlow, PyTorch, etc., you can use different options. We train, and the time per epoch is down to 727 seconds. If we train for 150 epochs, it now takes 30 hours instead of 158 hours. The cost is twice as much, but some people will say it's a good deal to go five times faster for twice the cost. Others will want to go much faster and pay even less. That's the point of this episode: it should be fast, inexpensive, and easy.
We have a good speedup, but we're spending a little too much. One quick optimization is to look at the training job we ran and check CloudWatch metrics for GPU utilization and GPU memory utilization. In this case, looking at CloudWatch, I can see it's only 300%, meaning I'm only using 37.5% of my GPU memory. I can increase the batch size quite a lot. I can use the profiling capability on SageMaker Debugger to generate a profiling report. I train again with a batch size of 2736. GPU utilization is very high, and GPU memory is pretty good, but I could still increase it a bit. The profiler in SageMaker Debugger generates this information automatically. We can fix the batch size issue by increasing the batch size. It didn't really speed things up, which was surprising. I think this is because of the cost of synchronization in distributed training. It's a good idea to max out GPU memory.
Now we know we're making good use of that one instance. We can add another one but don't want to pay twice. So we introduce managed spot training. We can get access to unused capacity in EC2 at a significant discount. I bump this job to two instances and set up spot instances. The cost is divided by three, and the time per epoch is 378 seconds, about twice as fast. Compared to the previous scenario, we're twice as fast and three times cheaper. Let's try four instances. Time per epoch is now 198 seconds, twice as fast again, with a very minor cost increase. It's almost linear scaling. Let's try eight instances. Now we're getting into crazy territory with 64 GPUs, two terabytes of GPU RAM, and eight petaflops of FMA operations. Time per epoch is 99 seconds, twice as fast, and the same cost. We started at 158 hours and ended at 4.12 hours, 43 times faster, and not even twice the initial cost. Who wouldn't want to speed things up 43x for not even twice the cost?
Scaling and cost optimization are linked. People think scaling is just about increasing capacity, but if you scale right, you can save money. From one to two to four to eight, it's almost a straight line. If we wanted to train for a long time, we could use SageMaker Debugger to detect problems and stop training jobs that are bound to fail, saving money. We can set up built-in rules to detect overfitting, loss not decreasing, vanishing gradients, and exploding tensors. SageMaker will detect these issues and send a notification that can trigger a Lambda function to stop the job. We can also use early stopping and learning rate scheduling. After training, we want to deploy the model. For a low-traffic endpoint, we could use the least expensive instance, an ml.m5.large, with a single NVIDIA T4. It's not expensive, and we can predict. However, if we need more processing power, we can use Amazon Elastic Inference. We deploy on a CPU instance and add an elastic inference accelerator, which comes in three sizes, from one to four teraflops. This is a fraction of a GPU's performance. The combination of a CPU instance and an accelerator is very cost-effective compared to a full GPU instance, saving 50-60% of the inference cost.
In a nutshell, scaling and cost on AWS are always linked. If you scale right, you shouldn't spend much more money and might even save money. SageMaker is a good example. We can save even more with savings plans, a new feature that offers deep discounts in exchange for a commitment. You commit to spending a certain amount of money for one year or three years on SageMaker. This is measured in dollars per hour, and you can get a good discount. In my account, I spend a little more than $3,000 per month on SageMaker. If I commit to spending $2.205 per hour for a year, I could save 60% compared to the on-demand cost, saving $500 per month. For a three-year plan, I could save 32% with no upfront cost. If I commit to an upfront payment, I could save up to 36%. In a nutshell, savings plans can save you a significant amount of money on your SageMaker usage.
In just a few minutes, we covered a lot. I will put the notebook in the usual repository and provide links to all the blog posts. The first one is the one we looked at with Greg, and then we have SageMaker Debugger, SageMaker Profiling, how to trigger Lambda functions to stop bad training jobs detected by Debugger, model tuning, spot training, elastic inference, and savings plans. You have a lot to read until next time. Thank you, Greg, for joining us. Thanks to all the colleagues who organized this and answered your questions. Thank you, Segelen, for the great interview and moral support in scaling this thing. We'll see you in two weeks. We're going to do SageMaker Autopilot, so all you lazy machine learning engineers out there, don't miss this one. One click, build a model, go and have coffee. We're done. Amazing. Thank you very much. Have a great weekend. Feel free to connect and ask us questions. Happy to help you out. Thank you. Bye-bye. See you next time.
Julien Simon is the Chief Evangelist at Arcee AI
, specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.
With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.
Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.
Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.