Amazon Textract Extracting text tables and forms from documents

Transcript

In this video, I would like to show you Amazon Textract, a high-level service that lets you extract text, forms, and tables from documents, either images or PDF files. Textract was launched at re:Invent last year, and we just announced improvements to the service, so I would like to use a few sample documents and show you how well it does. Let's start with the first document. It's a research article, a pretty structured document with some graphics, tables, and different formats, including two columns, etc. I'm using the AWS console here. Of course, you can also use AWS APIs, either with the command line or in your favorite SDK. However, for a YouTube video, it's more appropriate to use the console. You would just go to the console and upload documents to see results. In the interest of time, I've already uploaded the document. It just takes a few seconds. The document gets copied to an S3 bucket in the same region and then processed. We see that this article was nicely recognized by Textract. We can get the raw text, and everything is basically correct. We can click on the text boxes to figure out where the text was detected. We can check that the text has been correctly processed. We can download results to a JSON file, which is what the actual API call returns. If we look at forms, there are no forms in this document. If we look at tables, we can see the large table here that's correctly detected. The important bit is not just detecting the text but extracting the structure. In the JSON document that the API returns, you get structure information on that table and other tables as well. If you were parsing that JSON file in a backend for further processing, you would get the text and the structure, including rows and columns. It looks like we've done a pretty good job on this one. Let's look at another one, something a little more difficult. This is a financial document for an ETF. You can see this one is less structured than the previous one. It has all kinds of text blocks, graphs, and text inside graphs, as well as a column of text and a table. It's not a proper article but just blocks of text and graphics. Let's see how we do here. I've uploaded this previously, so if we take a quick look at the document, every bit of text was correctly detected. We can see the title, etc. What about forms? We can see this box on the right-hand side was detected as a form because it doesn't have cells like a proper table would have. In the case of forms, you would get that structure in the JSON document, including key-value information. For example, the key would be "fund launch date," and the value would be "May 15, 2000." For each key-value pair, you have proper structure in the JSON answer. What about tables? This was an easy one. It does look like a table with proper cells and lines. We know exactly where each bit of information sits, on what row and column. This is super useful if you want to move that data to a database. It's already structured. Let's try the last one, a proper form. I just grabbed a random form from the web, a health insurance claim form. Every company has to process forms, and they're structured. Every cell has a proper name and expects a value, or some cells are optional and stay blank. It's complicated because of multi-line text and numbers that might be confusing, and it's not properly aligned. This is not a super easy one. Raw text is not an issue. Let's look at the forms. We're doing okay. We find the patient's name, and that information is available in the JSON answer. If you were looking for the person's name, you would just look for the key called "patient's name," and the value would be "Smith Bob." The same goes for the address, etc. The tiny numbers haven't confused Textract, and the multi-line text was not a problem either. It looks like we've done well. This was not detected as a form but as a table, with proper cells, rows, and columns. Even though some data was not really aligned, and we have multi-column descriptions spanning multiple columns, we're doing okay. It looks like Textract did a really good job on these three examples. These are three pretty random examples. You can easily test this by logging into the AWS console, going to Textract, which is available in all those regions, and making up your own mind. Last but not least, Textract has just become PCI DSS certified, so if you need to process credit card information, you can do it. I will include the link to the blog post in the video description. That's it for today. I hope you liked it, and I'll see you soon for another video. Bye-bye.

Amazon Textract Extracting text tables and forms from documents

Transcript

Tags

About the Author