AWS AI Machine Learning Podcast Episode 13 Amazon Kendra special
March 11, 2020
In this episode, I focus on Amazon Kendra, an enterprise search service powered by machine learning... but you don't need any ML skills to set it up and use it! I show you how to create an index, add data sources, and then I run queries using the AWS console and the AWS CLI.
⭐️⭐️⭐️ Don't forget to subscribe to be notified of future episodes ⭐️⭐️⭐️
https://aws.amazon.com/kendra/
This podcast is also available in audio at https://julsimon.buzzsprout.com.
For more content, follow me on:
* Medium https://medium.com/@julsimon
* Twitter https://twitter.com/@julsimon
Transcript
Hey, good morning, everyone. This is Julien from Arcee. Welcome to episode 13 of my podcast. Don't forget to subscribe to my channel to be notified of future videos. In this episode, we're going to move away from machine learning for a second and focus on a new service that was announced at reInvent a few months ago. This service is Kendra. Kendra is a search engine that makes it really easy to create an index from data located in different sources and then query that index using natural language. A really cool service, and of course, it is powered by deep learning under the hood, but you don't need to know the first thing about this. Let's not wait; let me show you how to use it. Let's take a look at the console first. This is the Kendra console, and the first step is to create an index. I created one already because this operation takes a bit of time, but let me show you how to create a new one. Basically, simply click on "Create Index," give it a name, a description if you'd like, and a role because, as you can expect, Kendra needs permission to fetch data from the different data sources we're going to look at in a second, such as S3, RDS, etc. So you need to have permissions. You can create a role here or use an existing one, just make sure you have permission to access your buckets, etc. Click on "Create," and that's it. It's going to run for a while, generating that new index, and then you can start adding data to it.
My index has already been created; it's active, and now you need to add data sources. Data sources can be either an S3 bucket, an RDS database, or SharePoint Online. Here, I simply added an S3 bucket, and this is exactly what you would think. Give it a name, pass the location of your bucket. You can pass metadata, so if you have extra information on those files, you can have a separate file in that bucket with that extra information. Here, that's not what I'm doing; I'm just passing bulk data, the IAM role, and a sync schedule. You can go from on-demand to hourly, daily, weekly, etc. Click "Next," and that's it. Super easy. The same for RDS. You can pick the engine type and, of course, the connection information to fetch the data and the run schedule. Nothing weird here.
Let's take a look at my S3 bucket. This is what I have in there: a bunch of PDF files, a couple of doc files. These are just plain text. This is the newsgroup dataset that you may know, a collection of text messages from newsgroups. I have some slides, some of my PowerPoint decks, and a ton of Wikimedia files, about 50,000 files or something. That's a very small fraction of Wikipedia here. If we look at the documentation for Kendra, we can see the types of documents that you can index: HTML, PowerPoint, Word, plain text, PDF. I have a bit of each. You can also add questions and answers, structured information. If you want to have really precise, predefined answers to common questions, you can do that as well. All these are supported at the moment. As we saw, we have a run schedule for sources. Here, I'm running on-demand. So if I clicked on "Sync Now," the index would be refreshed with the data in that bucket. I ran this a couple of times and can see the number of documents that have been added or updated. That's my total number of documents, over 57,000 here. Super simple.
FAQs are easy as well. All you have to do is upload a CSV file with a column for question, a column for answer, and a column for URL if you have that, if you want to add extra information to the answer. Just put that in an S3 bucket, upload it, and this becomes an FAQ that your index will point to when user requests hit the questions. Before we query, you can also use facets. Facets are fields you can use to filter your content. Here, I have some predefined fields like document title, last update, view count, etc. I could make those visible in my search UI, and users could filter their request based on that. You can add custom fields as well. This is a little too involved for this short demo, but please take a look at the documentation. It's really not difficult.
Once you've indexed your data sources, you want to query. Let's take a look at that. We have a built-in search console here. Let's try and run some queries. I'm going to use natural language because that's what we do; we don't want to use keywords and complex query languages. We just want to use natural language. So, what is Amazon SageMaker? I see this Amazon Kendra suggested answer, which is really the best answer that Kendra can find. What did it find? It found text in one of my PowerPoint decks, actually in speaker notes, and as you can see, it's highlighting the text that it thinks is the best answer. This is a very good answer; it's proper language, not just a hit on the document. The next one has more keywords that are matched, so I could say, "Well, this is a good one, thanks." Let's try another one. There is a feature in SageMaker called pipe mode, and I'm sure it's in those documents. Let's take a look. Here again, I have a suggested answer, the top answer with a proper piece of text, and this is the definition of pipe mode: a feature that streams data from Amazon S3 to training instances. This is in one of my Word documents. That's pretty cool. The rest is mostly keywords, so as you can see, Kendra highlights the top answer, the one that really contains natural language that answers the question, not just matching keywords.
Let's try to find information in those 50,000 files. For the record, I indexed articles starting with "th." So that's why you're going to see a lot of that. It's really just a small subset of Wikipedia. Let's try this: Who's Thad Jones? Thad Jones is an American jazz trumpeter, composer, and band leader. Again, this is a really good answer because this is exactly what I was looking for, and the fact that the title and the file name contain Thad Jones obviously helps Kendra find the right answer for me. Let's try a few more. Maybe I want to know what instrument Thad Jones plays. Here again, this is a really good answer because it pulls one of the Thad Jones articles, and it's highlighting the meaningful words. So Thad Jones is in there because it's in my query, and Trumpet is highlighted. There's definitely an association here between Thad Jones and Trumpet, which is exactly what my query was about. Once again, there was nothing in my query that said Trumpet. Kendra was able to understand the context of my query and find the right answer and highlight the right bit of text inside the answer.
Maybe a last one. Where was Thad Jones born? Here we get the actual answer; Pontiac, Michigan is highlighted. This is really nice because it is in the Wikipedia article for Thad Jones. Kendra can extract that information and promote it, extracted from the article. We see natural language processing at work, context being extracted from the query, and text strings extracted from the top-ranking article, pointing me to the exact answer. I use the console here, but of course, we could use the CLI. Let's take a look. We have a bunch of Kendra APIs. Let's list indexes. Let's not argue whether indexes or indices is the right word; that's one for the scholars. I see we have an index here. Let's see if we can query it. Kendra query index ID and query. So, where was Thad Jones born? I get a JSON answer, which is just a JSON representation of what we see here: a list of answers with URLs to the documents and offsets to the relevant highlights, etc. There you go. This is the 10-minute or 15-minute demo to Kendra. Super easy to use, pretty powerful, deep learning under the hood, but just a couple of clicks, a couple of API calls, and you can start indexing your data. S3, RDS, SharePoint Online, and more connectors in the future, I'm sure. This is the test console. What would you do next? Well, I guess the next step would be to start integrating those different widgets in your own application. We have a bunch of documents here that show you how to do that. I'm not going to go deep on that because that would make the video too long and it involves front-end skills, which I definitely don't have. So, sorry about that, but basically, you can go and fetch those different components and integrate them. We have a sample app that shows you how to do that, so if you were a front-end person, you'll get it in no time. You know me; it's a challenge for sure.
The last thing I want to say is, please check out the service page. You can find more information on features, pricing, which is important. Kendra is probably more expensive than the services you're used to, so make sure you understand pricing before you start indexing tons of documents. And of course, you'll find customer stories here as well. All right, this is it for this episode. I hope you liked it and learned a few things. Again, don't forget to subscribe to my channel, and I'll see you soon with more videos. Until then, fuck the virus and keep rocking!
Tags
AWS KendraNatural Language SearchData Indexing Service
Julien Simon is the Chief Evangelist at Arcee AI
, specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.
With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.
Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.
Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.