SageMaker Fridays Season 2 Episode 5 Natural language processing topic modeling November 2020
November 09, 2020
Broadcasted live on 06/11/2020. Join us for more episodes at https://amazonsagemakerfridays.splashthat.com/
⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos ⭐️⭐️⭐️
This project provides an end-to-end solution to group news headlines according to their topics. Using the Million News Headlines from Kaggle, we use SageMaker to train a model with Neural Topic Model (NTM) algorithm. We also discuss how NTM differs from Latent Dirichlet Allocation (LDA), another algorithm for topic modeling. Finally, we deploy the model to a real-time endpoint, and we use it to predict topics for news headlines.
Transcript
Hey, good morning, everyone. Good afternoon. Good evening, depending on where you are. Welcome to episode five of this new season of SageMaker Fridays. This is almost the last episode. We have one more to go, so still lots of good content to cover. My name is Julien. I'm a principal developer advocate focusing on AI and machine learning. Please meet my co-presenter. Hello, everyone. My name is Sego-Len, and I'm a senior data scientist working with the AWS Machine Learning Solution Lab.
All right, Sego, thanks for joining us again. I'm sure you're going to help us understand a few more complex concepts this week. Let me remind you before we start that we are live. We're in lockdown, not in the AWS office, but we're going to unlock machine learning at least, right? So we're live. You can ask all your questions. We have friendly moderators waiting for them. Ask anything you like about machine learning. Don't be shy. I keep saying there are no silly questions. This is about learning. So make sure you learn as much as possible. Okay?
So it's time to get started. We've covered quite a few topics in previous episodes, and this week we're going to focus on a wide and important topic for machine learning: natural language processing. Sego, tell us more.
Indeed, Julien. Today we are going to talk about a very practical NLP topic: topic modeling. Topic modeling is used to organize a corpus of documents into topics, which is a grouping based on the statistical distribution of words within the documents themselves. The key idea is to convert unstructured data into meaningful and useful data for later analysis, thanks to advanced ML models. There are many practical use cases for topic modeling, such as document classification based on detected topics, automatic content tagging using a tag map to a set of topics, document summarization, information retrieval using topics, content recommendations based on topic similarities, and so on.
Yeah, sounds like a use case that everyone can relate to. Every company and organization has mountains of text documents they'd like to organize, right? Customer emails, internal documents, news articles, archives, and so on. There's so much out there, and it's impossible to go through all of it and label it. So we'll see how to handle that. Today, we're going to cover how to group those documents. Document and NLP work, especially with the huge amount of documents available everywhere, is a very exciting topic. But it's also quite difficult due to the complexity of language and the importance of data processing.
So we are going to show how another unsupervised Amazon SageMaker building algorithm can help us get good results out of the box. Indeed, we are going to use a new one, the NTM (Neural Topic Modeling) algorithm, on top of the open-source million news headlines dataset, which comes from the ABC Australian website. We will see how to process text datasets using open-source data tools such as NLTK and GenSim, and of course, how to train, evaluate, and deploy the topic modeling algorithm.
Wow. Okay. So new algorithms with more complex names, one million news headlines, and interesting open-source libraries. What's not to like? I think we're going to dive quite deep today. So you know what to do right now. You need to get some coffee or caffeine or whatever you need to stay awake for an hour or so and learn a lot.
Okay, so let me show you the GitHub repo. It's here on my screen. Don't worry about the URL right now; it's going to be on the final slide, as always. This is actually one of the examples from my SageMaker book, which I've talked too much about already. But it's all in there. There is an example using two algorithms. I'm showing you how to build topic modeling using NTM, which Sego mentioned, and another one we'll quickly mention called LDA. It's also interesting to compare the performance and efficiency of these algorithms. All the code is in there. Don't worry about it; it's on the final slide, and we can point you to it later.
As usual, before we dive into the code, we need to understand the machine learning process and how we're going to solve it. As we just explained, the problem we're trying to solve is finding related text documents inside a pretty large collection. Here we have a million documents, which are news headlines, so they're pretty short. Still, each headline can be seen as a document. We want to use this unsupervised learning technique, topic modeling. So, Sego, can you explain and maybe give us an example of topic modeling in action?
Sure. The technical definition of topic modeling is that each topic is a distribution of words, and each document is a mixture of topics across a set of documents. As an example, if you have a collection of documents that contain frequent occurrences of words such as bike, car, mile, or brake, they are likely to share a topic on transportation.
You got that, right? Are you following? She's making sure you're following. Exactly. So, in other words, the topic transportation is a distribution made by bike, car, mud, etc. I'll give you another example to ensure everyone understands. If you have another collection of documents that share words such as SCSI, port, floppy, or CI, they are likely to discuss a topic about computers. The process of topic modeling is to infer hidden variables, such as word distribution for all topics and topic mixture distribution for each document, by observing the entire collection of documents.
Okay, I see. So the intuition is pretty simple, right? We have a collection of documents, and we assume there is a hidden list of topics. We're going to work with news headlines, so you can start thinking about what those topics can be. A topic is just a group of words that are very meaningful and frequently appear together. The first step is to identify what those topics are by finding groups of words that statistically frequently appear together. Once we have those topics and know those words, we can score each document against the list of topics. For example, if we have 10 topics, we get a score or some kind of indicator telling us which topics a document is mostly about.
The thing that always hits me is that it doesn't require any labeling. It is unsupervised, so we can start right away with whatever documents are available to train a model. Before we dive into this a little more, I have a quick question. I remember that we have a high-level service called Amazon Comprehend, which can do topic modeling in a managed way. You put your documents in S3, Comprehend does its magic, and you get your topics and scores. I like simple things, so I like Comprehend for that. Why should I even pay attention to NTM? What's the bonus of training with NTM?
So, of course, we have this fully managed text analytics service called Comprehend, which provides a pre-configured topic modeling API best suited for the most popular NLP use cases. It is a suggested topic modeling choice for customers as it removes a lot of the routine steps associated with topic modeling, like tokenization, training a model, and adjusting parameters. But sometimes, you need finer control over the training, custom optimization, or you deal with a particular writing style or domain. This is why you might want to implement your own custom topic modeling model, such as the NTM model on SageMaker.
Okay, makes sense. So just more control, more tuning, and tweaking opportunities if we work with NTM. But keep in mind, Comprehend can do this as well and could be a good baseline.
Exactly. Okay, so we understand the problem and the intuition of what we're trying to achieve. Let's take a look at the dataset. You can see it on my screen right now. It's called the Million News Headlines dataset, a million news headlines from the Australian ABC media. It's not the American one. It's the Australian ABC. The dataset is available on Kaggle and looks something like this. There's a date, but we're not going to use that data at all, so we'll drop that column quickly. Then there's the headline, which is lowercase and in English. We have a million of these, and about topics, we don't have a list of topics. That's the fun part of this episode—we don't know what we're going to get. It's like a box of chocolates. Hopefully, we find some good ones.
So, very simple dataset. But you mentioned early on that NLP requires a lot of data preparation. What would we do? At a high level, what do we need to do to use this data?
So you need a good format for your data. You need to convert it, etc. But when you think about the language itself, you'll remove a lot of words that don't bring any more information. You need to clean the data, not like time series data, but when working with real-world text data, you have a lot of processing to do before using ML models. For example, short words like "to" and "in" don't bring much meaning. In the case of topic modeling, they don't add value. We need to clean things up. We're going to use the NTM algorithm, one of the built-in algorithms in SageMaker, and it expects data in a certain format. We need to convert the sentences into a bag of words representation, which is impossible to read for humans but is meant to be fed into ML algorithms.
The dataset processing will be covered later when we look at the code, but there's going to be some work there. Let's talk about the algorithm in more detail. So, neural topic modeling. I had my coffee, so you can go and explain what that is. I'll try to keep it simple. But before explaining the NTM algorithm, let me describe the other well-known topic modeling algorithm, LDA, because it will help us better understand NTM.
So, LDA is a generative probabilistic model and can be seen as dimensionality reduction for text, similar to multinomial PCA. LDA discovers topics through posterior inference and learns a word distribution per topic. However, this kind of statistical analysis relies on some assumptions, such as a fixed and known number of topics, which can sometimes lead to poor quality topics. This is where the NTM algorithm comes into play and addresses this potential pitfall.
Before we talk about NTM, just to make sure I have the intuition, LDA assumes a certain number of topics that match a certain type of distribution, right? It assumes a beta distribution, and the Dirichlet object is a multi-variate version of the beta distribution. If you accept that your topics look like multiple beta distributions, then you're happy with LDA. But there's something here that I don't quite like. Assuming that words fit into distributions with a certain shape. It works, but there's something I don't like. What about NTM? How is NTM different?
With LDA, you have strong statistical assumptions to make before the algorithm works. On the other side, the Amazon NTM, which is called the Coherence Awareness NTM model in the research paper, offers a more flexible framework to accommodate more expressive models. LDA has a fixed and well-known number of topics, but NTM provides a more flexible framework thanks to the neural variational inference architecture. NTM incorporates topic coherence objectives into the training process, which is one of the most important differences from LDA. You are not limited to a statistical framework, and you have a topic coherence objective during training.
To summarize the difference, LDA makes strong statistical assumptions and tries to discover topics that fit these assumptions. NTM doesn't make such assumptions. It uses pattern extraction and neural networks, with encoder and decoder architectures, to figure out what the topics are. The end result is similar, but the approach is different. Let me show this on the slide.
Oops, sorry. Stop sharing my screen. It's going to come back. Don't worry. Clicked on the wrong thing. All right. Here we are. So this is the network architecture in the research paper. It looks a bit complicated, but we're not going to zoom in on every detail. Basically, it's an encoder-decoder architecture. On the left-hand side is the input layer, where we put the bag of words representation of the words in a headline. We feed that into an encoder, which has a smaller dimension to extract the meaningful information. Then it decodes it to rebuild the input bag of words. Ideally, we would get the exact same thing, but it's not going to be exactly the same because the purpose is to throw away noise or non-meaningful stuff.
Unlike most neural networks where we do supervised learning with labels, here we don't care about the output layer. The output layer is just an encoded, decoded representation of the input layer. What we care about are the internal parameters learned during training, which are the topic weights. If you have 10 topics, we'll see 10 weights representing how much of a topic is present in a document. That's a non-mathematical explanation of how NTM works. In this series, we focus on intuition, not math. If you want the math, it's in the paper, but it's pretty dense.
Do you want to add anything here, Sego? Did I forget anything?
Oh, yeah. Maybe explain the metrics. How do we actually optimize for this? The NTM gives you two types of metrics to ensure the topics are good and significant. The first is WETC, meaning word embedding topic coherence, and the other is TU, topic uniqueness. WETC tells us how semantically close the topic words are, with a value between 0 and 1, where higher is better. It's computed using the cosine similarity of the corresponding word vectors in a pre-trained GloVe model. The TU metric tells us how unique a topic is, i.e., whether its words are found in other topics. Again, the value is between 0 and 1, and the higher the score, the more unique the topic. These metrics are added within the NTM framework to ensure the topics are useful.
We'll see these metrics in the training log, and they are very helpful in understanding what we've trained. LDA says, "I trained a model and you've got 10 topics," and you have to figure out what those topics are by scoring documents. NTM goes further with WETC and TU, making it much simpler to understand the topics. We have hyperparameters, right? The number of topics is probably a hyperparameter. SageMaker NTM supports a list of hyperparameters for fine-tuning model performance. You can configure the number of topics to extract, the number of epochs, the learning rate, and more. In our case, we are going to use four types of hyperparameters: the number of topics, the feature dimension, the vocabulary size, and the optimizer. By default, it is AdaDelta, but we are going to use Adam. We also have the mini-batch size and the number of patience epochs to control early stopping behavior. The algorithm would stop training if there has not been an improvement in validation loss within the last number of patience epochs.
It's always good practice. How long should I train? The answer is, I don't know. Just use a huge number and set early stopping. If the metric stops improving over 10 or 20 epochs, then stop. That's a machine learning mystery solved.
We have a dataset, an algorithm, and we understand the metrics. Now it's time to switch to the demo. Let me go to the notebook. I'm using SageMaker Studio, our machine learning IDE. If you've never tried Studio, it's web-based, and you can create a user in the SageMaker console in a minute. Then you can open it and get to work.
As usual, we need some dependencies. We'll be doing some preprocessing using open-source libraries like NLTK and GenSim. We use Pandas to open our dataset, which has 1 million lines. We ignore any errors or bad lines. The dataset has two columns: the publish date, which we drop, and the headlines, which are usually pretty short. Headlines can be dense and compact, sometimes missing words, but they are short. We're trying to build topics on very short documents, so it's a challenge.
We need to process this. We'll do a number of things. We have a short Python function that we apply to each headline. It removes punctuation, digits, and numbers, converts everything to lowercase, and splits the headlines into individual words. We remove stop words, which are short, useless words, using a built-in list in NLTK. We also use lemmatization to keep the root of the words, similar to stemming. For example, "swimming" becomes "swim." We're trying to get rid of variations to focus on the core meaning. This is basic processing, but it's a simple example. We don't need to check for HTML tags or other typical issues in text datasets, but there's more you could do.
We apply this function to every headline, and it takes 44 seconds. Then we visualize the dataset again. We don't have sentences; we have arrays of words. We build the dictionary, which is an important parameter. The size of the vocabulary is almost 75,000 words. Is this large? Is this too small? It depends on the data. In this case, with headlines, maybe it's enough. We start and see what happens, and we can iterate on this step. When I started, I felt 75,000 words was a lot. So I used a function in GenSim called `filter_extremes` to remove any word that appears in more than 50% of the headlines. I kept the top 512 words, a brutal word selection. We only have 512 words, which we save to a text file. We use the dictionary to build the bag of words representation.
So, the bag of words representation is a format for the algorithm to ingest. Each word is given an identifier, and we count how many times each word appears in a given headline. For example, word nine appears once, word 10 appears once, and so on. It's a mapping of actual words to IDs and frequencies. The order doesn't matter, so it's called a bag of words. We do this for 1 million headlines and save it to S3. We use a sparse matrix object from SciPy to avoid storing zeros. We create a sparse matrix and write it to a buffer using a SageMaker utility function. We upload this directly from memory to S3.
We've built everything. It took a few minutes, and we have our protobuf encoded sparse matrix and vocabulary file in S3. Now, we train. We retrieve the name of the NTM container for the region we're running in. We create an estimator, pass the container, permissions in the form of an IAM role, and select a GPU instance for this because it's deep learning. We set parameters: 10 topics, the Adam optimizer, a batch size of 100, 100 epochs, and a patience of 10.
How do we choose 10 topics? When you do topic modeling, you need to understand the framework of the documents. For news headlines, go to a news website and see the categories. For example, the ABC website has politics, work, business, analysis, etc. Most news websites have around eight to 10 top categories. If you were working with enterprise data, you could use the same intuition. If you have customer emails and five different products, maybe there are five to 10 topics. The ballpark estimate should be easy to find.
We train, passing the channels: the training data and the auxiliary channel, which is the vocabulary file. We see the vocabulary file, the GloVe embeddings being downloaded, and the epochs going by. We see the reconstruction loss and the KLD (Kullback-Leibler divergence). The training objective is to minimize the reconstruction error and the KLD. It trains for a while, and we see the best epoch was epoch 24. We have 10 topics, and the metrics are WETC 0.39 and TU 0.82.
TU is topic uniqueness, which tells us how unique the word groups are. 0.82 is pretty good. The higher the score, the more unique the topic. We see individual topic uniqueness scores, with some as high as 0.92. WETC tells us how closely related the words are using GloVe embeddings. 0.39 is average, but there's some meaning in there. We have 10 topics, and we can figure out what they are. For example, one topic has words like market, share, profit, dollar, rise, interest, business, price, which looks like a finance topic. Another is clearly sports, and another is crime. Some topics are less clear, but we can see the ones with high topic uniqueness and similarity scores are more meaningful.
We deploy to a real-time endpoint and predict some samples using the same processing. For example, "Major tariffs expected to end Australian barley trade to China" scores as finance (35%) and international (22%). "US woman wanted over federal crash asks for release after coronavirus holds extradition" scores as crime (40%) and international (10%). "50 trains out of service as fault forces Adelaide passengers to pack like sardines" scores as crime, unknown, and sports, which isn't a great prediction. "Germany's Bundesliga plans its return from lockdown as football world watches" scores as sports (30%). "RFS volunteer in custody for allegedly lighting fires" scores as disasters.
Even with short headlines and minimal tweaking, NTM is pretty good at figuring out topics and helping us organize documents. I think it's a satisfactory model.
It's almost time, so here's the most important slide. All the URLs and repos you need are here. The dataset URL, the research paper URL, NLTK, GenSim, and more. Don't forget about re:Invent, which is online and free this year. My SageMaker book, where this example was built, is also available. I'll leave this on for a few more seconds, but you can review it in the online video. All the videos are available on Twitch.
Thank you again to Sego-Len for her help in understanding all these complex things. Thank you for watching. I hope you learned a lot and got your questions answered. Thanks to all my AWS colleagues who've been helping with this Twitch episode and our moderators. We'll see you next week for the final episode. It's already the last one, but re:Invent is coming. We'll be back next week, and we'll talk about computer vision, specifically training at scale. Get ready for really large-scale training. Thanks again. Have a great weekend. Stay safe wherever you are. We'll see you next Friday. Thank you very much. Bye-bye.
Tags
Natural Language ProcessingTopic ModelingAmazon SageMakerNeural Topic ModelingMachine Learning Techniques
Julien Simon is the Chief Evangelist at Arcee AI
, specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.
With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.
Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.
Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.