Analyze SEC filings with Transformers for fun and profit

Transcript

Hi everybody, this is Julien from Hugging Face. Processing financial documents is a very strong use case for natural language processing and transformers. In fact, most organizations in financial services and beyond need to work with a mountain of documents, from company filings to analyst reports and more, in order to extract information, classify the documents, run sentiment analysis, and summarize documents, etc. There are really so many things you could do here. In this video, I'm going to work with company filings that I downloaded from the SEC website and will use a pretty cool SDK that I found to do this. Then we'll process the filings and run a few transformer models, namely summarization and sentiment analysis. Okay, let's get started. Let's take a look at the raw material, the documents. Starting on the SEC website, we go to filings and select Edgar, which is the name of the database that includes company filings. We could look for any particular company just to take a look at one of those documents before we start processing them. Why not pick Amazon? We see different filing types: 10K, which are the annual reports, 10Q, quarterly reports, and 8K, current reports. We'll be able to work with all of these in our example. Let's take a look at the latest quarterly report. The structure of this document is well-defined. The table of contents is always the same, so we'll find lots of financial information and a section called management's discussion and analysis of financial condition and results of operations. This is usually very interesting because you get the management's opinion on how things are going, highlighting what's going well and what's not. For financial analysis, this is quite interesting. We'll take a look at this section again when we start processing documents. You can go and find the information you need here, but of course, what we want to do is automate that process, download the documents, process them, extract the text sections we're interested in, and run a transformer model. There is an API for EDGAR documents, but I found recently that some of my former colleagues at AWS have built an SDK that lets you easily retrieve SEC filings and break down those filings into different sections that we can save in a CSV file and process. This is a huge time-saver. Personally, I'm not a big fan of text processing and regular expressions, so these people have done the heavy lifting for us. Thank you very much. They've also written a couple of blog posts on the SDK and how to use it for different use cases. The notebooks are integrated into SageMaker Jumpstart, which lets you easily deploy machine learning solutions on SageMaker. I'll put all the links in the video description, and you can go and read about it. I'll just show you how to use the SDK. Thank you for building this; it's quite useful. This SDK is on GitHub and is called the Jumpstart Industry SDK, although it's only about financial documents for now. Maybe we'll see other industries integrated there. Okay, so that's what we're going to do: grab some filings from the SEC website, process them with this SDK to get a data frame with individual document sections in different columns, and then apply some transformer magic to that. I've broken down my example into two notebooks. The first notebook is data prep, so downloading, processing, and saving to CSV files. The second notebook is about running transformers. This makes it a little easier to keep track of everything. Here we go. I include for reference the name of the Jumpstart solution I grabbed some code from, called "dashboarding SEC text for financial NLP." This is the SDK we're using. Let me zoom in a bit. Of course, we need to install the SageMaker SDK and the Jumpstart Industry SDK, import them, create some objects we'll need, like an S3 bucket to download filings, the IAM role for permissions, the usual stuff. Next, we can configure the downloading job, which is based on SageMaker processing, an easy way to run batch jobs on SageMaker. We need a bucket to store the processed documents, so it's downloading the raw documents to one bucket and saving the processed documents to another. This is the file name we're going to save the documents to. The first step is to create an Edgar dataset config object. The first line here is which tickers or company IDs we want to include. Here, I'm only fetching Amazon documents. These are the stock tickers, so for Google, you'd say GOOG, and for Tesla, you'd say TSLA, etc. You can have as many as you want. Form types: here, I'm grabbing 10Ks and 10Qs. The SDK supports other document types, and you'll find the information in the SDK. The start date, end date: I'm grabbing old docs for 2019, 2020. This is just a bogus email we need to retrieve documents, but you don't need a user account to grab your documents. Throttling is a thing, so don't expect to pull 2 million documents in 5 minutes without being throttled. You have to be considerate here. This is what we want to get. Now, this is the infrastructure setup, and it's based on SageMaker processing. We'll run this on one ML C5.2xlarge instance, which should be large enough. No problem here. We just launch that data loading job with the buckets and the file name to save everything to. It runs for a little bit, based on PySpark, which SageMaker Processing supports. The internal code name for this was Gecko, and if you've seen the movie Wall Street with Michael Douglas, you know why. Good name for this. At the end of that job, all I have is this CSV file in S3. I can go and grab it, read it, and this is what I get: the ticker code, the form type, some numbers and dates, the full text of the filing, and the management discussion and analysis, broken down as a separate section because this is probably the most interesting part of the document. If we want to go one step further, we can break down this text column into individual sections. Going back to my example, let's go to the contents. We want to break down every single item here as a separate column because we may want to run some analysis on that. This is where it's great that it's all done for you because you don't want to run those references. They've included that code in a notebook called "SEC functions." I cleaned it up a bit because it included other stuff for visualization, etc., but all that code is part of the SageMaker Jumpstart solutions. They extract all the items, and you can see the regex, which is pretty intense. Good thing I didn't have to write that. We'll just use those functions and map the column names to the right names, etc. This is the only bit I wrote or adapted, a simple function where I can pass a data frame with filings, the form type we want to process, and do everything automatically. It applies all that regex and returns a data frame with one section in each column. That's what we have for data prep. Starting from this initial data frame, this is pretty nice. Just pass this data frame, say, "process all the 10Ks here," and it returns another data frame. Now I can see in this data frame that the individual sections of the filing have been broken down into individual columns, making it easier to zoom in and examine and process particular things here, instead of that huge blob of text. I do the same for the 10Q filings, a different data frame, different names, and save those two processed data frames to CSV files. Simple enough. Now I've got my two files. I could open one of them, and we'll see exactly the same thing, all nicely broken down into text. Perfect. Now we can move on to data processing. The original blog post runs some NLP functions on the text. You may want to read about that, although it feels more like traditional statistics and NLP, and there's no machine learning at work. There's a bit of Hugging Face summarization, but generally, they're just using traditional techniques. Still, that's part of the SDK. If you want to compute those NLP scores, you can do that. I installed some dependencies: Pandas, NLTK to break down text sections into individual sentences, and transformers for modeling. This is the tokenizer I'm going to use to break down text into individual sentences. I'm loading my two files into data frames. Here, we just extracted very few documents to keep the demo short. You can extract much more. I have two 10Ks for 2019 and 2020, and six 10Qs because there are three quarters in between the 10Ks for each year. Makes sense. Which model are we going to use here? Here, I figured, let's do a classification. I'm going to analyze individual sentences in the management analysis section and keep the most significant ones, the ones that are either highly positive or highly negative, and ignore everything else. Then I'll concatenate all the positive sentences and summarize them, and do the same for the negative sentences. For summarization, I'm going to use the T5 model, T5 base. For classification, I've tried a couple of models. I've tried FinBERT, a BERT version fine-tuned on financial documents that can do sentiment analysis with three classes. I've also found this distilled BERT model fine-tuned on emotion detection with a wider range: anger, disgust, fear, joy, sadness, surprise. It picks up interesting things, so let's give it a go. I'm using the pipeline API from the Transformers library, so it's really a one-liner to grab the model. Here's how we process those docs. I have the first function called `find_emotional_sentences` where I pass a text section, tokenize it into sentences, and classify each sentence. If it's not neutral and the score is higher than a certain threshold, I keep it and append it to a list stored in a dictionary. I group those, print a summary, and return the dictionary with a list of sentences for each emotion. Then I have a summarization function where I join all the sentences for each emotion, summarize them, and print the summary. You could certainly improve this function because if you concatenate many sentences that exceed the max sequence size of the model, they get truncated, and you lose information. Chunking this bit of text into, say, 512 tokens would yield a better summary. Exercise for the reader. Let's try it. We'll take one of those 10Qs and point at the management discussion section. We'll try to find non-neutral sentences with a score higher than 0.95. This section is already quite long, over 40K characters and 174 sentences. I've got nine highly positive sentences and nine highly negative sentences. Let's look at the negative ones. These assumptions about future disposition of inventory are inherently uncertain. Liquidity is also affected by restricted cash balances. Changes in foreign currency exchange rates impacted net sales by 814 million. These are all pretty negative things. Now, we'll concatenate all the negative sentences, all the positive sentences, and summarize. I'm running this on a GPU instance in SageMaker Studio because it will run faster. The positive summary is: "As we utilize our federal net operating losses and tax credit, we expect cash paid for taxes to increase." You could argue that's positive. AWS sales increased 37%. The negative summary is: "Change in foreign exchange rates impacted sales. Decrease in North America operated income is primarily due to increased marketing expenses. Increase in international operating loss is due to increased marketing expense." These are negative things. You can control the output with the minimum and maximum length of the summarizer. Now, let's try the emotion detection model and see what it picks up. This is a very high threshold with six or seven emotions, so maybe I need to lower it a little bit. Let's say 0.7. It didn't pick up much, so let's try 0.2. We have lots of joyful sentences. Who thought Amazon reports would be joyful? Let's look at the angry ones. We are also currently subject to tax controversies in various jurisdictions. Developments in an audit, investigation, or other tax controversy could have a material effect on our operating results. Summarizing anger: "Tax controversy could have a material impact on our results." For fear and joy, the input is higher than the maximum token length for the model, so we'd want to chunk that fear bit into 512 blocks and summarize each one. Joy is probably too long to be properly summarized. Sadness gives us: "Guidance anticipates an unfavorable impact of approximately 30 basis points for foreign exchange rates." That's definitely not a happy fact. As you can see, you need to tweak this a little bit and probably do more processing on the text and chunk it to get a better summary. But this isn't too bad for just a few lines of code. So, that's pretty much what I wanted to show you today. If you thought it was crazy to build a solution that analyzes SEC filings, it's actually not. A lot of the heavy lifting is done by this Jumpstart SDK, and then you get your CSV file and can just grab models, which are off the shelf. I didn't fine-tune anything here, and just started processing that stuff. Of course, you could even train your own language model on that data if you download enough. I've downloaded a full year of 10Ks and 10Qs for all S&P 500 companies, which is a few thousand documents and almost a gigabyte of data. If you want to go back in time and download, say, 20 years, it's going to be a long download, which you probably need to break into different jobs because you'll be throttled. It's going to be a lot of data, but this data is out there, and you can process it very easily. You can build pretty cool models, and maybe we'll see more financial models on the Hugging Face Hub thanks to this. If you need help, ping me. I'll put all the links in the video description. I hope this was informative and fun. See you soon with more content. Until next time, have fun, keep learning. Bye-bye.

Analyze SEC filings with Transformers for fun and profit

Transcript

Tags

About the Author