How Witty Works leverages Hugging Face to scale inclusive language

Transcript

All right, I think we can get started. Good morning or good afternoon, everyone, depending on where you are. Feel free to post a message saying where you're from. It's always nice to know where our audience is based. I'm outside Paris. Lucas, where are you, by the way? I'm in the mountains in Switzerland. Okay, lucky you. That's why you have great weather. And Elena? I'm in Zurich. Okay, Zurich. Not as great as the mountains. No, no, not as great as Lucas has. Oh, no, come on. You're making everyone jealous, Lucas. No, that's not fair. Welcome, everyone. We have a few more folks joining. Let's give them a few more seconds. Everyone's late after the 4 p.m. meetings. Story of my life. All right. Welcome, everyone. Let's get started. We have quite a few things to cover. Thank you very much to our friends at Wittyworks for joining us today. Let's first introduce everyone. My name is Julien. I'm the chief evangelist for Hugging Face outside Paris. I'm very happy to have two guests today. I'll give them an opportunity to introduce themselves. Elena, tell us a little bit about you. Hi, everyone. I'm Elena Nazarenko. I joined Wittyworks as a data scientist and NLP developer a little over a year and a half ago. Prior to Wittyworks, I worked as an NLP developer to build chatbots for tech support centers and to improve free tech search on web platforms. I have a PhD in physics from the University of Grenoble Alps in France, specializing in theoretical and computational physics. I worked as a researcher in research institutes in Sweden and Switzerland for a few years. Here at Wittyworks, I work on core NLP algorithms for the writing assistant, focusing on inclusive language. All right. Machine learning is a hobby for you, right? Exactly. Sounds good. And we're lucky to have Lucas, who's one of the co-founders of Wittyworks. Welcome, Lucas, and tell us a little bit about you. Sure. I'm originally from Berlin, Germany, but I'm also living in Switzerland. My background is more in development, so I work with Elena to ensure that the data science results and algorithms are built into good engineering software principles, making our API scalable. All right, sounds good. Before we get started with the use case, let's talk about Wittyworks. You guys have the coolest URL, and it's witty.works. I love that, very clever. But Lucas, tell us a little bit about the company. When was it founded? What's the vision? What's the mission you've set for yourselves? Wittyworks was founded in 2018 as a consulting company. The idea was based on the realization that IT is shaping more and more of our realities, but there's a significant lack of diversity in this space. The initial approach was to see what could be done for companies to become more diverse, rather than teaching people from marginalized communities how to survive in the current business world. The first project was to manually rewrite job ads to make them more inclusive. In 2019, we decided to make this a software product to be more scalable. We created a solution for job ads in German and French, which was quite successful. However, we realized the scope should be much bigger because it doesn't help people if the job ad is inclusive but the rest of their work experience isn't. We created a digital writing assistant that can work in any context, whether you're writing an email internally or externally, on LinkedIn, in Gmail, or with job ads. It works as a browser extension that you can install to give you feedback. Most importantly, it doesn't just tell you what you did wrong but explains why, educating people about the biases they have learned through socialization and language, with the goal of changing their behaviors in other contexts. That's an interesting angle. We'll look at some examples and dive into the solutions you've built. I want to take 30 seconds to talk about Hugging Face, although this is really about Wittyworks today. Hugging Face is trying to build the best ML community, a central place where we can find and discover great machine learning models to use in our projects and share our own. People call us the GitHub of machine learning, which is a fair analogy. You go to GitHub to find code, libraries, and tools, and you can do the same for machine learning models and datasets on the Hugging Face hub at huggingface.co. Some quick numbers: as of today, we have over 130,000 models, all open source, which you can download in one line of code and use in your projects. Elena will tell us in a few minutes how she did that. We also have over 20,000 datasets, again, all open source, which you can use for various purposes, such as fine-tuning models. We have over 10,000 companies and organizations contributing models and datasets to the hub. We're one of the most popular open source projects ever, with our Transformers library recently surpassing 80,000 GitHub stars. We're looking forward to hitting 100K. We're very proud to serve the community with open source models and datasets. Today, we're here to discuss how Wittyworks has leveraged Hugging Face models and support to build their application and get it into production. Before we dive into the machine learning part, let's talk about the problem you've been trying to solve. Tell us a little more about the problem statement, Lucas. You started to explain it, but can we dive in a bit more? Sure. When I joined Wittyworks, the writing assistant system didn't exist yet. We needed to start from scratch. The initial product was based on regular expressions and was specified for job ad analysis. We started with a simple approach using transfer learning with the Spacy library for German and English to analyze text. We performed linguistic analysis, extracting linguistic features, lemmatizing words, and labeling parts of speech. We also did name entity recognition to extract geographical locations, job titles, and other relevant information. After that, we searched through our inclusive and non-inclusive knowledge base to see if any detected words belonged to it. We filtered these words using specific linguistic labels to ensure we highlighted the correct nouns, verbs, etc., and then showed alternatives. Can we do a quick demo, Lucas? Sure. Let me give you the screen back for a second. We'll show the actual thing, and then we'll talk about the machine learning that powers it. Okay, let me share. Here's the community screen. You can see this is just the text area. If I click into it, it calls back to our API and highlights specific words. For example, the word "fossil" is used as a derogatory slur, but here it refers to the fossil fuel industry. We don't highlight it because it's not derogatory in this context. Words like "guys" are highlighted, and you can click to accept an alternative. Users can also create their own language rules, such as capitalizing "W" in "Wittyworks." You can configure options like using singular "they" and support for plain language. It's very intuitive. Now, let's talk about how this works. Elena, you started telling us about the initial approach. Yes, the initial approach was based on pre-trained Spacy models and lemmatization. It works well for 85% of the vocabulary, but context-dependent non-inclusive words are challenging. For example, "flexible" in "you will have a flexible schedule" versus "you should keep your schedule flexible." In one case, we need to highlight "flexible," but not in the other. We tried using vanilla transformers to generate word embeddings and calculate cosine similarity, but the accuracy was only 0.7, which is too low. We realized we needed to use sentence transformers to extract sentence embeddings and create a sentence classifier, not just a word-based classifier. We tried sentence transformers like Roberta and BERT, but achieving high accuracy required about 100 sentences per word, which is impractical for building a large dataset. Our mentor from Hugging Face, Florin, suggested using SetFit, a framework for few-shot learning. SetFit is an open-source library built by Hugging Face that uses sentence transformers. It allows you to build classifiers based on semantic similarity without needing a huge dataset. We ended up using about 20 sentences per word, usually 15, which is much more manageable. SetFit is very efficient, requiring less computational time and making it easy to update the model. We also focused on creating a balanced dataset, which is crucial for good performance. SetFit saved us a lot of time and money by avoiding the need to build a massive dataset, and the training time is much shorter. The model should be fast, especially for real-time applications. When we saw the demo, the response time was very snappy. We didn't need to optimize the model further because it was already fast. We trained and deployed the model on Azure, which worked well due to the smaller size of sentence transformers. The collaboration with Hugging Face was invaluable. We wanted to deploy our models, and Hugging Face provided the expertise to guide us through the vast array of transformers and show us the best possible approach. This saved us a lot of time. The Hugging Face community, including the newsletter and blog, provided a lot of useful information and practical examples. The current accuracy of our upgraded solution is 0.92, and we have a well-established workflow. We create real data, test and train the model on Google Colab, push the model to the Hugging Face Hub, deploy it on Azure Cloud, and integrate it into our backend solution. For developers and data scientists facing similar problems, our advice is to try SetFit for text classification. It works well even with synthetic data and zero-shot learning. Focus on finding a pre-trained model that fits your case, measure its performance on real data, and fine-tune if necessary. The process is iterative, and using pre-trained models and SetFit can accelerate development and save time and money. We have three main projects for the future: improving grammar, especially in German; creating a tool to analyze and suggest improvements for job ads; and exploring style transfer to transform sentences from non-inclusive to inclusive. For style transfer, we need to collect more data. Lucas, do you have any insights on the relationship with Hugging Face and what's coming next for Wittyworks? We're a small startup with six people, and the collaboration with Hugging Face was a significant investment. However, it was the most economical way to provide Elena with a team of sparring partners. Whenever you're stuck on an IT project, having someone to bounce ideas off is invaluable. We also have a button for users to ignore suggestions, which helps us detect false positives. However, we need to be supervised in this process to manage biases and provide relevant information to users. We're taking a more supervised approach because we believe it gives the best results for this specific domain. Elena, do you have anything to add about the Azure deployment? Yes, the Azure deployment was a pilot project for Hugging Face. We received a lot of support to deploy the model on Azure, which was crucial for our success. Hugging Face can help not just with the machine learning part but also with DevOps and deploying models on different clouds. This has been a very successful project. Congratulations to the Wittyworks team for your efficiency and quick iteration. Go check out Wittyworks at witty.works. Thank you, Lucas and Elena, for taking the time to speak with us today. We have time for some questions, so don't be shy. There are no silly questions. Please go ahead and ask us anything. We have a question from Charles: Do you classify sentences or words? We classify sentences. Initially, we thought we could extract word embeddings and use cosine similarity, but that wasn't accurate enough. We moved to sentence embeddings using SetFit, which is more effective. We have a question from Vladimir: Do you use a part-of-speech approach? Yes, we use part-of-speech tagging. For example, some words need to be highlighted only if they are verbs, nouns, or adjectives. We use pre-trained models for part-of-speech tagging and filter the results accordingly. We have a question from Chantal: How well does your model generalize? Does it flag non-inclusive words not in your database? Can you use these predictions to extend the database? Currently, we only highlight words we know need to be highlighted. To generalize, we need to collect more data and understand the context in which words are harmful or biased. We take a supervised approach to manage biases and provide relevant information to users. We might explore more automated approaches in the future, but for now, a supervised approach is crucial. We have a question from Martin: Have you applied domain adaptation to an existing model before fine-tuning for classification? We haven't applied domain adaptation specifically, but we've considered it. For now, our approach works well, and we haven't needed to use domain adaptation. We have a question from Gabrielle: What is the dataset you need, and how big is it? Our dataset isn't huge. We have about 2,200 words in English and German, and for context-dependent words, we have around 300 sentences. German is very different from English, especially in terms of gender-related words. We need to build separate models for each language. Are you planning to add new languages? Yes, we plan to add Spanish this year, followed by other Latin languages like French and Italian. We hope to continuously add more languages. We have a question from Jane: Are there any exciting projects happening at Hugging Face? We only do exciting projects. Check out our blog at huggingface.co/blog for updates on new releases, models, and partnerships. Do the 15 sentences per word have a balance of positive and negative samples? Yes, a balanced dataset is crucial for SetFit. We aim for a balanced set of positive and negative samples, typically around 10 to 15 sentences per word. We have a final question from Vladimir: Do you use entity recognition? Yes, we use entity recognition, especially for geographical locations and job titles. For example, "international" is highlighted if it's not a real international company but not if it refers to a specific location like London. Thank you, Lucas and Elena, for your time and for answering all the great questions. You're building something really nice, and I love the efficiency of the project. Looking forward to the next iterations. Thank you, everyone, for joining us today. If you have more questions, connect on LinkedIn and send them my way. I'll share them with Elena and Lucas. Thank you to my colleague Florent, who supported Wittyworks on this project, and to Violette, who organized this webinar. Thanks again, everyone. Take care, and we'll see you next time. Bye-bye, everyone. Thank you, Julien. Thank you very much. Bye-bye.

How Witty Works leverages Hugging Face to scale inclusive language

Transcript

Tags

About the Author