Open Source AI with Hugging Face Dallas AI meetup 05 2024
June 14, 2024
Join us for the Dallas AI Meetup held in Austin, Texas, on May 29, 2024 (https://dallas-ai.org).
Learn how Hugging Face is changing the ML landscape, get practical insights on working with large language models, and dive into the details of Retrieval-Augmented Generation (RAG). Watch a demo on building a chatbot for the energy sector, and hear lessons learned from over 200 customer meetings about deploying LLMs in real-world settings. Discover how to choose the best models using Hugging Face leaderboards and see the latest trends in ML engineering.
00:00 Introduction
02:15 Hugging Face
10:10 Working with Large Language Models
25:30 Double-clicking on Retrieval-Augmented Generation (RAG)
31:10 RAG demo - building a chatbot for the energy domain
44:20 LLMs from the trenches: lessons from 200+ customer meetings
1:03:15 Picking models with the Hugging Face leaderboards
1:08:55 ML engineering is on fire
#LargeLanguageModels #HuggingFace #MachineLearning #DeepLearning #AI #opensource
⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos. Follow me on Medium at https://julsimon.medium.com or Substack at https://julsimon.substack.com. ⭐️⭐️⭐️
Notebook: https://gitlab.com/juliensimon/huggingface-demos/-/tree/main/langchain
Transcript
So, thank you, everyone, for making it to the office, and hi to everyone on Zoom. It's a pleasure to be here. My name is Julien, and I'm the chief for Hugging Face. That's what the slide says. What I do is travel quite a bit, meeting with customers and trying to explain to them that open source AI is the way to go in this AI business. I also work with our partners, including cloud and hardware partners, which takes up a fair amount of my time.
In this presentation, I'd love to keep things as interactive as possible, so please jump in, ask questions, and interrupt me. It's difficult for me to keep track of questions on Zoom, so if you could kindly relay the questions from our friends, I don't want anyone to feel excluded from asking questions. In the room, you have no excuse—just wave at me, yell at me, or throw things at me. Okay. Let's have some fun and learn together.
Today, I'll cover a quick background on Hugging Face. We've been doing quite a few things lately, and you may not be completely up to speed. Then, we'll chat about some of the latest trends in LLMs. This is where I go super opinionated, and I'm trying to force you to react to some of this stuff to get the conversation going. So, just have at it. We're in Texas, so we can. Okay, no guns. But all right, that's fine. I don't have mine. We'll do demos. I'll show you some of the cool things we've been building, and answer as many questions as we can.
A quick word on Hugging Face: Hugging Face is the home of open source AI. In a few years, we've become the de facto place where the community comes to find models and datasets and also share them. Who has shared a model on Hugging Face or uploaded a model on Hugging Face? Okay, all of you need to do more. Thank you. We need more. We always need more. Folks have called us, and still call us, the GitHub of machine learning. I think that's fine. I don't mind the analogy, but you'll see we're building more and trying to be more than just a nice collection of models and datasets.
We have some great companies as investors as of last summer, which is good validation for us. We'll try to put that money to good use for the community. More importantly, it's great validation for open source AI generally. Almost all of these companies are also building closed models, but it's never black or white. If you came here to hear me trash closed models, I'll go a little more French than I used to with customers, but I'm not going to do that. Closed models have their own use cases, and I think open source AI is generally a better idea. But there's probably room for everyone here.
The mile-high view on Hugging Face: If you looked at Hugging Face a year ago or have been experimenting casually, you're probably aware of the models and datasets in the bottom right corner. There's an insane number. I checked this morning and updated the slide. So, 680,000, it might be 690,000. Who knows? All of these are open-sourced, though they have different open-source licenses. Please be mindful of that. Not everything is fine for commercial usage, but if you pick stuff with an Apache 2, MIT, or LAMA license, you're good to go. Just be careful.
These models and datasets are community models. We're the stewards, the model herders. We welcome them and try to take good care of them, but they come from the community, including Google, Microsoft, Meta, universities, startups, and individuals. These datasets and models are the raw material for your AI projects. The next logical step is to work with these models and datasets through our open-source libraries. The most popular one is called Transformers, which made Hugging Face popular. Over time, we added quite a few more. If you go to our repo on GitHub, you'll find many more, such as Diffusers for stable diffusion models, text-to-image, and text-to-video. Accelerate is for distributed training made simple, and Text Generation Inference (TGI) is our own inference server. We use TGI to deploy LLMs and in our own services and cloud integrations. There are many more libraries you can read about.
Over the course of building Hugging Face, we've also implemented a few cloud services. Spaces is basically machine learning demos—small web apps you can write and host on our infrastructure to showcase models in a web app, not just a Jupyter notebook. An inference endpoint is a model deployment service. You can one-click deploy 99.99% of our models on any of the three major clouds. This is a fully managed service, so you deploy it, and we take care of everything else. The core of the platform also includes the Enterprise Hub, which adds security and compliance features for enterprise users, such as SSO and auditing.
We're not a model-building company, but from time to time, there's an opportunity for Hugging Face to add new models to the open-source collection. Bloom, launched in 2022, was the first open-source LLM to compete with GPT-3.5, a very large model. StarCoder is a code generation model, and Edifix is a visual large language model, allowing you to chat with images. HuggingChat is our open-source chatbot. If you want a fully privacy-preserving chatbot, this is it. It's based on a curated list of the best open-source models, which we update regularly. You can select the model you want to chat with, and everything from the UI to the backend to the models is open source. There's also an iOS app, so you can chat with the best open-source LLMs on your iPhone.
We have cloud partners, and we integrate our open-source ecosystem into cloud environments, mostly machine learning services. We also have hardware partners, focusing on accelerating training and inference across the board. This work lives under the Optimum umbrella. Optimum is a collection of open-source libraries, such as Optimum Intel and Optimum AMD, providing transformers-like APIs with built-in acceleration. You just import the Optimum library that matches your hardware, and we automatically accelerate and optimize training and inference.
Lastly, we have a consulting and professional services program called Expert Support. We can engage directly with customers to help them build and bring their models to production more quickly.
Now, let's double-click on LLMs a bit. All of you have tried ChatGPT at some point, right? That's fine; we don't need more than that to understand here. When you work with LLMs, the first step is to use them as is—ask a question and get an answer. That's easy and what LLMs are for. But very quickly, you realize the answer might be correct but not in the tone of voice you want, or it might be too long or too short. You need to start providing instructions, which is called prompting. Please don't call it prompt engineering. Any prompt engineers in the room? Just asking. If you're on Zoom, you can log off and yell at me on LinkedIn or Twitter, or you can hang on and take it.
Prompting is useful for tone of voice, safety guidelines, and output control, such as brevity and formatting. It works well for these tasks. It breaks down when you try to teach the model new things through prompting. Showing five examples of what you're after is not enough to teach the model how to generalize. This is called many-shot prompting, and it works for very basic things but not for complex tasks. The analogy I use is that I never learned COBOL. If you showed me 20 examples of buggy COBOL snippets, I could pick up a few things by analogy, but I wouldn't become an expert. The same goes for trying to turn vanilla LLMs into legal or engineering experts with a few examples. If you need to bake new domain knowledge into LLMs, the procedure is called fine-tuning. Fine-tuning means training the model for a specific task. For example, you might want an LLM to answer legal questions in the oil and gas domain. You need a narrow slice of knowledge, but you want it to be deep. You won't ask for cooking recipes or astronomy questions, only legal questions. Fine-tuning lets you do that.
The next level up is continuous pre-training, which is incremental training. Initial training or pre-training is training from scratch. If you have a million pages of legal documents for the oil and gas industry, you can train your model from scratch on that corpus. This is a big, expensive effort. If you have a thousand new pages every month, you can take your vanilla LLM, train it on those thousand pages, and then train it again on new data incrementally. This is continuous pre-training. You can also adapt a model trained on oil and gas legal to nuclear energy legal, making it handle multiple domains.
Retrieval-Augmented Generation (RAG) has spread like wildfire. We'll do a RAG demo, and you'll see the benefits. The higher you go in these techniques, the better the domain adaptation, but the more complicated, expensive, and time-consuming it gets. Initial training can cost hundreds of thousands or millions of dollars. Fine-tuning can be very cheap, so start at the bottom, evaluate in terms of accuracy, and only move up if needed.
The question is, for fine-tuning and continuous pre-training, do I expect companies to build one model per use case per domain, or will we see industry models from large players? So far, attempts at industry models haven't been super successful. Some companies have tried healthcare and legal LLMs, which can serve as a foundation for your own fine-tuning efforts. If someone trains a model on a million pages of legal documents, that's great, but if you work at Amazon and need a legal LLM for the retail domain, it might be too general. It's difficult to see an external vendor training a model that is so good and relevant to your specific domain knowledge, vocabulary, product names, and policies that you would use it out of the box. Healthcare might be different because of public knowledge, but even then, there's a lot of confidential knowledge that needs to be injected. My dream scenario is to see open-source LLMs for particular industries, trained on as much public data as possible, and then companies fine-tuning them on their data.
It could be incremental on the same domain or adding a few different domains. Generally, it's about improving the same domain. You do a first round, evaluate, and fix the issues. You add more examples and train a little more until you get to a good enough model. You could refine from scratch, but if you have something you like, you don't want to tear it down and remix the datasets, which could lead to widely different results.
I've been hearing about small language models. Are they for embedded devices? No, we don't need more buzzwords. Large language models are called LLMs, and anything smaller than, let's say, 70 billion parameters is a small language model. But it's a bullshit definition. I prefer to talk about open source versus closed. Where's the limit for small? Is 70 billion a small model? Or is it 13? Or is it seven? If GPT-4 has a trillion parameters, then 70 billion is small. Most customers are in the orange box, starting with prompting, adding RAG, and eventually doing some fine-tuning for further adaptation.
Let's double-click on RAG and do the first demo. This is the high-level architecture for RAG. RAG is about adding external data to the mix for generation. Instead of relying on what the LLM knows, you add fresh, confidential data from your company. The only way to get freshness and access to company data is by having an external source of truth. You take the data, run it through an embedding model, turn it into high-dimensional vectors, and store those vectors in a backend. When a user asks a question, you convert the question into a high-dimensional vector, run a similarity search, and retrieve the top 5 or 10 vectors that closely match the query. These correspond to documents that hopefully contain the answer. You pull those documents back and inject them into the prompt. This is how you can inject information that happened five minutes ago and is very private and confidential.
This looks like a complicated slide, but I'll show you how to do this in a single notebook. RAG brings data freshness and access to company data. It's a good first thing to try. If you want to make it better, you can fine-tune the LLM or the embedding model to specialize it for your domain.
Let's look at the demo. I'm deploying the latest version of Mistral 7B on Amazon SageMaker. This is the full code, and it shows the SageMaker integration for Hugging Face. You just define the model, create a Hugging Face model object in the SageMaker SDK, and deploy it on a small GPU instance. Deploying LLMs is not a project; it's simple. If you look at the model on the hub and click "Deploy SageMaker," we generate the code for you. Just copy and paste. Deploying LLMs on AWS is as simple as that.
If you want to do it on EC2, EKS, or any other compute environment, be my guest, but there's zero reason to do that. I'm super happy with copy-pasting and having a production-grade LLM in the cloud in seven minutes. This is what we've been building with AWS for three years.
A couple of requests from our online friends: If it's possible to share the link for this, and if I can repeat the questions coming from the room. The question is, "I don't want to do SageMaker because I love to reinvent the wheel on EC2. How do I do it?" The answer is, go do it. You don't need me. I'll copy the link and send it to you.
Now we have the LLM deployed as a SageMaker endpoint. We need to plug this into LangChain. LangChain needs to know how to serialize and deserialize information coming from the endpoint. This is what it looks like. LangChain needs to know how to send data and retrieve data. We can try asking a question directly. As a helpful energy specialist, please answer the question, focusing on numerical data. Don't invent facts if you can't provide a factual answer. Create a template, create a chain, and ask the question: "What is the trend for solar investments in China in 2023 and beyond?" The answer is not bad. The model says it doesn't have real-time data or the ability to predict future events but provides information from the International Energy Agency. It's a good answer because the model doesn't hallucinate.
Let's add RAG. I'm grabbing three PDF files from the International Energy Agency, copying them to S3, and extracting the text using AWS TextTrack. I chunk the text into 256-byte chunks and run them through an embedding model to turn them into vectors. I store these vectors in an in-memory database. Now, I can retrieve them. I configure everything to use my collection of vectors as the source of truth, retrieve 10 chunks for each query, and inject them into the prompt. I ask the same question again: "What is the trend for solar investments in China in 2023 and beyond?" This time, the answer is much better. It says, "Solar investments in China will continue to be significant in 2023 and beyond. In 2023, approximately 380 billion is expected to be invested in solar globally." The model is telling me based on the provided context, which is based on the retrieved documents. This is RAG in action. If you ask about 2024, the model won't know the new information. The model acts as a writing assistant, using the provided context to write a story. This means you don't need a very large model. A large model has more parameters to store more knowledge, but if you're not using that built-in knowledge, why overspend? Smaller models, like Mistral (7 billion parameters), or even smaller ones like PHI-3 (3.8 billion parameters), are often sufficient.
Regarding content, AI marketing often emphasizes large context sizes, like 100K or 1 million tokens. This is unnecessary and expensive. A typical novel is about 100,000 words or 130,000 tokens. Passing 100,000 tokens is like passing a full novel, which is rarely needed. For most use cases, 8K or 16K context is sufficient and much more cost-effective. Model customization has also evolved. Fine-tuning used to be expensive, but now, with Parameter Efficient Fine-Tuning (PEFT) techniques like LoRa and QLoRa, you can fine-tune a model for as little as $10 to $20. These techniques fine-tune only a small percentage of the model's parameters, achieving results close to full fine-tuning.
Reinforcement Learning with Human Feedback (RLHF) is powerful but requires human oversight, which can be costly and error-prone. Techniques like DPO, PPO, and Oropo can help eliminate the need for human feedback, making the process simpler and faster. Model merging is another interesting technique where you combine different models to create a new one with combined capabilities. For example, you can merge a math model, a code model, and a legal model to create a single model that can handle all three domains. This is done by averaging the weights of the models, which is computationally efficient and can be done on a laptop.
Moving to production is crucial. People want to see AI in action, not just proof of concepts (POCs). Cost performance is key. If you don't evaluate latency, throughput, and ROI, your project might be too slow or expensive. Inference is the major cost, not training. Fine-tuning and model merging can reduce training costs, leaving inference as the primary expense. Inference optimization is essential. You can use mid-range GPUs, like AWS G5 instances, which cost about $1 per hour and are suitable for 7-8 billion parameter models. Local CPU inference is also becoming viable, and companies are working on ways to charge for it.
To determine which models are the best, we have leaderboards like the LLN leaderboard. Models are constantly improving, so you should keep an eye on new releases and evaluate them on your data. Prompts are technical debt; they need to be rewritten when switching models. Keeping prompts generic can help. Performance leaderboards help you find the right model size for your infrastructure. Embeddings are also important, and there are leaderboards for that as well. The latest release of Sentence Transformers v3 is worth checking out.
There's a lot of focus on models, but the secret sauce is the data, especially for RAG and fine-tuning. Machine learning engineering is crucial for cost performance. Techniques like model compilation, quantization, and merging can significantly improve efficiency. For low-scale, low-domain adaptation projects, use a model API and get it done quickly. For high-scale, low-domain adaptation, use smaller models to reduce costs. For high-scale, high-domain adaptation, RAG and fine-tuning are essential. For high-domain adaptation, low-scale projects, consider whether automation is worth the effort.
Thank you for joining. I'll send you the slides and links to the notebooks. Feel free to play with this stuff. Thanks, everyone.
Julien Simon is the Chief Evangelist at Arcee AI
, specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.
With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.
Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.
Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.