Building a RAG chatbot with LangChain Chroma Hugging Face and Arcee Conductor

March 31, 2025
In this video, we build a retrieval-augmented generation chatbot to query PDF files (research articles in this case). We use LangChain for orchestration, a Hugging Face model for embeddings, Chroma for vector search, Gradio for the user interface, and Arcee Conductor to optimize inference. We first run a local version and then push it to a Hugging Face Space. Arcee Conductor (https://www.arcee.ai/product/arcee-conductor) is an inference platform that intelligently routes any query to the best model, efficiently delivering precise and cost-effective results for any task. If you’d like to understand how Arcee AI can help your organization build scalable and cost-efficient AI solutions, don't hesitate to contact sales@arcee.ai or book a demo at https://www.arcee.ai/book-a-demo. ⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos. You can also follow me on Medium at https://julsimon.medium.com or Substack at https://julsimon.substack.com. ⭐️⭐️⭐️ * Arcee Conductor: https://conductor.arcee.ai * Arcee Conductor product page: https://www.arcee.ai/product/arcee-conductor * Code: https://github.com/juliensimon/arcee-demos/tree/main/conductor-rag

Transcript

Hi everybody, this is Julien from Arcee. In this video, we're going to build a retrieval augmented generation chatbot to query PDF documents. I will use research articles, but you can use any PDF files that you like. To build a chatbot, we're going to use a Hugging Face model for embedding, for vector search, we'll use LangChain for orchestration, and we'll use Gradio to build the UI. In fact, I will publish the chatbot as a Hugging Face space. And, as usual, you'll get all the code and you'll be able to replicate everything. Oh, and last but not least, I will use Arcee Maestro to select the best SLM or LLM for each of our queries. Sounds good? Let's get started. All my code is on GitLab, and I'll put the URL in the video description. We have two Python scripts in there: `demo.py`, which is a standalone Python script that lets me run vanilla queries without RAG, and `ragpower_demos.py`. We'll look at this one. Then I've got the Gradio app, which builds the UI and lets us run this either as a Gradio app on the laptop or as a Hugging Face space. We'll look at that. There's a folder called PDF where, as you might guess, I've uploaded some research articles. You can use those, replace them, or do anything you like. Let's take a quick look at the demo script and then we'll look at the Gradio app. As Arcee Maestro is compatible with the OpenAI API, we can use all the OpenAI objects in LangChain. The only thing we need to do is set the API key. Here, I'm storing this as an environment variable, and you can create the key in Maestro and just store it there. We need to set the URL, which is basically sending the queries to the router model and that automatically picks the best SLM or LLM. That's about it. That's the only thing you need to do. Everything else will just work as an OpenAI model. In fact, I'm setting the model to `auto`, which means the router model. We could also use one of the individual models available there, but here we want to use the router. This is the path for the vector store, and here I'm storing locally. If you want to leverage persistent storage on Hugging Face, you actually have to use `/data/whatever` because persistent storage lives under `/data`. That's the path for the PDF files. Creating the LLM is very straightforward. We just use the `ChatOpenAI` object, pass the name of the model, the key, and the URL. Very clear. Then we create the embeddings model. Here, I'm using the `BGE-small-English` model, which runs nicely on CPU. Feel free to try something else. Then we have a text splitter to create the chunks. We get the list of PDF files that are living in that PDF directory. I have a function to remove sections that I don't want from those documents, like references or appendixes. Usually, there's not a lot of useful information there for RAG. Then we process the documents, split them, and return the chunks. If a vector store already exists, we load it so we don't have to embed the documents every single time. If the vector store doesn't exist, we create it. We load the vector database, get the files, and if they don't exist, return an error. Then we retrieve the VectorStore collection. If any new files are present in the folder, we update the VectorStore. So you can just keep dumping PDFs in there, and only the new ones get updated. We have a function to create the vector store. Nothing really complicated here. Then we create the RAG power chain in that function, passing the context, any chat history, and setting the vector store, so Chroma as the retriever. Here, we're going to retrieve three chunks. We have a function to create the vanilla response. We're just going to query Maestro, passing no context from the RAG, and print that out. Then we have a function to run the RAG-powered search. We invoke the Q&A chain, which retrieves the context, invokes Maestro, and Maestro selects the model. Then we print out the answer. If we have source documents, which we should have if RAG actually finds chunks, we return the list of documents, sort them, and make sure we don't duplicate filenames. That's it. There's really nothing complicated here. It's vanilla LangChain, and the only thing is I'm using Chroma and, of course, Maestro. Let's run this, and our query will be "Tell me about Arcee Fusion." Of course, before you're able to run the script, you need to install some dependencies. You'll find all of them here, and I would recommend a virtual environment to avoid messing up your Python environment. Once you've installed everything, you can just run the script. So first, we're going to get the vanilla answer. Here it is, and it's not a really good answer. I don't have specific information about Arcee Fusion. If it's a particular product, company, or concept, I'm not familiar with it. At least the model is telling the truth, not hallucinating or making things up. Arcee Fusion is very new, so I'm not surprised the model doesn't know anything about it. That's the main benefit of RAG: we can inject fresh domain knowledge. Here, we can see the chunks that have been retrieved, and there are mentions of Arcee Fusion because of the research articles in my document collection. Now, here's the RAG-powered answer: "Arcee Fusion is a model merging technique designed to maintain stability and avoid over-updating by focusing on the most significant changes in model parameters, etc." We got information on performance, etc. So that's a much better answer, obviously. And it looks like we got a couple of hits in our document collection. So RAG seems to be working here, and we can see the benefit of embedding those documents. Again, we can use Maestro in a simple way, just like we would use any other model. Now, let's look at the Gradio app. In the Gradio app, we import the main functions from the demo script to be able to run our queries and initialize our chain again. The rest is really Gradio-specific. We have multi-turn conversation enabled, a box allowing us to switch RAG on and off, and a layout with a checkbox to enable RAG, a text box to display the retrieved context, an input area for our query, and a couple of buttons to submit, clear, and update the UI when we get an answer. Let's give this a shot. Now we can run Gradio, and the benefit of running Gradio here instead of Python is that it will watch the directory. If you update sources, it will reload automatically. Here, you can see I've already embedded my docs, so Chroma is up to date, and I don't need to recreate the vector DB. I can just go to this page and load it. Let's first disable RAG and run a vanilla query. Let's try "Arcee Fusion" again. Now we're sending this to Maestro, which will automatically select the best and most cost-efficient model. Once again, we get a horrible answer because it looks like it's confusing Arcee Fusion, which is our merging technique, with the Arcee transformer robot. That's completely wrong. So, amazing hallucination here. Now, let's enable RAG and ask the question again. We should see the context being retrieved, and it's the same, right? No surprise. And we get a really good answer here. Once again, pretty nice. And I think I have multi-turn working here. Let's give it a shot. "How does Fusion compare to model soups, which is another merging technique?" Now we're going to run RAG again and factor in the previous history. It looks like it's working. "Arcee Fusion and model soups both serve as methods for model merging, but they employ different approaches, etc." Arcee Fusion compares favorably to model soups in terms of performance and stability. And we see our context again, just three chunks, not a ton. Let's try another one. Let's disable RAG and ask, "What is the main innovation of DELLA merging?" Another technique. Once again, we're completely wrong. The DELA merging is kind of a plant hormone signaling pathway research, and a protein was discovered by the team of Professor Aritzatka. This might be true, but it's not what we wanted. So, let's clear this, enable RAG, and run this again. Now, hopefully, yes, because as you would expect, I have the DELLA paper in my index. The main innovation in DELLA merging is the integration of magnitude-based sampling of Delta parameters, which is crucial for reducing interference during the model merging process. This technique, known as MagPrune, involves pruning parameters based on their magnitude. We can see we hit the DELLA paper 2406, and we could check 2406, yes, it is the DELLA paper. So here we have a local version that works. Now, let's put this thing on Hugging Face. In the repo, I've included instructions in the README file to create the space if you're not too familiar with this. It's pretty simple. The only trick is, for safety reasons, you want to keep your API key very private and set it as an environment variable in the space. If you go to the space settings, you can add secrets there, so the code will find it and it's not exposed to anyone. Again, you can use your own files if you want to, just put them in the PDF folder and restart the space to re-embed, etc. I've done this, and it looks like we have this space running. It's private, so you won't be able to access it, but you can create your own, enter your key, and you're good to go. Let's check that this is working correctly. Just click submit. Yes, here it is. So that's pretty cool. Here, I'm running on CPU upgrade, which gives me a few more vCPUs, so it's a little faster. This runs okay on the free CPU setting, which I think is only two vCPUs. So embedding is a bit slow, but queries run fine. So this is how you can easily build your RAG chatbot using a collection of SLMs and LLMs that get picked automatically, your favorite embedding model, and Chroma and LangChain. It's not a ton of code, and you can see that it works. You can certainly make it better, and that's why I'm sharing the code. Let's take a quick look in the Maestro UI. In the API history, we will see our queries being sent. "What is the main innovation in DELLA merging?" So that's the basic query. Then we actually have the LangChain prompt here running, etc. If you look at the price, this isn't a big query, but it's very cost-effective. This is probably not Sone 37 running here; the price looks too low for that. Generally, when you run Sone, you'll see more expensive prices. This was probably Sone like five cents. That's a bigger query, but still. So I'm guessing here this is not Sone running. We could find out, but who really cares? It's working, we got a low price, and if we have really complex queries, then Maestro will route them to the more powerful but more expensive models. So that's nice. Only spend more when it's really needed. All right, that's what I wanted to show you today. A nice little chatbot with RAG and a bunch of really cool open-source libraries and the power of Maestro to select the best model and optimize your inference costs. Hope you liked it. Much more coming soon, as you would expect. Until then, my friends, keep rocking.

Tags

RAGChatbotHuggingFaceLangChainGradio

About the Author

Julien Simon is the Chief Evangelist at Arcee AI , specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.

With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.

Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.

Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.