Building an AI Meeting Companion with AFM 4.5B and llama.cpp.

Transcript

Hi everybody, this is Julien from Arcee. I'm sure you spend a lot of time in online meetings, so I thought, can I build an AI-powered meeting companion that would run locally on my machine for full privacy to get real-time insights about the meeting? For example, a live summary, good questions to ask, or chatting with a transcript to clarify something that was said. So I built it and used the recently released Arcee foundational model running locally on my machine. This is going to be fun. Let's go. Let me tell you how I got the idea. Like many of you, I use Zoom frequently for online meetings and got an email saying they launched something called Zoom AI Companion. They have a bunch of AI features to increase productivity during meetings. I thought it was pretty cool. Before you ask, no, I didn't get paid by Zoom to talk about this feature. It's just something I use. Zoom, if you want to send me a t-shirt, I'll take it, though. I started looking into it more, tested it, and felt it was pretty cool. But then I wondered, how does it work? Where is the AI running? I took a look at the security and privacy documents Zoom provides, and kudos to them, they are pretty extensive. Looking at the data flow and transmission to third parties, I found this architecture graph showing that everything except the user is in the cloud. There's an LLM service and this mysterious Zoom AI, all running in the cloud. So, I assume that's where the companion features run. You can ask questions about the transcript, and they go all the way to the Zoom cloud, to the LLM service, and then you get an answer. There's nothing wrong with that. I'm sure Zoom is taking all the right security measures, as mentioned, like encryption, etc. But I thought, can I do the same locally? Can I provide similar features for demo purposes? Obviously, I'm not claiming to build something as good or as production-ready as what Zoom has built, but can I demonstrate similar features with a small language model running on my local machine? The benefits would be low latency, no need to go to the cloud, and, more importantly, full privacy. Maybe I'm asking sensitive questions about the transcript or just uncomfortable with my prompts and data going to the LLM. This is what I tried building and what I'm going to show you now. So, what do we have here? Let's look at those two windows on the right. Here, I've got my app. It's a web app implemented with Gradio, running on my local machine. And here, I've got my Arcee Maestro 4.5 billion parameter model running locally on my Mac. This is the 8-bit quantized version, and I'm using the preview model we announced a couple of weeks ago. I expect the final model to be even better. In the web app, I've got information that will help the app personalize the content for me: who I am, who I work for, and my goal for the meeting. Not everyone has the same goal; if you're in the finance team, you're more interested in finance discussions, and if you're in R&D, you're more interested in tech, etc. I thought the companion could use this information to personalize the questions to ask or the summary, etc. Here, I'm using a real transcript from a real call we had with our Intel friends a couple of weeks ago. Real discussion, real people, and I'm going to stream it in real-time to simulate real-time transcription from a real call. So, here I've got personal information: who I am, who I work for, and my goal for the meeting. The app will use this information to tailor the summary and questions to ensure I get what I want from the meeting. I won't get a vanilla summary or vanilla questions; I'll get insights that match my interests. We'll see them as we go. I can chat with the transcript in real-time. Let's start the transcription and speed it up a bit. We'll start streaming the text. Assume you're in the meeting, so we've got Julien, Shannon, Andrew, etc., and the speech-to-text model starts transcribing. The most recent sentence is at the top to save me from scrolling endlessly in that window. Every 60 seconds, we'll get an updated live summary. We can see the inference is already running. Every 30 seconds, I'll get items to discuss based on the transcript so far. The model is prompted to suggest things that should be discussed to have a good meeting and help me reach my goal. Clarify the specific areas where Arcee SLMs can be promoted on Intel platforms. Someone probably mentioned AWS, and now I'm getting clarified integration requirements for Arcee SLMs with Intel AWS offerings, etc. We can see it's all going. Maybe I was distracted for a second and not sure who Shannon is. So, who's Shannon? I'm chatting with the transcript so far. Shannon works at Intel and is involved with the ISV Partner Motion and Global Partners team. Awesome. Do we have more people here? Oh, yeah, we've got Cole. Hi, Cole, if you're watching this. Maybe I was distracted. What does Cole do? He works with cloud service providers and is based in Washington DC. Interesting. And maybe I want to know if he's in the same team as Shannon. Yes, he is. Great. So you can see how quickly and privately we can chat with the transcript. We see the model inferencing here. It processes about 57-58 tokens per second, which is very good. Prompt processing is pretty fast, almost a thousand tokens per second, so even with a long context or a long meeting, we can factor everything in. Open points are piling up and getting more specific as the meeting goes on. Clarify the SLM that Intel is interested in promoting and how they align with Intel's strategy for AI on CPUs and Gaudi systems. So, maybe we could ask who's the best person for those Intel Gaudi discussions? I should talk to Cole; he's focused on optimizing and going to market with Arcee on Intel offerings. Great. The meeting goes on, and we see the live summary being updated. I can check if the summary so far aligns with my goal for the meeting. If I was specifically interested in getting Intel marketing to promote my products or events, I could look at the summary and flag that we haven't discussed it at all. The meeting is interesting, but I'm not getting what I want out of it. I'll pause the video and wait until the end of the transcription because I have a few more features to show you. The meeting is over, and we can generate action points or even write a follow-up email directly here. From me to Cole about cloud service providers, using the full transcript and clicking the button to send the follow-up email. This looks good. I could use it directly, copy, paste, and send it. We could generate action points, and there are lots of features we could have. The model is running fast enough, and I'm only using eight cores on the Mac, so there's definitely room for more inference and more features. That's what I wanted to show you. Privacy, security, compliance, and all that good stuff are super important for enterprise users. I would definitely prefer to run as much as I can on my local machine and keep all my data private. Just a small demo. Zoom, if you're watching this, let's talk. And yes, I'll take that t-shirt. Thanks, everyone, for watching. Until next time, keep rocking.

Building an AI Meeting Companion with AFM 4.5B and llama.cpp.

Transcript

Tags

About the Author