Retail AI at the edge Cisco Live 2025 interview

July 01, 2025
The retail landscape is rapidly evolving, with AI-driven solutions reshaping how stores operate and engage with customers. Today's consumers expect personalized, efficient service, while staff need immediate access to accurate information on customer traffic, inventory, sales, and other key metrics. Meeting these demands requires powerful edge computing solutions that can process and deliver insights in real time. Powered by Intel Xeon 6 CPUs running in a Cisco UCS server, the Edge IQ Retail Assistant exemplifies this potential. This technical demonstrator will be featured in the Intel Showcase (#3035) at Cisco Live 2025, taking place in San Diego, CA, from June 8 to 12, 2025. Attendees will get a firsthand experience of how generative AI can transform retail operations without relying on GPUs. Thanks to a chatbot interface powered by open-source small language models and real-time data analytics, store associates can interact naturally through voice or text, receiving immediate information about product availability from Chooch's inventory system or crowd density from WaitTime's analytics platform. The assistant seamlessly translates these inquiries into actionable insights, helping staff make informed decisions that enhance customer experience while optimizing store operations, all powered by CPU processing. You can also read our blog post at https://www.arcee.ai/blog/building-an-ai-retail-assistant-at-the-edge-with-small-language-models-and-intel-xeon-cpus ⭐️⭐️⭐️ While you're here, I’ve got a great deal for you! If you care about your online security, you need Proton Pass — the ultra-secure password manager from the creators of Proton Mail. GET 60% OFF at https://go.getproton.me/aff_c?offer_id=42&aff_id=13055&url_id=994 ⭐️⭐️⭐️

Transcript

Welcome to the Intel QuickBytes market. We're here to talk about Arcee, a leading vendor in the small language model market. One of the challenges we hear from our customers is that they want to deploy more AI solutions in Edge and other retail locations. While it's amazing to increase productivity for their employees, there are more dashboards and different areas they need to look at to get a complete store status. So we partnered with Arcee, and I'm joined here with Julien Simon, chief evangelist from Arcee, to create a retail digital assistant chat that aggregates and provides real-time information to all different store managers. Julien, can you share more about who is Arcee? Arcee is a US startup with a heavy research focus. We're a model builder. We started by improving the best open source models available on Hugging Face through our post-training stack, which is made from open source libraries like MergeKit and others. And now we're also training net new models, so new foundation models. We take those models and host them everywhere we can, from devices to edge servers to the cloud. Awesome. And as far as the RetailIQ Digital Assistant chat, can you share more about what the solution looks like from an architecture standpoint? Sure. Working with some of the other technology partners, let me show you the architecture. So everything is running on a Cisco server, and we've got the Chooch stack giving us inventory. You can see the products in the bins with cameras. And we have the wait time system. We have crowd cameras. Both Chooch and wait time are exposing APIs, which are queried by our solution in the middle. We have a chatbot UI, which we'll take a look at afterwards. And we have a small language model optimized for Intel systems. CPUs and users can ask questions about inventory or crowd statistics, all on the same server, running on CPU. And can you share a little bit more about the small language model as far as the size and performance, especially how it's running on a CPU-based Xeon platform? Yes. In this particular example, we use an 8 billion parameter small language model that we built a few months ago. It's a LAMA 3.1 variant, and when we released it, it was the best LAMA 3.1 8 billion variant available on Hugging Face. To run it on a CPU, we optimized it. We used the OpenVINO toolkit to quantize the model to 4-bit precision, making it smaller and more CPU-friendly without degrading it in any significant way. We use this optimized model on the CPU, and it runs fast enough, well above 10 tokens per second, which is what you need for a good user experience. All running on Cisco unified computing systems. Well, I know what everyone else is thinking, so can you show a quick demo of what this digital assistant chat looks like? Sure. We see the chatbot user interface, which is running on the server, and we can ask questions about inventory. So we could say, "Hey, show me potato chips in the inventory," or "Do we have Sprite in stock?" or "Show me out-of-stock products." Using the inventory information coming from the API, we get the appropriate data, and the small language model builds a conversation on that. And the same for crowd statistics. So "Show me crowd stats" and pulling from the wait time API, we could see how dense the crowd is in a particular area on the booth. And of course, you can have follow-up questions because we have a language model that can maintain context. Instead of just looking at information on a million dashboards, it's all in one place, and we use human language to access it. Can you give some other examples of store information this language model could access? Yeah, in the context of a store, we could ask about staff information as a store manager. You could ask, "Who's the manager on duty tomorrow?" "Who's the barista tomorrow?" "Is that person also working the next day?" You could manage all that information. You could ask questions about kitchen equipment. "Do you have any equipment issues?" Assuming you have a back office system telling you, "Hey, maybe the oven is broken or the grills need maintenance," etc. You could have sales information, of course, plugging into your CRM and sales systems. "How many servers, how many customers did we serve yesterday?" "What was the revenue for last week?" So you could unify all those different IT systems running in your enterprise and access them at the edge in a very simple, friendly way. And you could also use speech-to-text and text-to-speech in the demo if you didn't even want to type anything on a keyboard. That's amazing. While we're showing this use case for retail, I'm assuming you can use this for any other verticals or departments that need a digital assistant chat. Sure. There are a lot of businesses that need access to information in real time anywhere they are. So imagine civil engineering and construction, healthcare, mining, any activity where you're out in the field and need to access systems. It's not easy to have a full-fledged laptop or system. You can just use a tablet and a mic and ask questions, connect back to your local server, and get immediate access without having to type anything on a tiny keyboard. So, yeah, I think it could apply to a ton of different use cases. That's great. Julien, thanks for sharing. Everyone, you see the power of RetailIQ and with partners like Arcee to deliver a small language model digital chat assistant anywhere you really need it. If you want to learn more, visit intel.com/cisco.

Tags

RetailIQEdgeAIDigitalAssistantSmallLanguageModelRealTimeInformation