Hi everybody, this is Julien from Arcee. In this video, I'm very happy to introduce a new model called Spotlight. Spotlight is a visual language model that lets us interact with image content using natural language. First, I will demonstrate Spotlight in our inference platform called Model Engine, and then I'll demonstrate it programmatically using the OpenAI API. Ready?
In a previous video, I already showed you the Model Engine. In a nutshell, the Model Engine is our inference platform hosting our small language models and letting you use them with APIs that are compatible with the OpenAI APIs and with pay-per-token billing. Feel free to check the product page and the previous video; I will put all the links in the video description. Now, let's take a look at Spotlight.
Here's the list of our CSLMs that are available in Model Engine. We can see that Blitz, which I covered recently, is already there. But let's click on Spotlight. We can see that Spotlight is a 7 billion parameter model. It's actually based on the QAN2 5 VL, improved by our amazing research team. The context length is 32K, and pricing is 10 cents per million tokens input, 40 cents output.
Let's try Spotlight with an image. Here it is, and we're going with a simple prompt to see what comes up. The image shows RCAI on the NASDAQ building a few months ago. Cool stuff. The image shows a bustling urban scene likely in a financial district, dominated by a large colorful advertisement on a building. The ad is for RCAI and announces a 24 million Series A raise. The area is busy with pedestrians and vehicles, including a couple of trucks. Construction cones suggest ongoing work, which is typical of New York. The sky is clear and blue, suggesting a sunny day. This is a pretty cool description. It's quite a complex picture, and it did pick up a lot of the elements, including the text and the logo, which usually is a problem for a lot of image models. They're not too good at picking up text. So this is good; it's working.
Now, let's give it a shot with the API. As mentioned, Model Engine uses the OpenAI API, which is cool because it means we can use the OpenAI client. The only thing we have to do is change the URL to point at the Model Engine URL. The endpoint, the API key, which you get when you sign up to the Model Engine, and an HTTP2 client for efficiency. Of course, we define the model name as Spotlight. Let's run the client and use a small utility function to print out streaming responses.
Let's try an image. Here's my image, pretty recognizable. My prompt will be, "Where was this picture taken?" When we work with images here, we can pass them to the model in two ways. The first way is to simply pass the URL. Here's the prompt and here's the URL. As usual, we can set typical parameters, such as streaming enabled. Let's run this and see what happens. The picture appears to have been taken at the Arc de Triomphe. The fireworks and the tricolore red, white, and blue smoke trails are reminiscent of Bastille Day celebrations, which usually occur on July 14th. This is correct. Military parades, Champs-Elysées, etc. So, it was an easy image, but the description and context are great. Once again, you saw how fast this model was. It's a 7B model, very lightweight, so inference is blazing fast, which is great if you need to process a lot of images or want low latency.
Here's another one. "Write a short and precise caption for this picture." Airshow over the Arc de Triomphe, colorful trails paint the sky above the Champs-Elysées. We can use this for different tasks, such as descriptions, and one of my favorites is metadata generation. When you have a lot of images to process, you might want more than just captions. You'll want that data in a specific format that you can ingest into a data store for image search, similarity search, content management, and more. Let's try generating JSON metadata for this picture, including country, city, landmark, short description, detailed description, themes, keywords, etc. Here it comes. France, Paris, Arc de Triomphe, short description, detailed description, themes, keywords, etc. Again, a super fast and elaborate answer. You could tweak the prompt to make it even better, but you see the interest of these visual language models. It's not only interacting with images using natural language but also generating structured content, metadata content, with millions of applications.
The first way to work with images is to pass the URL. Another way is to pass the image inline. In this case, we have to load the image and encode it in base64 format. This function loads the image and returns the base64 string. Now, I can load the image and pass it inline in the query. It's still called image URL, but we are passing the base64 content inline. This should work the same. We get the same results, except this time, we're passing the image, which might be more convenient if you have a ton of local images and don't want to access them over HTTP.
That's what I wanted to show you today. Arcee Spotlight, our new VLM, and I'm sure you can put it to work in many interesting ways. If you want to try it, just go to models.arcee.ai, sign up for the Model Engine, and you can start testing immediately. Of course, you can try all the other models too, not just Spotlight. That's it for Spotlight, and I'll see you soon with more content. Until then, keep rocking.
Tags
Visual Language ModelSpotlightModel EngineImage CaptioningMetadata Generation