Semantic search on images and videos with BridgeTower

February 07, 2023
BridgeTower is a new vision-language multimodal model by Intel and Microsoft. In this video, I show you two quick demos for semantic search on images and videos. ⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos ⭐️⭐️⭐️ - Research paper: https://arxiv.org/abs/2206.08657 - Image demo: https://huggingface.co/spaces/juliensimon/bridgetower-demo - Video demo: https://huggingface.co/spaces/juliensimon/bridgetower-video-search

Transcript

Hi everybody, this is Julien from Hugging Face. In this video, I would like to show you a couple of demos that I've built with a new exciting model called Bridge Tower. This was released just a couple of weeks ago by Intel and Microsoft. It's a multi-modal model for language and vision, and you can use it to score pieces of text against images. It's pretty innovative. So, I have a small demo for images and a small demo for videos. Let's take a look. Here's the research paper for Bridge Tower. I'll put the link in the video description. What I think is very interesting is this high-level view on the architecture of the model. So far, similar models have used a text encoder and an image encoder to learn their representations and then, at the end, a cross-model encoder to put those representations together. Where Bridge Tower is different, as you can see here, is the cross-model encoding happens within the layers of the text encoder and the visual encoder. The intuition is that we can probably extract fine-grained representations from text and images and learn from both repeatedly, not waiting until the final representations have been computed. I think that's pretty clever, and you can go and read the paper. So, how do we use it? Let's look at an example here. This one is for images. Let's look at the code. It's very simple. Just download the model and the processor, and we'll pass an image and a list of comma-separated texts. We'll iterate over each text and get a score for each one, then output the scores. Very simple. So let's give that a try. I have a couple of sample images. This is running on CPU, by the way. Here's the first test image. They've been cached, so it takes a few seconds, but no need to wait. My texts are: a metal band on stage, a chamber orchestra on stage, a giant rubber duck, a machine learning meetup. The highest score for this image is a giant rubber duck, which is indeed quite difficult to miss. It's clearly not a machine learning meetup, although, if you have a meetup that looks like that, I want to speak. And it's definitely not a chamber orchestra on stage either. It's probably a metal band. You can use this model to find the best description for a piece of text if you hesitate on a few captions. You can try different captions and see what the model thinks. Of course, you can do semantic search. If you have a huge collection of images and you're looking for a particular piece of text, you can run the model on the images, score the images against the text, and return the top-ranking images. Here's another example: a group of angels, medieval art, religious art, a movie poster. No, I don't think so. Super simple to use and pretty accurate. Go and try it with your own images and text and let me know how you're doing. I went one step further and tried to implement video search. The idea is to take a video, grab one frame every n seconds, predict those frames with the text query, and return the longest possible video interval that matches the query with a minimum score. In this case, I am looking for wild bears. The code is not difficult at all. It's quite similar to the previous one. The frame processing code is identical to the image processing code I just showed you, and processing the video is just a bit longer—opening the video, grabbing one frame every n frames, processing it, looking at the score, and if the score is higher than my minimum score, start marking that video until the next frame with a score lower than the minimum. I'm trying to catch the interval and then return the interval, a screenshot of the beginning of the clip, and the score. So, what we see here is that we find three segments: from 25 seconds to 30 seconds, from 35 to 50, and from 115 to 120. They're rounded to five seconds because of the five-second sampling rate, but you can go fine-grained if you want. I can see the scores returned by the model and the screenshot of each initial clip. Definitely bears. Again, it's a first try, not so bad. You can go and try it yourself. I'll keep working on this to make it faster. That's it for today. I hope you like this. Bridge Tower is a really cool model. I'll be back soon with more content. Until next time, keep rocking.

Tags

Bridge TowerMulti-modal ModelImage and Text MatchingVideo SearchMachine Learning Demo