Computer Vision Meetup Intro to Hugging Face Transformers

Transcript

All right, so you should see my screen right now. In this talk, I want to give you an overview of what it means to work with computer vision models hosted on the Hugging Face Hub. It's a whirlwind tour of the different features, from finding a model to training one, and options for deployment, etc. The first thing to note is that computer vision is incredibly important. I remember when I first got involved with deep learning around 2015, computer vision and the ImageNet dataset and related competitions got me into it. When you say "deep learning" to developers or people less familiar with the science, they immediately talk about computer vision. It's very striking. Anyone can understand it. You can do demos with images and videos for business users and stakeholders, and they'll immediately understand what you're showing them. With NLP or other types of data, it's not always so obvious. Computer vision is critical, and many enterprise and startup customers are keen to build computer vision use cases. For a long time, this meant working with convolutional neural networks (CNNs). I assume everyone here is familiar with CNNs, so I won't go into all the details. The key point is that we train CNNs on image datasets to learn convolution filters that extract the right features and then classify or detect images accordingly. The model focuses on edges and geometric features, and we apply pooling to reduce the number of pixels, ideally keeping the important ones. CNNs are still great and evolving, often mixed with transformer models. However, the golden age of CNNs is probably behind us, thanks to the Vision Transformer that Google released almost two years ago. The Vision Transformer uses a transformer architecture, specifically the vanilla transformer with the attention mechanism. The clever part is treating an image as a sequence of tokens. The image is split into patches, which are then flattened and treated as a sequence of tokens, with position markers to indicate the relative positions of the patches. This NLP-inspired technique applied to images has proven extremely effective. The benchmarks from the research paper show the Vision Transformer generally outperforming ResNet-152, which was state-of-the-art at the time. The key difference is that we consider every single pixel in the image, applying the sequence magic to images, which didn't work well with fully connected networks. The attention mechanism works exceptionally well here. Training this model, using TPUs, was four times less compute-intensive than training ResNet-152. Transfer learning is a key property of transformers, whether for NLP or computer vision. You can train your model initially on a large dataset, learn the embeddings, and then specialize it for downstream tasks with a little bit of your own data. Two years ago, when you talked about transformers, you meant NLP. Google BERT was released in 2018, and NLP was the primary use case. In 2020, only 2% of research papers using transformers were computer vision-related. Fast forward two years, and NLP is still the biggest category, but it's less than 50%. There's been an 11x increase in computer vision transformer research. Video has gone from zero to 5%. This makes sense, as computer vision put deep learning on the IT radar for enterprises, and people are trying to capture that market with transformer models. This brings us to Hugging Face, which answers the question of where to find and use these models. In 2018, using BERT out of the box was difficult, with complex GitHub repos and fine-tuning scripts. We're trying to solve this by building open-source libraries, a website, and commercial tools that make it easy to work with state-of-the-art models, even if you're not an expert. We're often called the GitHub of machine learning, and it's a decent analogy. We all go to GitHub to share and find code for projects, and many people go to the Hugging Face Hub to find and share models. When I updated this slide a few days ago, we had 110K models. Now, we're at 120K plus, with more than a thousand models added every day. We have about 18K datasets and support 25 plus machine learning libraries. You can find Keras, scikit-learn, and fastAI models on the hub, with the same user interface and features as transformer models. We have about 10,000 organizations contributing models, from Google, Meta, and Microsoft to research labs, universities, and individual developers. We have about 100,000 active users and a million model downloads daily. Zooming in on computer vision, we have about 4,000 models for various tasks, from image classification and object detection to generative models. I've broken these models into four categories: PyTorch Image Models (TIM), computer vision transformers, multi-modal transformers, and generative models. We have about 300 models from the TIM library, a collection of computer vision models, including traditional architectures like ResNet, ResNext, and MobileNet, as well as transformer models. These models are now available on the Hugging Face Hub, making it easy to pull and use them. The bulk of the CV models are transformer models. On the hub, you can find 1,700 plus models for classification, sorted by downloads. You can learn about the models, see code snippets, and use the inference widget to predict images. Links to datasets and spaces are also provided. Spaces are one of the coolest Hugging Face features. A space is a small application you can write using Streamlit or Gradio, push to the Hugging Face Hub, and automatically run in a container. Here, I've built a simple UI to predict images with multiple models, using the Transformers library's pipeline API. It takes just 52 lines of code to create this app, making it easy to experiment with models. We have models for detection, segmentation, and more. Teams often publish their models to Hugging Face within hours of publishing their papers. Using these models is straightforward, as shown by the simple code for object detection. For production, you can run the pipeline directly if you don't need complex pre-processing. For more advanced models, I've picked three: image captioning, zero-shot segmentation using a text prompt, and audio classification using a spectrogram. These models combine NLP and computer vision, demonstrating the power of the transformer architecture. For example, the audio spectrogram transformer builds a spectrogram from an audio clip and uses a vision transformer to classify it, achieving better accuracy and speed than previous models. Generative models, like stable diffusion, are also available. I've shown an example of generating an image with stable diffusion, which is surprisingly simple to use despite its complexity. We can also do image in-painting and other creative tasks using these models. For more advanced use cases, you can train and deploy models using Hugging Face's open-source libraries and services. We have libraries for distributed training, hardware acceleration, and AutoML. You can deploy applications on spaces or use our Inference API and Inference Endpoint services for managed infrastructure with auto-scaling and security options. To get started, check out the Hugging Face tasks and course, which will take you from a beginner to an expert. We have documentation on datasets, transformers, and diffusers, and you can find code, training scripts, and examples on GitHub. If you have questions, visit the Hugging Face forum. For commercial support, contact us on the support page. You can also find me on Twitter, YouTube, and LinkedIn. Thank you, Julien, for inviting me. I hope this was useful. If you have questions, I'd be happy to answer them. Thanks to everyone who listened. We have two questions in the Q&A. Is VIT already used in production somewhere? Yes, we have customers using VIT or vision transformers in production. The main challenges are fine-tuning the model on your own data and managing inference latency. While training on image datasets is simple with our libraries, transformers can still be too large for edge devices. However, there are smaller, mobile-compatible versions of Vision Transformers, some as small as 2.3 megabytes. Can we target edge devices for Hugging Face? Yes, you can, but the hardware requirements for transformer models can still be overwhelming for edge devices. Work is ongoing to shrink these models through techniques like quantization and pruning. The Optimum library supports optimization tools like Intel OpenVINO, making it feasible to use transformers at the edge. When playing with applications in spaces, are user-uploaded images ultimately archived or deleted? No, they are not stored. Unless you explicitly add storage code, nothing is stored. Our privacy policy is clear that none of the data you send to spaces or inference widgets is stored. You mentioned 4x better efficiency for VIT compared to ResNets. Is that for training? Yes, the VIT paper reports a 4x training speedup compared to ResNet-152. Transformers generally perform better with large datasets, while CNNs tend to do better with smaller datasets. The huge benefit of transformers is transfer learning, and many customers achieve good results with pre-trained models and fine-tuning. Can you compare the size of transformers to convolutional CNNs? The Vision Transformer is about 300 to 400 megabytes, but downscaled versions can be as small as tens of megabytes, making them suitable for small devices. Thank you, Julien, for the presentation. It was a pleasure to talk to the computer vision meetup. I hope you found it useful. Thank you.

Computer Vision Meetup Intro to Hugging Face Transformers

Transcript

Tags

About the Author