Vision transformers (ViT) were introduced to the literature two years ago and have become a central part of computer vision research. Taking a component that performed exceptionally well in linguistic tasks and converting it into the realm of computer vision was a bold move, but it worked. Since then, advancements in the field of computer vision have accelerated.
Computer vision processors differ from their natural language processing (NLP) counterparts. They are dominated by hybrid models of vision-specific transformers that use vision-specific modules of attention. Adding vision-specific biases allows these hybrid transformer models to be more efficient.
Vanilla ViTs still have many desirable features despite being superior in cost versus performance. They are made up of simple matrix multiplications, which makes them faster than their raw flop count suggests. Additionally, they support strong self-supervised pre-training techniques like MAE that can produce cutting-edge results while being quick to train, and because they make no assumptions about the data, they can be used in many modalities with little to no change.
As good as it sounds, everything has a cost and that cost is huge size for ViT. It can be problematic to run these massive models to reproduce their results.
There have been studies to solve this problem. Token pruning is one of them. The tokens can be clipped at runtime to allow for a faster model since the transformers are independent of the inputs. However, this approach has several problems, the main one being the loss of information due to the elimination of certain tokens. You can’t just prune each token, you have a limit on how many tokens you can prune before the information loss gets too high. Also, existing methods require you to retrain the model to be efficient with pruned tokens.
So token pruning is not the way to go, and we still want to use ViTs. But we can’t use them in most cases because they are still too slow. What could be the solution? How could we speed up pruning-like ViTs while still maintaining much higher precision than pruning? We have an answer to these questions, and it’s called Merge Tokens.
Token Merging (or ToMe) combines tokens instead of pruning them, and thanks to its custom matching algorithm, it is as fast as pruning while being more accurate. Plus, it works without requiring additional training, so you can use it on huge models to speed them up without sacrificing much accuracy.
The goal is to integrate a token merging module into an existing ViT to increase training and inference throughput by combining redundant tokens, without necessarily requiring training.
Token merging is applied between the attention and MLP branches of each transformer block. This allows information to be propagated from tokens that would be merged and allows the ViT to use attention module functionality to decide what to merge.
The first step in merging is to determine similar tokens. This is relatively easy to do in ViT, thanks to the QKV properties (query, key, value) already extracted. The keys already abstract the tokens, so all that remains is to use a dot product similarity metric between the keys of each token.
Once token similarities are found, the next step is to match them. This is the tricky part, because it should be very fast, so it’s not possible to use existing solutions like k-means or graph cuts. Token merging uses a new two-party soft pairing solution to solve this pairing problem.
This was a brief summary of token merging, a unique technique to increase the throughput and actual training speed of ViT patterns. Using token fusion can double training speed in some cases. It can be used for image, video and audio tasks while achieving pinpoint accuracy.
This Article is written as a research summary article by Marktechpost Staff based on the research paper 'TOKEN MERGING: YOUR VIT BUT FASTER'. All Credit For This Research Goes To Researchers on This Project. Check out the paper and github code.
Please Don't Forget To Join Our ML Subreddit
Ekrem Çetinkaya obtained his B.Sc. in 2018 and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. thesis on image denoising using deep convolutional networks. He is currently pursuing a doctorate. degree at the University of Klagenfurt, Austria, and working as a researcher on the ATHENA project. His research interests include deep learning, computer vision and multimedia networks.
#Meta #Researchers #Propose #Token #Merging #ToMe #Accelerate #Vision #Transformer #Execution