Review of the article Visual Transformers - a new approach to training computer vision models based on visual tokens

This work is interesting because the authors in it propose a new approach to training models in images - to use not only pixels and convolutions, but also to represent images in the form of visual tokens and train transformers on them. Compared to using a simple ResNet architecture, the proposed approach reduces MAC (multiply and accumulate operations) by 6.9 times and increases the top-1 accuracy by 4.53 points on the ImageNet classification problem.



image



Motivation approach



The generally accepted approach to computer vision tasks is to use images as a 3D array (height, width, number of channels) and apply convolutions to them. This approach has several disadvantages:



  • not all pixels are created equal. For example, if we have a classification task, then the object itself is more important to us than the background. Interestingly, the authors do not say that Attention is already being used in computer vision tasks;
  • Convolutions don't work well enough with pixels that are far apart. There are approaches with dilated convolutions and global average pooling, but they do not solve the problem itself;
  • Convolutions are not efficient enough in very deep neural networks.


As a result, the authors propose the following: convert images into some kind of visual tokens and submit them to the transformer.



image



  • First, a regular backbone is used to get feature maps
  • Next, the feature map is converted into visual tokens
  • Tokens are fed to transformers
  • Transformer output can be used for classification problems
  • And if you combine the output of the transformer with a feature map, you can get predictions for segmentation tasks


Among the works in similar directions, the authors still mention Attention, but notice that usually Attention is applied to pixels, therefore, greatly increases the computational complexity. They also talk about works on improving the efficiency of neural networks, but they believe that in recent years they have provided less and less improvements, so we need to look for other approaches.



Visual transformer



Now let's take a closer look at how the model works.



As mentioned above, the backbone retrieves feature maps, and they are passed to the visual transformer layers.



Each visual transformer consists of three parts: a tokenizer, a transformer, and a projector.



Tokenizer



image



The tokenizer retrieves visual tokens. In fact, we take a feature map, do a reshape in (H * W, C) and from this we get tokens. The



image



visualization of the coefficients for tokens looks like this:



image



Position encoding



As usual, transformers need not only tokens, but also information about their position.



image



First, we do a downsample, then we multiply by the training weights and concatenate with tokens. To adjust the number of channels, you can add 1D convolution.



Transformer



Finally, the transformer itself.



image



Combining visual tokens and feature map



This makes projector.



image



image



Dynamic tokenization



After the first layer of transformers, we can not only extract new visual tokens, but also use those extracted from the previous steps. Trained weights are used to combine them:



image



Using visual transformers to build computer vision models



Further, the authors describe how the model is applied to computer vision problems. Transformer blocks have three hyperparameters: the number of channels in the feature map C, the number of channels in the visual token Ct, and the number of visual tokens L.



If the number of channels turns out to be unsuitable when switching between the blocks of the model, then 1D and 2D convolutions are used to obtain the required number of channels.

To speed up calculations and reduce the size of the model, use group convolutions.

The authors attach ** pseudocode ** blocks in the article. The full-fledged code is promised to be posted in the future.



Image classification



We take ResNet and create visual-transformer-ResNets (VT-ResNet) based on it.

We leave stage 1-4, but instead of the last we put visual transformers.



Backbone exit - 14 x 14 feature map, number of channels 512 or 1024 depending on VT-ResNet depth. 8 visual tokens for 1024 channels are created from the feature map. The output of the transformer goes to the head for classification.



image



Semantic segmentation



For this task, the panoptic feature pyramid networks (FPN) is taken as a base model.



image



In FPN, convolutions work on high resolution images, so the model is heavy. The authors replace these operations with visual transformer. Again, 8 tokens and 1024 channels.



Experiments



ImageNet classification



Train 400 epochs with RMSProp. They start with a learning rate of 0.01, increase to 0.16 during 5 warm-up epochs, and then multiply each epoch by 0.9875. Batch normalization and batch size 2048 are used. Label smoothing, AutoAugment, stochastic depth survival probability 0.9, dropout 0.2, EMA 0.99985.



This is how many experiments I had to run to find all this ...



On this graph you can see that the approach gives a higher quality with a reduced number of calculations and the size of the model.



image



image



Article titles for compared models:



ResNet + CBAM - Convolutional block attention module

ResNet + SE - Squeeze-and-excitation networks

LR-ResNet - Local relation networks for image recognition

StandAlone - Stand-alone self-attention in vision models

AA-ResNet - Attention augmented convolutional networks

SAN - Exploring self-attention for image recognition



Ablation study



To speed up the experiments, we took VT-ResNet- {18, 34} and trained 90 epochs.



image



Using transformers instead of convolutions gives the biggest gain. Dynamic tokenization instead of static tokenization also gives a big boost. Position encoding gives only slight improvement.



Segmentation results



image



As you can see, the metric has grown only slightly, but the model consumes 6.5 times less MAC.



Potential future of the approach



Experiments have shown that the proposed approach allows you to create more efficient models (in terms of computational costs), which at the same time achieve better quality. The proposed architecture successfully works for various tasks of computer vision, and it is hoped that its application will help improve systems using comuter vision - AR / VR, autonomous cars, and others.



The review was prepared by Andrey Lukyanenko, leading developer of MTS.



All Articles