A Silent Revolution and a New Wild West at ComputerVision

It would seem that there has already been a revolution with Computer Vision. In 2012, algorithms based on convolutional neural networks were fired . From 2014 they reached production, and from 2016 they filled everything . But, at the end of 2020, a new round took place. This time not in 4 years, but in one. let's talk about Transformers in ComputerVision. The article will provide an overview of new products that have appeared in the last year. If it is more convenient for someone, then the article is available as a video on youtube.

Transformers are a type of neural networks created in 2017. Initially, they were used for translations :

But, as it turned out, they worked simply as a universal model of the language. And off we go. Actually, the famous GPT-3 is a product of transformers.

ComputerVision?

. , . - , . . , . CV.

DETR

2020. . ? . , DETR (End-to-End Object Detection with Transformers), 2020 . , :

, ReInspect 2015 - , BackBone . - ReInspect Detr. .

, , DETR ( , ). .

, DETR ComputerVision. ? ? :

- , . Deformable DETR.
DETR . . iterdet. - ( - https://paperswithcode.com/sota/panoptic-segmentation-on-coco-panoptic ).

DETR Visual Transformer ( + ) . Feature map backbone:

Visual Transformer , . backbone .

VIT

. ViT:

2020 (). -. . - 16*16. “”, .

, , . ( state-of-art). 14 - .

. FaceBook - Deit. .

- https://paperswithcode.com/paper/going-deeper-with-image-transformers

- . , ~2-3 , . ResNet .

CLIP

. CLIP. . CLIP . , . , - :

, . . :

:

, - :

ResNet50. , 100 .

, /. CLIP . CLIP . . , :

Vision Transformers for Dense Prediction

, , - “Vision Transformers for Dense Prediction”, . Vit/Detr. , .

/, / . State-of-art , RealTime. @AlexeyAB ( Yolov4 ), . , , . - , :

---------------------------------------

. - :

1-2

- / . .

PoseFormer

Pose3D. , , :

3 . CherryLabs ( ) 3 , , . , , . - 3D, :

- . ( ). .

, . / .

TransPose

, . TransPose - :

( OpenPose)

. . , , :

SWIN

Intel. SWIN Microsoft , RealTime. VIT/Deit, :

, , - https://paperswithcode.com/paper/swin-transformer-hierarchical-vision

LOFTR

. . SIFT/SURF+RANSAK ( + ). SuperGlue- Graph Neural Network ComputerVision. SuperGlue . , LOFTR End-To-End:

, :

, , , . : (Video Transformer Network, ActionBert). MMAction.

. , . , - STARK:

, . . , , . , , . . BBOX + , ,

TransTrack

TransT

.

ReID

, . 20 ReID - .

:

. VIT (1,2):

(1,2):

- OCR . , - :

state-of-art . . - 2 . - .

, . , , :

ComputerVision. , , .

. . , - , 2 . , -

, . . - . / - https://t.me/CVML_team ( https://vk.com/cvml_team ).

, , youtube:

All Articles