Rethinking Attention Mechanism with Performers

Transformer-based models have achieved outstanding results in a wide variety of disciplines, including conversational AI , natural language processing , image processing , and even music . The main component of any architecture is Transformers attention module (attention module), which calculates the similarity for all pairs in the input sequence. However, it does not scale well with the increase in the length of the input sequence, requiring a quadratic increase in computational time to obtain all similarity estimates, as well as a quadratic increase in the amount of memory used to construct a matrix for storing these estimates.



For applications that require extended attention, several faster and more compact proxies have been proposed, such as memory caching techniques , but the more common solution is to use sparse attention . Sparse attention reduces the computational time and memory requirements of the attention mechanism by calculating only a limited number of similarity scores from a sequence, rather than all possible pairs, resulting in a sparse rather than a complete matrix. These sparse occurrences can be manually suggested, found using optimization techniques, learned, or even randomized, as demonstrated by techniques such as Sparse Transformers , Longformers, Routing Transformers , Reformers and Big Bird . Since sparse matrices can also be represented by graphs and edges , sparse methods are also motivated by the graph neural network literature , especially regarding the attention mechanism outlined in Graph Attention Networks. Such sparsity architectures typically require additional layers to implicitly create a full attention mechanism.



image12



. : , . : Graph Attention Networks, , , . . ยซ : ยป .



, . (1) , ; (2) ; (3) , , ; (4) , , . , , , Pointer Networks. , , , (softmax), .



, Performer, , . , , , ImageNet64, , PG-19. Performer () , , () . (Fast Attention Via Positive Orthogonal Random Features, FAVOR+), . ( , -). , .





, , , . , - . , , .



image8



: , , , q k. : Q' K' , /. - , .



- , . , , , () . , . , , .



, , . , , , , -.





, . , . , , . , FAVOR+.



image10



: , A V. : Q' K', A , , , , A .



, , . () , , , , , , .



image4



: , . : , .





Performer , , , .



image7



(T) (L). GPU. (X) ยซยป , , , . Performer .



, Performer, -, , .



image13



One Billion Word Benchmark (LM1B), Performer, 0.07 ( ). Performer .



  :



โ€” , . , , 20 . (, UniRef) , . Performer-ReLU ( ReLU, , ) , Performer-Softmax (accuracy) , .



image2



. (Train) โ€” , (Validation) โ€“ , โ€” (U), โ€” (B). 36 ProGen (2019) , 16x16 TPU-v2. .



Protein Performer, ReLU. Performer , , . , , . Performer' . , , Performer - .



image17-2



: , . , (D, E) (F, Y), . : 4 () 3 ยซยป () BPT1_BOVIN, .



image5



8192, . TPU, ( ) .





, . , , FAVOR Reformer. , Performer' . , , .





  • โ€” Krzysztof Choromanski, Lucy Colwell
  • โ€”
  • Editing and layout - Sergey Shkarin



All Articles