Compressing transformers: simple, versatile and practical ways to make them compact and fast

transformer_press







Now in the field of ML, one constantly hears about the incredible "successes" of transformers in various fields. But more and more articles appear that many of these successes are, to put it mildly, far-fetched (from the recent I remember an article about pre-training large CNNs in computer vision, a huge MLP grid , an article about deconstructing achievements in the field of transformers).







β€” , .







self-attention



: (i) (ii) (iii) ( ) (iv) ( 5).







self-attention ( Linformer ). , - self-attention , , "" . 500 ( Google).







β€” self-attention production-ready :







  • ;
  • ;
  • , ;
  • ;
  • , ;
  • ;


, , . ( ), sequence-to-sequence .









- . "" ( , ) , :







  • Self-attention 2 8- ;
  • self-attention 2 "", , ;


, , - , .







:







  • ;
  • ( 2 );


:







  • . , "" ;
  • , ( x-transformers



    );




PyTorch - 1.3 Linear



LSTM



. ( ) :







quantized_model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)
      
      





.







:







  • sequence-to-sequence ( , , PyTorch ). , ( _q



    );
  • 4 (float32



    => int8



    );
  • Intel ( AMD , 10% 70% ).
  • (1 );
  • ;


:







  • ( ) GPU;
  • , , downstream , ONNX;
  • AMD ( , 10% 70%);
  • self-attention PyTorch ( 3 4 !);
  • , . ;




. , , Singular Value Decomposition ( PyTorch) , .







, monkey-patching , Linear



FactorizedLinear



.







:







  • ;
  • (70 , );
  • , , GPU;
  • SVD ;


:







  • ;
  • , , ;
  • , ;
  • , , ;




, , . .







FNet



. self-attention



. , , GPU ( CPU). 10%.







:







  • PyTorch FFT2 ( );
  • , "";
  • , ;
  • self-attention ;


:







  • , , SVD, ;
  • - GPU, , β€” ;


β€” PyTorch



fusion , relu. 1.9



"" inference mode. , inference mode , 14% . , , . , - , x2. , - .







:







  • ;
  • "" ;


:







  • ;
  • , ( ), ( );




2 2x 2-3x -
4x 2x CPU CPU
2-4x 2-3x GPU, CPU ,
attention FFT 2x 2x CPU, 7x GPU
- 15-25% , fusion




, , :







,
,
2 ,
70
attention FFT 10 ,




. - , (i) , (ii) , .







The first includes freezing and quantization. In total, they give a nice reduction in model size (4x) and speed up in the region of 2-3x on the CPU.







The second includes factorization and FFT. They should be considered as some kind of additional optimization, and they are most likely mutually exclusive. Together with the first types of methods, you can get a total reduction in the size of the model by almost an order of magnitude and speedup is also almost an order of magnitude. If we also tweak the hyper-parameters of the model, then the "order" in principle does not seem unattainable.







To be honest, I don’t know how to accelerate by two orders of magnitude. Perhaps you know?








All Articles