😳 🐲 🧜🏻 Compressing transformers: simple, versatile and practical ways to make them compact and fast 🤱🏻 📍 🔁

transformer_press

Now in the field of ML, one constantly hears about the incredible "successes" of transformers in various fields. But more and more articles appear that many of these successes are, to put it mildly, far-fetched (from the recent I remember an article about pre-training large CNNs in computer vision, a huge MLP grid , an article about deconstructing achievements in the field of transformers).

— , .

self-attention

: (i) (ii) (iii) ( ) (iv) ( 5).

self-attention ( Linformer ). , - self-attention , , "" . 500 ( Google).

— self-attention production-ready :

;
;
, ;
;
, ;
;

, , . ( ), sequence-to-sequence .

- . "" ( , ) , :

Self-attention 2 8- ;
self-attention 2 "", , ;

, , - , .

;
( 2 );

. , "" ;
, ( x-transformers

);

PyTorch - 1.3 Linear

LSTM

. ( ) :

quantized_model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

sequence-to-sequence ( , , PyTorch ). , ( _q

);
4 (float32

=> int8

);
Intel ( AMD , 10% 70% ).
(1 );
;

( ) GPU;
, , downstream , ONNX;
AMD ( , 10% 70%);
self-attention PyTorch ( 3 4 !);
, . ;

. , , Singular Value Decomposition ( PyTorch) , .

, monkey-patching , Linear

FactorizedLinear

.

;
(70 , );
, , GPU;
SVD ;

;
, , ;
, ;
, , ;

, , . .

FNet

. self-attention

. , , GPU ( CPU). 10%.

PyTorch FFT2 ( );
, "";
, ;
self-attention ;

, , SVD, ;
- GPU, , — ;

— PyTorch

fusion , relu. 1.9

"" inference mode. , inference mode , 14% . , , . , - , x2. , - .

;
"" ;

;
, ( ), ( );


2	2x	2-3x	-
	4x	2x CPU	CPU
	2-4x	2-3x GPU, CPU	,
attention FFT	2x	2x CPU, 7x GPU
	-	15-25%	, fusion

, , :


		,
	,
2	,
	70
attention FFT	10	,

. - , (i) , (ii) , .

The first includes freezing and quantization. In total, they give a nice reduction in model size (4x) and speed up in the region of 2-3x on the CPU.

The second includes factorization and FFT. They should be considered as some kind of additional optimization, and they are most likely mutually exclusive. Together with the first types of methods, you can get a total reduction in the size of the model by almost an order of magnitude and speedup is also almost an order of magnitude. If we also tweak the hyper-parameters of the model, then the "order" in principle does not seem unattainable.

To be honest, I don’t know how to accelerate by two orders of magnitude. Perhaps you know?

Compressing transformers: simple, versatile and practical ways to make them compact and fast

FNet

— PyTorch

More articles: