Now in the field of ML, one constantly hears about the incredible "successes" of transformers in various fields. But more and more articles appear that many of these successes are, to put it mildly, far-fetched (from the recent I remember an article about pre-training large CNNs in computer vision, a huge MLP grid , an article about deconstructing achievements in the field of transformers).
β , .
self-attention
: (i) (ii) (iii) ( ) (iv) ( 5).
self-attention ( Linformer ). , - self-attention , , "" . 500 ( Google).
β self-attention production-ready :
- ;
- ;
- , ;
- ;
- , ;
- ;
, , . ( ), sequence-to-sequence .
- Self-attention 2 8- ;
- self-attention 2 "", , ;
, , - , .
:
- ;
- ( 2 );
:
PyTorch - 1.3 Linear
LSTM
. ( ) :
quantized_model = torch.quantization.quantize_dynamic( model, {torch.nn.Linear}, dtype=torch.qint8 )
:
- sequence-to-sequence ( , , PyTorch ). , (
_q
); - 4 (
float32
=>int8
); - Intel ( AMD , 10% 70% ).
- (1 );
- ;
:
- ( ) GPU;
- , , downstream , ONNX;
- AMD ( , 10% 70%);
- self-attention PyTorch ( 3 4 !);
- , . ;
. , , Singular Value Decomposition ( PyTorch) , .
, monkey-patching , Linear
FactorizedLinear
.
:
- ;
- (70 , );
- , , GPU;
- SVD ;
:
- ;
- , , ;
- , ;
- , , ;
, , . .
FNet
. self-attention
. , , GPU ( CPU). 10%.
:
- PyTorch FFT2 ( );
- , "";
- , ;
- self-attention ;
:
- , , SVD, ;
- - GPU, , β ;
β PyTorch
fusion , relu. 1.9
"" inference mode. , inference mode , 14% . , , . , - , x2. , - .
:
- ;
- "" ;
:
- ;
- , ( ), ( );
2 | 2x | 2-3x | - |
4x | 2x CPU | CPU | |
2-4x | 2-3x GPU, CPU | , | |
attention FFT | 2x | 2x CPU, 7x GPU | |
- | 15-25% | , fusion |
, , :
, | ||
, | ||
2 | , | |
70 | ||
attention FFT | 10 | , |
. - , (i) , (ii) , .
The first includes freezing and quantization. In total, they give a nice reduction in model size (4x) and speed up in the region of 2-3x on the CPU.
The second includes factorization and FFT. They should be considered as some kind of additional optimization, and they are most likely mutually exclusive. Together with the first types of methods, you can get a total reduction in the size of the model by almost an order of magnitude and speedup is also almost an order of magnitude. If we also tweak the hyper-parameters of the model, then the "order" in principle does not seem unattainable.
To be honest, I donβt know how to accelerate by two orders of magnitude. Perhaps you know?