TabNet. Few details

Deep high performance neural network for tabular data TabNet   

Introduction 

Deep neural networks (GNNs) have become one of the most attractive tools for creating artificial intelligence systems (SRI), for example, speech recognition, natural communication, computer vision [2-3], etc. In particular, due to the automatic selection of GNS important, defining features, connections from data. Neural network architectures (neocognitronic, convolutional, deep trust, etc.), models and algorithms for learning GNS (autoencoders, Boltzmann machines, controlled recurrent, etc.) are developing. GNSs are difficult to train, mainly due to vanishing gradient problems.  





The article discusses the new canonical architecture of GNS for tabular data (TabNet), designed to display a "decision tree". The goal is to inherit the advantages of hierarchical methods (interpretability, sparse feature selection) and GNS-based methods (step-by-step and end-to-end learning). Specifically, TabNet addresses two key needs - high performance and interpretability. High performance is often not enough - GNS must interpret, replace tree-like methods. 





TabNet is a neural network of fully connected layers with a sequential attention mechanism, which: 





  • uses a sparse selection of objects by instances, obtained from a training dataset; 





  • creates a sequential multistage architecture in which each decision-making step can contribute to the part of the decision that is based on the selected functions;





  • improves learning ability through non-linear transformations of selected functions;





  • simulates an ensemble, involving more accurate measurements and more improvement steps.  









Each layer of a given architecture (Fig. 1) is a solution step that contains a block with fully connected layers for transforming characteristics - a Feature Transformer and an attention mechanism for determining the importance of the input original characteristics. 





Figure 1. TabNet architecture
Figure 1. TabNet architecture





1. Converter of functions 

1.1. Batch normalization 

    -   . . , (,   ), , . (covariate shift).  





. , — . ( ) , . , , , .  





. , — , . ,   , (   ,   – )   . .    - (batch normalization), 2015  [4]. 









 - .  





1. d: x = (x1, . . . , xd). kx ( ): 









2. . , . ,  , (   





σ (x) = \ frac {1} {1 + exp (−x)}

[−1, 1] ).  





, :  





γ, β .  









3. , ,  -,  





where \;  B = \ {{x ^ k_ 1, ..., _ m} \} - current \;  mini batch \;  for \;  k-th \;  neuron, where B = {x_1 ^ k, ..., m} - \;  \ text {current mini-batch for} \;  k-th \; neuron.

4.





 -:  





  • , , ;  





  • , ;  





  • ,      ;  





  •    . 





1.2. GLU

 [5]   Gated Linear Unit,       , , LSTM-.  





GLU  





, , , .     H = [h0 ,..., hN]   w0, ... ,wN,  P (wi |hi).   f  H   hi = f(hi - 1 , wi - 1) , i ( ,  ). 





f   H = f * w , , , , , . . , ,  [5] , , . 





. 2 . ,  D |V| x e,  |V| - ( ),  e - .  w0, … , wN, E = [Dw0, … , DwN].  h0 , …hL 





m, n – , ,  k - , X ∈ R N×m -  hl( , ), 





, σ -  ⊗  . 





 ,  hi  . , . , k-1, , - , ,  k - . 









Figure 2. Closed convolutional network architecture for language modeling
2.

 X * W + b,   σ(X * V + c). LSTM,  X * W + b  , . (GLU). E  H = hL◦. . .◦h0 (E). 





(GLU) ,      . 





3.3  LSTM 

LSTM (long short-term memory,  –  ) — ,   . LSTM    , ,  [5]. 





LSTM . — , ! 





.   , ,  tanh





LSTM  





LSTM . 





 





LSTM , . , « ».  h  x  0 1  C. 1  « »,  0 — « ». 





. , . , . , . 





   





, . .   , « », ,  .  tanh   - C, . . 





, . 





 





C. , . 





f, ,   .  i*C. , , . 





,   . 





 , , . .  , , .  tanh ( [-1, 1]) . 





, , , . ,   , ( ) . 





   TabNet 





 





Figure 3.3 Description of GLU Operation
3.3 GLU

3.4.  Split:  

 Feature Transformer  , .  ,      ,  Attentive Transformer  ,   .     (backpropagation) , «» ,    ( ). , .   ,  Attentive Transformer  . , "" , , . 





SPLIT 





:  (. . 1) . 





,  , ( ), , . 





. 3       . FC BN (GLU) ,   . √0.5 , , . . BN, , , BN BV  mB.      , , BN. , , . 3,  









.  softmax  ( argmax  ). 





Figure 3. SPLIT block
3. SPLIT

4.   

.   (), ( )  Softmax, ,   , :  ,  -  ,   —  . 





   





, ,  ht, t=1 …m,  d  ,  . 





Riunok.  4. Layer of attention
. 4.

C  d     di−1





 s —  hi  « ». 





,   s  softmax.  e=softmax(s) 





softmax : 





:





 cc ,  hi   ei. 





   .  , , , ,    , .  Softmax,  Sparsemax. ,  , - , Softmax  ,  . «»  «» ,    - .  





5.   SPARSEMAX 

, z z, . : 





τ(z) S(z), p. softmax  , , ,  softmax .  





, .  softmax   ,  sparsemax  : 





, : 





Figure 5. Difference between softmax and sparsemax
5. softmax sparsemax

,  sparsemax   ,  , : 





|S(z)| - S(z)





, , , ,  Sparsemax.  





,   





Figure 6. Representation of the mechanism of influence
6.

6.   

,   , ,   , -  . . , ,      .  ( ),      () , , , . 





:   





.   ,  , , ,   . : M[i] · f. (. . 1) , , a[i − 1]





 Sparsemax [6] , . 





,    





h[i] - , . 4., FC, BN, P[i] - , , : 





γ - : γ = 1, γ, . P[0]





- . ( ),  P[0]  , .  : 





ϵ -    .    λ,     , . 





 

   , ,   . ,    ,  , -   .    ,  [5]  , . 





TabNet - . TabNet . , () ,   .  





, , , .









  1. .. //  . : . 2017. .6, №3. .28–59. DOI: 10.14529/cmse170303 





  2. LeCun Y., Bengio Y., Hinton G. Deep Learning // Nature. 2015. Vol.521. Pp.436–444. DOI: 10.1038/nature14539. 





  3. Rav`ı D., Wong Ch., Deligianni F., et al. Deep Learning for Health Informatics // IEEE Journal of Biomedical and Health Informatics. 2017. Vol.21, No.1. PP.4–21. DOI: 10.1109/JBHI.2016.2636665. 





  4. Sergey Ioffe, Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal // Proceedings of The 32nd International Conference on Machine Learning (2015), pp.448-456. 





  5. Sercan O. Arik, Tomas Pfister. TabNet: Attentive Interpretable Tabular Learning // ICLR 2020 Conference Blind Submission 25 Sept 2019 (modified: 24 Dec 2019). URL:https://drive.google.com/file/d/1oLQRgKygAEVRRmqCZTPwno7gyTq22wbb/view?usp=sharing 





  6. Andre F. T. Martins and Ram´on Fern´andez Astudillo. 2016. From Softmax´ to Sparsemax: A Sparse Model of Attention and Multi-Label Classification. arXiv:1602.02068. 
















All Articles