Training on tabular data. TABNet. Part 1

We wanted to present the translation of an interesting article about learning using neural networks on tabular data. The second part is here.


Introduces TabNet, a new high-performance canonical deep learning architecture based on tabular data. TabNet uses sequential evaluations of the choice of features to be used at each decision point. This ensures the interpretability and efficiency of the learning process, since the ability to learn is determined by the more relevant functions (the most adequate, according to the considered estimates of the choice of the solution). TabNet has been shown to outperform other neural network and decision tree architectures over a wide range of tabular scalar datasets in interpreting their performance impact attributes, leading to understanding the behavior of the overall model. Finally, for the first time, as far as we know,We demonstrate self-supervised learning for tabular data with a significant increase in the learning rate and a sufficiently large initial data set.

1. Introduction

Deep neural networks (GNNs) have shown their success when working with images [21, 50], text [9, 34] and sound [1, 56]. For these types of data, the main development factor is the availability of canonical architectures that make it possible to efficiently encode the initial sequences into training sequences, to ensure high performance on new data sets and tasks solved with their help with minimal resources. For example, in image interpretation, variants of residual convolutional networks (in particular, ResNet [21]) should provide reasonably good performance when working with new datasets for images or related visual recognition problems (eg, classification, taxonomy). The only type of data on which the success of the canonical architecture of the GNS has not yet been achieved is tabular data. Despite,that it is the most common data type in AI implementations [8], deep learning for tabular data remains poorly understood, and variants of ensemble decision trees still dominate most applications [28]. Why is this so? First, because tree-based approaches have certain advantages that make them popular: (i) they are sufficiently representative (and therefore often highly efficient) for solution manifolds with fuzzy hyperplane distribution boundaries for tabular data; (ii) they are well interpreted (for example, by tracking nodal decisions) and there are effective methods for a posteriori explanation of the shape of their ensemble, which is [36] an important task in many real-world applications (for example, in financial services, where the trust in actions with high risk is critical);(iii) they learn quickly. Secondly, the previously proposed GNS architectures are not adaptive to tabular data: conventional GNS on convolutional layers or multilayer perceptrons (MLP) are often highly parameterized (by the number of parameters and by the complexity of their identification) - the absence of a corresponding inductive bias leads to the fact that they are not can find the optimal solution for the variety of tabular solutions [17]. Why study deep learning for tabular data? One obvious reason is that, as in other areas, performance gains can be expected from GNS-based architectures, especially for large datasets [22]. Also, unlike tree (hierarchical) learning, which does not use backpropagation of data errors to drive effective learning from erroneous signals,GNNs provide end-to-end gradient descent learning strategies for tabular data, with many advantages demonstrated in many different areas, allowing: (i) to efficiently encode many types of data, such as images in the form of tabular data; (ii) facilitate or eliminate the need for feature development, which is currently a key aspect of tree learning methods using tabular data; (iii) train on streaming data - training on a tree structure requires global statistics to select nodal points, and simple modifications, as in [4], usually give lower accuracy compared to training for the entire data sample; In contrast, STSs demonstrate greater potential for lifelong learning [44]; (iv) explore in end-to-end presentation models,allowing valuable new scenarios for new applications, including adaptation to the areas of efficient use of data [17], generative modeling [46] and part-teacher learning [11].

, , . , ? - TabNet, « » ( ) ( ). , TabNet : . , - , . , : (1) , TabNet ; (2) TabNet , , , , (. . 1); , , , , [6] [61], Tab-Net .

 1.    TabNet          [14].        ,        . TabNet     ,          .           .          ,       , ,     .
1. TabNet [14]. , . TabNet , . . , , , .

(3) , : (a) TabNet ; (b) TabNet : , , , .

 2.   .        , ,      ,      .                   .
2. . , , , . .

(4) , , (. . 2).



: , , () . , LASSO [20], , , . , [6] , [61] «-» . , TabNet , () , .

: . [18]. , (). – [23], . XGBoost [7] LightGBM [30] - , (Data Science). , , , .

DNN : , [26], . () [33, 58] . , . [60] , . [31] -, , , . [53] - « » (, ), . TabNet , .

: - , [3, 35] . , .

: , , [47]. [13] [55] - .

 3.          ()     ().           .    ,     ( ,  ) ReLU      ,      .       .    C1  C2,      -  Softmax (   ).
3. () (). . , ( , ) ReLU , . . C1 C2, - Softmax ( ).


. (. . 3 ). . , () . TabNet - . , , , :

(i) , ; (ii) , , ; (iii) ; (iv) .

 4. )  TabNet    ,    ,          .       ,       ,       .                ,        . (b)  TabNet,       . (c)     – 4- ,  2          2      .      (, Fully-Connected)     (Batch Normalization)     (Gted Linear Unit). (d)     –        ,  ,          .      sparsemax [37]          .
4. ) TabNet , , . , , . , . (b) TabNet, . (c) – 4- , 2 2 . (, Fully-Connected) (Batch Normalization) (Gted Linear Unit). (d) – , , . sparsemax [37] .

. 4 TabNet . . . , (). D-

f \ in R ^ {(B × D)}

, B- . TabNet N .

i- (i - 1)- , , . (, [25]) [40] .

, . ( ) , . .


M [i] ∈ R ^ {(B × D)}

. , , , . , M[i] · f. (. . 4) , , a[i − 1]:

  M [i] = sparsemax (P [i - 1] · h_i (a [i - 1])) \ (1)

Sparsemax [37] , .

, 1

\ sum_ {j = 1} ^ {D} M [i] _b, _j = 1

h[i] - , . 4., FC, BN, P[i] - , , :

P [i] = \ prod_ {j = 1} ^ {i = 1} (\ gamma - M [j]), \ (2)

γ - : γ = 1, γ, . P[0] ,

  1 ^ {B × D}

- . ( ), P[0] , . [19]:

L_ {sparse} = \ sum_ {i = 1} ^ {N_ {steps}} \ sum_ {b = 1} ^ {B} \ sum_ {j = 1} ^ {D} \ frac {-M_ {b, j } [i]} {N_ {steps} * B} log (M_ {b, j} {[i]} + \ epsilon)

ϵ- . λ . , .

: (. . 4) ,

[d [i], a [i]] = fi (M [i] · f), where \ d [i] ∈ R ^ {B × N_d} \ and \ a [i] ∈ R ^ {B × N_a }.

, ( ), , .

. 4 . FC BN (GLU) [12], . √0.5 , , [15]. . BN, , , BN [24] BV mB. , , BN. , , . 3,

d_ {out} = \ sum_ {i = 1} ^ {N_ {steps}} ReLU (d [i])


 W_ {final} d_ {out}

. softmax ( argmax ).


, , , .

