👎🏾 📙 🚕 Exploring important features by spreading differences in activation. DeepLIFT 👃🏿 👧🏿 👚

annotation

The perceived black box nature of neural networks is an obstacle to use in applications where interpretability is important. Here we present DeepLIFT (Deep Learning Important FeaTures), a method for decomposing the output prediction of a neural network at a specific input by backpropagating the responses of all neurons (nodes) of the network to each feature of the input signal. DeepLIFT compares the activation of each neuron with its “reference activation” and assigns estimates of its individual contribution. By considering the positive and negative contributions separately, DeepLIFT can also identify dependencies that other approaches miss. The scores can be efficiently calculated in one return pass. We apply DeepLIFT to MNIST-trained models and simulated genomic data,showing significant advantages over gradient methods.

Video tutorial: http://goo.gl/qKb7pL

ICML slides: bit.ly/deeplifticmlslides

ICML talk: https://vimeo.com/238275076

code: http://goo.gl/RM8jvH

1. Introduction

, , « » , . DeepLIFT ( ), . . -, «» , «» . , , DeepLIFT , , , . -, , DeepLIFT , . DeepLIFT , , ,

2.

2.1.

. & ( & , 2013 [12]) . «In-silico mutagenesis» (Zhou & Troyanskaya, 2015 [13]) . Zintgraf . (Zintgraf et al., 2017 [14]) . , . , (. 1).

Figure: 1. Perturbation approaches and gradient approaches are not capable of simulating saturation. — . 1. , , .

, . , i₁ = 1 i₂ = 1, i₁ i₂ 0 . , , i₁ + i₂> 1.

2.2. ,

, , . DeepLIFT.

2.2.1. , (, )

. ( ., 2013 [9]) « » . , () (Zeiler & Fergus, 2013 [12]), (ReLU). , ReLU , , ReLU . , , ReLU , , , ReLU . . (Springenberg et al., 2014 [10]) , ReLU, ReLU , . , , , ReLU. - , , () , . , , . 1, y h ( ), h i₁ i₂ , i₁ + i₂> 1 ( ). (. 2).

2.2.2. ×

. (Bach et al., 2015 [1]) , (LRP). . Kindermans et al. (Shrikumar et al., 2016; Kindermans et al., 2016 [8]) , , , LRP ReLU Simonyan et al. ( , × ). DeepLIFT gradient × input, GPU, LRP GPU, .

× , , , . 1 . 2.

2.2.3.

, , (: ) (Sundararajan et al., 2016). , 1 2, ( , , ) . , (. 3.4.3).

2.3. Grad-CAM CAM

Grad-CAM (Selvaraju et al., 2016 [7]) , , , , . ( ) , , Grad-CAM , , Grad-CAM. , . .

3. DeepLIFT

3.1. DeepLIFT

DeepLIFT «» «». «» - «» , , ( . 3.3). , t , , x₁, x₂, ..., x_n , t. t₀ t. ∆t , ∆t = t − t0. DeepLIFT

$C _ {\ Delta x_i \ Delta t} \; for \ Delta xi \; as$

$C _ {∆x_i ∆t} \ text {may be nonzero}$ $\ text {even if} \ frac {∂t} {∂x_i} \ text {equals zero. }$

DeepLIFT , , . 1, , . , DeepLIFT, .2, - () . , , , .

Figure 2. Discontinuous gradients can give incorrect estimates of importance. — 2. .

-10. , x = 10; x = 10 + e, × 10 + e x -10 ( - ). x < 10, x 0. , ( , ) .

3.2.

3.2.1.

x ∆x t ∆t, , m∆x∆t :

, m∆x∆t - ∆x ∆t, ∆x. : ∂t / ∂x - ∆t, ∆x, ∆x. , .

3.2.2.

, x₁, ..., x_n, y₁, ..., y_n t.

$m_ {∆x_i∆y_j} \; and \; m_ {∆y_j∆t} \; next \; definition \; m_ {∆x_i∆t}$

. 1 (. ):

. 3 . , - , .

3.3.

DeepLift, 3.5, , - . : y x₁, x₂, ... , y = f(x₁, x₂,...).

, ... , y0 :

, .

DeepLIFT. , , , DeepLIFT . « ?». MNIST , . ( {A,C,G, T}) , , ACGT (. 5), , ( J).

, × ( × ∆, ∆ ). , ( 2.2.3) , , DeepLIFT. Guided Backprop , , , , , .

3.4.

3.5.3 , - . , y ∆y + ∆y−, ∆y, :

∆y+ ∆y− ∆y , ∆x_i, . RevealCancel ( 3.5.3), t , m∆y + ∆t m∆y − ∆t . ( 3.5.1 3.5.2) : m∆y∆t = m∆y + ∆t = m∆y − ∆t.

3.5.

. ( 3.2) ( ) .

3.5.1.

( ). y - x_i ,

$y = b + \ sum_ {i = 1} ^ n w_ix_i$

$∆y = \ sum_ {i} w_ix_i$

∆y :

, 3.2.1.

, ∆x_i = 0? « » « », , ∆x + i ∆x - i ( ), « » . ,

$m_ {∆x ^ + _ i ∆y ^ +} = m_ {∆x ^ + _ i ∆y ^ -} = 0.5 w_i$

∆x_i 0 ( ∆x-).

. B, , .

3.5.2.

, , ReLU, tanh sigmoid. y - x , y = f(x). y , , ,

$C_ {∆∆} = ∆Y, u, \; therefore, m_ {∆X∆Y} = \ frac {∆y} {∆x}$

: ∆y+ ∆y− ∆+ ∆x− :

, :

$x → x^0, \; \; ∆x → 0 \; \; y → 0.$

, . .

$m_{∆x∆y} → \frac{dy}{dx}, \frac{dy}{dx} \; \; x = x^0.$

, , x , , .

, , , . 1 . 2. . 1,

$i^0_1 = i^0_2 = 0, \; \; \; i_1 + i_2 > 1 \; ∆h= \text{-} 1$ $∆y = 1, \; m_{∆h∆y} = \frac{∆h}{∆y} = \text{-}1, \; \; \frac{d}{dh} = 0$

( , , ). . 2, ₀ = ₀ = 0, x = 10 + , ∆y =

, × 10+e x -10 (DeepLIFT ).

(Lundberg & Lee, 2016 [6]), DeepLIFT Shapely. , Shapely , . «» , DeepLIFT Shapely. Lundberg & Lee DeepLIFT, .

3.5.3. : REVEALCANCEL

, , . min (i₁, i₂), . 3, i₁ = 0 i₂ = 0. , i₁, i₂ ( , ). , min.

, , ,

$i_1 > i_2. \; \; \; \; h_1 = (i_1 - i_2) > 0 \; \; h_2 = max(0, h_1) = h_1.$

$C_{∆i_1∆h_1} = i_1 \;\; C_{∆i_2∆h_1} = \text{-}i2.$

$M_{∆h_1,∆h_2} \; \; \frac{∆h_2 }{∆h_1} = 1,$

, ,

$C_{∆i_1∆h_2} = m_{∆h_1 ∆h_2}C_{∆i_1∆h_1} =i_1 \; \; C_ {∆i_2 ∆h_2} = m_{∆h_1∆ h_2}C_{∆i_2∆h_1} = \text{-}i2.$

i₁

$(i_1 \text{-} C_{∆i_1∆h_2}) = (i_1 \text{-} i_1) = 0,$

$i_2 \; to \; o\; is \; \text{-}∆i_2∆h_2 = i_2.$

, ,

$C_{∆i_2∆h_2} \; \; \; \;0,\; \; \; i_1$

- , , i₁ i₂, - , i₂ i₁ h₂. i₁ < i₂;

$C_{∆i_1∆_o} = i_1 \; \; C_{∆i_2∆o} = 0.$

, , ×, i₁, i₂, i₁ i₂ ( . C).

. y = f (x). , ∆y + ∆y−

$∆x ^ + and ∆ ^ - \; and \; m_ {∆x ^ + ∆y ^ +} = m_ {∆x ^ \ text {-} ∆y ^ \ text {-}} = m_ {∆x∆y}$

( ), :

, ∆y+ ∆x+ , ∆x−, ∆y− ∆x− , ∆x+. Shapely ∆x+ ∆x−, y.

, , - , . . 3 RevealCancel 0,5min(i₁, i₂) ( . C).

RevealCancel , . 1 .2, , . , ReLU, ∆y > 0 iff ∆x ≥ b. ∆x < b , ∆x+, ∆x− ( ), («») . RevealCancel , ∆x+ ∆x- .

Figure 3. Network computing o = min (i1, i2). — 3. o = min (i1, i2).

$i ^ 0_1 = i ^ 0_2 = 0. \; For \; i_1 <i_2 \; then \ frac {dy} {di_2} = 0, \; a \; when \; i_2 <i_1 \; then \; \ frac {do} {di_1} = 0$

, , 2.2, i₁ i₂. RevealCancel 0,5min(i₁, i₂) .

3.6.

softmax , , . , , , 3.1. , o = (y), y - .

$Let's pretend that \; y = x_1 + x_2, \; where \; x ^ 0_1 = x ^ 0_2 = 0. When x1 = 50 \; and \; x_2 = 0,$

o 1, x₁ x₂ 0,5 0 . , x₁ = 100 x₂ = 100, o - 1, x₁ x₂ 0,25 . , DeepLIFT. , y, o.

Softmax

, softmax, softmax, , softmax , softmax - . , , . , n - ,

$C_ {∆x∆c_i}$

ci ,

$C '_ {∆x c_i}$

, :

, softmax softmax .

4.

4.1. (MNIST)

MNIST (Le-Cun et al., 1999) Keras (Chollet, 2015) 99,2%. , , softmax (. D ). > 1 , , (Springenberg et al., 2014 [10]). DeepLift ( ).

, , : , co, , , C_o. ,

$S_ {x_idiff} = S_ {x_ic_o} -S_ {x_ic_t} (where S_ {x_ic} \ text {- pixel estimate} \; x_i \ text {and class} \; c)$

157 (20% ),

$S_ {x_idiff}, \ text {for which} S_ {x_idiff}> 0.$

C_o C_t .

Figure 4. DeepLIFT uses the RevealCancel rule to better identify pixels to convert from one digit to another. — 4. DeepLIFT RevealCancel .

: , (8) (3 6). 8, 3 6. 8→6 * . : - 1K , . " -n" n .

Figure: 5. DeepLIFT with RevealCancel gives the qualitatively desired behavior when simulating TAL-GATA. — . 5. DeepLIFT RevealCancel TAL-GATA.

() TAL1 (. G GATA1). -5 . X: log- TAL1 . Y - : . , TAL1 GATA1; GATA1, TAL1, . “DeepLIFT-fc-RC-conv-RS” RevealCancel ( ) , , -, RevealCancel .

() (log-odds > 7) TAL1 , TAL1 GATA1, <= 0 0; * INP DeepLIFT RevealCancel , 1 ( ()).

4.2. ()

( {A,C,G, T}). ( 200-1000), , (RPs), . RP (, GATA1) (, ) (, GATAA GATTA). , (), . , DeepLIFT , , , .

200 ACGT 0,3, 0,2, 0,2 0,3 . (. F) RPs GATA1 TAL1(. 6) (Kheradpour &Kellis, 2014 [3]), 0-3 . , 3 . 1 « - GATA1 TAL1 ()», 2 «GATA1 ()» 3 «TAL1 ()». 1/4 GATA1, TAL1 ( 111), 1/4 GATA1 ( 010), 1/4 TAL1 ( 001) 1/4 ( 000). , F. , ACGT (. . ACGT 0.3, 0.2, 0.2, 0.3; . J). × × ( "", measured ). , , , × , , ; , .

, , ACGT. , 5 ( ) , , . . 5 ( TAL1) E ( GATA1). , : (1) TAL1 2 (2) TAL1 1, (3) ; GATA1 ( 1, 2); (4) TAL1 GATA1 0, (5) , , , ( ; , . 5).

× (2) TAL1 1 ( . H). (4), 0 ( ). Guided Backprop × input, gradient × input (3), , 7, logodds (, ). , Guided Backprop × input gradient × input (. 6). . 2. ( y) .

DeepLIFT: (DeepLIFT-Rescale), RevealCancel (DeepLIFT-RevealCancel) RevealCancel (DeepLIFT-fc-RC-conv-RS). MNIST, , DeepLIFT-fc-RC-convRS RevealCancel. , - , 3.5.3; , , , , (. 6 ).

Gradient × inp, DeepLIFT-Rescale TAL1 0 (. 5b), RevealCancel (. . 6). , RevealCancel . I, (: TAL1, , TAL1, ).

Figure: 6. RevealCancel allocates TAL1 and GATA1 motives for task 0. — . 6. RevealCancel TAL1 GATA1 0.

(a) PWM- GATA1 TAL1, . (b) , , , TAL1, GATA1. . - GATA1, - TAL1. - TAL1 (CAGTTG CAGATG). TAL1 GATA1 0. RevealCancel RevealCancel .

5.

DeepLIFT, , «» «» . (. 1), , , tanh. DeepLIFT ( * - . . 2). , DeepLIFT-RevealCancel , (. 3). : () DeepLIFT RNN,(b) (c) «» ( Maxout Maxpooling ) .

[1] Bach, Sebastian, Binder, Alexander, Montavon, Gregoire, Klauschen, Frederick, Muller, Klaus-Robert, and Samek, Wojciech. On Pixel-Wise explanations for Non-Linear classifier decisions by Layer-Wise relevance propagation. PLoS One, 10(7):e0130140, 10 July 2015.

[2] Chollet, Franois. keras. https://github.com/fchollet/keras, 2015.

[3] Kheradpour, Pouya and Kellis, Manolis. Systematic discovery and characterization of regulatory motifs in encode tf binding experiments. Nucleic acids research, 42 (5):2976–2987, 2014.

[4] Kindermans, Pieter-Jan, Schtt, Kristof, Mller, KlausRobert, and Dhne, Sven. Investigating the influence of noise and distractors on the interpretation of neural networks. CoRR, abs/1611.07270, 2016. URL https://arxiv.org/abs/1611.07270.

[5] LeCun, Yann, Cortes, Corinna, and Burges, Christopher J.C. The mnist database of handwritten digits. http://yann.lecun.com/exdb/mnist/,1999.

[6] Lundberg, Scott and Lee, Su-In. An unexpected unity among methods for interpreting model predictions. CoRR, abs/1611.07478, 2016. URL http://arxiv.org/abs/1611.07478.

[7] Selvaraju, Ramprasaath R., Das, Abhishek, Vedantam, Ramakrishna, Cogswell, Michael, Parikh, Devi, and Batra, Dhruv. Grad-cam: Why did you say that? visual explanations from deep networks via gradient-based localization. CoRR, abs/1610.02391, 2016. URL http://arxiv.org/abs/1610.02391.

[8] Shrikumar, Avanti, Greenside, Peyton, Shcherbina, Anna,and Kundaje, Anshul. Not just a black box: Learning important features through propagating activation differences. arXiv preprint arXiv:1605.01713, 2016.

[9] Simonyan, Karen, Vedaldi, Andrea, and Zisserman, Andrew. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.

[10] Springenberg, Jost Tobias, Dosovitskiy, Alexey, Brox, Thomas, and Riedmiller, Martin A. Striving for simplicity: The all convolutional net. CoRR, abs/1412.6806, 2014. URL http://arxiv.org/abs/1412.6806.

[11] Sundararajan, Mukund, Taly, Ankur, and Yan, Qiqi. Gradients of counterfactuals. CoRR, abs/1611.02639, 2016. URL http://arxiv.org/abs/1611.02639.

[12] Zeiler, Matthew D. and Fergus, Rob. Visualizing and understanding convolutional networks. CoRR, abs / 1311.2901, 2013. URL http://arxiv.org/abs/1311.2901 .

[13] Zhou, Jian and Troyanskaya, Olga G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat Methods, 12: 931-4, 2015 Oct 2015. ISSN 1548-7105. doi: 10.1038 / nmeth.3547.

[14] Zintgraf, Luisa M, Cohen, Taco S, Adel, Tameem, and Welling, Max. Visualizing deep neural network decisions: Prediction difference analysis. ICLR, 2017. URL https://openreview.net/pdf?id=BJ5UeU9xx

Exploring important features by spreading differences in activation. DeepLIFT

annotation

1. Introduction

2.

3. DeepLIFT

4.

5.

More articles: