Language Modeling with Guided Convolutional Networks

The dominant approach to language modeling today is based on recurrent neural networks. Their success in modeling is often related to the ability of such networks to handle unbounded context. In this article, we develop a finite context approach using stacked (composite) convolutions, which can be more efficient as they allow sequential chunks of data to be parallelized. We propose a new simplified neuro-guided mechanism that is superior to that proposed by Oord et al. (2016b) [26]and investigate the impact of key architectural decisions for it. The proposed approach achieves the most significant results on the WikiText103 benchmark, even though it is characterized by long-term dependencies, as well as comparable results on the Google Billion Words benchmark. Our model reduces the latency in evaluating a proposal by an order of magnitude, compared to recurrent baseline values. As far as we know, this is the first time that the non-periodic approach is competitive with strong recurrent models in such large-scale language problems.

1. Introduction

Statistical language models estimate the probability distribution of a sequence of words by modeling the probability of the next word given the previous words, i.e.

wi - . (Yu & Deng, 2014)[34] (Koehn, 2010) [17].

, (Bengio et al., 2003 [1]; Mikolov et al., 2010 [2]; Jozefowicz et al., 2016 [14]) n- (Kneser & Ney, 1995 [16]; Chen & Goodman, 1996 [3]). , , , . , . (LSTM; Hochreiter et al., 1997[12]), .

. (LeCun & Bengio, 1995 [19]). ,

\mathcal{O} (N / k )

N k. ,

\mathcal {O} (N)

, , , , (Manning & Schutze, 1999 [20]; Steedman, 2002 [31]). , , (Glorot & Bengio, 2010 [6]).

. ( ) , . , , ( 2).

, (Jozefowicz et al., 2016 [14]). (GLU) , ( 5.2).

, , , , LSTM, Google Billion Word Benchmark (Chelba et al., 2013 [2]). WikiText-103, , , (Merity et al., 2016 [21]). , , (GLU) , LSTM- Oord et al. (2016 [26]; 4, 5).


, , , . (Bengio et al., 2003 [1])

H = [h_0,. . . , h_N]

w_0,. . . , w_N,

P (w_i | h_i).

f H

h_i = f (h_{i βˆ’ 1}, w_{i βˆ’ 1}).

, i ( ).

f  { }  H = f βˆ— w

, , , . . , , , , ( 5).

1 . ,

D^{ | V | Γ— e},  | V | -   ( ),  e -  .

w_0,. . . , w_N,

E = [D_{w0},. . . , D_{wN}].

h_0,. . . , h_L

m, n – , , k - ,

X \in \mathbb {R}  ^{N \times m} -   h_l

( , ),

W \in \mathbb {R}^{k \times m \times n}, b \in \mathbb {R}^n, V \in \mathbb {R}^{k\times m\times n}, c \in \mathbb {R}^n  -  , \sigma      \otimes  .



. , (Oord et al., 2016a [25]). ,


, - , ,

k -  .
  1.       .
1. .

X βˆ— W + b,    Οƒ (X βˆ— V + c).


X βˆ— W + b

, . (GLU). E

H = h_Lβ—¦. . .β—¦h_0 (E).

(GLU) , (He et al., 2015a [10]). , 5 ().

- softmax, , , (Gutmann & Hyvarinen [9]) softmax (Morin & Bengio, 2005 [24]). , softmax. ( , ) – (Grave et al., 2016a [7]). – , .


, , (Hochreiter & Schmidhuber, 1997 [12]). LSTMs , () . . . , , , () .

, , , . , , , . Oord et al. (2016b [26]) LSTM

tanh (X βˆ— W + b) βŠ—Οƒ (X βˆ— V + c)

. Kalchbrenner et al. (2016 [15]) .

(GLU) - , Dauphin & Grangier (2015) [35] , . , . LSTM, gated tanh unit (GTU),

, ( , ) -

 tanh ' (X)      Οƒ ' (X).

, (GLU)

βˆ‡X βŠ— Οƒ(X)

Οƒ (X). , . Β§5.2 , (GLU) .



. -, Google Billion Word (Chelba et al., 2013 [2]) , , 800 . . , 3 , . 30 301 028 , . -, WikiText-103 - , 100 . , 200 . (Merity et al., 2016 [21]). GBW, , , . <S> </S> . Google Billion Word , WikiText-103 . <S> </ S> , </S>. ,

 e^{ {\frac1N}\sum_{i}^N βˆ’ \log  p(w_i|...,w_{iβˆ’1})}



Torch (Collobert et al., 2011 [5]) Tesla M40. , . 8 , , 1/8 . Nvidia NCCL. .

, (Sutskever et al., 2013 [32]). , . (Pascanu et al., 2013 [27]) (Salimans & Kingma, 2016 [28]).

1. . [k, n]. Β«BΒ» .

. (2013) [27] , , RNN. RNN, .

, . , , 1.


. {1,. . . , 10}, {128,. . . , 256}, {128,. . . , 2048}, - {3,. . . , 5}. , , , , . , ( ., 2015b [11]), [1., 2.], 0,99 0,1. , .



, GCNN LSTM Google . , softmax (Grave et al., 2016a [7]), . GCNN 38,1 , LSTM 39,8 ( 2).

2. Google Billion Word. GCNN LSTM .

, GCNN . 2 , , softmax softmax. softmax, GCNN . GCNN , LSTM Jozefowicz et al. (2016 [14]), , softmax. , , , 31,9 30,6 , 2 8 3 32 LSTM. , , (Shazeer et al., 2017 [30]), .

2. (Jozefowicz et al., 2016 [14]), softmax, softmax , .

, GCNN . Google Billion Word - 20 . WikiText-103, , , . WikiText-103 , , 4000 . GCNN LSTM ( 3). GCNN-8 8 800 , LSTM - 1024 . , GCNN .

3. WikiText-103.

Gigaword Chen et al. (2016 [4]) . , , , 55,6 29,4. Penn tree. , GCNN LSTM : 108,7 109,3 . , , . LSTM, , GCNN , , .


. . , . , () . , () - , . , , , , . , . , , .

, 43,9 Google Billion Word. LSTM 2048 2, GCNN-8Bottleneck 7 Resnet, , (He et al., 2015a [10]), GCNN-8 . () k > 1 k = 1. k = 1 , . , .

4. . LSTM 2048 GCNN 43,9 Google Billion Word. GCNN 20 .

LSTM 750 20, 15 000 . - 15 000 . 4 , LSTM GCNN . LSTM , 750 . , LSTM cuDNN, cuDNN , . , 1- cuDNN. LSTM, GCNN , , GCNN 20 .

3. WikiText-103 () Google Billion Word () . (GLU) .

5.2 ()

, . (GTU) LSTM

tanh (X βˆ— W + b) βŠ— Οƒ (X βˆ— V + c)

(Oord et al., 2016b [17]) , ReLU Tanh. , . 3 () , GLU , WikiText-103. , ReLU , . ReLU, GLU. , Tanh, GTU , , . GTU , , .

GTU Tanh , Tanh GTU . ( 3, ) , , GTU Tanh. , ReLU GLU,

ReLU (X) = X βŠ— (X> 0)

, . GLU .

3 () Google Billion Words. 100 - , . WikiText-103, . 5 GLU ReLU, LSTM RNN, (Jozefowicz et al., 2016 [14]) .


, , , . GLU , , GLU. (Manning & Schutze, 1999 [20]). , GLU

h_l(X) = X βˆ— W + b

- , softmax, -. GLU - (Mnih & Hinton, 2007 [23]),

h_l(X) = (X βˆ— W + b) βŠ— (X βˆ— V + c).
4. Google Billion Word () Wiki-103 (). , , 20.

5 , GLU, , . 40 , GLU 20 . (115) 67,6 5- -, . , , 61 Google Billion Word, 5- -, (Ji et al., 2015 [13]).


5. Google Billion Word .

4 CNN. . , , , 40 , WikiText-103, . , , , . , , 40 . 4 , WikiText-103 , Google Billion Word, . WikiText-103 , Google Billion Word, 20. , 4000 , , 30 .


. , . - . 6 , . , . (1 0,01), . , ,   , .

6. Google Billion Word.


. , , , () . , , , . , WikiText-103. Google Billion Word , .

