🤪 🕐 🕖 Improving the markup of multimodal data: fewer assessors, more layers 🍹 🚧 🏛️

Hello! We - scientists at ITMO's Machine Learning laboratory and the Core ML team on VKontakte - are doing joint research. One of the important tasks of VK is the automatic classification of posts: it is necessary not only to generate thematic feeds, but also to identify unwanted content. Assessors are involved for such processing of records. At the same time, the cost of their work can be significantly reduced using such a machine learning paradigm as active learning.

It is about its application for the classification of multimodal data that will be discussed in this article. We will tell you about the general principles and methods of active learning, the peculiarities of their application to the task, as well as the insights obtained during the research.

Introduction

— machine learning, . , , , .

, (, Amazon Mechanical Turk, .) . — reCAPTCHA, , , , — Google Street View. — .

. , Voyage — , . , , . , .

Amazon DALC (Deep Active Learning from targeted Crowds). , . Monte Carlo Dropout ( ). — noisy annotation. , « , », .

Amazon . : / . , , . , : , . .

— ! , . pool-based sampling.

Figure: 1. General scheme of a pool-based scenario of active learning

. 1. pool-based

. , , ( ). : , .

, — . (. — query). , . ( , ) .

, , .

, — . ( ). ≈250 . . () 50 — — :

, (. embedding), ;
.

, (. . 2).

. 2 —

. 2 —

ML — . , .

. , . , , , . , , early stopping. , .

. residual , highway , (. encoder). , (. fusion): , .

— , . -.

, — , . , .

. , (. 3):

. 3.

. 3.

. , . , , . , ( + ) — .

, . 3, :

. 4.

. 4.

, , . , ó , , .

, : ? :

;
;
.

. : maximum likelihood , - . :

L = \frac{1}{σ_{1}^{2}} L_{1} + \frac{1}{σ_{2}^{2}} L_{2} + \frac{1}{σ_{3}^{2}} L_{3} + \log σ_{1} + \log σ_{2} + \log σ_{3}

$L = \frac{1}{\sigma_1 ^ 2}L_1 + \frac{1}{\sigma_2 ^ 2}L_2 + \frac{1}{\sigma_3 ^ 2}L_3 + \log{\sigma_1} + \log{\sigma_2} + \log{\sigma_3}$

$L_1, L_2, L_3$ — ( -), $\sigma_1, \sigma_2, \sigma_3$ — , .

Pool-based sampling

— , . pool-based sampling :

- .
.
, , .
.
( ).
3–5 (, ).

, 3–6 — .

, , :

, . , : . , , , . . , 2 000.
. , . ( ). , , . , . 20 .

. , . — , . 100 200.

, , , .

№1: batch size

baseline , ( ) (. 5).

. 5. baseline- .

random state. .

. «» , , .

, (. batch size). 512 — - (50). , batch size . . :

upsample, ;
, .

batch size: (1).

c u r r e n t_b a t c h_s i z e = b + ⌊ \frac{n \mod b}{⌊ \frac{n}{b} ⌋} ⌋ [1]

$current\_batch\_size =b + \Big \lfloor\frac{n \mod b}{\lfloor\frac{n}{b}\rfloor}\Big\rfloor [1]$

$b$ — batch size, $n$ — .

“” (. 6).

. 6. batch size (passive ) (passive + flexible )

: c . , , batch size . .

Uncertainty

— uncertainty sampling. , , .

1. (. Least confident sampling)

, :

x_{L C}^{*} = \underset{x}{\arg max} 1 - P_{θ} (\hat{y} | x) [2]

$x^{*}_{LC} = \underset{x}{\arg\max} \ 1 - P_{\theta }(\hat{y}|x) [2]$

$\hat{y} = \underset{y}{\arg\max}\ P_{\theta}(y|x)$ — , $y$ — , $x$ — , $x^{*}_{LC}$ — , .

. , $1-\hat{y}$ . , . .

. , : {0,5; 0,49; 0,01}, — {0,49; 0,255; 0,255}. , (0,49) , (0,5). , ó : . , .

2. (. Margin sampling)

, , , :

x_{M}^{*} = \underset{x}{\arg min} P_{θ} ({\hat{y}}_{1} | x) - P_{θ} ({\hat{y}}_{2} | x) [3]

$x^{*}_{M} = \underset{x}{\arg\min} \ P_{\theta }(\hat{y}_{1}|x) - P_{\theta }(\hat{y}_{2}|x)[3]$

$\hat{y}_1$ — $x$ , $\hat{y}_2$ — .

, . , . , , MNIST ( ) — , . .

3. (. Entropy sampling)

x_{H}^{*} = \underset{x}{\arg max} - \sum P_{θ} (y_{i} | x) \log P_{θ} (y_{i} | x) [4]

$x^{*}_{H} = \underset{x}{\arg\max} -\sum \ P_{\theta }(y_{i}|x)\log{P_{\theta }(y_{i}|x)}[4]$

$y_{i}$ — $i$ - $x$ .

, , . :

, , ;
, .

, , . , entropy sampling .

(. 7).

. 7. uncertainty sampling ( — , — , — )

, least confident entropy sampling , . margin sampling .

, , : MNIST. , , entropy sampling , . , .

. $O(p\log{q})$ , $p$ — , $q$ — . , .

BALD

, , — BALD sampling (Bayesian Active Learning by Disagreement). .

, query-by-committee (QBC). — . uncertainty sampling. , . QBC Monte Carlo Dropout, .

, , — . dropout . dropout , ( ). , dropout- (. 8). Monte Carlo Dropout (MC Dropout) . , . ( dropout) Mutual Information (MI). MI , , — , . .

. 8. MC Dropout BALD

, QBC MC Dropout uncertainty sampling. , (. 9).

. 9. uncertainty sampling QBC ( - , - , - )

. 9. uncertainty sampling ( QBC ) ( — , — , — )

BALD. , Mutual Information :

a_{B A L D} = H (y_{1}, . . ., y_{n}) - E [H (y_{1}, . . ., y_{n} | ω)] [5]

$a_{BALD}=\mathbb{H}(y_1,...,y_n)-\mathbb{E}[\mathbb{H}(y_1,...,y_n|\omega)] [5]$

E [H (y_{1}, . . ., y_{n} | w)] = \frac{1}{k} \sum_{i = 1}^{n} \sum_{j = 1}^{k} H (y_{i} | w_{j}) [6]

$\mathbb{E}[\mathbb{H}(y_1,...,y_n|w)]=\frac{1}{k}\sum_{i=1}^{n}\sum_{j=1}^{k}\mathbb{H}(y_i|w_j) [6]$

$n$ — , $k$ — .

(5) , — . , , . BALD . 10.

. 10. BALD

, , .

query-by-committee BALD , . , uncertainty sampling. , — $O(kp\log(q))$ , $p$ — , $q$ — , $k$ — , .

BALD tf.keras, . PyTorch, dropout , batch normalization , .

№2: batch normalization

batch normalization. batch normalization — , . , , , , batch normalization. , . , . BALD. (. 11).

. 11. batch normalization BALD

, , .

batch normalization, . , .

Learning loss

. , . , .

, . — . , . learning loss, . , (. 12).

. 12. Learning loss

learning loss . .

. , . «» learning loss: , , . ideal learning loss (. 13).

. 13. ideal learning loss

, learning loss.

, . , , - , . :

(2000 ), ;
10000 ( );
;
;
100 ;
, , 1;
.

, , . , ( margin sampling).

1.

		p-value
loss	-0,2518	0,0115
margin	0,2461	0,0136

, margin sampling — , , , . c .

: ?

, , (. 14).

. 14. ideal learning loss ideal learning loss

, MNIST :

2. MNIST

		p-value
loss	0,2140	0,0326
	0,2040	0,0418

ideal learning loss , (. 15).

Figure: 15. Actively training the character classifier from the MNIST dataset with the ideal learning loss strategy. Blue graph - ideal learning loss, orange - passive learning

. 15. MNIST ideal learning loss. — ideal learning loss, —

, , , , . .

learning loss , uncertainty sampling: $O(p\log{q})$ , $p$ — , $q$ — . , , . , .

, . . , margin sampling — . 16.

Figure: 16. Comparison of training on randomly selected data (passive training) and on data selected by the margin sampling strategy

. 16. ( ) , margin sampling

: ( — margin sampling), — , , . ≈25 . . 25% — .

, . , , .

, , . , :

batch size;
, , — , batch normalization.

Improving the markup of multimodal data: fewer assessors, more layers