🍶 🤰 👩🏿‍🍳 Cross-sampling or how to squeeze a few thousandths out of the dataset 💌 🧑🏾‍🤝‍🧑🏾 🙌🏿

This article is about pictures and classification. A small study of properties, such is the touch to the portrait of MNIST (well, a hint in solving other similar problems).

There are many publications on the network about the interpretation of a particular neural network and the significance and contribution of certain points to learning. There is a lot of work about the search for whiskers, tails and other parts and their importance and significance. Now I will not replace librarians and make a list. I'll just tell you about my experiment.

It all started with an excellent video Report “How robots think. Interpretation of ML-models ” , reviewed on the advice of one smart person and like any sensible business, raised many questions. For example: - how unique are the key points of the dataset?

Or another question: - there are many articles on the network about how changing one point of the picture can significantly distort the network prediction. Let me remind you that in this article we are considering only classification problems. How unique is this insidious point? Are there such points in the natural sequence of MNIST and if they are found and thrown out, will the training accuracy of the neural network be higher?

The author, following his traditional method of getting rid of all the unnecessary, decided not to interfere with the bunch and chose a simple, reliable and effective way to study the questions posed:

as an experimental problem, an example for preparation, choose the familiar MNIST ( yann.lecun.com/exdb/mnist ) and its classification.

As an experimental network, I chose the classic, recommended for beginners, an exemplary network of the KERAS team

github.com/keras-team/keras/blob/master/examples/mnist_cnn.py

And the study itself decided to be very simple.

Let's train the network from KERAS with such a stopping criterion as the absence of an increase in accuracy on the test sequence, i.e. teach the network until test_accuracy becomes significantly greater than validation_accuracy and validation_accuracy does not improve for 15 epochs. In other words, the network stopped learning and retraining began.

From the MNIST dataset, we will make 324 new datasets by discarding groups of points and will teach the same network on exactly the same conditions with the same initial weights.

Let's get started, I think it is right and right to lay out all the code, from the first to the last line. Even if the readers have seen him, obviously, many times.

We load the libraries and load the mnist dataset, if it hasn't been loaded yet.

Then we convert it to the 'float32' format and normalize it to the range 0. - 1.

The preparation is over.

'''Trains a simple convnet on the MNIST dataset.
Gets to 99.25% test accuracy after 12 epochs
(there is still a lot of margin for parameter tuning).
16 seconds per epoch on a GRID K520 GPU.
'''

from __future__ import print_function
import keras
from keras.datasets import mnist
from keras.models import Sequential, load_model
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K
from keras.optimizers import *
from keras.callbacks import EarlyStopping

import numpy as np
import os

num_classes = 10

# input image dimensions
img_rows, img_cols = 28, 28

# the data, shuffled and split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()

if K.image_data_format() == 'channels_first':
    x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
    x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
    input_shape = (1, img_rows, img_cols)
else:
    x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
    x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
    input_shape = (img_rows, img_cols, 1)

x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= np.max(x_train)
x_test /= np.max(x_test)

XX_test = np.copy(x_test)
XX_train = np.copy(x_train)
YY_test = np.copy(y_test)
YY_train = np.copy(y_train)


print('x_train shape:', XX_train.shape)
print('x_test shape:', XX_test.shape)

Let us remember in the variables the name of the model files and weights, as well as the accuracy and loss of our network. This is not in the source code, but it is necessary for the experiment.

f_model = "./data/mnist_cnn_model.h5"
f_weights = "./data/mnist_cnn_weights.h5"
accu_f = 'accuracy'
loss_f = 'binary_crossentropy'

The network itself is exactly the same as on

github.com/keras-team/keras/blob/master/examples/mnist_cnn.py .

Save the network and scales to disk. We will run all our training attempts with the same initial weights:

y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

model = Sequential()

model.add(Conv2D(32, kernel_size=(3, 3),
                 activation='relu',
                 input_shape=input_shape))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))

model.compile(loss=[loss_f], optimizer=Adam(lr=1e-4), metrics=[accu_f])
model.summary()

model.save_weights(f_weights)
model.save(f_model)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)              (None, 26, 26, 32)        320       
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 24, 24, 64)        18496     
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 12, 12, 64)        0         
_________________________________________________________________
dropout (Dropout)            (None, 12, 12, 64)        0         
_________________________________________________________________
flatten (Flatten)            (None, 9216)              0         
_________________________________________________________________
dense (Dense)                (None, 128)               1179776   
_________________________________________________________________
dropout_1 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 10)                1290      
=================================================================
Total params: 1,199,882
Trainable params: 1,199,882
Non-trainable params: 0
_________________________________________________________________

Let's start training on the original mnist to get a benchmark, basic efficiency.

x_test = np.copy(XX_test)
x_train = np.copy(XX_train)
s0 = 0

if os.path.isfile(f_model):
    model = load_model(f_model)
    model.load_weights(f_weights, by_name=False)

    step = 0
    while True:
        
        fit = model.fit(x_train, y_train,
                  batch_size=batch_size,
                  epochs=1,
                  verbose=0,
                  validation_data=(x_test, y_test)
                )
        
        current_accu = fit.history[accu_f][0]
        current_loss = fit.history['loss'][0]
        val_accu = fit.history['val_'+accu_f][0]
        val_loss = fit.history['val_loss'][0]
        print("\x1b[2K","accuracy {0:12.10f} loss {1:12.10f} step {2:5d} val_accu {3:12.10f} val_loss {4:12.10f}  ".\
                          format(current_accu, current_loss, step, val_accu, val_loss), end="\r")
    
        step += 1
        if val_accu > max_accu:
            s0 = 0
            max_accu = val_accu
        else:
            s0 += 1
        if current_accu * 0.995 > val_accu and s0 > 15:
            break
else:
    print("model not found ")

accuracy 0.9967333078 loss 0.0019656278 step   405 val_accu 0.9916999936 val_loss 0.0054226643

Now let's start the main experiment. We take for training from the original sequence all 60,000 marked up pictures, and in them we zero out everything except the 9x9 square. Let's get 324 experimental sequences and compare the result of training the network on them with training on the original sequence. We train the same network with the same initial weights.

batch_size = 5000
s0 = 0
max_accu = 0.

for i in range(28 - 9):
    for j in range(28 - 9):
        print("\ni= ", i, "  j= ",j)
        x_test = np.copy(XX_test)
        x_train = np.copy(XX_train)

        x_train[:,:i,:j,:] = 0.
        x_test [:,:i,:j,:] = 0.

        x_train[:,i+9:,j+9:,:] = 0.
        x_test [:,i+9:,j+9:,:] = 0.

        if os.path.isfile(f_model):
            model = load_model(f_model)
            model.load_weights(f_weights, by_name=False)
        else:
            print("model not found ")
            break

        step = 0
        while True:
            
            fit = model.fit(x_train, y_train,
                      batch_size=batch_size,
                      epochs=1,
                      verbose=0,
                      validation_data=(x_test, y_test)
                    )
            
            current_accu = fit.history[accu_f][0]
            current_loss = fit.history['loss'][0]
            val_accu = fit.history['val_'+accu_f][0]
            val_loss = fit.history['val_loss'][0]
            print("\x1b[2K","accuracy {0:12.10f} loss {1:12.10f} step {2:5d} val_accu {3:12.10f} val_loss {4:12.10f}  ".\
   format(current_accu, current_loss, step, val_accu, val_loss), end="\r")
        
            step += 1
            if val_accu > max_accu:
                s0 = 0
                max_accu = val_accu
            else:
                s0 += 1
            if current_accu * 0.995 > val_accu and s0 > 15:
                break

It makes no sense to post all 324 results here, if anyone is interested, I can send it personally. The calculation takes several days, if someone wants to repeat it.

As it turned out, the network on a 9x9 clipping can learn as worse, which is obvious, but also better, which is not at all obvious.

For example:

i = 0 j = 14

accuracy 0.9972333312 loss 0.0017946947 step 450 val_accu 0.9922000170 val_loss 0.0054322388

i = 18, j = 1

accuracy 0.9973166585 loss 0.0019487827 step 415 val_accu 0.9922000170 val_loss 0.0053000450

We throw away from pictures with handwritten numbers all but the square 9x9 and the quality of learning and recognition is improving with us!

It is also clear that there is more than one such special area for improving network quality. And not two, these two are given as an example.

The result of this experiment and preliminary conclusions.

Any natural dataset, I do not think that LeCune deliberately distorted something, contains not only points that are essential for learning, but also points that interfere with learning. The task of finding "harmful" points becomes urgent, they exist, even if they are not visible.
You can stack and blend not only along the dataset, selecting images in groups, but also across, selecting areas of images for splitting and then as usual. In this case, such an approach improves the quality of training and there is hope that in a similar task, the use of such stacking across will add quality. And on the same kaggle.com, a few ten-thousandths sometimes (almost always) allow you to significantly raise your authority and rating.

Thank you for attention.

Cross-sampling or how to squeeze a few thousandths out of the dataset

More articles: