Applying the VGG16 Pretrained Model to Recommendations Based on Product Images

Today I want to tell you about my experience of using a neural network to find similar products for an online store recommendation system. I will mainly talk about technical things. I decided to write this article on Habré because when I was just starting to do this project, I found one suitable solution on Habré, but as it turned out, it was already outdated and had to be modified. And so I decided to update the material for those who will have a need for a similar solution.






Separately, I want to say that this is my first experience of creating a more or less serious project in the field of Data Science, so if one of the more experienced colleagues sees what can still be improved, I will only be glad for the advice.





I'll start with a little background on why the chosen logic of the online store was chosen - namely, a recommendation based on similar products (and not methods of collaborative filtering, for example). The fact is that this recommendation system was developed for an online store that sells watches and therefore up to 90% of users who come to the site do not return. In general, the task was this - to increase the number of page views from users who come to the pages of specific products through advertising. Such users viewed one page and left the site if the product did not suit them.





I must say that in this project I did not have the opportunity to integrate with the backend of an online store - a classic story for small and medium-sized online stores. It was necessary to rely only on the system, which I will make aside from the site. Therefore, as a visual solution on the site itself, I decided to make a popup js widget. One line adds js to the html code, understands the page title that the user came to, and passes it to the backend of the service. If the backend found a product in its database of pre-loaded products, then it searches again in the pre-prepared database of products for recommendations and returns them to js, ​​and js then displays them in the widget. Also, to reduce the impact on loading speed, js creates an iframe, in which it does all the work with displaying the widget. Among other things,it also allows you to remove the problem with the intersection of the css classes of the widget and the site.





, Data Science. , . , , - .





.





( , A/B-) - ; , , - .





. .





:





!pip install theano

%matplotlib inline
from keras.models import Sequential
from keras.layers.core import Flatten, Dense, Dropout
from keras.layers.convolutional import Convolution2D, MaxPooling2D, ZeroPadding2D
from keras.optimizers import SGD
import cv2, numpy as np
import os
import h5py
from matplotlib import pyplot as plt

from keras.applications import vgg16
from keras.applications import Xception
from keras.preprocessing.image import load_img,img_to_array
from keras.models import Model
from keras.applications.imagenet_utils import preprocess_input

from PIL import Image
import os
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

import theano
theano.config.openmp = True
      
      



( , ):





import re
def sorted_alphanumeric(data):
    convert = lambda text: int(text) if text.isdigit() else text.lower()
    alphanum_key = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ] 
    return sorted(data, key=alphanum_key)

dirlist = sorted_alphanumeric(os.listdir('images'))

r1 = []
r2 = []
for i,x in enumerate(dirlist):
    if x.endswith(".jpg"):
        r1.append((int(x[:-4]),i))
        r2.append((i,int(x[:-4])))

extid_to_intid_dict = dict(r1)
intid_to_extid_dict = dict(r2)
      
      



:





imgs_path = "images/"
imgs_model_width, imgs_model_height = 224, 224

nb_closest_images = 3 #    (  )
      
      



( ):





vgg_model = vgg16.VGG16(weights='imagenet')
      
      



( 1000 ImageNet - . 4096- , ).





— , .





:





feat_extractor = Model(inputs=vgg_model.input, outputs=vgg_model.get_layer("fc2").output)
      
      



, CNN . , :





feat_extractor.summary()
      
      



( , xml -, , , ; , ):





files = [imgs_path + x for x in os.listdir(imgs_path) if "jpg" in x]

print("number of images:",len(files))
      
      



:





import re

def atof(text):
    try:
        retval = float(text)
    except ValueError:
        retval = text
    return retval

def natural_keys(text):
    '''
    alist.sort(key=natural_keys) sorts in human order
    http://nedbatchelder.com/blog/200712/human_sorting.html
    (See Toothy's implementation in the comments)
    float regex comes from https://stackoverflow.com/a/12643073/190597
    '''
    return [ atof(c) for c in re.split(r'[+-]?([0-9]+(?:[.][0-9]*)?|[.][0-9]+)', text) ]

files.sort(key=natural_keys)
      
      



PIL :





original = load_img(files[1], target_size=(imgs_model_width, imgs_model_height))
plt.imshow(original)
plt.show()
print("image loaded successfully!")
      
      



PIL numpy array:

PIL - width, height, channel

Numpy - height, width, channel





numpy_image = img_to_array(original) #    
      
      



batch format.

expand_dims





, - batchsize, height, width, channels. , 0.





image_batch = np.expand_dims(numpy_image, axis=0) #   - (2-dims)
print('image batch size', image_batch.shape)
      
      



VGG:





processed_image = preprocess_input(image_batch.copy()) #    
      
      



( ):





img_features = feat_extractor.predict(processed_image)
      
      



:





print("features successfully extracted!")
print("number of image features:",img_features.size)
img_features
      
      



, — .





importedImages = []

for f in files:
    filename = f
    original = load_img(filename, target_size=(224, 224))
    numpy_image = img_to_array(original)
    image_batch = np.expand_dims(numpy_image, axis=0)
    
    importedImages.append(image_batch)
    
images = np.vstack(importedImages)

processed_imgs = preprocess_input(images.copy())
      
      



:





imgs_features = feat_extractor.predict(processed_imgs)

print("features successfully extracted!")
imgs_features.shape
      
      



:





cosSimilarities = cosine_similarity(imgs_features)
      
      



pandas dataframe:





columns_name = re.findall(r'[0-9]+', str(files))

cos_similarities_df = pd.DataFrame(cosSimilarities, columns=files, index=files)
cos_similarities_df.head()
      
      



. 6000 SKU. 6000 * 6000. float 0 1 8 , . , 430 ( 130 ). . , - GitHub, . GitHub 100 ( ). , - . :) - - - . :





cos_similarities_df_2.round(2) # cos_similarities_df_2 -     ,   
      
      



, . float. pandas float float16 - .





int:





cos_similarities_df_2.apply(lambda x: x * 100)

cos_similarities_df_2.apply(lambda x: x.astype(np.uint8))
      
      



31 . .





h5:





cos_similarities_df_2.to_hdf('storage/cos_similarities.h5', 'data')
      
      



40 . , -, GitHub, -, :)





, , :





import re

# function to retrieve the most similar products for a given one

def retrieve_most_similar_products(given_img):

    print("-----------------------------------------------------------------------")
    print("original product:")
    original = load_img(given_img, target_size=(imgs_model_width, imgs_model_height))
    original_img = int(re.findall(r'[0-9]+', given_img)[0])
    print((df_items_2.iloc[[original_img]]['name'].iat[0], df_items_2.iloc[[original_img]]['pricer_uah'].iat[0], df_items_2.iloc[[original_img]]['url'].iat[0]))
   
    plt.imshow(original)
    plt.show()

    print("-----------------------------------------------------------------------")
    print("most similar products:")

    closest_imgs = cos_similarities_df[given_img].sort_values(ascending=False)[1:nb_closest_images+1].index
    closest_imgs_scores = cos_similarities_df[given_img].sort_values(ascending=False)[1:nb_closest_images+1]

    for i in range(0,len(closest_imgs)):
        original = load_img(closest_imgs[i], target_size=(imgs_model_width, imgs_model_height))
        item = int(re.findall(r'[0-9]+', closest_imgs[i])[0])
        print(item)
        print((df_items_2.iloc[[item]]['name'].iat[0], df_items_2.iloc[[item]]['pricer_uah'].iat[0], df_items_2.iloc[[item]]['url'].iat[0]))
        plt.imshow(original)
        plt.show()
        print("similarity score : ",closest_imgs_scores[i])

kbr = '' #    
find_rec = int(df_items_2.index[df_items_2['name'] == kbr].tolist()[0]) # df_items_2    ,     
print(find_rec)

retrieve_most_similar_products(files[find_rec])
      
      



:)





.





, - :





, :





import os

if not os.path.exists('storage'):
    os.makedirs('storage')

if not os.path.exists('images'):
    os.makedirs('images')
      
      



, xml - .





, , :





# importing required modules
import urllib.request

image_counter = 0

error_list = []

#        
def image_from_df(row):
    global image_counter
    
    item_id = image_counter
    
    filename = f'images/{item_id}.jpg'
    image_url = f'{row.image}'

    try:
      conn = urllib.request.urlopen(image_url)
       
    except urllib.error.HTTPError as e:

      # Return code error (e.g. 404, 501, ...)
      error_list.append(item_id)

    except urllib.error.URLError as e:

      # Not an HTTP-specific error (e.g. connection refused)
      
      print('URLError: {}'.format(e.reason))


    else:

      # 200
      urllib.request.urlretrieve(image_url, filename)
      image_counter += 1
      
      



xml, :





df_items_2.apply(lambda row: image_from_df(row), axis=1)
      
      



, . . . , xml , . , , , , , .





for i in error_list:

  df_items_2.drop(df_items_2.index[i], inplace = True)
  df_items_2.reset_index(drop=True, inplace = True) 

print(f'   : {error_list}')
print(len(error_list))
      
      



, . , - ! )





, - )





P.S. , VGG - VGG19. , .





P.S.S , : , Senior JavaScript Developer ( js CORS-); , Senior Python Developer Senior Engineer ( Docker CI/CD pipeline); SkillFactory, SkillFactory Accelerator ( , Data Science ); (, A/B- ); (another mentor who helped with the understanding of NLP problems and, in particular, the work of the tycoons when creating chat bots (another project that I worked on as part of the accelerator and which I may talk about a little later); this is Emil Maggeramov (mentor, who in general oversaw my progress in the accelerator for the creation of this project); these are classmates Valery Kuryshev and Georgy Bregman (regularly called once a week and shared the experience gained during the week).








All Articles