OpenAI's CLIP Neural Network: A classifier that doesn't need to be trained. Long Live Learning Without Learning

Can you imagine an image classifier that solves almost any problem and doesn't need to be trained at all? Have you presented? It turns out that this should be a universal classifier? That's right! This is a new CLIP neural network from OpenAI. Parsing CLIP from the heading: Disassembling and Assembling Neural Networks using Star Wars as an example!

CLIP ""?

An example of image classification by the CLIP neural network using the "learning without training" method on various datasets, including ImageNet
  • Beyond the Infinite. CLIP, ?

CLIP - there is an element-wise cosine similarity of text and visual representations
CLIP — cosine similarity

Picture.  Darth Vader kills his son

# image_encoder - ResNet or Vision Transformer
# text_encoder - CBOW or Text Transformer
# I[n, h, w, c] - minibatch of aligned images
# T[n, l] - minibatch of aligned texts
# W_i[d_i, d_e] - learned proj of image to embed
# W_t[d_t, d_e] - learned proj of text to embed
# t - learned temperature parameter

# extract feature representations of each modality
I_f = image_encoder(I) #[n, d_i]
T_f = text_encoder(T) #[n, d_t]

# joint multimodal embedding [n, d_e]
I_e = l2_normalize(, W_i), axis=1)
T_e = l2_normalize(, W_t), axis=1)

# scaled pairwise cosine similarities [n, n]
logits =, T_e.T) * np.exp(t)

# symmetric loss function
labels = np.arange(n)
loss_i = cross_entropy_loss(logits, labels, axis=0)
loss_t = cross_entropy_loss(logits, labels, axis=1)
loss = (loss_i + loss_t)/2

Zero-shot CLIP turns out to be more resistant to shifting distributions than the model trained on ImageNet.
Zero-shot CLIP , ImageNet.

Zero-shot CLIP , ImageNet.

DALL·E, , . , , .

Two possible modes of using the CLIP hybrid neural network

