Self-Supervised Vision Transformers: Emerging Properties

Press enter or click to view image in full size

Currently, I am continuing my journey of reading prevalent self-supervised learning papers, and this week I decided to read the DINO paper. I was particularly interested to read this paper since DINOv3 from META came out a month ago, and Dr. Jia-Bin Huang released a paper summarization on YouTube about the DINO series(I highly recommend you watch his detailed videos: link). Although this work was released in 2021, it has gone through 2 more iterations over the last 4 years. In this article, I won’t be going over the 2 most recent papers since most of the novelty and fundamental findings come from the first paper. At first glance, this paper and methodology seem very similar to the BYOL paper, but there are still some significant differences.

Vision Transformers:

Since this paper was released in 2021, Vision Transformers were just starting to become popular in the Computer Vision domain, so this paper heavily discusses the benefits of Vision Transformers. The authors wanted to see if Vision Transformers perform better in self-supervised settings compared to convnets. Additionally, they determine if Vision Transformers have unique learned representations compared to convnets. The issue with transformers is that they are computationally expensive and require lots of data. Transformers found success in Natural Language through the self-supervised setting, through works such as BeRT and GPT. Thus, the main motivation for this paper was to investigate the impact of self-supervised learning on Vision Transformers and their features.

Approach:

Knowledge Distillation: There is a student network g_θs that is trained to match the teacher network, parameterized by θs and θt.

General Idea:

Given the input image x, both networks create an encoded representation and then perform a normalized softmax to get P_s and P_t probability distributions. With a dynamically trained teacher network g_θt, we want to match P_s and P_t by minimizing cross-entropy(-alogb).

Normalized softmax to create probability distributions

Augmentations(Cropping):

They first create numerous crops using a multi-crop strategy of the given image(views). They generate 2 global views, x_1g and x_2g, and numerous local views of smaller resolution. The standard setting for crops in this paper is 2 global views at resolution 224² and several local views of 96².

Both local and global crops are passed through the student network, while the teacher only gets the global crop.

Loss:

The loss is minimizing the cross entropy between P_t(x) and P_s(x’) where x is the global views and x’ are the global+local views. They optimize using stochastic gradient descent.

Minimizing cross-entropy between student and teacher networks

Although both networks(student and teacher) have the same architecture, they don’t share the same parameters. The teacher network is not given a prior, rather it is fully based on the prior states of the student network. Similar to BYOL, they use an EMA(exponential moving average) momentum encoder to update the teacher weights:

Exponential Moving Average Updatin of teacher network parameters

with λ following a cosine schedule from 0.996 to 1. The teacher network actually displays better performance through training.

Architecture:

For the architecture of teacher and student networks, they use a ViT or ResNet encoder followed by a 3-layer MLP and l_2 normalization with a fully connected layer. Additionally, DINO doesn’t use batch normalization.

Collapse has been an issue in SSL methods, which are usually avoided using normalization or contrastive learning. The authors decide to use centering, since it doesn’t allow a certain batch to dominate. They do this by adding a bias term to the network while also performing EMA to update this bias term.

Overall represenation of the DINO architecture

Results:

On ImageNet:

With ResNet-50, they perform on par w/ SOTA methods.
With ViT, they show improvements over BYOL, MOCOv2, and SWAV.

DINO performs better when scaling up and using a larger ViT model.

DINO also performs well on Retrieval, Copy detection, Video instance segmentation, and other transfer learning downstream tasks.

Press enter or click to view image in full size

In conclusion, self-supervised learning can extract useful features using ViT that result in better performance than supervision methods and convnets.