
Abstract
Wouldn’t there be different characteristics of using SSL over ViT as opposed to convnet?
Unlike supervised and convnet, self-supervised ViT features have explicit information about the semantic segmentation of the image.

These features are excellent features for kNN classification, achieving an accuracy of ImageNet top-1 78.3% even on small ViTs.
We also revealed the importance of using 1) momentum encoder* 2) multi-crop training 3) small patches with ViT, and based on these findings, we propose DINO (self-distillation with no labels). In a linear evaluation of DINO + ViT, we achieved an accuracy of ImageNet top-1 80.1%. *MoCo : Momentum contrast for unsupervised visual representation learning. It just means that the teacher network is updated.
Introduction
ViT performed comparably to convnet, but its advantages were not clear: It required more computation, required more training data, and ViT’s features did not have specialized properties. In this paper, we want to show that Transformer’s success in vision lies not in supervised learning, but in self-supervised learning, like NLP.
In vision, SSL methods usually have a similar structure, but with slightly different elements to avoid trivial solutions (=collapse) or to increase performance. We want to apply these studies to ViT features.

After applying it, we found the properties described in the abstract and propose DINO based on them. DINO is a network with a momentum encoder that directly predicts the output of the teacher network and is trained with generalized cross-entropy. DINO generally works well by just centering/sharpening the teacher output, which is very simple and flexible, unlike other networks that use predictors, advanced norms, and contrastive losses. I’ve experimented with combining DINO with small patch ViT, and I’ve experimented with different combinations depending on how much GPU resources I have.
Approach
SSL with Knowledge Distillation

DINO follows the overall structure of modern SSL and has some similarities to knowledge distillation (KD).
KD is a method in which, given the output of a teach network, the output of a student network is trained to follow it. Given an input image x, the K-dimensional output probability, P_s, is normalized by softmax.

The temperature parameter, \tau_s, determines the sharpness of the output distribution. Given a teacher network, we are trained to minimize cross entropy over the student network.

We will show the process of transferring this loss to SSL. First, we use a multi-crop approach to create a set of V different images in the form of distorted views or crops. This image set has two global views (full images) and many low-dimensional local view images. The cropped images are trained on the student network, and the global views are passed through the teacher to achieve a “local-to-global” correspondence, i.e., our loss is rewritten,

The number of Vs in this case can be any number, even two. We defined a global view of size 224 x 224 and a local view of size 96 x 96, which covers more than 50% of the image. The two networks have the same structure but different parameters, and only the parameters of the student network are trained with SGD.
Teacher Network
Unlike KD, we don’t have a teacher because we don’t have private knowledge. Therefore, we learn from the past student network. We experimented with different ways to update the teacher network and found that freezing it for an epoch performed surprisingly well. We used an exponential moving average (EMA). (like #25 )

In this case, \lambda was scheduled to cosine from 0.996 up to 1. This EMA methodology had the effect of ensembling like Polyak-Ruppert Averaging(?).
Network architecture
NN g consists of a backbone f (ViT or ResNet) and a projection head h. The features used in the downstream task are the outputs of the backbone f. h is a 3-layer MLP with hid_dim of 2048 and l2 norm (called SwAV-like structure). h is a 3-layer MLP with hid_dim of 2048 and l2 norm (called SwAV-like structure). ViT does not use batch norm anywhere.
Avoiding collapse.
SSL methodologies take different approaches to avoid collapse. DINO can also be stabilized with norms, but centering + sharpening the momentum of teacher output was enough to prevent model collapse. Centering prevents one dimension from dominating and encourages collapse to a uniform distribution, while sharpening does the opposite. Together, the two operations provide a balance that prevents collapse.
centering can be interpreted as adding a bias term c to teacher. The center c is learned together with the EMA.

Result
Main Result

- BYOL, MoCov2, SwAV wins.
- When applying ViT to DINO, the results are almost as good as linear probing or KNN alone (74.5 vs 77.0), which is not seen in DINO + Convnet, which is a characteristic of the ViT architecture.
- We found that a patch size of 8 performed better than a patch size of 16.
Properties of ViT trained with SSL
We evaluated the properties of DINO features for nearest neighbor search, information about object location, and transferability to other downstream tasks.
image retrieval task to retrieve an image given query, image

copy detection
A task to recognize images that have been crushed by blur, insertion, print and scan.
segmentation

For the model trained with DINO for each ViT/s, we visualize the self-attention of [cls] tokens of each head in the last layer, and we have the segmentation information as shown below.

DINO performs better segmentation than the one trained with supervised ViT. Below is the top 60% of the self attention map

- transfer learning

Ablation
Importance of components

Effect of patch size

As the patch size gets smaller, the parameters stay the same, but the throughput increases. The smaller the patch size, the better the performance.
Analyzing training dynamic

Avoiding Collapse

Cross Entropy is divided into two terms, KL divergence and entropy, and plotted. Without either sharpening or centering, the KL divergence goes to zero, which means that the output is always constant, so collapse occurs, On the other hand, entropy converges to zero for sharpening alone and to -log(1/K) for centering alone, which means that they collapse in different directions, which means that the two operations must be balanced against each other.
