
Introduction
Bootstrap* Your Own Latent (BYOL) is designed so that two networks, the online network and the target network, interact and learn from each other. The online network is trained to represent an image when it is aggregated and the target network is trained to represent the same image when it is aggregated differently. At the same time, we train the target network with the slow-moving average of the online network. Current SOTA models use negative pairs, but BYOL achieves a new SOTA without them.
*bootstrap is not an ML term, but rather its own meaning:
to improve your situation or become more successful, without help from others or without advantages that others have.

- While previous work has used pseudo-labels, cluster indicators, or handful labels, our work bootstraps the representation directly.
- Our work is robust to image augmentation by not using negative pairs.
- Methodologies such as #9 were trained by predicting the same image as the image and the agglomerated image as the same image, which leads to representation collapse when the prediction problem is given in the representation space. To avoid this, we applied a methodology that predicts the difference between the same image and the other image, but it has the limitation that it requires a very large number of negative samples.
- To avoid collapse without negative samples, a simple solution is to make a fixed randomized network the target for our prediction. While this prevents collapse, the performance is low, but surprisingly, linear evaluation of a random initialized network has an accuracy of 1.4%, while predicting the output of a fixed random initialized network has an accuracy of 18.8%. This experiment was the motivation for BYOL.
- Given a representation (=target network), we can train a new online network to predict the target representation. From there, we can learn higher quality representations as we repeat this procedure, and set the next online network as the new target network to learn more. In practice, we bootstrap using the moving exponential average of the online network.
BYOL

- The online network consists of an encoder, a projector, and a predictor and has weight \theta.
- The target network has the same structure as online, but has a different weight, \psi, and serves to provide targets for the online network. In this case, the parameter \psi is the moving average of the online parameter \theta.

Create \nu, \nu’ aggregated over one image and burn each network. Then compare the output of online’s last prediction with target’s projection and MSE.

Then we reverse the \nu, \nu’ aggregation back into the online network and find the loss. Then sum the losses and minimize only on \theta.

Implementation details
- Image Augmentation Use the same augmentation set as #9. select 224 x 224 random horizontal flip with random patch …
- Architecture ResNet-50 for encoder, average pooling for representaion layer, MLP(4096 -> ReLU -> 256) for prediction layer. no batch norm.
- Optimization : LARS, cosine decay, …
Result
linear evaluation in ImageNet

Finetuning(=Semi-supervised training) in ImageNet

Transfer to other classification task

Transfer to other vision task

Ablation

Compared to simCLR, there was less of a performance drop as we reduced the batch_size and augmentation.

It made sense to use a moving average.

It made sense to have a target network.