image

paper

TL;DR

  • task : Let’s see how sparse our transformer is and under what circumstances it is sparse
  • architecture : T5, ViT-B16
  • data : C4, ImageNet-21K
  • contribution : measure sparsity of transformer

Details

  • ViT, T5 encoder decoder, sparsity is high regardless of the encoder decoder. All but the first layer are within 10%. image

This shows that it is not because some neurons are not active. The probability of a neuron being active was image

  • The deeper the layer, the wider it is, the higher the sparsity. image

    1. Is there a human annotation bias in the label? 2) Is it because the natural image has a bias? 3) Is it because the model has a higher capacity than the data? image

To verify the above three points, the sparsity did not change noticeably when we 1) randomized the labels, 2) randomized the images, and 3) made the data infinite, meaning that sparsity is an inherent property of transformers.

  • FLOPs drop thanks to sparsity image

  • When limiting sparsity to top-K, performance is just like Transformer, with better performance for robustness and confidence.

image image

ECE: expected calibration error. The difference between the probability of a model prediction and whether that prediction was actually correct.