[71] Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers

TL;DR

task : Let’s see how sparse our transformer is and under what circumstances it is sparse
architecture : T5, ViT-B16
data : C4, ImageNet-21K
contribution : measure sparsity of transformer

Details

ViT, T5 encoder decoder, sparsity is high regardless of the encoder decoder. All but the first layer are within 10%.

This shows that it is not because some neurons are not active. The probability of a neuron being active was

The deeper the layer, the wider it is, the higher the sparsity.
1. Is there a human annotation bias in the label? 2) Is it because the natural image has a bias? 3) Is it because the model has a higher capacity than the data?

To verify the above three points, the sparsity did not change noticeably when we 1) randomized the labels, 2) randomized the images, and 3) made the data infinite, meaning that sparsity is an inherent property of transformers.

FLOPs drop thanks to sparsity
When limiting sparsity to top-K, performance is just like Transformer, with better performance for robustness and confidence.

ECE: expected calibration error. The difference between the probability of a model prediction and whether that prediction was actually correct.

TL;DR#

Details#

TL;DR

Details