[60] Efficient Sparsely Activated Transformers

TL;DR

task : lanugage modeling
problem : The transformer is too large and heavy. It would be nice to have the network configure itself to meet the inference latency goal.
Idea:** Design FFN, MHA, MoE FFN layer of Transfomer-XL given the latency of NAS writing.
architecture: Transformer-XL, using GumbelSoftmax when NAS selects blocks + reinforcement based search.
objective : cross-entropy loss + latent loss (= probability of each super block being selected and the latency of that super block), with latency loss only added if it is higher than the target latency.
baseline : Transformer-XL, PAR Transformer, Sandwich Transformer
data : wt103, enwiki8
Result : 2x faster latency for similar performance. Higher normalized latency compared to PPL for the same size model with MoE in iso-parametric setting.
contribution : NAS for inference latency
limitation or something I don’t understand : MHA doesn’t want to touch MoE…it says wuzzling -> runtime overhead introduced by dynamic behavior, but I don’t know what it means.