
TL;DR
- task : lanugage modeling
- problem : The transformer is too large and heavy. It would be nice to have the network configure itself to meet the inference latency goal.
- Idea:** Design FFN, MHA, MoE FFN layer of Transfomer-XL given the latency of NAS writing.
- architecture: Transformer-XL, using GumbelSoftmax when NAS selects blocks + reinforcement based search.
- objective : cross-entropy loss + latent loss (= probability of each super block being selected and the latency of that super block), with latency loss only added if it is higher than the target latency.
- baseline : Transformer-XL, PAR Transformer, Sandwich Transformer
- data : wt103, enwiki8
- Result : 2x faster latency for similar performance. Higher normalized latency compared to PPL for the same size model with MoE in iso-parametric setting.
- contribution : NAS for inference latency
- limitation or something I don’t understand : MHA doesn’t want to touch MoE…it says wuzzling -> runtime overhead introduced by dynamic behavior, but I don’t know what it means.
Details
latency for each layer of the transformer

MSA / FFN Comparison of latency when changing each hyperparameter

Model architecture configurations discovered by the NAS

Reducing the number and dimension of MHA layers and adding MoEs or FFNs.
MoE

search space for NAS
