image

paper

TL;DR

  • task : lanugage modeling
  • problem : The transformer is too large and heavy. It would be nice to have the network configure itself to meet the inference latency goal.
  • Idea:** Design FFN, MHA, MoE FFN layer of Transfomer-XL given the latency of NAS writing.
  • architecture: Transformer-XL, using GumbelSoftmax when NAS selects blocks + reinforcement based search.
  • objective : cross-entropy loss + latent loss (= probability of each super block being selected and the latency of that super block), with latency loss only added if it is higher than the target latency.
  • baseline : Transformer-XL, PAR Transformer, Sandwich Transformer
  • data : wt103, enwiki8
  • Result : 2x faster latency for similar performance. Higher normalized latency compared to PPL for the same size model with MoE in iso-parametric setting.
  • contribution : NAS for inference latency
  • limitation or something I don’t understand : MHA doesn’t want to touch MoE…it says wuzzling -> runtime overhead introduced by dynamic behavior, but I don’t know what it means.

Details

  • latency for each layer of the transformer image

  • MSA / FFN Comparison of latency when changing each hyperparameter image

  • Model architecture configurations discovered by the NAS image

Reducing the number and dimension of MHA layers and adding MoEs or FFNs.

  • MoE image

  • search space for NAS image