image

paper

TL;DR

  • task : lanugage modeling
  • problem : transformer๊ฐ€ ๋„ˆ๋ฌด ํฌ๊ณ  ๋ฌด๊ฒ๋‹ค. inference latency ๋ชฉํ‘œ์— ๋งž๊ฒŒ ์•Œ์•„์„œ ๋„คํŠธ์›Œํฌ๊ฐ€ ๊ตฌ์„ฑ๋˜๋ฉด ์ข‹๊ฒ ๋‹ค.
  • idea : NAS ์จ์„œ latency๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, Transfomer-XL์˜ FFN, MHA, MoE FFN ๋ ˆ์ด์–ด๋ฅผ ์„ค๊ณ„.
  • architecture : Transformer-XL, NAS๊ฐ€ block์„ ์„ ํƒํ•  ๋•Œ GumbelSoftmax ์‚ฌ์šฉ + reinforcement ๊ธฐ๋ฐ˜์˜ search.
  • objective : cross-entropy loss + latent loss(=๊ฐ super block์ด ์„ ํƒ๋  ํ™•๋ฅ ๊ณผ ๊ทธ super block์˜ latency), latency loss๋Š” ๋ชฉํ‘œ latency๋ณด๋‹ค ๋†’์•„์งˆ ๊ฒฝ์šฐ์—๋งŒ ๋ถ€๊ฐ€๋จ.
  • baseline : Transformer-XL, PAR Transformer, Sandwich Transformer
  • data : wt103, enwiki8
  • result : ๋น„์Šทํ•œ ์„ฑ๋Šฅ์— 2๋ฐฐ ๋น ๋ฅธ latency. ๊ฐ™์€ ํฌ๊ธฐ์˜ MoE ์ ์šฉ์•ˆํ•œ(iso-parametric setting) ๋ชจ๋ธ๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ PPL ๋Œ€๋น„ ๋†’์€ normalized latency.
  • contribution : NAS for inference latency
  • limitation or ์ดํ•ด ์•ˆ๋˜๋Š” ๋ถ€๋ถ„ : MHA๋Š” ๋‹ค๋“ค MoE๋กœ ๊ฑด๋“œ๋ฆด ์ƒ๊ฐ์„ ์•ˆํ•˜๋„ค..์™œ์ง• -> runtime overhead introduced by dynamic behavior๋ผ๊ณ  ๋‚˜์™€์žˆ๋Š”๋ฐ ๋ญ”๋ง์ธ์ง€ ๋ชจ๋ฅด๊ฒ ์Œ.

Details

  • ํŠธ๋žœ์Šคํฌ๋จธ์˜ ๊ฐ ๋ ˆ์ด์–ด ๋ณ„ latency image

  • MSA / FFN ๊ฐ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ๋ณ€๊ฒฝํ•  ๋•Œ์˜ latency ๋น„๊ต image

  • NAS๊ฐ€ ์„œ์น˜ํ•œ ๋ชจ๋ธ ์•„ํ‚คํ…์ณ ๊ตฌ์„ฑ๋“ค image

MHA๋ ˆ์ด์–ด์˜ ๊ฐœ์ˆ˜์™€ ์ฐจ์›์„ ์ค„์ด๊ณ , MoE๋‚˜ FFN์„ ์ถ”๊ฐ€ํ•˜๋Š” ์–‘์ƒ.

  • MoE image

  • search space for NAS image