image

paper

TL;DR

  • task : large language modeling, domain incremental learning
  • problem : ์•„์ด๋””์–ด๋Š” DeMix์™€ ๊ฑฐ์˜ ์œ ์‚ฌ! ๊ทผ๋ฐ multi-node synchronize ํ•˜๋Š” ๋ถ€๋ถ„์˜ ์ปค๋ฎค๋‹ˆ์ผ€์ด์…˜์„ ์ค„์ด๊ณ  ์‹ถ๋‹ค.
  • idea : ๋„๋ฉ”์ธ ๋ณ„๋กœ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ณต์œ ํ•˜์ง€ ์•Š๋Š” expert LM์„ ๋งŒ๋“ค๊ณ (์ด์ „ MoE LM๋“ค์€ FFN๋งŒ ๋”ฐ๋กœ ์“ฐ๊ณ  ๋‚˜๋จธ์ง€๋Š” ๊ณต์œ ํ•จ) Branch-Train-Merge(BTM)์„ ์‚ฌ์šฉํ•ด์„œ ํ•™์Šตํ•จ. BTM์˜ ์ฃผ์š”์•„์ด๋””์–ด๋Š” ์ƒˆ๋กœ์šด ๋„๋ฉ”์ธ์ด ์œ ์ž…๋์„ ๊ฒฝ์šฐ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด LM์„ ์ฐพ์€๋’ค ํ‰๊ท ์„ ๋‚ด์„œ initialize ํ•˜์—ฌ branch๋ฅผ ๋”ฐ์„œ ํ•™์Šต์ด ๋˜๊ณ  branch forest์— ์ถ”๊ฐ€ํ•จ. inference ์‹œ์—๋Š” ์–ด๋–ค ๋„๋ฉ”์ธ์ธ์ง€ bayes rule์„ ํ†ตํ•ด posterior๋ฅผ ์ถ”์ •ํ•œ๋’ค weighted sum์œผ๋กœ ์ตœ์ข… ์˜ˆ์ธก๋œ๋‹ค.
  • architecture : vanilla Transformer..
  • objective : cross-entropy loss
  • baseline : Transformer LM(GPT), DeMix
  • data : Wikipedia, C4, StackOverflow, JavaScript, … ๋“ฑ๋“ฑ
  • result : out-of-domain์—์„œ ๋” ์ข‹์€ perplexity, 64๊ฐœ์˜ domain์— ๋Œ€ํ•ด incremental learningํ–ˆ์„ ๋•Œ 2.5๋ฐฐ์˜ ํฌ๊ธฐ๋ฅผ ๊ฐ€์ง„ Transformer LM๊ณผ ๋น„์Šทํ•œ ์„ฑ๋Šฅ.
  • contribution : MoE without shared parameters.
  • limitation or ์ดํ•ด ์•ˆ๋˜๋Š” ๋ถ€๋ถ„ :

Details

Batch-Train-Merge(BTM)

image image

Inference

image

๋ชจ๋“  ELM์— forward ํ•ด์•ผํ•˜๋Š”๊ฑด ๋งž์ง€๋งŒ ์„ ํƒ๋˜๋Š” ELM์ด sparseํ•˜๊ฒŒ ๊ตฌ์„ฑ๋จ์„ ํ™•์ธํ• ์ˆ˜ ์žˆ์—์Œ.

Data..

image

DeMix

DeMix, 2021

  • https://arxiv.org/pdf/2108.05036.pdf image
  • problem : ์—ฌ๋Ÿฌ ๋„๋ฉ”์ธ์˜ corpus๋ฅผ ํ•˜๋‚˜์˜ LM์œผ๋กœ ํ•™์Šตํ•  ๋•Œ์˜ perplexity๋ฅผ ๋‚ฎ์ถ”๊ณ  ์‹ถ๋‹ค. ์ด๋•Œ ์šฐ๋ฆฌ๋Š” ๊ฐ ๋ฐ์ดํ„ฐ์˜ ๋„๋ฉ”์ธ์„ ์•Œ๊ณ  ์žˆ๋‹ค.
  • solution : corpus์˜ ๋„๋ฉ”์ธ ๋ณ„๋กœ FFN(switch Transformer์ฒ˜๋Ÿผ)์„ expert๋กœ ๋‘์–ด ํ•™์Šต์‹œํ‚จ๋‹ค. inference ์‹œ์— ์ƒˆ๋กœ์šด ๋„๋ฉ”์ธ์ด ์ถ”๊ฐ€ ๋˜์—ˆ์„ ๋•Œ, 1) ๋ชจ๋“  FFN์„ forward๋ฅผ ํ•˜์—ฌ ๋ฒ ์ด์ฆˆ๋ฃฐ๋กœ weighted sumํ•˜์—ฌ ๊ฒฐ๊ณผ๋ฅผ ๋‚ด๊ฑฐ๋‚˜ 2) ํ•ด๋‹น ๋„๋ฉ”์ธ์„ ์œ„ํ•œ FFN์„ ์ถ”๊ฐ€ํ•  ์ˆ˜ ์žˆ๋‹ค.
  • result : ํ•™์Šต ํšจ์œจ์„ ๋Š˜๋ฆฌ๋ฉด์„œ LM perplexity ๊ฐœ์„ , ์ด์ „ expert๋“ค์˜ forgetting ์—†์ด ์ƒˆ๋กœ์šด ๋„๋ฉ”์ธ์„ ์ถ”๊ฐ€ํ•˜๊ฑฐ๋‚˜ ์ œ๊ฑฐํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ž„.