image

paper problem : multi-lingual ์…‹ํŒ…์—์„œ BPE๋ฅผ ํ•˜๋ฉด, ์ž˜ ๋‚˜์˜ค์ง€ ์•Š๋Š” ์บ๋ฆญํ„ฐ๋“ค ๋•Œ๋ฌธ์— vocab์ˆ˜๋ฅผ ์žก์•„๋จน๋Š”๋‹ค. ์ค‘๊ตญ์–ด์˜ ๊ฒฝ์šฐ์—๋Š” ๊ธ€์ž๊ฐ€ ๋‹ค๋ฅธ ๊ธ€์ž์˜ ์ผ๋ถ€์ธ ๊ฒฝ์šฐ๋„ ์žˆ๋Š”๋ฐ(่™ซ, ่Ÿฒ), ์บ๋ฆญํ„ฐ ๋ ˆ๋ฒจ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ๊ด€๊ณ„๋ฅผ ์•Œ๊ธฐ ์–ด๋ ต๋‹ค. solution : ๊ธ€์ž๋“ค์„ ‘utf-8’๋กœ ์ธ์ฝ”๋”ฉํ•œ ๋’ค์— BPE๋ฅผ ์ ์šฉํ•˜์ž. result : 1) ๋” ์ ์€ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ BPE์™€ ๋น„์Šทํ•œ ์„ฑ๋Šฅ 2) BPE๋ณด๋‹ค ๋” ์งง์€ sequence length์„ ๋งŒ๋“ค์–ด train / inference ์†๋„์—๋„ ์šฉ์ด 3) transfer learning์— ์šฉ์ด(OOV๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐ) details :

  • multi-lingual์„ ํ•  ๋•Œ, ์ค‘๊ตญ์–ด๋‚˜ ์ผ๋ณธ์–ด์—์„œ ์ž˜ ๋‚˜์˜ค์ง€ ์•Š๋Š” ์บ๋ฆญํ„ฐ๋“ค ๋•Œ๋ฌธ์— vocab์ˆ˜๋ฅผ ์žก์•„๋จน์Œ.
>>> '่Ÿฒ'.encode('utf-8')
b'\xe8\x9f\xb2'
>>> '่™ซ'.encode('utf-8')
b'\xe8\x99\xab'
>>> '์•ˆ'.encode('utf-8')
b'\xec\x95\x88'
>>> '์•Š'.encode('utf-8')
b'\xec\x95\x8a'
  • utf-8 ์ธ์ฝ”๋”ฉ์„ ํ•œ ๋’ค, n-gram์„ ํ†ตํ•œ BPE vocab set์„ ๋งŒ๋“ฆ

  • encoding์€ transformer๋ฅผ ์‚ฌ์šฉํ•จ

  • decoder๋Š” encoder์— ๋น„ํ•ด BBPE๋ฅผ ์ ์šฉํ•˜๊ธฐ๊ฐ€ ์–ด๋ ค์šด๋ฐ, ๋ชจ๋“  ์บ๋ฆญํ„ฐ๋Š” byte sequence๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ๋ฐ˜๋Œ€์˜ ๊ฒฝ์šฐ์—๋Š” invalidํ•œ byte sequence๊ฐ€ ๋‚˜์˜ฌ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ํ•™์Šต๋œ ๋ชจ๋ธ์—์„œ ์ด๋Ÿฌํ•œ ํ˜„์ƒ์€ ๊ฑฐ์˜ ๋ฐœ์ƒํ•˜์ง€ ์•Š์•˜๋‹ค.

    • ํ•™์Šต ์ค‘๊ฐ„์—๋Š” ๋ถˆํ•„์š”ํ•˜๊ฒŒ byte๋ฅผ ๋ฐ˜๋ณตํ•˜๋Š” ํ˜„์ƒ์ด ์žˆ์—ˆ๋Š”๋ฐ, ์šฐ๋ฆฌ๋Š” ์ด๋Ÿฌํ•œ error pattern์„ ์ตœ๋Œ€ํ•œ ๋งŽ์€ character๋กœ ์›๋ณตํ•˜๋Š” ์‹œ์Šคํ…œ์„ ๋งŒ๋“ค์—ˆ๋‹ค.
  • ์—ฌ๋Ÿฌ MT ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•™์Šตํ–ˆ๊ณ , beam search 4 ์‚ฌ์šฉ, ํ‰๊ฐ€๋กœ๋Š” tokenized BLEU(sacreBLEU )๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค.

  • Symbol Frequency Distribution : ๊ฐ€๋กœ๊ฐ€ symbol, ์„ธ๋กœ๊ฐ€ frequency. BBPE๊ฐ€ frequency์—์„œ ๋” consistentํ•œ distribution์„ ๊ฐ€์ง. image

  • Cross-Lingual Sharing : ๋‹ค๋ฅธ ์–ธ์–ด ์‚ฌ์ด์—์„œ๋„ writing ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ๋‹ค๋ฅด์ง€๋งŒ ๊ฐ™์€ symbol์„ ๊ณต์œ ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์ƒ๊น€. image

  • Impact on Sequence Length : BPE์™€ ๋‹ฌ๋ฆฌ BBPE๋Š” ๋‹จ์œ„๊ฐ€ ์งง๊ธฐ ๋•Œ๋ฌธ์—, sequence๊ฐ€ ๊ธธ์–ด์ง€๊ณ  ์ด์— ๋”ฐ๋ผ train, inference๊ฐ€ ๋” ์˜ค๋ž˜ ๊ฑธ๋ฆฐ๋‹ค. ํ•˜์ง€๋งŒ BPE์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์••์ถ•์„ ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, BPE๋ณด๋‹ค ๋” ์งง์€ ์‹œํ€€์Šค๋ฅผ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋‹ค. (X-En์˜ ๊ฒฝ์šฐ 1/5)

  • BBPE on Nosiy Character Set : En-De์—์„œ๋Š” nosiy ํ•œ ๋ฌธ์žฅ์ด ๋ช‡๊ฐœ ์žˆ์—ˆ๋Š”๋ฐ, En-De ๋ชจ๋‘ 30๊ฐœ์˜ ์•ŒํŒŒ๋ฒณ์œผ๋กœ ์ด๋ฃจ์–ด์ ธ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋” ๋‚ญ๋น„๊ฐ€ ์‹ฌํ–ˆ๋‹ค. BBPE๋ฅผ ํ†ตํ•ด 2K, 4K๋กœ ๋งŒ๋“ค์—ˆ์„ ๋•Œ BPE 32K๋กœ ๋งŒ๋“ ๊ฒƒ๊ณผ ๋ชจ๋ธ ํฌ๊ธฐ๋Š” ์ž‘์ง€๋งŒ ๋น„์Šทํ•œ ์„ฑ๋Šฅ์„ ๋‚ด๋Š” ๊ฒƒ์„ ํ™•์ธํ–ˆ๋‹ค. image

  • BBPE on Character-Rich Languages ์ค‘๊ตญ์–ด๋‚˜ ์ผ๋ณธ์–ด ๊ฐ™์€ ๊ฒฝ์šฐ์— 50K์˜ ๊ธ€์ž๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์ง€๋งŒ ์ผ๋ถ€๋ถ„๋งŒ ๋งŽ์ด ์‚ฌ์šฉ๋œ๋‹ค. Ja-En ๋ฐ์ดํ„ฐ์…‹์—์„œ 7.9K์˜ ์บ๋ฆญํ„ฐ ์ค‘์— 2.4K์˜ ์บ๋ฆญํ„ฐ๊ฐ€ ๋นˆ๋„์˜ 99.99%๋ฅผ ์ฐจ์ง€ํ–ˆ๋‹ค. BBPE๋กœ ์ „์ฒด ์บ๋ฆญํ„ฐ์˜ ๋ฐ˜์ธ 4K๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค. ์ด๋•Œ BPE 16K๋ฅผ ์‚ฌ์šฉํ•œ ๊ฒƒ๊ณผ ์„ฑ๋Šฅ์ด ์œ ์‚ฌํ–ˆ์œผ๋ฉฐ, big model์—์„œ๋Š” ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. image

  • BBPE on Many-to-En Translation multilingual setting์—์„œ BBPE๋Š” BPE๋ณด๋‹ค ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ ๊ฒŒ ์“ฐ๋ฉด์„œ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ–ˆ๋‹ค. ์ด๋Š” BBPE๊ฐ€ sequence length๊ฐ€ ๋” ์งง๊ธฐ ๋•Œ๋ฌธ์ผ ์ˆ˜๋„ ์žˆ์ง€๋งŒ, ์–ด์จŒ๋“  ์„ฑ๋Šฅ๊ณผ ์†๋„๊ฐ€ ๋” ์ข‹๊ธฐ ๋•Œ๋ฌธ์— ๊ดœ์ฐฎ๋‹ค ^^ image

  • Transfer Learning on Unseen Characters BBPE๋Š” ๋ชจ๋“  utf-8 bytes๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๊ณ , OOV token์ด ์—†๊ธฐ ๋•Œ๋ฌธ์—, BBPE ๋ชจ๋ธ์€ ๋‹ค๋ฅธ ์–ธ์–ด๋กœ transfer๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค. ๋ฐ˜๋ฉด์— character-based vocab์€ ์ƒˆ๋กœ์šด character๊ฐ€ ์ถ”๊ฐ€๋˜๋ฉด vocab์„ ๋ฐ”๊พธ๊ณ  ์ฒ˜์Œ๋ถ€ํ„ฐ ๋‹ค์‹œ ํ•™์Šตํ•ด์•ผํ•œ๋‹ค. X-En์— ์—†๋Š” ์บ๋ฆญํ„ฐ ์…‹์ธ Si(Sinhala)-En ๋ฐ์ดํ„ฐ๋กœ BBPE ๋ชจ๋ธ๋กœ transfer-learning์„ ํ•œ ๊ฒƒ์€ (์‹ฌ์ง€์–ด ๊ณต์œ ํ•˜๋Š” char ์—†๋Š”๋ฐ๋„) baseline ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์ข‹์•˜๋‹ค. image