image

paper , page

TL;DR

  • I read this because.. : 4D parallelism ๋“ฑ ์ž์ฃผ ์–ธ๊ธ‰๋˜์–ด
  • task : foundation model
  • idea : ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ. ๋” ์ตœ์ ํ™”. ๋‹ค์–‘ํ•œ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ
  • architecture : Llama2์™€ ๋Œ€๋™์†Œ์ด. GQA ์‚ฌ์šฉํ•œ ๊ฒƒ๊ณผ vocab ์ฆ๊ฐ€ RoPE frequency ์ฆ๊ฐ€ ์ •๋„๊ฐ€ ์ฐจ์ด๋ผ๊ณ  ํ•จ. 8B, 70B, 405B ์ •๋„ ์…‹. // vision์˜ ๊ฒฝ์šฐ cross attention ๋ ˆ์ด์–ด
  • objective : ce loss / DPO loss
  • baseline : llama2, claude, chatgpt, mistral, mixtral, gemini, gemma
  • data : pretraining data, SFT data(via rejection sampling), RM data(human annotation),
  • evaluation : ..
  • result :
  • contribution : ์ข‹์€ ์„ฑ๋Šฅ์˜ ์˜คํ”ˆ์†Œ์Šค ๋ชจ๋ธ. ๋‹ค์–‘ํ•œ ๋ถ„์•ผ์—์„œ์˜ ์ตœ์ ํ™” + ํƒ์ƒ‰
  • etc. : ์ •๋ง ์ž์ž˜ํ•œ ๊ฒƒ๊นŒ์ง€ ์ ์–ด๋†”์„œ ์žฌ๋ฐŒ์—ˆ๋‹ค.

Details

๋‚ด์šฉ์ด ๋งŽ์•„์„œ ํฅ๋ฏธ ์œ„์ฃผ๋กœ ์ •๋ฆฌ

pretraining

  • model arch image

  • training recipe ์ฒ˜์Œ์— ๋‚ฎ์€ bs๋กœ ์‹œ์ž‘ํ–ˆ๋‹ค๊ฐ€ ์ ์  bs๋ฅผ ๋†’์ด๋Š”๊ฒŒ ์„ฑ๋Šฅ ์•ˆ์ •์„ฑ์— ์ข‹์•˜๋‹ค๊ณ  ํ•จ (bs 4M tokens with 4096 length -> 8M sequences of 8192 length -> 16M .. ํ•™์Šต ์ค‘๊ฐ„์— non-english์™€ ์ˆ˜ํ•™ ๋ฐ์ดํ„ฐ์˜ ๋น„์œจ์„ ๋†’์˜€๋‹ค๊ณ  ํ•จ.

  • annealing data pretraining ๋‹จ๊ณ„์—์„œ high-quality์˜ mathematical data๋ฅผ ๋งˆ์ง€๋ง‰์— ๋งŽ์ด ๋ณด์—ฌ์ฃผ๋Š” ๋ฐฉ์‹์œผ๋กœ ์„ฑ๋Šฅ์„ ๊ฐœ์„  (GSM8K์™€ MATH val์„ 24%, 6.4% ๊ฐœ์„ ) ์ด๊ฑธ ๋ฐ˜๋Œ€๋กœ ์‚ฌ์šฉํ•ด์„œ small data annealing์„ ํ•ด๋ณด๊ณ  ์„ฑ๋Šฅ์ด ์•ˆ์˜ค๋ฅด๋ฉด ์ข‹์€ ๋ฐ์ดํ„ฐ๊ฐ€ ์•„๋‹ˆ๊ตฌ๋‚˜ํ•˜๊ณ  ํŒ๋‹จํ•˜๊ธฐ๋„

  • parallelism for model scaling {TP, CP, PP, DP} ์ˆœ์œผ๋กœ ํ–ˆ๋‹ค๊ณ  ํ•จ – TP๋กœ ๊ฐˆ ์ˆ˜๋ก ๋” ํ†ต์‹ ์ด ์žฆ์•„์„œ MFU๋Š” 38~43% ์ •๋„ ๋‚˜์™”๋‹ค๊ณ  ํ•จ. image

configuration์€ ์ด๋Ÿฌํ•จ. TP: PP๋น„์œจ์ด 1:2์ด๊ณ  DP๋Š” ๊ทธ๋ƒฅ ๋‚˜๋จธ์ง€! PP ๊ฐœ์„ ์— ๋Œ€ํ•ด์„œ๋„ ๋‚˜์™€์žˆ๋Š”๋ฐ, bubble์„ ์ค„์ด๊ธฐ ์œ„ํ•ด interleaved schedule์„ ์ ์šฉํ–ˆ๊ณ , ์ฒซ ๋ ˆ์ด์–ด์˜ ์ž„๋ฒ ๋”ฉ๊ณผ ๋งˆ์ง€๋ง‰์˜ Output ์˜ˆ์ธก ๋ถ€๋ถ„์€ ๊ทธ๋ƒฅ ํ•˜๋‚˜์˜ gpu๊ฐ€ ๋‹ด๋‹นํ•˜๋„๋ก ํ–ˆ๋‹ค๊ณ ํ•จ. ๊ทธ๋ฆฌ๊ณ  PP์—์„œ asynch Point-to-point communication ์ผ๋‹ค๊ณ  ํ•จ.

bs์— ๋Œ€ํ•ด gpu ๊ฐœ์ˆ˜๋กœ ์ œ์•ฝ์ด ์ƒ๊ธฐ๋Š”๊ฑด ์•„๋ž˜์™€ ๊ฐ™์ด ํ•ด๊ฒฐํ–ˆ๋‹ค๋Š”๋ฐ ์ดํ•ด๋ฅผ ๋ชปํ•จ

image
  • numerical stability reduce-scatter ํ•  ๋•Œ FP32, accumํ•  ๋•Œ FP32๋“ฑ์„ ์‹ ๊ฒฝ์ผ๋‹ค๊ณ  ํ•จ

  • collective communication NCCL์˜ ๊ฐœ์„ ๋œ ๋ฒ„์ „์ธ NCCLX์ด๋ผ๋Š” ํŒจํ‚ค์ง€๋ฅผ ๋งŒ๋“ค์–ด์ผ๋‹ค๊ณ  ํ•จ. ์ฃผ์š”๊ธฐ๋Šฅ์€ data chunking / data transfer๋ฅผ ํŠœ๋‹ํ–ˆ๋‹ค๊ณ  ํ•˜๊ณ  ์ž‘์€ control message๋ฅผ ๋” ์šฐ์„ ์ˆœ์œ„ ๋†’๊ฒŒ ํ–ˆ๋‹ค๊ณ  ํ•จ.

Post Training

  • Reward Model edited > chosen > rejected ์ˆœ์œผ๋กœ Prefer ๋ฐ์ดํ„ฐ๋ผ๊ณ  ๋ณด๊ณ  ํ•™์Šต. concatํ•ด์„œ ํ•˜๋‚˜์˜ Row์—์„œ ์ฒ˜๋ฆฌํ•˜๋„๋ก ํ–ˆ๊ณ  ์ด ๋•Œ ์„ฑ๋Šฅ์ƒ ์—ดํ™”๋Š” ์—†์—ˆ๋‹ค๊ณ  ํ•จ

  • SFT RFT data + synthetic data + small amount of human-curated data RFT data (rejection sampling์œผ๋กœ ํ‘œํ˜„) human annotated prompt๋กœ ๊ฐ€์žฅ ์ตœ๊ทผ์˜ ๋ชจ๋ธ์—๊ฒŒ K๊ฐœ์˜ ์ƒ˜ํ”Œ์„ ๋ฐ›์•„์„œ rejection sampling ํ•ด์„œ ํ•™์Šต์— ์‚ฌ์šฉ. -> ๋Œ€๋ถ€๋ถ„์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์ด๋Ÿฐ ์‹์œผ๋กœ ์ƒ์„ฑ๋˜์–ด์„œ ์—ฌ๋Ÿฌ ํ•„ํ„ฐ๋ง์„ ๊ฑฐ์ณค๋‹ค๊ณ  ํ•จ.

synthetic data๋Š” ๋Œ€๋ถ€๋ถ„ ์ฝ”๋“œ ๊ด€๋ จ๋œ ๋‚ด์šฉ์ด์—ˆ๋Š”๋ฐ ์žฌ๋ฐŒ์—ˆ๋˜๊ฑด prompt์˜ ๋‹ค์–‘ํ™”๋ฅผ ์œ„ํ•ด ์ฝ”๋“œ ์Šค๋‹ˆํŽซ์„ ์ฃผ๊ณ  ‘์ด๊ฑฐ์— ์˜๊ฐใ„ฑ๋ฐ›์•„์„œ prompt ์ƒ์„ฑํ•ด๋ด’ํ–ˆ๋˜ ๊ฒƒ + pretraining data๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ QA ํ˜•ํƒœ๋กœ ๋งŒ๋“ค๊ธฐ๋„ ํ•œ๋‹ค๊ณ  ํ•จ

  • DPO large-scale์—์„œ PPO ๋ณด๋‹ค ํšจ์œจ์ ์ด์–ด์„œ DPO ์ฑ„ํƒ. ์ด ๋•Œ ์•„๋ž˜์™€ ๊ฐ™์€ ๋ณ€๊ฒฝ์‚ฌํ•ญ ์ถ”๊ฐ€
  1. formatting token – ๊ฐ‘์ž๊ธฐ ๋Œ€๋‹ต์„ ๋๋‚ธ๋‹ค๋˜์ง€ ๋ ๋ถ€๋ถ„์— Repetition์ด ๋˜๋Š” ๋ถ€๋ถ„์„ ๊ฐœ์„ ์‹œํ‚ด. win / reject ๋‘˜๋‹ค ๊ฐ™์€ ํ† ํฐ์ด ์žˆ์„ ๊ฒฝ์šฐ์— loss์—์„œ conflict ์ด ์žˆ๋Š” ๋ถ€๋ถ„์„ ๊ฐœ์„ ์‹œํ‚จ๋‹ค๊ณ ํ•จ.
  2. regularization of NLL loss: 0.2 coeff์œผ๋กœ ๋„ฃ์—ˆ์„ ๋•Œ ์•ˆ์ •์„ฑ์ด ๋Š˜์–ด๋‚œ๋‹ค๊ณ  ํ•จ.
  • Model averaging – ์ด๊ฑธ ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ์™€ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ํ•ด์„œ ๋‹ค averaging ํ–ˆ๋‹ค๊ณ  ํ•จ
  • ์ด๊ฑธ.. ๋˜ 6๋ฒˆ์— ๊ฑธ์ณ์„œ ํ–ˆ๋‹ค๊ณ  ํ•จ

Visual Experiments

  • ViT-H/14 ์‚ฌ์šฉ.
  • cross-attention ์‚ฌ์šฉ. ์ด๋•Œ finegrained ability๋ฅผ ์“ฐ๊ธฐ ์œ„ํ•ด์„œ {4, 9, 16, 24, 31}๋ฒˆ์งธ ๋ ˆ์ด์–ด์˜ ํžˆ๋“ ์„ final layer feature์™€ ๊ฐ™์ด ์ผ๋‹ค๊ณ  ํ•จ
  • 6B scale์˜ image-text pair๋ฅผ 336 x 336์— ์ตœ๋Œ€ 4๊ฐœ grid๋ฅผ ๊ฐ€์ง€๋Š” anyres๋กœ pretraining
  • ์ดํ›„์— ์–ธ์–ด๋ชจ๋ธ๊ณผ ๋น„์Šทํ•˜๊ฒŒ SFT + RM + DPO + very high quality์˜ ๋ฐ์ดํ„ฐ๋งŒ์œผ๋กœ DPO ํ•™์Šต์„ ํ–ˆ๋‹ค๊ณ  ํ•จ
  • ๋ฒค์น˜๋งˆํฌ ์„ฑ๋Šฅ image

์„ฑํ˜„๋‹˜ ์š”์•ฝ The Llama 3 Herd of Models Llama 3 405B ๋ชจ๋ธ ๊ณต๊ฐœ์™€ ํ•จ๊ป˜ ๊ธฐ๋Œ€ํ–ˆ๋˜ ํ…Œํฌ๋‹ˆ์ปฌ ๋ฆฌํฌํŠธ๊ฐ€ ๋‚˜์™”๋‹ค. Llama 2 ๋ฆฌํฌํŠธ๋ณด๋‹ค ํ›จ์”ฌ ๋” ๋งŽ์€ ์ •๋ณด๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์–ด ๊ต‰์žฅํžˆ ํฅ๋ฏธ๋กญ๋‹ค.

  1. ํ”„๋ฆฌํŠธ๋ ˆ์ด๋‹ ์›น ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ์— ๋Œ€ํ•ด ๋น„๊ต์  ์ƒ์„ธํ•˜๊ฒŒ ๊ธฐ์ˆ ํ•˜๊ณ  ์žˆ๋‹ค. ์ž์ฒด ๊ฐœ๋ฐœํ•œ Main Content Extractor๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค. ์ „์ฒด ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ Deduplication์„ ์ˆ˜ํ–‰ํ–ˆ๋‹ค๊ณ  ํ•˜๋Š”๋ฐ Global Deduplication์„ ์‹œ์‚ฌํ•˜๋Š” ๊ฒƒ์ธ์ง€ ๊ถ๊ธˆํ•œ ์ ์ด ์žˆ๋‹ค. ์ถ”๊ฐ€์ ์œผ๋กœ CCNet ์Šคํƒ€์ผ์˜ Line Deduplication์œผ๋กœ Boilerplate๋ฅผ ์ถ”๊ฐ€ ์ œ๊ฑฐ. C4/Gopher ์Šคํƒ€์ผ ํœด๋ฆฌ์Šคํ‹ฑ ํ•„ํ„ฐ์™€ (์•„๋งˆ๋„ LM ๊ธฐ๋ฐ˜์˜) ํ…์ŠคํŠธ ๋ถ„ํฌ์—์„œ์˜ ์•„์›ƒ๋ผ์ด์–ด๋“ค์— ๋Œ€ํ•œ ํ•„ํ„ฐ๋ง๋„ ์‚ฌ์šฉ. Wikipedia๋ฅผ ๋ ˆํผ๋Ÿฐ์Šค๋กœ ์žก์€ fastText, Llama 2 ์˜ˆ์ธก ๊ธฐ๋ฐ˜์˜ DistilRobert ๋ถ„๋ฅ˜๊ธฐ๋กœ ํ€„๋ฆฌํ‹ฐ ํ•„ํ„ฐ๋ง. DeepSeek ์Šคํƒ€์ผ์˜ ์ˆ˜ํ•™๊ณผ ์ฝ”๋“œ ๋„๋ฉ”์ธ์— ํŠนํ™”ํ•œ ์›น ํŽ˜์ด์ง€ ์ถ”์ถœ. ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ์‚ฌ์šฉํ•ด ์›น ๋ฐ์ดํ„ฐ์˜ ๋„๋ฉ”์ธ์„ ๋ถ„๋ฅ˜ํ•˜๊ณ  Scaling Law ์ถ”์ •์„ ํ†ตํ•ด ๋ฐ์ดํ„ฐ ๋ฏน์Šค๋ฅผ ๊ฒฐ์ •. ์ผ๋ฐ˜ ์ง€์‹ 50%, ์ˆ˜ํ•™ ๋ฐ ์ถ”๋ก  ๊ด€๋ จ 25%, ์ฝ”๋“œ 17%, ๋‹ค๊ตญ์–ด 8%. ๊ณ ํ’ˆ์งˆ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต ์ตœ์ข… ๋‹จ๊ณ„์—์„œ Annealing ์ ์šฉ. ์—ญ์œผ๋กœ Annealing์„ ์‚ฌ์šฉํ•ด ๋ฐ์ดํ„ฐ์…‹์˜ ํ€„๋ฆฌํ‹ฐ๋ฅผ ๊ฒ€์ฆํ•˜๊ธฐ๋„ ํ•จ. ๋ฌธ์„œ ๊ฐ„ Attention์„ ์ฐจ๋‹จํ•˜๊ธฐ ์œ„ํ•œ ๋งˆ์Šคํ‚น ์ ์šฉ. Long Context ์ถ”๊ฐ€ ํ•™์Šต์— ์œ ์šฉํ–ˆ๋‹ค๊ณ  ์–ธ๊ธ‰. Polyak Averaging๋„ ์ ์šฉ. ๋‹ค์šด์ŠคํŠธ๋ฆผ ๊ณผ์ œ์— ๋Œ€ํ•œ Likelihood์— ๋Œ€ํ•ด Scaling Law๋ฅผ ์ถ”์ •ํ•œ ๋‹ค์Œ Likelihood์™€ ๊ณผ์ œ์— ๋Œ€ํ•œ ์Šค์ฝ”์–ด์˜ ํ•จ์ˆ˜๋ฅผ ์ถ”์ •ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๋‹ค์šด์ŠคํŠธ๋ฆผ ๊ณผ์ œ์— ๋Œ€ํ•œ Scaling Law๋ฅผ ์ถ”์ •. Pipeline Parallel์˜ ๋ฐฐ์น˜ ํฌ๊ธฐ ์ œ์•ฝ์„ ์™„ํ™”ํ•˜๊ณ  Ring ๋Œ€์‹  All-Gather ๊ธฐ๋ฐ˜ Context Parallel์„ ์‚ฌ์šฉ.
  2. ํฌ์ŠคํŠธํŠธ๋ ˆ์ด๋‹ Reward Modeling์—์„œ ์‹œ์ž‘ํ•ด์„œ Rejection Sampling์œผ๋กœ ์ƒ์„ฑํ•œ ๋ฐ์ดํ„ฐ๋กœ SFT๋ฅผ ํ•˜๊ณ  DPO๋ฅผ ํ•˜๋Š” ํ๋ฆ„. ์ฆ‰ PPO๋ฅผ ์“ฐ์ง€ ์•Š๊ณ  SFT ๋‹จ๊ณ„์—์„œ๋„ ๋ชจ๋ธ ์ƒ์„ฑ ๋ฐ์ดํ„ฐ๊ฐ€ ์ฃผ์ถ•์ด ๋œ๋‹ค. ํฌ์ŠคํŠธํŠธ๋ ˆ์ด๋‹ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ๋Š” ๊ฐ•ํ•˜๊ฒŒ ํ€„๋ฆฌํ‹ฐ ์ปจํŠธ๋กค์„ ํ–ˆ๋‹ค. ์ˆ˜์ž‘์—…์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ํ•„ํ„ฐ๋งํ•˜๊ณ , ๋ชจ๋ธ ๊ธฐ๋ฐ˜์˜ ํ€„๋ฆฌํ‹ฐ ํ•„ํ„ฐ๋ง๊ณผ ๋‚œ์ด๋„์— ๋”ฐ๋ฅธ ๋น„์œจ ์กฐ์ •, ๊ทธ๋ฆฌ๊ณ  Semantic Deduplication์„ ์ ์šฉ. ํฌ์ŠคํŠธํŠธ๋ ˆ์ด๋‹ ์‹œ์ ์—์„œ๋Š” ๊ฐ ๋„๋ฉ”์ธ์— ๋Œ€ํ•ด ํŠนํ™”๋œ ๋ฐ์ดํ„ฐ ๊ตฌ์ถ• ์ž‘์—…๋“ค์„ ์ง„ํ–‰ํ–ˆ๋‹ค. 2.1 ์ฝ”๋“œ ์ฝ”๋“œ ํŠนํ™” ๋ชจ๋ธ์„ ๊ตฌ์ถ•ํ•˜๋Š” ๊ฒƒ์—์„œ ์‹œ์ž‘. ๋ ˆํฌ ๋ ˆ๋ฒจ ๋ฐ์ดํ„ฐ๋„ ํ™œ์šฉํ–ˆ๋‹ค. ์ปดํŒŒ์ผ๋Ÿฌ ํ”ผ๋“œ๋ฐฑ๊ณผ ๋ชจ๋ธ๋กœ ์ƒ์„ฑํ•œ ์œ ๋‹› ํ…Œ์ŠคํŠธ๋ฅผ ์‚ฌ์šฉํ•œ ํ”ผ๋“œ๋ฐฑ์„ ์‚ฌ์šฉํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐœ์„ ํ•˜๊ณ  ํ•™์Šต. ์„œ๋กœ ๋‹ค๋ฅธ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด๊ฐ„ ๋ฒˆ์—ญ, ์ฝ”๋“œ ์„ค๋ช…, ์ƒ์„ฑ, ๋ฌธ์„œํ™”, ๋””๋ฒ„๊น… ๋“ฑ์˜ ๊ณผ์ œ์— ๋Œ€ํ•ด์„œ ๋ชจ๋ธ์ด ์‘๋‹ต์„ ์ƒ์„ฑํ•˜๊ฒŒ ํ•œ ๋‹ค์Œ ์› ์ฝ”๋“œ๋กœ Backtranslation์„ ํ•˜๊ฒŒ ํ•˜๊ณ  ์ถœ๋ ฅ ๊ฒฐ๊ณผ์˜ ํ€„๋ฆฌํ‹ฐ๋ฅผ ์‚ฌ์šฉํ•ด ํ•„ํ„ฐ๋ง. 2.2 ๋‹ค๊ตญ์–ด ๋‹ค๊ตญ์–ด์— ๋Œ€ํ•ด์„œ๋„ ํŠนํ™” ๋ชจ๋ธ์„ ํ•™์Šต. NLP ๋ฐ์ดํ„ฐ์…‹๊ณผ ์‚ฌ๋žŒ์ด ์ž‘์„ฑํ•œ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  Rejection Sampling์„ ์ ์šฉํ•œ ๋‹ค์Œ ๋ฃฐ ๊ธฐ๋ฐ˜ ํ€„๋ฆฌํ‹ฐ ํ•„ํ„ฐ๋ง. ๊ธฐ๊ณ„ ๋ฒˆ์—ญ์€ ์˜๋„์ ์œผ๋กœ ์ œ์™ธํ•˜๋ ค๊ณ  ๋…ธ๋ ฅ. 2.3 ์ˆ˜ํ•™ ํ”„๋ฆฌํŠธ๋ ˆ์ด๋‹ ๋ฐ์ดํ„ฐ์™€ ์‚ฌ๋žŒ์„ ํ†ตํ•ด ํ”„๋กฌํ”„ํŠธ๋ฅผ ๊ตฌ์ถ•. ๋ชจ๋ธ๋กœ CoT ์‘๋‹ต์„ ์ƒ์„ฑํ•œ ๋‹ค์Œ ๋ชจ๋ธ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฒ€์ฆ. Process Reward Model์„ ์‚ฌ์šฉํ•ด ํ•„ํ„ฐ๋งํ•˜๊ณ , ์–ด๋ ค์šด ๋ฌธ์ œ์˜ ๊ฒฝ์šฐ์—๋Š” MCTS์™€ Process Reward Model์„ ์‚ฌ์šฉํ•ด ์‘๋‹ต์„ ์ƒ์„ฑ. 2.4 ์ถ”๋ก  ํ…์ŠคํŠธ์™€ ์ฝ”๋“œ๋ฅผ ์‚ฌ์šฉํ•ด ์ถ”๋ก  ๋ฌธ์ œ๋ฅผ ํ’€๋„๋ก ํ•™์Šต. ์ฝ”๋“œ ์‹คํ–‰ ํ”ผ๋“œ๋ฐฑ์„ ์‚ฌ์šฉํ•˜๊ณ  ์ž˜๋ชป๋œ ์ƒ์„ฑ ๊ฒฐ๊ณผ๋ฅผ ๋ชจ๋ธ์„ ํ†ตํ•ด ์˜ค๋ฅ˜๋ฅผ ๊ต์ •. 2.5 Long Context ๊ธด ๋ฌธ์„œ๋ฅผ ์ฒญํ‚นํ•œ ๋‹ค์Œ ๊ฐ ์ฒญํฌ์— ๋Œ€ํ•ด QA๋ฅผ ์ƒ์„ฑํ•˜๊ณ  ํ•ฉ์น˜๋Š” ์ „ํ†ต์ ์ธ ๋ฐฉ์‹์— ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์ฒญํฌ์— ๋Œ€ํ•ด ์š”์•ฝํ•œ ๋‹ค์Œ ์ด ์š”์•ฝ์„ ์š”์•ฝํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉ. ํŒŒ์ด์ฌ ์ฝ”๋“œ์— ๋Œ€ํ•ด Dependency Sorting์„ ํ•˜๊ณ  ๊ฐ€์žฅ ๋งŽ์ด ์ฐธ์กฐ๋œ ํŒŒ์ผ์„ ์ง€์šด ๋‹ค์Œ ์ด ํŒŒ์ผ์˜ ์ฝ”๋“œ๋ฅผ ์ƒ์„ฑํ•˜๋„๋ก ํ•˜๋Š” ๋ฐฉ๋ฒ•๋„ ์‚ฌ์šฉ. 2.6 ๋„๊ตฌ ์‚ฌ์šฉ ์›น ๊ฒ€์ƒ‰, ํŒŒ์ด์ฌ ์ธํ„ฐํ”„๋ฆฌํ„ฐ, Wolfram Alpha์— ๋Œ€ํ•œ ๋„๊ตฌ ์‚ฌ์šฉ ๋Šฅ๋ ฅ์„ ํ•™์Šต์‹œ์ผฐ์Œ. ์ด์ชฝ์€ ์‚ฌ๋žŒ์ด ์ง์ ‘ ๋ฐ์ดํ„ฐ๋ฅผ ์ž‘์„ฑํ•˜๋Š” ๋ฐฉ์‹. ๋‹ค๋งŒ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด ๊ธฐ๋ณธ์ ์ธ ๋„๊ตฌ ์‚ฌ์šฉ ๋Šฅ๋ ฅ์„ ์Šต๋“์‹œํ‚จ ๋‹ค์Œ์— ์‹œ์ž‘. 2.7 Factuality ํ”„๋ฆฌํŠธ๋ ˆ์ด๋‹์—์„œ ๋ฌธ์„œ๋ฅผ ๊ฐ€์ ธ์™€์„œ ์งˆ๋ฌธ์„ ์ƒ์„ฑํ•˜๊ฒŒ ํ•œ ๋‹ค์Œ ์‘๋‹ต์„ ์ƒ˜ํ”Œ๋ง. ๋ฌธ์„œ์™€ ์‘๋‹ต์„ ํ†ตํ•ด Llama 3 ๊ธฐ๋ฐ˜์œผ๋กœ ์ •ํ™•์„ฑ๊ณผ ์ •๋ณด์˜ ํ’ˆ์งˆ์„ ํ‰๊ฐ€. ์ง€์†์ ์œผ๋กœ ์˜ค๋‹ต์ด ๋ฐœ์ƒํ•˜๋Š” ๊ฒฝ์šฐ์—๋Š” Refusal์„ ์ƒ์„ฑ. 2.8 Steerability ์–ด๋…ธํ…Œ์ดํ„ฐ๋“ค์ด ์‹œ์Šคํ…œ ํ”„๋กฌํ”„ํŠธ๋ฅผ ๋งŒ๋“ค๊ฒŒ ํ•œ ๋‹ค์Œ ๋Œ€ํ™”๋ฅผ ์ง„ํ–‰ํ•˜๊ณ  ์–ด๋…ธํ…Œ์ด์…˜์„ ํ•˜๋„๋ก ํ•จ. 2.9 Safety ์–ด๋…ธํ…Œ์ดํ„ฐ๋ฅผ ํ†ตํ•ด Adversarial ํ”„๋กฌํ”„ํŠธ๋ฅผ ๊ตฌ์ถ•ํ•˜๊ณ  Automatic Red Teaming๋„ ์ ์šฉ.
  3. ๋น„์ „, ์˜ค๋””์˜ค ๋น„์ „์˜ ๊ฒฝ์šฐ์—๋Š” ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”์— ๋Œ€ํ•œ Cross Attention, ์˜ค๋””์˜ค์˜ ๊ฒฝ์šฐ์—๋Š” ์ธ์ฝ”๋” ์ถœ๋ ฅ์„ ๋ชจ๋ธ ์ž…๋ ฅ์œผ๋กœ ๋ฐ”๋กœ Projection. ์†Œ๊ฐ ์›น ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๋Š” ๋†’์€ ์ฐจ์›์—์„œ๋Š” ํ˜„์žฌ ์ •์„์ ์œผ๋กœ ์—ฌ๊ฒจ์ง€๋Š” ๋ฐฉ๋ฒ•๋“ค์„ ์ฑ„ํƒ. DeepSeek ์Šคํƒ€์ผ์˜ ์ฝ”๋“œ ๋ฐ ์ˆ˜ํ•™ ๋„๋ฉ”์ธ ๋ฐ์ดํ„ฐ ๋ฐœ๊ตด์ด ์ค‘์š”ํ•œ ์ˆ˜๋‹จ์ด๋ผ๋Š” ๊ฒƒ์ด ์žฌ์ฐจ ๊ฒ€์ฆ๋จ. ์ถ”๊ฐ€์ ์œผ๋กœ Scaling Law๋ฅผ ์‚ฌ์šฉํ•œ ๋ฐ์ดํ„ฐ ๋ฏน์Šค ๊ฒฐ์ • ๋“ฑ ์ฒด๊ณ„์ ์ธ ๋ฐฉ๋ฒ•๋„ ํฅ๋ฏธ๋กœ์šด ์ง€์ . ์•ˆ์ •์ ์ด๊ณ  ํšจ์œจ์ ์ธ ํ•™์Šต ์ธํ”„๋ผ ๊ตฌ์ถ•์„ ์œ„ํ•ด์„œ๋Š” ํ†ต์‹ ๊ณผ Parallelism์˜ ๋ฐ”๋‹ฅ๋ถ€ํ„ฐ ์ง์ ‘ ์†์„ ๋Œ€์–ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์ฆ๋ช…. ํฌ์ŠคํŠธํŠธ๋ ˆ์ด๋‹์€ ๋” ํ™•์—ฐํ•˜๊ฒŒ Preference Data๋กœ ๋ฌด๊ฒŒ์ค‘์‹ฌ์ด ์˜ฎ๊ฒจ์ง. SFT๋„ ๋Œ€๋ถ€๋ถ„์ด ๋ชจ๋ธ ์ƒ์„ฑ ๋ฐ์ดํ„ฐ๋ฅผ Reward Model๋กœ Rejection Samplingํ•œ ์ƒ˜ํ”Œ์— ๋Œ€ํ•ด ํ•™์Šต์‹œํ‚ค๋Š” ๊ฒƒ์œผ๋กœ ๊ตฌ์„ฑ๋จ. ์‚ฌ์‹ค์ƒ ์˜จ๋ผ์ธ ์ƒ˜ํ”Œ์„ ํ†ตํ•ด ํ•™์Šตํ•˜๊ฒŒ ๋˜๋ฉด์„œ PPO ์—†์ด DPO๋งŒ์œผ๋กœ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ์ถฉ๋ถ„ํžˆ ์ข‹์€ ์„ ํƒ์ด ๋จ. ์ฝ”๋“œ์™€ ์ˆ˜ํ•™์— ๋Œ€ํ•ด ์ฝ”๋“œ ์‹คํ–‰/์ปดํŒŒ์ผ๋Ÿฌ ํ”ผ๋“œ๋ฐฑ๊ณผ Process Reward Model/MCTS๊ฐ€ ์ค‘์š”ํ•œ ๊ตฌ์„ฑ ์š”์†Œ๊ฐ€ ๋จ. ํฌ์ŠคํŠธํŠธ๋ ˆ์ด๋‹์— ์–ธ๊ธ‰๋œ ๋ฐฉ๋ฒ• ํ•˜๋‚˜ํ•˜๋‚˜๊ฐ€ ๋ชจ๋‘ ๊ฐ๊ฐ ํ•œ ํŽธ์˜ ๋…ผ๋ฌธ์ด ๋  ์ˆ˜ ์žˆ๋Š” ํ…Œํฌ๋‹‰๋“ค. ๊ทธ๋ฆฌ๊ณ  ์ด ๋ชจ๋“  ํ…Œํฌ๋‹‰๋“ค์„ ์ข…ํ•ฉํ•ด์„œ ํฌ์ŠคํŠธํŠธ๋ ˆ์ด๋‹์ด ๊ตฌ์„ฑ๋จ. ์‚ฌ์‹ค ์ด๊ฒƒ์€ ํ”„๋ฆฌํŠธ๋ ˆ์ด๋‹๊ณผ ํ•™์Šต ์ธํ”„๋ผ ๊ตฌ์ถ•์—์„œ๋„ ๋งˆ์ฐฌ๊ฐ€์ง€. ์ง€๊ธˆ ์‹œ์ ์˜ ํ”„๋Ÿฐํ‹ฐ์–ด ๋ชจ๋ธ์€ ์ตœ์ฒจ๋‹จ์„ ๋‹ฌ๋ฆฌ๋Š” ๋‹ค์–‘ํ•œ ๊ธฐ์ˆ ๋“ค์„ ์ข…ํ•ฉ์ ์œผ๋กœ ๋ชจ๋ธ ํ•˜๋‚˜์— ์ง‘์ค‘ํ•ด์„œ ๋งŒ๋“ค์–ด์ง€๊ณ  ์žˆ์Œ. GPU ์ˆซ์ž ๊ฐ™์€ ๋ช…์‹œ์ ์ธ ์—ฐ์‚ฐ๋ ฅ์— ๋ฐ€๋ ค ํ”ํžˆ ๊ฐ€๋ ค์ง€๋Š” ๋ถ€๋ถ„์ด์ง€๋งŒ ์ด ์ˆ˜๋งŽ์€ ๋…ธ๋ ฅ๋“ค์„ ํšจ๊ณผ์ ์œผ๋กœ ์ง‘์ค‘ํ•˜๋Š” ๊ฒƒ์ด ๊ต‰์žฅํžˆ ์ค‘์š”ํ•œ ์š”์†Œ๋ผ๋Š” ๊ฒƒ์„ ์‹œ์‚ฌํ•˜๋Š” ๊ฒƒ์ผ ๊ฒƒ.