Image

paper

TL;DR

  • I read this because.. : agent + VLM ๊ฒ€์ƒ‰ํ•˜๋‹ค๊ฐ€ ๋‚˜์˜ด. gaming rl์— ๊ด€์‹ฌ ์žˆ์–ด์„œ. ICML'24 oral
  • task : world modeling, embodied agent(language instruction์„ ์ˆ˜ํ–‰ํ•˜๋Š” agent)
  • problem : ๊ธฐ์กด ๋ฐฉ์‹์€ language๊ฐ€ ์ฃผ์–ด์ง€๊ณ  ํ–‰๋™์„ ํ•˜๋Š” ์‹์ด์ง€๋งŒ real-world๋Š” ์‹ค์ œ๋กœ language, action, video ๊ฐ€ ์—ฐ์†์ ์œผ๋กœ input/output ๋˜๋Š” ๊ฒƒ์— ๊ฐ€๊นŒ์›€
  • idea : language๋ฅผ instruction ๋ฟ ์•„๋‹ˆ๋ผ ์ง€์‹์„ ์Šต๋“ํ•˜๊ณ  ๋ฏธ๋ž˜๋ฅผ ์˜ˆ์ธกํ•˜๋Š”๋ฐ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์ง€ ์•Š์„๊นŒ?
  • input/output : (world model) {video, text, action} -> {representation of future, (optional) language} (agent) state -> action
  • architecture : (vision encoder) strided image encoder (vision decoder) strided image decoder (text embedding) embedding from scratch or T5 (sequence modeling) GRU /// (policy model) DreamerV3
  • objective : (world model) reconstruction error + regularization + next representation prediction (policy model) maximize expected reward
  • baseline : (model-free RL) IMPALA, R2D2, (task-specific model) EMMA (Messenger)
  • data : (world model) replay buffer from {homegrid, messenger, vln-ce, langroom} (pretraining) messenger manual(in-domain), tiny stories(general)
  • evaluation : HomeGrid (proposed), Messenger, VLM-CE, LangRoom (proposed)
  • result :
  • contribution : streaming ์œผ๋กœ ๋“ค์–ด์˜ค๋Š” ์ •๋ณด์— ๋Œ€ํ•ด ํšจ๊ณผ์ ์ธ world model ํ•™์Šต (single “text” modality pretraining ์ด ๊ฐ€์žฅ contribution ์ธ๋“ฏ?) – ์™ธ์˜ world model + actor-critic ์ด๋ž‘ ๊ฐ™์ด ํ•™์Šตํ•˜๋Š” ๊ฑด dreamer v3(https://arxiv.org/abs/2301.04104 )์˜ contribution์ธ๋“ฏํ•˜๋‹ค.
  • etc. :

Details

Image Image

problem setting

  • action: $a_t$ – discrete action
  • reward $r_t$
  • episode end $c_t$ ($c_t$=0 when ends)
  • observation $o_t$ -> multimodal observation (visual $x_t$, textual $l_t$)
Image Image

world model learning

  • Image
  • Recurrent State Space Model(RSSM) – GRU ๊ธฐ๋ฐ˜์˜ sequence model์„ ์‚ฌ์šฉ
  • $z_t$ : reresentation representation -> $\hat{z_{t+1}}$๋ฅผ ์˜ˆ์ธก
  • $h_t$ : recurrent state
  • multimodal representation
    • variational autoencoder objective๋กœ $z_t$๋กœ ์••์ถ•. ์ดํ›„ $z_t$์— ๋Œ€ํ•ด reward $\hat r_t$์™€ $\hat c_t$๋„ ์˜ˆ์ธก.
    • Image
    • ์ถ”๊ฐ€๋กœ $z_t$์™€ $\hat z_t$๊ฐ€ ๋„ˆ๋ฌด ๋‹ฌ๋ผ์ง€์ง€ ์•Š๋„๋ก regularize ์ถ”๊ฐ€
  • future prediction
    • Image
    • ํ˜„์žฌ์˜ model state $z_{t-1}$, $h_{t-1}$์—์„œ ๋ชจ๋ธ์ด ์˜ˆ์ธกํ•œ $\hat {z_t}$๊ฐ€ ์‹ค์ œ ๋‹ค์Œ step์˜ $z_t$์™€ match ๋˜๋„๋ก ํ•™์Šต.
    • world model์ด ๋ฏธ๋ž˜์˜ ํ‘œํ˜„์— ๋Œ€ํ•œ $\hat z_t$๋ฅผ ์˜ˆ์ธกํ•˜๊ฒŒ ํ•จ์œผ๋กœ์„œ ๋ฏธ๋ž˜์˜ image, language, reward๋ฅผ ์˜ˆ์ธกํ•˜๊ณ  ๋‹ค์–‘ํ•œ multiple modalities์˜ correlation์„ ํ•™์Šตํ•˜๋„๋ก ํ•จ
  • single modality pretraining
    • world model์€ offline์œผ๋กœ๋„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— text-only, video-only data๋กœ world model์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Œ
    • text only์˜ ๊ฒฝ์šฐ image, action input์„ zero๋กœ ๋‘๊ณ  decoder loss coefficient๋ฅผ 0์œผ๋กœ ๋‘๋ฉด pretraining์„ ํ•  ์ˆ˜ ์žˆ์Œ.
    • language modeling loss์™€ ๋‹ฌ๋ฆฌ ๋‹ค์Œ์˜ representation์„ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ํ•™์Šต๋จ
    • actor, critic ์„ ์ดˆ๊ธฐํ™”ํ•œ ๋’ค ๊ฐ๊ฐ์˜ modality ์— ๋Œ€ํ•ด ์ด์™€ ๊ฐ™์ด pretraining ํ•  ์ˆ˜ ์žˆ์Œ

policy learning

  • actor-critic ์œผ๋กœ Dreamer V3์˜ ๊ตฌ์กฐ๋ฅผ ๊ทธ๋Œ€๋กœ ๊ฐ€์ ธ๊ฐ
Image

experiment

  • RQ1: {image, language}๋ฅผ timestep ๋ณ„ pair๋กœ ๋„ฃ๋Š” ๊ฒƒ์ด ๋” ์ข‹์„ ๊ฒƒ์ด๋‹ค

  • RQ2: ํ•™์Šตํ•  ๋•Œ ๋‹ค์–‘ํ•œ language๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— model-free baseline๊ณผ ๋น„๊ตํ–ˆ์„ ๋Œ€ ๋‹ค์–‘ํ•œ ์ข…๋ฅ˜์˜ language๋ฅผ ๋„ฃ์—ˆ์„ ๋•Œ ์„ฑ๋Šฅ์ด ๊ดœ์ฐฎ์„ ๊ฒƒ์ด๋‹ค.

  • RQ3: instruction์„ world model์— ๋„ฃ๋Š” ๊ฒƒ์€ language-conditioned policy๋ฅผ ์‚ฌ์šฉํ•˜๋Š”๊ฒƒ๋ณด๋‹ค ๋‚˜์˜์ง€ ์•Š์„ ๊ฒƒ์ด๋‹ค.

  • RQ4: multimodal generative model์„ ํ†ตํ•ด grounded language generation๊ณผ offline text-only data ํ•™์Šต์ด ๊ฐ€๋Šฅํ•จ์„ ๋ณด์ž„

  • RQ1: {image, language}๋ฅผ timestep ๋ณ„ pair๋กœ ๋„ฃ๋Š” ๊ฒƒ์ด ๋” ์ข‹์„ ๊ฒƒ์ด๋‹ค

    • Image
    • language conditioned์ธ ๋‹ค๋ฅธ ๋ฒ ์ด์Šค๋ผ์ธ(~10M) ๋Œ€๋น„ Dynalang ์ด ๊ฐ€์žฅ ์ข‹์•˜์Œ.
  • RQ2 & 3:

    • language instruction ์™ธ์—๋„ language hint๊ฐ€ ์žˆ๋Š” ํ™˜๊ฒฝ์ธ HomeGrid๋ฅผ ์ œ์•ˆ.
    • Image
    • 100 step ๋‚ด์— ํ€˜์ŠคํŠธ๋ฅผ ๋งŽ์ด ์„ฑ๊ณตํ•˜๋Š” ๊ฒƒ์ด reward
    • future observation : object๊ฐ€ ์–ด๋””์žˆ๋Š”์ง€
    • dynamics : ์“ฐ๋ ˆ๊ธฐํ†ต์„ ์—ด๋ ค๋ฉด ์–ด๋–ค ํ–‰๋™์„ ํ•ด์•ผํ•˜๋Š”์ง€
    • correction: ํ˜„์žฌ ๋ชฉํ‘œ์™€์˜ ๊ฑฐ๋ฆฌ๊ฐ€ ๋ฉ€์–ด์ง€๋ฉด “no, turn around"์™€ ๊ฐ™์ด ๋งํ•ด์คŒ
    • ํžŒํŠธ๋ฅผ ๋ฐ›์•˜์„ ๋•Œ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ด๊ณ , task-only instruction์—์„œ๋„ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ƒ„.
    • Image
  • game manual์ด ์ œ๊ณต๋˜์–ด ์žˆ๋Š” Messenger game์— ๋Œ€ํ•œ ์„ฑ๋Šฅ

  • Image
  • Vision language navigation continuous environment

    • Image
    • ๊ธธ์„ ์ฐพ๋Š” task์ด๊ณ  ์กฐ๊ธˆ ๋” action์ด low-level์ธ ์…‹ํŒ…์ด continuous environment
    • goal๊ณผ์˜ ๊ฑฐ๋ฆฌ์™€ ๊ด€๋ จ๋œ dense reward๋ฅผ ๋ฐ›๊ณ , ์„ฑ๊ณตํ•˜๋ฉด ์„ฑ๊ณต reward๋ฅผ ๋ฐ›๋Š” ํ˜•ํƒœ
    • ์•„๋‹ˆ r2d2๋Š” ์•„์˜ˆ ์„ฑ๊ณต์„ ๋ชปํ•˜์ž๋‚˜ ใ…‹ใ…‹ ๋ฒ ์ด์Šค๋ผ์ธ์ด ์ด๊ฒŒ ๋งž๋‚˜
  • LangRoom : embodied question answering

    • ์ค‘๊ฐ„์— language ์ƒ์„ฑ๋„ ํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ๋ณด์—ฌ์คŒ
    • ์งˆ๋‹ต์„ ํ•˜๋˜ perception์„ ํ†ตํ•ด utterance๊ฐ€ ์ด๋ฃจ์–ด์ง€๋Š” ์…‹ํŒ…
    • Image
    • ์ด๋•Œ vocab size๋ฅผ ๋„ˆ๋ฌด ๋Š˜๋ฆฌ๋ฉด prior ์—†์ด๋Š” ์„ฑ๋Šฅ์ด ์ˆ˜๋ ดํ•˜์ง€ ๋ชปํ–ˆ์Œ
    • Image
    • ์ด๋ฅผ ํ•ด๊ฒฐ ํ•˜๊ธฐ ์œ„ํ•ด world model์— entropy regularizer๋ฅผ ์ถ”๊ฐ€ํ•ด์„œ ์ด๋ฅผ ํ•ด๊ฒฐํ•จ
    • Image
  • text-only pretraining

    • ์—ฌํƒœ๊นŒ์ง€๋Š” experience online์— ๋Œ€ํ•œ ์‹คํ—˜์ด์—ˆ๊ณ  offline๋„ ํ•„์š”ํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•ด์„œ text only pretraining ์„ ํ•ด๋ด„.
    • Image
    • in-domain ์€ manuals from Messenger S2 games
    • domain-general text๋Š” GPT-4๋กœ ์ƒ์„ฑ๋œ 2M short story
    • T5๋ฅผ ์“ฐ๋Š” ๊ฒƒ๋ณด๋‹ค one-hot from scratch๋กœ general domain์— ๋Œ€ํ•ด ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ๋” ์„ฑ๋Šฅ์ด ์ข‹์•˜์Œ.

  • actor model config
    • Image