[214] Learning to Model the World With Language

paper

TL;DR

I read this because.. : agent + VLM 검색하다가 나옴. gaming rl에 관심 있어서. ICML'24 oral
task : world modeling, embodied agent(language instruction을 수행하는 agent)
problem : 기존 방식은 language가 주어지고 행동을 하는 식이지만 real-world는 실제로 language, action, video 가 연속적으로 input/output 되는 것에 가까움
idea : language를 instruction 뿐 아니라 지식을 습득하고 미래를 예측하는데 사용할 수 있지 않을까?
input/output : (world model) {video, text, action} -> {representation of future, (optional) language} (agent) state -> action
architecture : (vision encoder) strided image encoder (vision decoder) strided image decoder (text embedding) embedding from scratch or T5 (sequence modeling) GRU /// (policy model) DreamerV3
objective : (world model) reconstruction error + regularization + next representation prediction (policy model) maximize expected reward
baseline : (model-free RL) IMPALA, R2D2, (task-specific model) EMMA (Messenger)
data : (world model) replay buffer from {homegrid, messenger, vln-ce, langroom} (pretraining) messenger manual(in-domain), tiny stories(general)
evaluation : HomeGrid (proposed), Messenger, VLM-CE, LangRoom (proposed)
result :
contribution : streaming 으로 들어오는 정보에 대해 효과적인 world model 학습 (single “text” modality pretraining 이 가장 contribution 인듯?) – 외의 world model + actor-critic 이랑 같이 학습하는 건 dreamer v3(https://arxiv.org/abs/2301.04104 )의 contribution인듯하다.
etc. :

Details

problem setting

action: $a_t$ – discrete action
reward $r_t$
episode end $c_t$ ($c_t$=0 when ends)
observation $o_t$ -> multimodal observation (visual $x_t$, textual $l_t$)

world model learning

Recurrent State Space Model(RSSM) – GRU 기반의 sequence model을 사용
$z_t$ : reresentation representation -> $\hat{z_{t+1}}$를 예측
$h_t$ : recurrent state
multimodal representation
- variational autoencoder objective로 $z_t$로 압축. 이후 $z_t$에 대해 reward $\hat r_t$와 $\hat c_t$도 예측.
- 추가로 $z_t$와 $\hat z_t$가 너무 달라지지 않도록 regularize 추가
future prediction
- 현재의 model state $z_{t-1}$, $h_{t-1}$에서 모델이 예측한 $\hat {z_t}$가 실제 다음 step의 $z_t$와 match 되도록 학습.
- world model이 미래의 표현에 대한 $\hat z_t$를 예측하게 함으로서 미래의 image, language, reward를 예측하고 다양한 multiple modalities의 correlation을 학습하도록 함
single modality pretraining
- world model은 offline으로도 학습할 수 있기 때문에 text-only, video-only data로 world model을 학습할 수 있음
- text only의 경우 image, action input을 zero로 두고 decoder loss coefficient를 0으로 두면 pretraining을 할 수 있음.
- language modeling loss와 달리 다음의 representation을 예측하는 방식으로 학습됨
- actor, critic 을 초기화한 뒤 각각의 modality 에 대해 이와 같이 pretraining 할 수 있음

policy learning

actor-critic 으로 Dreamer V3의 구조를 그대로 가져감

experiment

RQ1: {image, language}를 timestep 별 pair로 넣는 것이 더 좋을 것이다
RQ2: 학습할 때 다양한 language를 사용하기 때문에 model-free baseline과 비교했을 대 다양한 종류의 language를 넣었을 때 성능이 괜찮을 것이다.
RQ3: instruction을 world model에 넣는 것은 language-conditioned policy를 사용하는것보다 나쁘지 않을 것이다.
RQ4: multimodal generative model을 통해 grounded language generation과 offline text-only data 학습이 가능함을 보임
RQ1: {image, language}를 timestep 별 pair로 넣는 것이 더 좋을 것이다
- language conditioned인 다른 베이스라인(~10M) 대비 Dynalang 이 가장 좋았음.
RQ2 & 3:
- language instruction 외에도 language hint가 있는 환경인 HomeGrid를 제안.
- 100 step 내에 퀘스트를 많이 성공하는 것이 reward
- future observation : object가 어디있는지
- dynamics : 쓰레기통을 열려면 어떤 행동을 해야하는지
- correction: 현재 목표와의 거리가 멀어지면 “no, turn around"와 같이 말해줌
- 힌트를 받았을 때 더 좋은 성능을 내고, task-only instruction에서도 더 좋은 성능을 냄.
game manual이 제공되어 있는 Messenger game에 대한 성능
Vision language navigation continuous environment
- 길을 찾는 task이고 조금 더 action이 low-level인 셋팅이 continuous environment
- goal과의 거리와 관련된 dense reward를 받고, 성공하면 성공 reward를 받는 형태
- 아니 r2d2는 아예 성공을 못하자나 ㅋㅋ 베이스라인이 이게 맞나
LangRoom : embodied question answering
- 중간에 language 생성도 할 수 있는지 보여줌
- 질답을 하되 perception을 통해 utterance가 이루어지는 셋팅
- 이때 vocab size를 너무 늘리면 prior 없이는 성능이 수렴하지 못했음
- 이를 해결 하기 위해 world model에 entropy regularizer를 추가해서 이를 해결함
text-only pretraining
- 여태까지는 experience online에 대한 실험이었고 offline도 필요하다고 생각해서 text only pretraining 을 해봄.
- in-domain 은 manuals from Messenger S2 games
- domain-general text는 GPT-4로 생성된 2M short story
- T5를 쓰는 것보다 one-hot from scratch로 general domain에 대해 학습하는 것이 더 성능이 좋았음.

actor model config

TL;DR#

Details#

problem setting#

world model learning#

policy learning#

experiment#

TL;DR

Details

problem setting

world model learning

policy learning

experiment