[214] Learning to Model the World With Language

paper

TL;DR

I read this because.. : Came across this while searching for agent + VLM. Because I’m interested in gaming rl. ICML'24 oral
task : world modeling, embodied agent (agent that performs language instructions)
problem : The existing method is given a language and performs an action, but the real-world is more like a continuous input/output of language, action, and video.
IDEA : Couldn’t we use language not only for instruction, but also for acquiring knowledge and predicting the future?
input/output : (world model) {video, text, action} -> {representation of future, (optional) language} (agent) state -> action
architecture : (vision encoder) strided image encoder (vision decoder) strided image decoder (text embedding) embedding from scratch or T5 (sequence modeling) GRU /// (policy model) DreamerV3
objective : (world model) reconstruction error + regularization + next representation prediction (policy model) maximize expected reward
baseline : (model-free RL) IMPALA, R2D2, (task-specific model) EMMA (Messenger)
data : (world model) replay buffer from {homegrid, messenger, vln-ce, langroom} (pretraining) messenger manual(in-domain), tiny stories(general)
evaluation : HomeGrid (proposed), Messenger, VLM-CE, LangRoom (proposed)
result :
contribution : Effective world model learning for information coming in streaming (single “text” modality pretraining seems to be the most contribution?) – Learning with world model + actor-critic seems to be a contribution of dreamer v3 (https://arxiv.org/abs/2301.04104) .
etc. :

Details

problem setting

action: $a_t$ – discrete action
reward $r_t$
episode end $c_t$ ($c_t$=0 when ends)
observation $o_t$ -> multimodal observation (visual $x_t$, textual $l_t$)

world model learning

Recurrent State Space Model (RSSM) – uses a GRU-based sequence model
Predicts $z_t$ : reresentation representation -> $\hat{z_{t+1}}$
- $h_t$ : recurrent state
multimodal representation
Compress $z_t$ with a variational autoencoder objective. Then predict reward $\hat r_t$ and $\hat c_t$ for $z_t$ as well.
Additionally add regularize to prevent $z_t$ and $\hat z_t$ from being too different
future prediction
Learn to make the model’s predicted $\hat {z_t}$ in the current model state $z_{t-1}$, $h_{t-1}$ match the actual $z_t$ in the next step.
Let the world model predict $\hat z_t$ for future representations, allowing it to predict future images, languages, and rewards and learn correlations across multiple modalities
single modality pretraining
World models can be trained offline, so you can train world models with text-only and video-only data
For text only, you can pretrain by setting the image, action input to zero and the decoder loss coefficient to zero.
Unlike language modeling loss, it is trained in a way that predicts the representation of
You can initialize actor, critic, and then pretrain them for each modality like this

policy learning

Replicating Dreamer V3’s structure with actor-critic

experiment

RQ1: It would be better to put {image, language} as a pair per timestep
RQ2: Since we use a variety of languages when training, performance will be better when we include different types of languages compared to the model-free baseline.
RQ3: Putting instructions into the world model is no worse than using language-conditioned policies.
RQ4: Show that a multimodal generative model enables grounded language generation and learning from offline text-only data
RQ1: It would be better to put {image, language} as a pair per timestep
Dynalang performed best compared to other baselines that were language conditioned (~10M).
RQ2 & 3:
Suggested HomeGrid, an environment with language hints in addition to language instruction.
Reward for completing many quests within 100 steps
future observation : where object is
dynamics: what action must be taken to open the trash can
correction: Say something like “no, turn around” when you’re getting farther away from your current goal
It performs better when hinted, and it also performs better with task-only instructions.
Performance for Messenger games for which a game manual is provided
Vision language navigation continuous environment
A setting with a wayfinding task and a lower-level action would be the continuous environment
Dense rewards related to distance to goal, with success rewards if successful
No, let’s make sure R2D2 doesn’t succeed at all. lol Is this the baseline?
LangRoom : embodied question answering
Demonstrate that you can also create a language in the middle
A setting where questions are answered but utterances are made via perception
If we increased the vocab size too much, performance did not converge without prior
Added an entropy regularizer to the world model to resolve this
text-only pretraining
I tried text only pretraining because it was an experiment with the experience online and I thought I needed offline as well.
in-domain is manuals from Messenger S2 games
domain-general text is a 2M short story generated with GPT-4
Learning the general domain from scratch one-hot performed better than using T5.

actor model config

TL;DR#

Details#

problem setting#

world model learning#

policy learning#

experiment#

TL;DR

Details

problem setting

world model learning

policy learning

experiment