Image

paper

TL;DR

  • I read this because.. : Came across this while searching for agent + VLM. Because I’m interested in gaming rl. ICML'24 oral
  • task : world modeling, embodied agent (agent that performs language instructions)
  • problem : The existing method is given a language and performs an action, but the real-world is more like a continuous input/output of language, action, and video.
  • IDEA : Couldn’t we use language not only for instruction, but also for acquiring knowledge and predicting the future?
  • input/output : (world model) {video, text, action} -> {representation of future, (optional) language} (agent) state -> action
  • architecture : (vision encoder) strided image encoder (vision decoder) strided image decoder (text embedding) embedding from scratch or T5 (sequence modeling) GRU /// (policy model) DreamerV3
  • objective : (world model) reconstruction error + regularization + next representation prediction (policy model) maximize expected reward
  • baseline : (model-free RL) IMPALA, R2D2, (task-specific model) EMMA (Messenger)
  • data : (world model) replay buffer from {homegrid, messenger, vln-ce, langroom} (pretraining) messenger manual(in-domain), tiny stories(general)
  • evaluation : HomeGrid (proposed), Messenger, VLM-CE, LangRoom (proposed)
  • result :
  • contribution : Effective world model learning for information coming in streaming (single “text” modality pretraining seems to be the most contribution?) – Learning with world model + actor-critic seems to be a contribution of dreamer v3 (https://arxiv.org/abs/2301.04104) .
  • etc. :

Details

Image Image

problem setting

  • action: $a_t$ – discrete action
  • reward $r_t$
  • episode end $c_t$ ($c_t$=0 when ends)
  • observation $o_t$ -> multimodal observation (visual $x_t$, textual $l_t$)
Image Image

world model learning

  • Image
  • Recurrent State Space Model (RSSM) – uses a GRU-based sequence model
  • Predicts $z_t$ : reresentation representation -> $\hat{z_{t+1}}$
    • $h_t$ : recurrent state
  • multimodal representation
  • Compress $z_t$ with a variational autoencoder objective. Then predict reward $\hat r_t$ and $\hat c_t$ for $z_t$ as well.
    • Image
  • Additionally add regularize to prevent $z_t$ and $\hat z_t$ from being too different
  • future prediction
    • Image
  • Learn to make the model’s predicted $\hat {z_t}$ in the current model state $z_{t-1}$, $h_{t-1}$ match the actual $z_t$ in the next step.
  • Let the world model predict $\hat z_t$ for future representations, allowing it to predict future images, languages, and rewards and learn correlations across multiple modalities
  • single modality pretraining
  • World models can be trained offline, so you can train world models with text-only and video-only data
  • For text only, you can pretrain by setting the image, action input to zero and the decoder loss coefficient to zero.
  • Unlike language modeling loss, it is trained in a way that predicts the representation of
  • You can initialize actor, critic, and then pretrain them for each modality like this

policy learning

  • Replicating Dreamer V3’s structure with actor-critic
Image

experiment

  • RQ1: It would be better to put {image, language} as a pair per timestep

  • RQ2: Since we use a variety of languages when training, performance will be better when we include different types of languages compared to the model-free baseline.

  • RQ3: Putting instructions into the world model is no worse than using language-conditioned policies.

  • RQ4: Show that a multimodal generative model enables grounded language generation and learning from offline text-only data

  • RQ1: It would be better to put {image, language} as a pair per timestep

    • Image
  • Dynalang performed best compared to other baselines that were language conditioned (~10M).

  • RQ2 & 3:

  • Suggested HomeGrid, an environment with language hints in addition to language instruction.

    • Image
  • Reward for completing many quests within 100 steps

  • future observation : where object is

  • dynamics: what action must be taken to open the trash can

  • correction: Say something like “no, turn around” when you’re getting farther away from your current goal

  • It performs better when hinted, and it also performs better with task-only instructions.

    • Image
  • Performance for Messenger games for which a game manual is provided

  • Image
  • Vision language navigation continuous environment

    • Image
  • A setting with a wayfinding task and a lower-level action would be the continuous environment

  • Dense rewards related to distance to goal, with success rewards if successful

  • No, let’s make sure R2D2 doesn’t succeed at all. lol Is this the baseline?

  • LangRoom : embodied question answering

  • Demonstrate that you can also create a language in the middle

  • A setting where questions are answered but utterances are made via perception

    • Image
  • If we increased the vocab size too much, performance did not converge without prior

    • Image
  • Added an entropy regularizer to the world model to resolve this

    • Image
  • text-only pretraining

  • I tried text only pretraining because it was an experiment with the experience online and I thought I needed offline as well.

    • Image
  • in-domain is manuals from Messenger S2 games

  • domain-general text is a 2M short story generated with GPT-4

  • Learning the general domain from scratch one-hot performed better than using T5.


  • actor model config
    • Image