[216] Emerging Properties in Unified Multimodal Pretraining

TL;DR

I read this because…: unified multimodal model, open-source model against proprietary models like GPT-4o/Gemini 2.0. arXiv 2025. a.k.a. BAEGEL
task: unified multimodal understanding & generation (text, image, video)
Problem: Existing open-source models have separated understanding and generation, and have a large performance gap with proprietary models. Need a unified model trained on interleaved multimodal data
Idea: Mixture-of-Transformer-Experts (MoT) architecture for both multimodal understanding and generation in a single model, unleashing emergent capabilities
input/output: (understanding) {text, image, video} -> text response (generation) text prompt -> image/video (editing) {text instruction, image} -> edited image
architecture: (visual encoder) VAE encoder (FLUX) + ViT (Siglip-SO400M) (Transformer) 14A7B Transformer(Qwen2.5 LLM) (visual decoder) VAE decoder (FLUX)
objective: next group of token prediction (text tokens + visual tokens) + reconstruction loss + generation loss
baseline: (VLM understanding) Qwen2.5-VL, InternVL-2.5 (T2I generation) SD3-Medium, FLUX-1-dev (image editing) InstructPix2Pix, Janus-Pro-7B
data: (pretraining) trillions of interleaved text, image, video, web data (continued training) refined multimodal datasets (supervised finetuning) specific task datasets
evaluation: MME, MMBench, MM-Vet, MathVista (understanding), GenEval (T2I), GEdit-Bench-EN, IntelligentBench (editing)
results: (1) outperforms top-tier open models (Qwen2.5-VL, InternVL-2.5) on VLM benchmarks (2) performs competitively with specialist T2I models with a GenEval of 0.88 (3) emergent capabilities: basic multimodal understanding/generation, traditional editing, intelligent editing + world modeling
contribution:
Unify multimodal understanding + generation + editing in a single unified model
Prove the scalability of your MoT architecture
Express emergent capabilities (free-form editing, future frame prediction, 3D manipulation, world navigation)
Demonstrate the effectiveness of interleaved multimodal pretraining
etc: Fully open source with Apache 2.0 license. Developed by the ByteDance Seed team

Details

architecture

There are methods such as quantized auto-regressive and external diffusion, but the former has too low performance and the latter has limited expressiveness because the vector must be compressed through an adapter. Therefore, an integrated transformer structure is used.

arch design choice
- visual understanding: Siglip-SO400M/14, NaViT, 980x980 maximum input size
- visual generation: pretrained VAE from FLUX – down sample ratio 8 – 2x2 patch embedding
causal attention
noised VAE tokens: Diffusion noise added to the VAE latent. Created with Rectified flow and only used to score mse loss
clean VAE tokens: clean VAE latents used as conditions for generating images or text.
ViT tokens: Tokens for understanding images in interleaved data
The above three clean VAE and ViT tokens are only causally attended to each other, so they can see each other (and would see each other if they came before), but the noisy VAE tokens are masked from attending.
Mixture-of-Transformer
Make two copies of the Qwen 2.5 LLM and use each as an expert for understanding and generation, but only share the attention layer.
Explain that the learning curve was better than MoE, especially in generation loss

data

Trying to make a very complex and detailed Interleaved.

text-only data
vision-text paired data
vision-text interleaved data
video data: Koala36M (looks like clean segmented video data), MVImgNet2.0 (images viewed in multi-view)
- web data : OmniCorpus, image-editing data
data filtering
video data: breaking it into segments and clips with shot detection
Merge segments based on visual semilarity
logo, black border cleared
Finally, filter by length, Resolution, claritiy, motion clarity, dedup with CLIP
web data: Similar to DeepSeekMath, LLM is used to classify document topics, then fastText is used to train a classifier to filter, and then LLM is used for fine-grained filtering.
data construction
- interleaved data from videos
Generate descriptions of visual changes in consecutive frames, creating captions with Qwen2.5-VL-7B (limiting generation to 30 tokens to reduce hallucination). Generate captions by pulling an average of 4 frames per video
- interleaved data from webs
Created with a captioner at the end to ensure the image and text align well
- reasoning-augmented data
t2i generation -> free-form image manipulation -> conceptual edits Generate this process using different models. for example, reasoning, generate with deepseek-r1.

training

result

World modeling / world navigation skills manifested.

TL;DR#

Details#

architecture#

data#

training#

result#