[135] Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text

paper , code

TL;DR

I read this because.. : Time to do VLM…. Probably the first interleaved image-text data. used in OpenFlamingo.
task : data
problem : open interleaved image-text data
IDEA : Acquire images starting from a common crawl.
input/output : sequence of images, sequence of texts -> text
architecture : OpenFlamingo(3B) https://github.com/long8v/PTIR/issues/118 ,
objective : CE loss
baseline : Flamingo trained with LAION-2B only
data : Multimodal C4(mmc4), Multimodal C4 fewer-faces(mmc4-ff), mmc4-core, mmc4-core-ff -> COCO caption
evaluation : zero-shot captioning, 4-, 8-shot captioning
result : Much better performance than LAION-2B pretained (amazing…)
contribution : first public open interleaved image-text data
etc. :

Details

`mmc4`

Data Curation Process

source Using the clean version of c4 (365M documents, 156B tokens) during April 2019 of the Common Crawl (called the popular dump)
images Download the original web page from C4 and then download the image Remove the words logo, button, icon, plugin, widget, if any, leaving only the extension png / jpeg / jpg. Resize to 800 pixels on the major axis. -> 115M documents / 1.37B images
dedup + small resolution
Use dedup: https://gitlab.com/opennota/findimagedupes
small resolution: remove if shorten is less than 150.
Remove if the aspect ratio is greater than or equal to 2 or less than 0.5 (this has been reported to help remove banner-like ads)
We took a sample of 3.7K and found that about 2.5% of them are ads
NSFW
Use the dataset2metadata package to use NSFW binary classifier
Classified by training a classifier with NSFW images classified by LAION-2B.
Aligning images and sentences
C4 is a preprocessed version and the image was downloaded in full, so there may not be any text corresponding to the image
Should see the DOM of html but doesn’t
First, find the pairwise correlation between the image and all sentences.
If any of the images have a similarity of less than 0.15, remove them.
This is followed by bipartite matching so that each sentence can have only one image, assign

This will provide better coverage than just assigning a similarity maximizer.

assign followed by either before the sentence or after the sentence, depending on how Flamingo works.

Real-world examples

Exploring `mmc4`

url source
topics Topics with LDA -> top frequent words -> GPT4

Result

Comparing a child who learned with Open Flamingo to a child who learned with LAION-2B

Retrieval
COCO caption

Comparison of MSCOCO caption zero-shot / 4-/ 8-shot caption trained with LAION-2B at 15M Red is zero-shot performance. The reason why it is worse than 4 and 8 shots is that LAION-2B is only trained with short text, so it may not be able to recognize long text. Shouldn’t it be compared to 2B;

(from FLAMINGO, coco dev set, 4shot)

TL;DR#

Details#

mmc4#

Data Curation Process#

Exploring mmc4#

Result#