TL;DR
- I read this because.. : Time to do VLM…. Probably the first interleaved image-text data. used in OpenFlamingo.
- task : data
- problem : open interleaved image-text data
- IDEA : Acquire images starting from a common crawl.
- input/output : sequence of images, sequence of texts -> text
- architecture : OpenFlamingo(3B) https://github.com/long8v/PTIR/issues/118 ,
- objective : CE loss
- baseline : Flamingo trained with LAION-2B only
- data : Multimodal C4(mmc4), Multimodal C4 fewer-faces(mmc4-ff), mmc4-core, mmc4-core-ff -> COCO caption
- evaluation : zero-shot captioning, 4-, 8-shot captioning
- result : Much better performance than LAION-2B pretained (amazing…)
- contribution : first public open interleaved image-text data
- etc. :
Details
mmc4
Data Curation Process
- source Using the clean version of c4 (365M documents, 156B tokens) during April 2019 of the Common Crawl (called the popular dump)
- images Download the original web page from C4 and then download the image Remove the words logo, button, icon, plugin, widget, if any, leaving only the extension png / jpeg / jpg. Resize to 800 pixels on the major axis. -> 115M documents / 1.37B images
- dedup + small resolution
- Use dedup: https://gitlab.com/opennota/findimagedupes
- small resolution: remove if shorten is less than 150.
- Remove if the aspect ratio is greater than or equal to 2 or less than 0.5 (this has been reported to help remove banner-like ads)
- We took a sample of 3.7K and found that about 2.5% of them are ads
- NSFW
- Use the dataset2metadata package to use NSFW binary classifier
- Classified by training a classifier with NSFW images classified by LAION-2B.
- Aligning images and sentences
- C4 is a preprocessed version and the image was downloaded in full, so there may not be any text corresponding to the image
- Should see the DOM of html but doesn’t
- First, find the pairwise correlation between the image and all sentences.
- If any of the images have a similarity of less than 0.15, remove them.
- This is followed by bipartite matching so that each sentence can have only one image, assign
This will provide better coverage than just assigning a similarity maximizer.
assign followed by either before the sentence or after the sentence, depending on how Flamingo works.
Real-world examples
Exploring mmc4
url source
topics Topics with LDA -> top frequent words -> GPT4
Result
Comparing a child who learned with Open Flamingo to a child who learned with LAION-2B
Retrieval
COCO caption
Comparison of MSCOCO caption zero-shot / 4-/ 8-shot caption trained with LAION-2B at 15M Red is zero-shot performance. The reason why it is worse than 4 and 8 shots is that LAION-2B is only trained with short text, so it may not be able to recognize long text. Shouldn’t it be compared to 2B;
(from FLAMINGO, coco dev set, 4shot)