[111] Perceiver IO: A General Architecture for Structured Inputs & Outputs

paper

TL;DR

I read this because.. : Mentioned in CS330 lecture. In #118, it was also mentioned that using Perceiver doesn’t make any difference to IO and spring
task : image classification, language modeling, optical flow, StarCraft II, …
problem : I have models for each domain/task. Life would be easier if I could handle them with one NN
idea : transformer encoder-decoder structure, but let’s use perceiver structure (where input modality goes to CA) + output query
input : (encoder) N x D-dimensional latent array (decoder) positional embedding or task embedding
output : (encoder) context vector (decoder) class(for image classification), token id(for MLM), …
architecture : But the encoder is a perceiver (text, image, video, etc. go into the CA) / decoder is a CA between the encoder context vector and the output query.
objective : objective function for each task
baseline : GLUE(BERT), Image Classification(ViT-B), Optical Flow(PWCNet, RAFT), StarCraft(Transformer), AudioSet Classification(Perceiver IO)
data : English Wikipedia + C4, ImageNet, JFT….
result : Better performance on GLUE vs. BERT for the same FLOPS. Optical flowbars also perform well against a few metrics compared to baseline. The rest have decent performance, but not the best.
contribution : test. for quite a few modalities. isn’t the way you put task embedding/PE embedding in the decoder a contribution point?! The rest of it seems like it’s not new
etc. :

Details

Architecture

Output Queries

Classification, such as image classification, can be done by simply embedding the task in the
If multi task, multiple task embeddings
For MLM, 2048 Positional Embeddings

Architecture Details

Result

tasks
GLUE

The introduction also emphasizes using UTF-8 bytes, but I don’t know if this is a contribution (is there any prior work like BBPE?). This makes max_len longer than $$O(n**2)$$, and the structural linear increase in complexity seems to be a contribution! In this table, it has much larger parameters than BERT, but lower FLOPS. The parameter decreases the hidden dim and increases the depth by a lot. Compared to BERT, max_len was increased from 512 -> 2048 and vocab size was reduced to 256.

image classification

ViT-B/16ã¨ã"ã¨ã"ã’ˆã£ã¦éžå¸¸ã"ãªã’Šã¾ã-ãŸã€’ First of all, it seems worse than ViT Performance JFT pretraining scored 86.4 points, which is a bit different from ViT-H/14’s 88.6 points per viewer (although the number of parameters is 1/3). In the end, the best performance is the one with Conv. Other than that, it looks like it’s better than its predecessor Perceiver?

AudioSet Classification
StarCraft II

TL;DR#

Details#

Architecture#

Output Queries#

Architecture Details#

Result#