image

paper

TL;DR

  • I read this because.. : Mentioned in CS330 lecture. In #118, it was also mentioned that using Perceiver doesn’t make any difference to IO and spring
  • task : image classification, language modeling, optical flow, StarCraft II, …
  • problem : I have models for each domain/task. Life would be easier if I could handle them with one NN
  • idea : transformer encoder-decoder structure, but let’s use perceiver structure (where input modality goes to CA) + output query
  • input : (encoder) N x D-dimensional latent array (decoder) positional embedding or task embedding
  • output : (encoder) context vector (decoder) class(for image classification), token id(for MLM), …
  • architecture : But the encoder is a perceiver (text, image, video, etc. go into the CA) / decoder is a CA between the encoder context vector and the output query.
  • objective : objective function for each task
  • baseline : GLUE(BERT), Image Classification(ViT-B), Optical Flow(PWCNet, RAFT), StarCraft(Transformer), AudioSet Classification(Perceiver IO)
  • data : English Wikipedia + C4, ImageNet, JFT….
  • result : Better performance on GLUE vs. BERT for the same FLOPS. Optical flowbars also perform well against a few metrics compared to baseline. The rest have decent performance, but not the best.
  • contribution : test. for quite a few modalities. isn’t the way you put task embedding/PE embedding in the decoder a contribution point?! The rest of it seems like it’s not new
  • etc. :

Details

Architecture

image

Output Queries

image
  • Classification, such as image classification, can be done by simply embedding the task in the
  • If multi task, multiple task embeddings
  • For MLM, 2048 Positional Embeddings

Architecture Details

image

Result

  • tasks image

  • GLUE image

The introduction also emphasizes using UTF-8 bytes, but I don’t know if this is a contribution (is there any prior work like BBPE?). This makes max_len longer than $$O(n**2)$$, and the structural linear increase in complexity seems to be a contribution! In this table, it has much larger parameters than BERT, but lower FLOPS. The parameter decreases the hidden dim and increases the depth by a lot. Compared to BERT, max_len was increased from 512 -> 2048 and vocab size was reduced to 256.

  • image classification image

ViT-B/16ã¨ã"ã¨ã"ã’ˆã£ã¦éžå¸¸ã"ãªã’Šã¾ã-ãŸã€’ First of all, it seems worse than ViT Performance JFT pretraining scored 86.4 points, which is a bit different from ViT-H/14’s 88.6 points per viewer (although the number of parameters is 1/3). In the end, the best performance is the one with Conv. Other than that, it looks like it’s better than its predecessor Perceiver?

  • AudioSet Classification image

  • StarCraft II image