[4] Conditional Positional Encodings for Vision Transformers

paper , code

Problem : Traditional positional embedding degrades performance when the length is longer than the trained length, and is not translation-invariant. Relative PE is computationally complex and performs poorly in image classification problems due to lack of absolute position information. Solution : Learn the position embedding of the neighboring tokens of the input token as input so that the position embedding changes depending on the input. Specifically, the tokens passed through the ViT encoder are reconstructed as N x W x H x C, which is then passed through a zero-padded CNN and used as the position embedding. Result : Better performance than ViT, DeiT. Can be seamlessly adapted to existing transformer structures. Eliminates [CLS] tokens and uses GAPs to solve the SOTA details :paper summary