[2] ELSA: Enhanced Local Self-Attention for Vision Transformer

paper

Problem : Swin Transformer’s Local Self-Attention (LSA) was better when replaced with Depthwise-Conv (DeConv) or Decoupled Dynamic Filter (DDF). Solution : Expressed DeConv, DDF, and LSA as attention expressions and conducted an ablation study. We found that increasing head and sliding methods are important for performance and proposed hadamard attention, which is more efficient than ghost-head and dot-product. Result : Higher FLOPS with LSA-like parameters, improved performance of SwinTransformer on classification tasks I noticed : neighboring window (=sliding window) performs better than local window… Like the last paper, the more I apply CNN’s methodology, the better it gets… details : paper summary