Problem : SwinTransformer’s shifted window method is rather complicated and not optimized for deep learning frameworks (like TensorRT). Solution : Cut into m by n sub-windows like Swin, perform local self-attention, pool the resulting values into one value for each sub-window (e.g. strided CNN), and repeat the LSA+GSA operation to seek attention from other sub-windows with this m by n matrix as the key value. + Apply conditional positional encoding. Result : Better performance than SwinTransformer with simpler/optimized operations. Adding conditional positional encoding to PVT performs better than Swin What I thought : CNN is taking a lot of ideas from me. It’s good that it’s simpler than Swin but performs well… I’ll read CPVT as my next paper. details : paper summary