[22] Transformers without Tears: Improving the Normalization of Self-Attention

pre-norm vs post-norm

The vanilla Trasnformer’s LayerNorm is applied after the self-attentioned output and input have been added together (residual connections). This is called the PostNorm.

    def forward(self, src, src_mask):
        
        #src = [batch size, src len, hid dim]
        #src_mask = [batch size, 1, 1, src len] 
                
        #self attention
        _src, _ = self.self_attention(src, src, src, src_mask)
        
        #dropout, residual connection and layer norm
        src = self.self_attn_layer_norm(src + self.dropout(_src))

Subsequent studies have shown that applying LayerNorm before self-attention leads to more stable learning, and is particularly effective when the layer is deep. (?) Note that one must append an additional normalization after both encoder and decoder so their outputs are appropriately scaled. -> encoder, decoder As you say one must append an additional normalization to the output of the finished (probably last layer).

    def forward(self, src, src_mask):
        
        #src = [batch size, src len, hid dim]
        #src_mask = [batch size, 1, 1, src len] 
                
        #self attention
        src_norm = self.self_attn_layer_norm(src)
        _src, _ = self.self_attention(src_norm, src_norm, src_norm, src_mask)
        
        #dropout, residual connection and layer norm
        src = self.self_attn_layer_norm(src + self.dropout(_src))

weight initializer

We will use the xaiver initializer, and when we use postnorm, convergence fails because the initialization value of Xavier normal is too large. In the Transformer implementation, it is divided by sqrt(hid_dim) (scaled dot product), which is about sqrt(d)=22.6 for d=512. FFNs already have a small standard deviation (hid_dim, fcn_dim, where fcn_dim is roughly 4 times hid_dim, so the std for ffn is sqrt(2/(d+4d))). Suggest reducing the initializer of the attention layer as well (Small_Int)

Scaled L2 norm and FixNorm

Both batch norm and layer norm are designed to reduce covariate-shift, but studies have shown that the real benefit comes from smoothing the loss landscape. For example, dividing by L_p norm rather than by variance gave similar or better performance in image classification. We propose to replace LayerNorm with scaled L2 norm.

In the last layer, the larger internality causes the output distribution to be sharper (smaller variance), which causes the frequent words to have larger numbers than the infrequent words. To improve this, it has been suggested to apply a FixNorm to the last layer.

To make the parameter g trainable, ScaleNorm and FixNorm can be applied at once, and can be written as follows.

In this expression, the last layer is equal to cosine normalization (Luo et al., 2018) with a learned scale.

PreNorm converges faster and performs better after convergence FixNorm converges faster, but the converged performance is similar to that of Better performance when applying FixNorm + ScaleNorm ScaleNorm is slower to converge with warmup, and the performance is similar to that without warmpup.

pre-norm vs post-norm#

weight initializer#

Scaled L2 norm and FixNorm#

pre-norm vs post-norm

weight initializer

Scaled L2 norm and FixNorm