
pre-norm vs post-norm
The vanilla Trasnformer’s LayerNorm is applied after the self-attentioned output and input have been added together (residual connections). This is called the PostNorm.

def forward(self, src, src_mask):
#src = [batch size, src len, hid dim]
#src_mask = [batch size, 1, 1, src len]
#self attention
_src, _ = self.self_attention(src, src, src, src_mask)
#dropout, residual connection and layer norm
src = self.self_attn_layer_norm(src + self.dropout(_src))
Subsequent studies have shown that applying LayerNorm before self-attention leads to more stable learning, and is particularly effective when the layer is deep. (?) Note that one must append an additional normalization after both encoder and decoder so their outputs are appropriately scaled. -> encoder, decoder As you say one must append an additional normalization to the output of the finished (probably last layer).

def forward(self, src, src_mask):
#src = [batch size, src len, hid dim]
#src_mask = [batch size, 1, 1, src len]
#self attention
src_norm = self.self_attn_layer_norm(src)
_src, _ = self.self_attention(src_norm, src_norm, src_norm, src_mask)
#dropout, residual connection and layer norm
src = self.self_attn_layer_norm(src + self.dropout(_src))
weight initializer

We will use the xaiver initializer, and when we use postnorm, convergence fails because the initialization value of Xavier normal is too large. In the Transformer implementation, it is divided by sqrt(hid_dim) (scaled dot product), which is about sqrt(d)=22.6 for d=512. FFNs already have a small standard deviation (hid_dim, fcn_dim, where fcn_dim is roughly 4 times hid_dim, so the std for ffn is sqrt(2/(d+4d))). Suggest reducing the initializer of the attention layer as well (Small_Int)

Scaled L2 norm and FixNorm
Both batch norm and layer norm are designed to reduce covariate-shift, but studies
have shown that the real benefit comes from smoothing the loss landscape. For example, dividing by L_p norm rather than by variance gave similar or better performance in image classification.
We propose to replace LayerNorm with scaled L2 norm.

In the last layer, the larger internality causes the output distribution to be sharper (smaller variance), which causes the frequent words to have larger numbers than the infrequent words. To improve this, it has been suggested to apply a FixNorm
to the last layer.

To make the parameter g trainable, ScaleNorm and FixNorm can be applied at once, and can be written as follows.

In this expression, the last layer is equal to cosine normalization (Luo et al., 2018) with a learned scale.

PreNorm converges faster and performs better after convergence FixNorm converges faster, but the converged performance is similar to that of Better performance when applying FixNorm + ScaleNorm ScaleNorm is slower to converge with warmup, and the performance is similar to that without warmpup.
