[54] Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models

TL;DR

task : large language modeling, domain incremental learning
problem : The idea is pretty much the same as DeMix, but I want to reduce the communication in the multi-node synchronize part.
Idea :** Create expert LMs that do not share parameters by domain (previous MoE LMs shared only FFN) and learn using Branch-Train-Merge (BTM). The main idea of BTM is that when a new domain is introduced, the closest LMs are found, averaged, initialized, branched, trained, and added to the branch forest. When inferring, the posterior is estimated using Bayes rule to determine which domain it is, and the final prediction is a weighted sum.
architecture : vanilla Transformer..
objective : cross-entropy loss
baseline : Transformer LM(GPT), DeMix
data : Wikipedia, C4, StackOverflow, JavaScript, … etc.
result : Better perplexity on out-of-domain, similar performance to Transformer LM with 2.5x the size when incrementally learning on 64 domains.
contribution : MoE without shared parameters.
Limitations or things I don’t understand :

It is correct that we should forward to all ELMs, but we can see that the ELMs selected are sparsely configured.

DeMix, 2021

https://arxiv.org/pdf/2108.05036.pdf
problem : We want to reduce the perplexity of training a corpus of multiple domains with a single LM, where we know the domain of each piece of data.
solution : Train an FFN (like switch Transformer) as an expert for each domain in the corpus. When a new domain is added at inference time, You can either 1) forward all FFNs and do a bezier-weighted sum to get the result, or 2) add FFNs for that domain.
Result: improved LM perplexity while increasing learning efficiency, showing that new domains can be added or removed without forgetting previous experts.