[60] Efficient Sparsely Activated Transformers

2022λ…„ 9μ›” 2일 Β· 1 λΆ„ Β· long8v Β· 

[54] Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models

2022λ…„ 8μ›” 25일 Β· 2 λΆ„ Β· long8v Β· 

MoEBERT code reading

2022λ…„ 5μ›” 23일 Β· 1 λΆ„ Β· long8v Β· 

[26] Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts

2022λ…„ 5μ›” 13일 Β· 1 λΆ„ Β· long8v Β· 

Sparse MoE code reading

2022λ…„ 5μ›” 10일 Β· 1 λΆ„ Β· long8v Β·