[60] Efficient Sparsely Activated Transformers

September 2, 2022 ยท 1 min ยท long8v ยท 

[54] Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models

August 25, 2022 ยท 2 min ยท long8v ยท 

MoEBERT code reading

May 23, 2022 ยท 1 min ยท long8v ยท 

[26] Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts

May 13, 2022 ยท 1 min ยท long8v ยท 

Sparse MoE code reading

May 10, 2022 ยท 1 min ยท long8v ยท