The following publications are possibly variants of this publication:
- Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable TransformersTianlong Chen, Zhenyu Zhang, Ajay Kumar Jaiswal, Shiwei Liu, Zhangyang Wang. iclr 2023: [doi]
- Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language TasksHao Li, Jinguo Zhu, Xiaohu Jiang, Xizhou Zhu, Hongsheng Li, Chun Yuan, Xiaohua Wang, Yu Qiao, Xiaogang Wang, Wenhai Wang, Jifeng Dai. cvpr 2023: 2691-2700 [doi]
- MPMoE: Memory Efficient MoE for Pre-Trained Models With Adaptive Pipeline ParallelismZheng Zhang 0036, Yaqi Xia, Hulin Wang, Donglin Yang, Chuang Hu, Xiaobo Zhou, Dazhao Cheng. tpds, 35(6):843-856, June 2024. [doi]
- Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture with Task-Level Sparsity via Mixture-of-ExpertsRishov Sarkar, Hanxue Liang, Zhiwen Fan, Zhangyang Wang, Cong Hao. iccad 2023: 1-9 [doi]