A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training Previous Next