Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wen Heng

MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production

May 19, 2025

Chao Jin, Ziheng Jiang, Zhihao Bai, Zheng Zhong, Juncai Liu, Xiang Li, Ningxin Zheng, Xi Wang, Cong Xie, Qi Huang(+9 more)

Abstract:We present MegaScale-MoE, a production system tailored for the efficient training of large-scale mixture-of-experts (MoE) models. MoE emerges as a promising architecture to scale large language models (LLMs) to unprecedented sizes, thereby enhancing model performance. However, existing MoE training systems experience a degradation in training efficiency, exacerbated by the escalating scale of MoE models and the continuous evolution of hardware. Recognizing the pivotal role of efficient communication in enhancing MoE training, MegaScale-MoE customizes communication-efficient parallelism strategies for attention and FFNs in each MoE layer and adopts a holistic approach to overlap communication with computation at both inter- and intra-operator levels. Additionally, MegaScale-MoE applies communication compression with adjusted communication patterns to lower precision, further improving training efficiency. When training a 352B MoE model on 1,440 NVIDIA Hopper GPUs, MegaScale-MoE achieves a training throughput of 1.41M tokens/s, improving the efficiency by 1.88$\times$ compared to Megatron-LM. We share our operational experience in accelerating MoE training and hope that by offering our insights in system design, this work will motivate future research in MoE systems.

Via

Access Paper or Ask Questions

Capturing the motion of every joint: 3D human pose and shape estimation with independent tokens

Mar 01, 2023

Sen Yang, Wen Heng, Gang Liu, Guozhong Luo, Wankou Yang, Gang Yu

Abstract:In this paper we present a novel method to estimate 3D human pose and shape from monocular videos. This task requires directly recovering pixel-alignment 3D human pose and body shape from monocular images or videos, which is challenging due to its inherent ambiguity. To improve precision, existing methods highly rely on the initialized mean pose and shape as prior estimates and parameter regression with an iterative error feedback manner. In addition, video-based approaches model the overall change over the image-level features to temporally enhance the single-frame feature, but fail to capture the rotational motion at the joint level, and cannot guarantee local temporal consistency. To address these issues, we propose a novel Transformer-based model with a design of independent tokens. First, we introduce three types of tokens independent of the image feature: \textit{joint rotation tokens, shape token, and camera token}. By progressively interacting with image features through Transformer layers, these tokens learn to encode the prior knowledge of human 3D joint rotations, body shape, and position information from large-scale data, and are updated to estimate SMPL parameters conditioned on a given image. Second, benefiting from the proposed token-based representation, we further use a temporal model to focus on capturing the rotational temporal information of each joint, which is empirically conducive to preventing large jitters in local parts. Despite being conceptually simple, the proposed method attains superior performances on the 3DPW and Human3.6M datasets. Using ResNet-50 and Transformer architectures, it obtains 42.0 mm error on the PA-MPJPE metric of the challenging 3DPW, outperforming state-of-the-art counterparts by a large margin. Code will be publicly available at https://github.com/yangsenius/INT_HMR_Model

* 17 pages, 12 figures. ICLR 2023 (spotlight)

Via

Access Paper or Ask Questions

RIFE: Real-Time Intermediate Flow Estimation for Video Frame Interpolation

Nov 17, 2020

Zhewei Huang, Tianyuan Zhang, Wen Heng, Boxin Shi, Shuchang Zhou

Figure 1 for RIFE: Real-Time Intermediate Flow Estimation for Video Frame Interpolation

Figure 2 for RIFE: Real-Time Intermediate Flow Estimation for Video Frame Interpolation

Figure 3 for RIFE: Real-Time Intermediate Flow Estimation for Video Frame Interpolation

Figure 4 for RIFE: Real-Time Intermediate Flow Estimation for Video Frame Interpolation

Abstract:We propose RIFE, a Real-time Intermediate Flow Estimation algorithm for Video Frame Interpolation (VFI). Most existing methods first estimate the bi-directional optical flows and then linearly combine them to approximate intermediate flows, leading to artifacts on motion boundaries. RIFE uses a neural network named IFNet that can directly estimate the intermediate flows from images. With the more precise flows and our simplified fusion process, RIFE can improve interpolation quality and have much better speed. Based on our proposed leakage distillation loss, RIFE can be trained in an end-to-end fashion. Experiments demonstrate that our method is significantly faster than existing VFI methods and can achieve state-of-the-art performance on public benchmarks. The code is available at https://github.com/hzwer/arXiv2020-RIFE.

* 10 pages, 8 figures

Via

Access Paper or Ask Questions

Single Path One-Shot Neural Architecture Search with Uniform Sampling

Apr 06, 2019

Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, Jian Sun

Figure 1 for Single Path One-Shot Neural Architecture Search with Uniform Sampling

Figure 2 for Single Path One-Shot Neural Architecture Search with Uniform Sampling

Figure 3 for Single Path One-Shot Neural Architecture Search with Uniform Sampling

Figure 4 for Single Path One-Shot Neural Architecture Search with Uniform Sampling

Abstract:One-shot method is a powerful Neural Architecture Search (NAS) framework, but its training is non-trivial and it is difficult to achieve competitive results on large scale datasets like ImageNet. In this work, we propose a Single Path One-Shot model to address its main challenge in the training. Our central idea is to construct a simplified supernet, Single Path Supernet, which is trained by an uniform path sampling method. All underlying architectures (and their weights) get trained fully and equally. Once we have a trained supernet, we apply an evolutionary algorithm to efficiently search the best-performing architectures without any fine tuning. Comprehensive experiments verify that our approach is flexible and effective. It is easy to train and fast to search. It effortlessly supports complex search spaces (e.g., building blocks, channel, mixed-precision quantization) and different search constraints (e.g., FLOPs, latency). It is thus convenient to use for various needs. It achieves start-of-the-art performance on the large dataset ImageNet.

Via

Access Paper or Ask Questions

Learning to Paint with Model-based Deep Reinforcement Learning

Mar 22, 2019

Zhewei Huang, Wen Heng, Shuchang Zhou

Figure 1 for Learning to Paint with Model-based Deep Reinforcement Learning

Figure 2 for Learning to Paint with Model-based Deep Reinforcement Learning

Figure 3 for Learning to Paint with Model-based Deep Reinforcement Learning

Figure 4 for Learning to Paint with Model-based Deep Reinforcement Learning

Abstract:We show how to teach machines to paint like human painters, who can use a few strokes to create fantastic paintings. By combining the neural renderer and model-based Deep Reinforcement Learning (DRL), our agent can decompose texture-rich images into strokes and make long-term plans. For each stroke, the agent directly determines the position and color of the stroke. Excellent visual effect can be achieved using hundreds of strokes. The training process does not require experience of human painting or stroke tracking data.

* 11 pages, 13 figures

Via

Access Paper or Ask Questions

Stroke-based Character Reconstruction

Oct 17, 2018

Zhewei Huang, Wen Heng, Yuanzheng Tao, Shuchang Zhou

Figure 1 for Stroke-based Character Reconstruction

Figure 2 for Stroke-based Character Reconstruction

Figure 3 for Stroke-based Character Reconstruction

Figure 4 for Stroke-based Character Reconstruction

Abstract:Character reconstruction for noisy character images or character images from real scene is still a challenging problem, due to the bewildering backgrounds, uneven illumination, low resolution and different distortions. We propose a stroke-based character reconstruction(SCR) method that use a weighted quadratic Bezier curve(WQBC) to represent strokes of a character. Only training on our synthetic data, our stroke extractor can achieve excellent reconstruction effect in real scenes. Meanwhile. It can also help achieve great ability in defending adversarial attacks of character recognizers.

* 6 pages, 7 figures

Via

Access Paper or Ask Questions

Harmonic Adversarial Attack Method

Aug 08, 2018

Wen Heng, Shuchang Zhou, Tingting Jiang

Figure 1 for Harmonic Adversarial Attack Method

Figure 2 for Harmonic Adversarial Attack Method

Figure 3 for Harmonic Adversarial Attack Method

Figure 4 for Harmonic Adversarial Attack Method

Abstract:Adversarial attacks find perturbations that can fool models into misclassifying images. Previous works had successes in generating noisy/edge-rich adversarial perturbations, at the cost of degradation of image quality. Such perturbations, even when they are small in scale, are usually easily spottable by human vision. In contrast, we propose Harmonic Adversar- ial Attack Methods (HAAM), that generates edge-free perturbations by using harmonic functions. The property of edge-free guarantees that the generated adversarial images can still preserve visual quality, even when perturbations are of large magnitudes. Experiments also show that adversaries generated by HAAM often have higher rates of success when transferring between models. In addition, we find harmonic perturbations can simulate natural phenomena like natural lighting and shadows. It would then be possible to help find corner cases for given models, as a first step to improving them.

Via

Access Paper or Ask Questions