Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yiran Xing

ERNIE 5.0 Technical Report

Feb 04, 2026

Haifeng Wang, Hua Wu, Tian Wu, Yu Sun, Jing Liu, Dianhai Yu, Yanjun Ma, Jingzhou He, Zhongjun He, Dou Hong(+425 more)

Abstract:In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are trained from scratch under a unified next-group-of-tokens prediction objective, based on an ultra-sparse mixture-of-experts (MoE) architecture with modality-agnostic expert routing. To address practical challenges in large-scale deployment under diverse resource constraints, ERNIE 5.0 adopts a novel elastic training paradigm. Within a single pre-training run, the model learns a family of sub-models with varying depths, expert capacities, and routing sparsity, enabling flexible trade-offs among performance, model size, and inference latency in memory- or time-constrained scenarios. Moreover, we systematically address the challenges of scaling reinforcement learning to unified foundation models, thereby guaranteeing efficient and stable post-training under ultra-sparse MoE architectures and diverse multimodal settings. Extensive experiments demonstrate that ERNIE 5.0 achieves strong and balanced performance across multiple modalities. To the best of our knowledge, among publicly disclosed models, ERNIE 5.0 represents the first production-scale realization of a trillion-parameter unified autoregressive model that supports both multimodal understanding and generation. To facilitate further research, we present detailed visualizations of modality-agnostic expert routing in the unified model, alongside comprehensive empirical analysis of elastic training, aiming to offer profound insights to the community.

Via

Access Paper or Ask Questions

3D-RETR: End-to-End Single and Multi-View 3D Reconstruction with Transformers

Oct 17, 2021

Zai Shi, Zhao Meng, Yiran Xing, Yunpu Ma, Roger Wattenhofer

Figure 1 for 3D-RETR: End-to-End Single and Multi-View 3D Reconstruction with Transformers

Figure 2 for 3D-RETR: End-to-End Single and Multi-View 3D Reconstruction with Transformers

Figure 3 for 3D-RETR: End-to-End Single and Multi-View 3D Reconstruction with Transformers

Figure 4 for 3D-RETR: End-to-End Single and Multi-View 3D Reconstruction with Transformers

Abstract:3D reconstruction aims to reconstruct 3D objects from 2D views. Previous works for 3D reconstruction mainly focus on feature matching between views or using CNNs as backbones. Recently, Transformers have been shown effective in multiple applications of computer vision. However, whether or not Transformers can be used for 3D reconstruction is still unclear. In this paper, we fill this gap by proposing 3D-RETR, which is able to perform end-to-end 3D REconstruction with TRansformers. 3D-RETR first uses a pretrained Transformer to extract visual features from 2D input images. 3D-RETR then uses another Transformer Decoder to obtain the voxel features. A CNN Decoder then takes as input the voxel features to obtain the reconstructed objects. 3D-RETR is capable of 3D reconstruction from a single view or multiple views. Experimental results on two datasets show that 3DRETR reaches state-of-the-art performance on 3D reconstruction. Additional ablation study also demonstrates that 3D-DETR benefits from using Transformers.

* BMVC 2021

Via

Access Paper or Ask Questions

KM-BART: Knowledge Enhanced Multimodal BART for Visual Commonsense Generation

Jan 02, 2021

Yiran Xing, Zai Shi, Zhao Meng, Yunpu Ma, Roger Wattenhofer

Figure 1 for KM-BART: Knowledge Enhanced Multimodal BART for Visual Commonsense Generation

Figure 2 for KM-BART: Knowledge Enhanced Multimodal BART for Visual Commonsense Generation

Figure 3 for KM-BART: Knowledge Enhanced Multimodal BART for Visual Commonsense Generation

Figure 4 for KM-BART: Knowledge Enhanced Multimodal BART for Visual Commonsense Generation

Abstract:We present Knowledge Enhanced Multimodal BART (KM-BART), which is a Transformer-based sequence-to-sequence model capable of reasoning about commonsense knowledge from multimodal inputs of images and texts. We extend the popular BART architecture to a multi-modal model. We design a new pretraining task to improve the model performance on Visual Commonsense Generation task. Our pretraining task improves the Visual Commonsense Generation performance by leveraging knowledge from a large language model pretrained on an external knowledge graph. To the best of our knowledge, we are the first to propose a dedicated task for improving model performance on Visual Commonsense Generation. Experimental results show that by pretraining, our model reaches state-of-the-art performance on the Visual Commonsense Generation task.

* Work in progress. The first three authors contribute equally to this work

Via

Access Paper or Ask Questions