Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shengcong Chen

Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

Aug 07, 2025

Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Yue Hu, Jingbin Cai, Si Liu, Jianlan Luo(+4 more)

Abstract:We introduce Genie Envisioner (GE), a unified world foundation platform for robotic manipulation that integrates policy learning, evaluation, and simulation within a single video-generative framework. At its core, GE-Base is a large-scale, instruction-conditioned video diffusion model that captures the spatial, temporal, and semantic dynamics of real-world robotic interactions in a structured latent space. Built upon this foundation, GE-Act maps latent representations to executable action trajectories through a lightweight, flow-matching decoder, enabling precise and generalizable policy inference across diverse embodiments with minimal supervision. To support scalable evaluation and training, GE-Sim serves as an action-conditioned neural simulator, producing high-fidelity rollouts for closed-loop policy development. The platform is further equipped with EWMBench, a standardized benchmark suite measuring visual fidelity, physical consistency, and instruction-action alignment. Together, these components establish Genie Envisioner as a scalable and practical foundation for instruction-driven, general-purpose embodied intelligence. All code, models, and benchmarks will be released publicly.

* https://genie-envisioner.github.io/

Via

Access Paper or Ask Questions

EWMBench: Evaluating Scene, Motion, and Semantic Quality in Embodied World Models

May 14, 2025

Hu Yue, Siyuan Huang, Yue Liao, Shengcong Chen, Pengfei Zhou, Liliang Chen, Maoqing Yao, Guanghui Ren

Abstract:Recent advances in creative AI have enabled the synthesis of high-fidelity images and videos conditioned on language instructions. Building on these developments, text-to-video diffusion models have evolved into embodied world models (EWMs) capable of generating physically plausible scenes from language commands, effectively bridging vision and action in embodied AI applications. This work addresses the critical challenge of evaluating EWMs beyond general perceptual metrics to ensure the generation of physically grounded and action-consistent behaviors. We propose the Embodied World Model Benchmark (EWMBench), a dedicated framework designed to evaluate EWMs based on three key aspects: visual scene consistency, motion correctness, and semantic alignment. Our approach leverages a meticulously curated dataset encompassing diverse scenes and motion patterns, alongside a comprehensive multi-dimensional evaluation toolkit, to assess and compare candidate models. The proposed benchmark not only identifies the limitations of existing video generation models in meeting the unique requirements of embodied tasks but also provides valuable insights to guide future advancements in the field. The dataset and evaluation tools are publicly available at https://github.com/AgibotTech/EWMBench.

* Website: https://github.com/AgibotTech/EWMBench

Via

Access Paper or Ask Questions

EnerVerse-AC: Envisioning Embodied Environments with Action Condition

May 14, 2025

Yuxin Jiang, Shengcong Chen, Siyuan Huang, Liliang Chen, Pengfei Zhou, Yue Liao, Xindong He, Chiming Liu, Hongsheng Li, Maoqing Yao(+1 more)

Abstract:Robotic imitation learning has advanced from solving static tasks to addressing dynamic interaction scenarios, but testing and evaluation remain costly and challenging due to the need for real-time interaction with dynamic environments. We propose EnerVerse-AC (EVAC), an action-conditional world model that generates future visual observations based on an agent's predicted actions, enabling realistic and controllable robotic inference. Building on prior architectures, EVAC introduces a multi-level action-conditioning mechanism and ray map encoding for dynamic multi-view image generation while expanding training data with diverse failure trajectories to improve generalization. As both a data engine and evaluator, EVAC augments human-collected trajectories into diverse datasets and generates realistic, action-conditioned video observations for policy testing, eliminating the need for physical robots or complex simulations. This approach significantly reduces costs while maintaining high fidelity in robotic manipulation evaluation. Extensive experiments validate the effectiveness of our method. Code, checkpoints, and datasets can be found at <https://annaj2178.github.io/EnerverseAC.github.io>.

* Website: https://annaj2178.github.io/EnerverseAC.github.io

Via

Access Paper or Ask Questions

EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation

Jan 03, 2025

Siyuan Huang, Liliang Chen, Pengfei Zhou, Shengcong Chen, Zhengkai Jiang, Yue Hu, Peng Gao, Hongsheng Li, Maoqing Yao, Guanghui Ren

Figure 1 for EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation

Figure 2 for EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation

Figure 3 for EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation

Figure 4 for EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation

Abstract:We introduce EnerVerse, a comprehensive framework for embodied future space generation specifically designed for robotic manipulation tasks. EnerVerse seamlessly integrates convolutional and bidirectional attention mechanisms for inner-chunk space modeling, ensuring low-level consistency and continuity. Recognizing the inherent redundancy in video data, we propose a sparse memory context combined with a chunkwise unidirectional generative paradigm to enable the generation of infinitely long sequences. To further augment robotic capabilities, we introduce the Free Anchor View (FAV) space, which provides flexible perspectives to enhance observation and analysis. The FAV space mitigates motion modeling ambiguity, removes physical constraints in confined environments, and significantly improves the robot's generalization and adaptability across various tasks and settings. To address the prohibitive costs and labor intensity of acquiring multi-camera observations, we present a data engine pipeline that integrates a generative model with 4D Gaussian Splatting (4DGS). This pipeline leverages the generative model's robust generalization capabilities and the spatial constraints provided by 4DGS, enabling an iterative enhancement of data quality and diversity, thus creating a data flywheel effect that effectively narrows the sim-to-real gap. Finally, our experiments demonstrate that the embodied future space generation prior substantially enhances policy predictive capabilities, resulting in improved overall performance, particularly in long-range robotic manipulation tasks.

* Website: https://sites.google.com/view/enerverse

Via

Access Paper or Ask Questions

DARC: Distribution-Aware Re-Coloring Model for Generalizable Nucleus Segmentation

Sep 01, 2023

Shengcong Chen, Changxing Ding, Dacheng Tao, Hao Chen

Abstract:Nucleus segmentation is usually the first step in pathological image analysis tasks. Generalizable nucleus segmentation refers to the problem of training a segmentation model that is robust to domain gaps between the source and target domains. The domain gaps are usually believed to be caused by the varied image acquisition conditions, e.g., different scanners, tissues, or staining protocols. In this paper, we argue that domain gaps can also be caused by different foreground (nucleus)-background ratios, as this ratio significantly affects feature statistics that are critical to normalization layers. We propose a Distribution-Aware Re-Coloring (DARC) model that handles the above challenges from two perspectives. First, we introduce a re-coloring method that relieves dramatic image color variations between different domains. Second, we propose a new instance normalization method that is robust to the variation in foreground-background ratios. We evaluate the proposed methods on two H$\&$E stained image datasets, named CoNSeP and CPM17, and two IHC stained image datasets, called DeepLIIF and BC-DeepLIIF. Extensive experimental results justify the effectiveness of our proposed DARC model. Codes are available at \url{https://github.com/csccsccsccsc/DARC

* Accepted by MICCAI 2023

Via

Access Paper or Ask Questions

CPP-Net: Context-aware Polygon Proposal Network for Nucleus Segmentation

Feb 13, 2021

Shengcong Chen, Changxing Ding, Minfeng Liu, Dacheng Tao

Figure 1 for CPP-Net: Context-aware Polygon Proposal Network for Nucleus Segmentation

Figure 2 for CPP-Net: Context-aware Polygon Proposal Network for Nucleus Segmentation

Figure 3 for CPP-Net: Context-aware Polygon Proposal Network for Nucleus Segmentation

Figure 4 for CPP-Net: Context-aware Polygon Proposal Network for Nucleus Segmentation

Abstract:Nucleus segmentation is a challenging task due to the crowded distribution and blurry boundaries of nuclei. Recent approaches represent nuclei by means of polygons to differentiate between touching and overlapping nuclei and have accordingly achieved promising performance. Each polygon is represented by a set of centroid-to-boundary distances, which are in turn predicted by features of the centroid pixel for a single nucleus. However, using the centroid pixel alone does not provide sufficient contextual information for robust prediction. To handle this problem, we propose a Context-aware Polygon Proposal Network (CPP-Net) for nucleus segmentation. First, we sample a point set rather than one single pixel within each cell for distance prediction. This strategy substantially enhances contextual information and thereby improves the robustness of the prediction. Second, we propose a Confidence-based Weighting Module, which adaptively fuses the predictions from the sampled point set. Third, we introduce a novel Shape-Aware Perceptual (SAP) loss that constrains the shape of the predicted polygons. Here, the SAP loss is based on an additional network that is pre-trained by means of mapping the centroid probability map and the pixel-to-boundary distance maps to a different nucleus representation. Extensive experiments justify the effectiveness of each component in the proposed CPP-Net. Finally, CPP-Net is found to achieve state-of-the-art performance on three publicly available databases, namely DSB2018, BBBC06, and PanNuke. Code of this paper will be released.

Via

Access Paper or Ask Questions

Boundary-assisted Region Proposal Networks for Nucleus Segmentation

Jun 04, 2020

Shengcong Chen, Changxing Ding, Dacheng Tao

Figure 1 for Boundary-assisted Region Proposal Networks for Nucleus Segmentation

Figure 2 for Boundary-assisted Region Proposal Networks for Nucleus Segmentation

Figure 3 for Boundary-assisted Region Proposal Networks for Nucleus Segmentation

Figure 4 for Boundary-assisted Region Proposal Networks for Nucleus Segmentation

Abstract:Nucleus segmentation is an important task in medical image analysis. However, machine learning models cannot perform well because there are large amount of clusters of crowded nuclei. To handle this problem, existing approaches typically resort to sophisticated hand-crafted post-processing strategies; therefore, they are vulnerable to the variation of post-processing hyper-parameters. Accordingly, in this paper, we devise a Boundary-assisted Region Proposal Network (BRP-Net) that achieves robust instance-level nucleus segmentation. First, we propose a novel Task-aware Feature Encoding (TAFE) network that efficiently extracts respective high-quality features for semantic segmentation and instance boundary detection tasks. This is achieved by carefully considering the correlation and differences between the two tasks. Second, coarse nucleus proposals are generated based on the predictions of the above two tasks. Third, these proposals are fed into instance segmentation networks for more accurate prediction. Experimental results demonstrate that the performance of BRP-Net is robust to the variation of post-processing hyper-parameters. Furthermore, BRP-Net achieves state-of-the-art performances on both the Kumar and CPM17 datasets. The code of BRP-Net will be released at https://github.com/csccsccsccsc/brpnet.

* Early Acception by MICCAI 2020

Via

Access Paper or Ask Questions