Abstract:We present R3-Avatar, incorporating a temporal codebook, to overcome the inability of human avatars to be both animatable and of high-fidelity rendering quality. Existing video-based reconstruction of 3D human avatars either focuses solely on rendering, lacking animation support, or learns a pose-appearance mapping for animating, which degrades under limited training poses or complex clothing. In this paper, we adopt a "record-retrieve-reconstruct" strategy that ensures high-quality rendering from novel views while mitigating degradation in novel poses. Specifically, disambiguating timestamps record temporal appearance variations in a codebook, ensuring high-fidelity novel-view rendering, while novel poses retrieve corresponding timestamps by matching the most similar training poses for augmented appearance. Our R3-Avatar outperforms cutting-edge video-based human avatar reconstruction, particularly in overcoming visual quality degradation in extreme scenarios with limited training human poses and complex clothing.
Abstract:Group Activity Understanding is predominantly studied as Group Activity Recognition (GAR) task. However, existing GAR benchmarks suffer from coarse-grained activity vocabularies and the only data form in single-view, which hinder the evaluation of state-of-the-art algorithms. To address these limitations, we introduce SGA-INTERACT, the first 3D skeleton-based benchmark for group activity understanding. It features complex activities inspired by basketball tactics, emphasizing rich spatial interactions and long-term dependencies. SGA-INTERACT introduces Temporal Group Activity Localization (TGAL) task, extending group activity understanding to untrimmed sequences, filling the gap left by GAR as a standalone task. In addition to the benchmark, we propose One2Many, a novel framework that employs a pretrained 3D skeleton backbone for unified individual feature extraction. This framework aligns with the feature extraction paradigm in RGB-based methods, enabling direct evaluation of RGB-based models on skeleton-based benchmarks. We conduct extensive evaluations on SGA-INTERACT using two skeleton-based methods, three RGB-based methods, and a proposed baseline within the One2Many framework. The general low performance of baselines highlights the benchmark's challenges, motivating advancements in group activity understanding.
Abstract:In this paper, we introduce the HexPlane representation for 3D semantic scene understanding. Specifically, we first design the View Projection Module (VPM) to project the 3D point cloud into six planes to maximally retain the original spatial information. Features of six planes are extracted by the 2D encoder and sent to the HexPlane Association Module (HAM) to adaptively fuse the most informative information for each point. The fused point features are further fed to the task head to yield the ultimate predictions. Compared to the popular point and voxel representation, the HexPlane representation is efficient and can utilize highly optimized 2D operations to process sparse and unordered 3D point clouds. It can also leverage off-the-shelf 2D models, network weights, and training recipes to achieve accurate scene understanding in 3D space. On ScanNet and SemanticKITTI benchmarks, our algorithm, dubbed HexNet3D, achieves competitive performance with previous algorithms. In particular, on the ScanNet 3D segmentation task, our method obtains 77.0 mIoU on the validation set, surpassing Point Transformer V2 by 1.6 mIoU. We also observe encouraging results in indoor 3D detection tasks. Note that our method can be seamlessly integrated into existing voxel-based, point-based, and range-based approaches and brings considerable gains without bells and whistles. The codes will be available upon publication.
Abstract:Traffic simulation, complementing real-world data with a long-tail distribution, allows for effective evaluation and enhancement of the ability of autonomous vehicles to handle accident-prone scenarios. Simulating such safety-critical scenarios is nontrivial, however, from log data that are typically regular scenarios, especially in consideration of dynamic adversarial interactions between the future motions of autonomous vehicles and surrounding traffic participants. To address it, this paper proposes an innovative and efficient strategy, termed IntSim, that explicitly decouples the driving intentions of surrounding actors from their motion planning for realistic and efficient safety-critical simulation. We formulate the adversarial transfer of driving intention as an optimization problem, facilitating extensive exploration of diverse attack behaviors and efficient solution convergence. Simultaneously, intention-conditioned motion planning benefits from powerful deep models and large-scale real-world data, permitting the simulation of realistic motion behaviors for actors. Specially, through adapting driving intentions based on environments, IntSim facilitates the flexible realization of dynamic adversarial interactions with autonomous vehicles. Finally, extensive open-loop and closed-loop experiments on real-world datasets, including nuScenes and Waymo, demonstrate that the proposed IntSim achieves state-of-the-art performance in simulating realistic safety-critical scenarios and further improves planners in handling such scenarios.
Abstract:Multi-class segmentation of the aorta in computed tomography angiography (CTA) scans is essential for diagnosing and planning complex endovascular treatments for patients with aortic dissections. However, existing methods reduce aortic segmentation to a binary problem, limiting their ability to measure diameters across different branches and zones. Furthermore, no open-source dataset is currently available to support the development of multi-class aortic segmentation methods. To address this gap, we organized the AortaSeg24 MICCAI Challenge, introducing the first dataset of 100 CTA volumes annotated for 23 clinically relevant aortic branches and zones. This dataset was designed to facilitate both model development and validation. The challenge attracted 121 teams worldwide, with participants leveraging state-of-the-art frameworks such as nnU-Net and exploring novel techniques, including cascaded models, data augmentation strategies, and custom loss functions. We evaluated the submitted algorithms using the Dice Similarity Coefficient (DSC) and Normalized Surface Distance (NSD), highlighting the approaches adopted by the top five performing teams. This paper presents the challenge design, dataset details, evaluation metrics, and an in-depth analysis of the top-performing algorithms. The annotated dataset, evaluation code, and implementations of the leading methods are publicly available to support further research. All resources can be accessed at https://aortaseg24.grand-challenge.org.
Abstract:While 3D Gaussian Splatting (3DGS) has demonstrated remarkable performance in novel view synthesis and real-time rendering, the high memory consumption due to the use of millions of Gaussians limits its practicality. To mitigate this issue, improvements have been made by pruning unnecessary Gaussians, either through a hand-crafted criterion or by using learned masks. However, these methods deterministically remove Gaussians based on a snapshot of the pruning moment, leading to sub-optimized reconstruction performance from a long-term perspective. To address this issue, we introduce MaskGaussian, which models Gaussians as probabilistic entities rather than permanently removing them, and utilize them according to their probability of existence. To achieve this, we propose a masked-rasterization technique that enables unused yet probabilistically existing Gaussians to receive gradients, allowing for dynamic assessment of their contribution to the evolving scene and adjustment of their probability of existence. Hence, the importance of Gaussians iteratively changes and the pruned Gaussians are selected diversely. Extensive experiments demonstrate the superiority of the proposed method in achieving better rendering quality with fewer Gaussians than previous pruning methods, pruning over 60% of Gaussians on average with only a 0.02 PSNR decline. Our code can be found at: https://github.com/kaikai23/MaskGaussian
Abstract:Speech Emotion Recognition (SER) plays a critical role in enhancing user experience within human-computer interaction. However, existing methods are overwhelmed by temporal domain analysis, overlooking the valuable envelope structures of the frequency domain that are equally important for robust emotion recognition. To overcome this limitation, we propose TF-Mamba, a novel multi-domain framework that captures emotional expressions in both temporal and frequency dimensions.Concretely, we propose a temporal-frequency mamba block to extract temporal- and frequency-aware emotional features, achieving an optimal balance between computational efficiency and model expressiveness. Besides, we design a Complex Metric-Distance Triplet (CMDT) loss to enable the model to capture representative emotional clues for SER. Extensive experiments on the IEMOCAP and MELD datasets show that TF-Mamba surpasses existing methods in terms of model size and latency, providing a more practical solution for future SER applications.
Abstract:Image captioning models often suffer from performance degradation when applied to novel datasets, as they are typically trained on domain-specific data. To enhance generalization in out-of-domain scenarios, retrieval-augmented approaches have garnered increasing attention. However, current methods face two key challenges: (1) image features used for retrieval are often optimized based on ground-truth (GT) captions, which represent the image from a specific perspective and are influenced by annotator biases, and (2) they underutilize the full potential of retrieved text, typically relying on raw captions or parsed objects, which fail to capture the full semantic richness of the data. In this paper, we propose Dive Into Retrieval (DIR), a method designed to enhance both the image-to-text retrieval process and the utilization of retrieved text to achieve a more comprehensive understanding of the visual content. Our approach introduces two key innovations: (1) diffusion-guided retrieval enhancement, where a pretrained diffusion model guides image feature learning by reconstructing noisy images, allowing the model to capture more comprehensive and fine-grained visual information beyond standard annotated captions; and (2) a high-quality retrieval database, which provides comprehensive semantic information to enhance caption generation, especially in out-of-domain scenarios. Extensive experiments demonstrate that DIR not only maintains competitive in-domain performance but also significantly improves out-of-domain generalization, all without increasing inference costs.
Abstract:3D human avatars, through the use of canonical radiance fields and per-frame observed warping, enable high-fidelity rendering and animating. However, existing methods, which rely on either spatial SMPL(-X) poses or temporal embeddings, respectively suffer from coarse rendering quality or limited animation flexibility. To address these challenges, we propose GAST, a framework that unifies 3D human modeling with 3DGS by hierarchically integrating both spatial and temporal information. Specifically, we design a sequential conditioning framework for the non-rigid warping of the human body, under whose guidance more accurate 3D Gaussians can be obtained in the observation space. Moreover, the explicit properties of Gaussians allow us to embed richer sequential information, encompassing both the coarse sequence of human poses and finer per-vertex motion details. These sequence conditions are further sampled across different temporal scales, in a coarse-to-fine manner, ensuring unbiased inputs for non-rigid warping. Experimental results demonstrate that our method combined with hierarchical spatio-temporal modeling surpasses concurrent baselines, delivering both high-quality rendering and flexible animating capabilities.
Abstract:The development of 3D human avatars from multi-view videos represents a significant yet challenging task in the field. Recent advancements, including 3D Gaussian Splattings (3DGS), have markedly progressed this domain. Nonetheless, existing techniques necessitate the use of high-quality sharp images, which are often impractical to obtain in real-world settings due to variations in human motion speed and intensity. In this study, we attempt to explore deriving sharp intrinsic 3D human Gaussian avatars from blurry video footage in an end-to-end manner. Our approach encompasses a 3D-aware, physics-oriented model of blur formation attributable to human movement, coupled with a 3D human motion model to clarify ambiguities found in motion-induced blurry images. This methodology facilitates the concurrent learning of avatar model parameters and the refinement of sub-frame motion parameters from a coarse initialization. We have established benchmarks for this task through a synthetic dataset derived from existing multi-view captures, alongside a real-captured dataset acquired through a 360-degree synchronous hybrid-exposure camera system. Comprehensive evaluations demonstrate that our model surpasses existing baselines.