Abstract:This technical report analyzes non-contrast CT image segmentation in computer vision. It revisits a proposed method, examines the background of non-contrast CT imaging, and highlights the significance of segmentation. The study reviews representative methods, including convolutional-based and CNN-Transformer hybrid approaches, discussing their contributions, advantages, and limitations. The nnUNet stands out as the state-of-the-art method across various segmentation tasks. The report explores the relationship between the proposed method and existing approaches, emphasizing the role of global context modeling in semantic labeling and mask generation. Future directions include addressing the long-tail problem, utilizing pre-trained models for medical imaging, and exploring self-supervised or contrastive pre-training techniques. This report offers insights into non-contrast CT image segmentation and potential advancements in the field.
Abstract:This report reviews recent advancements in human motion prediction, reconstruction, and generation. Human motion prediction focuses on forecasting future poses and movements from historical data, addressing challenges like nonlinear dynamics, occlusions, and motion style variations. Reconstruction aims to recover accurate 3D human body movements from visual inputs, often leveraging transformer-based architectures, diffusion models, and physical consistency losses to handle noise and complex poses. Motion generation synthesizes realistic and diverse motions from action labels, textual descriptions, or environmental constraints, with applications in robotics, gaming, and virtual avatars. Additionally, text-to-motion generation and human-object interaction modeling have gained attention, enabling fine-grained and context-aware motion synthesis for augmented reality and robotics. This review highlights key methodologies, datasets, challenges, and future research directions driving progress in these fields.
Abstract:Segmentation of 3D medical images is a critical task for accurate diagnosis and treatment planning. Convolutional neural networks (CNNs) have dominated the field, achieving significant success in 3D medical image segmentation. However, CNNs struggle with capturing long-range dependencies and global context, limiting their performance, particularly for fine and complex structures. Recent transformer-based models, such as TransUNet and nnFormer, have demonstrated promise in addressing these limitations, though they still rely on hybrid CNN-transformer architectures. This paper introduces a novel, fully convolutional-free model based on transformer architecture and self-attention mechanisms for 3D medical image segmentation. Our approach focuses on improving multi-semantic segmentation accuracy and addressing domain adaptation challenges between thick and thin slice CT images. We propose a joint loss function that facilitates effective segmentation of thin slices based on thick slice annotations, overcoming limitations in dataset availability. Furthermore, we present a benchmark dataset for multi-semantic segmentation on thin slices, addressing a gap in current medical imaging research. Our experiments demonstrate the superiority of the proposed model over traditional and hybrid architectures, offering new insights into the future of convolution-free medical image segmentation.
Abstract:Human motion generation is a significant pursuit in generative computer vision with widespread applications in film-making, video games, AR/VR, and human-robot interaction. Current methods mainly utilize either diffusion-based generative models or autoregressive models for text-to-motion generation. However, they face two significant challenges: (1) The generation process is time-consuming, posing a major obstacle for real-time applications such as gaming, robot manipulation, and other online settings. (2) These methods typically learn a relative motion representation guided by text, making it difficult to generate motion sequences with precise joint-level control. These challenges significantly hinder progress and limit the real-world application of human motion generation techniques. To address this gap, we propose a simple yet effective architecture consisting of two key components. Firstly, we aim to improve hardware efficiency and computational complexity in transformer-based diffusion models for human motion generation. By customizing flash linear attention, we can optimize these models specifically for generating human motion efficiently. Furthermore, we will customize the consistency model in the motion latent space to further accelerate motion generation. Secondly, we introduce Motion ControlNet, which enables more precise joint-level control of human motion compared to previous text-to-motion generation methods. These contributions represent a significant advancement for text-to-motion generation, bringing it closer to real-world applications.