Abstract:Crystal structure forms the foundation for understanding the physical and chemical properties of materials. Generative models have emerged as a new paradigm in crystal structure prediction(CSP), however, accurately capturing key characteristics of crystal structures, such as periodicity and symmetry, remains a significant challenge. In this paper, we propose a Transformer-Enhanced Variational Autoencoder for Crystal Structure Prediction (TransVAE-CSP), who learns the characteristic distribution space of stable materials, enabling both the reconstruction and generation of crystal structures. TransVAE-CSP integrates adaptive distance expansion with irreducible representation to effectively capture the periodicity and symmetry of crystal structures, and the encoder is a transformer network based on an equivariant dot product attention mechanism. Experimental results on the carbon_24, perov_5, and mp_20 datasets demonstrate that TransVAE-CSP outperforms existing methods in structure reconstruction and generation tasks under various modeling metrics, offering a powerful tool for crystal structure design and optimization.
Abstract:Human mesh recovery can be approached using either regression-based or optimization-based methods. Regression models achieve high pose accuracy but struggle with model-to-image alignment due to the lack of explicit 2D-3D correspondences. In contrast, optimization-based methods align 3D models to 2D observations but are prone to local minima and depth ambiguity. In this work, we leverage large vision-language models (VLMs) to generate interactive body part descriptions, which serve as implicit constraints to enhance 3D perception and limit the optimization space. Specifically, we formulate monocular human mesh recovery as a distribution adaptation task by integrating both 2D observations and language descriptions. To bridge the gap between text and 3D pose signals, we first train a text encoder and a pose VQ-VAE, aligning texts to body poses in a shared latent space using contrastive learning. Subsequently, we employ a diffusion-based framework to refine the initial parameters guided by gradients derived from both 2D observations and text descriptions. Finally, the model can produce poses with accurate 3D perception and image consistency. Experimental results on multiple benchmarks validate its effectiveness. The code will be made publicly available.
Abstract:Crystal structure predictions based on the combination of first-principles calculations and machine learning have achieved significant success in materials science. However, most of these approaches are limited to predicting specific systems, which hinders their application to unknown or unexplored domains. In this paper, we present CrySPAI, a crystal structure prediction package developed using artificial intelligence (AI) to predict energetically stable crystal structures of inorganic materials given their chemical compositions. The software consists of three key modules, an evolutionary optimization algorithm (EOA) that searches for all possible crystal structure configurations, density functional theory (DFT) that provides the accurate energy values for these structures, and a deep neural network (DNN) that learns the relationship between crystal structures and their corresponding energies. To optimize the process across these modules, a distributed framework is implemented to parallelize tasks, and an automated workflow has been integrated into CrySPAI for seamless execution. This paper reports the development and implementation of AI AI-based CrySPAI Crystal Prediction Software tool and its unique features.
Abstract:Dynamic multi-person mesh recovery has broad applications in sports broadcasting, virtual reality, and video games. However, current multi-view frameworks rely on a time-consuming camera calibration procedure. In this work, we focus on multi-person motion capture with uncalibrated cameras, which mainly faces two challenges: one is that inter-person interactions and occlusions introduce inherent ambiguities for both camera calibration and motion capture; the other is that a lack of dense correspondences can be used to constrain sparse camera geometries in a dynamic multi-person scene. Our key idea is to incorporate motion prior knowledge to simultaneously estimate camera parameters and human meshes from noisy human semantics. We first utilize human information from 2D images to initialize intrinsic and extrinsic parameters. Thus, the approach does not rely on any other calibration tools or background features. Then, a pose-geometry consistency is introduced to associate the detected humans from different views. Finally, a latent motion prior is proposed to refine the camera parameters and human motions. Experimental results show that accurate camera parameters and human motions can be obtained through a one-step reconstruction. The code are publicly available at~\url{https://github.com/boycehbz/DMMR}.
Abstract:CNN-based object detection models that strike a balance between performance and speed have been gradually used in polyp detection tasks. Nevertheless, accurately locating polyps within complex colonoscopy video scenes remains challenging since existing methods ignore two key issues: intra-sequence distribution heterogeneity and precision-confidence discrepancy. To address these challenges, we propose a novel Temporal-Spatial self-correction detector (TSdetector), which first integrates temporal-level consistency learning and spatial-level reliability learning to detect objects continuously. Technically, we first propose a global temporal-aware convolution, assembling the preceding information to dynamically guide the current convolution kernel to focus on global features between sequences. In addition, we designed a hierarchical queue integration mechanism to combine multi-temporal features through a progressive accumulation manner, fully leveraging contextual consistency information together with retaining long-sequence-dependency features. Meanwhile, at the spatial level, we advance a position-aware clustering to explore the spatial relationships among candidate boxes for recalibrating prediction confidence adaptively, thus eliminating redundant bounding boxes efficiently. The experimental results on three publicly available polyp video dataset show that TSdetector achieves the highest polyp detection rate and outperforms other state-of-the-art methods. The code can be available at https://github.com/soleilssss/TSdetector.
Abstract:Existing multi-person human reconstruction approaches mainly focus on recovering accurate poses or avoiding penetration, but overlook the modeling of close interactions. In this work, we tackle the task of reconstructing closely interactive humans from a monocular video. The main challenge of this task comes from insufficient visual information caused by depth ambiguity and severe inter-person occlusion. In view of this, we propose to leverage knowledge from proxemic behavior and physics to compensate the lack of visual information. This is based on the observation that human interaction has specific patterns following the social proxemics. Specifically, we first design a latent representation based on Vector Quantised-Variational AutoEncoder (VQ-VAE) to model human interaction. A proxemics and physics guided diffusion model is then introduced to denoise the initial distribution. We design the diffusion model as dual branch with each branch representing one individual such that the interaction can be modeled via cross attention. With the learned priors of VQ-VAE and physical constraint as the additional information, our proposed approach is capable of estimating accurate poses that are also proxemics and physics plausible. Experimental results on Hi4D, 3DPW, and CHI3D demonstrate that our method outperforms existing approaches. The code is available at \url{https://github.com/boycehbz/HumanInteraction}.
Abstract:We present Knowledge NeRF to synthesize novel views for dynamic scenes. Reconstructing dynamic 3D scenes from few sparse views and rendering them from arbitrary perspectives is a challenging problem with applications in various domains. Previous dynamic NeRF methods learn the deformation of articulated objects from monocular videos. However, qualities of their reconstructed scenes are limited. To clearly reconstruct dynamic scenes, we propose a new framework by considering two frames at a time.We pretrain a NeRF model for an articulated object.When articulated objects moves, Knowledge NeRF learns to generate novel views at the new state by incorporating past knowledge in the pretrained NeRF model with minimal observations in the present state. We propose a projection module to adapt NeRF for dynamic scenes, learning the correspondence between pretrained knowledge base and current states. Experimental results demonstrate the effectiveness of our method in reconstructing dynamic 3D scenes with 5 input images in one state. Knowledge NeRF is a new pipeline and promising solution for novel view synthesis in dynamic articulated objects. The data and implementation are publicly available at https://github.com/RussRobin/Knowledge_NeRF.
Abstract:The prediction of chemical synthesis pathways plays a pivotal role in materials science research. Challenges, such as the complexity of synthesis pathways and the lack of comprehensive datasets, currently hinder our ability to predict these chemical processes accurately. However, recent advancements in generative artificial intelligence (GAI), including automated text generation and question-answering systems, coupled with fine-tuning techniques, have facilitated the deployment of large-scale AI models tailored to specific domains. In this study, we harness the power of the LLaMA2-7B model and enhance it through a learning process that incorporates 13,878 pieces of structured material knowledge data. This specialized AI model, named MatChat, focuses on predicting inorganic material synthesis pathways. MatChat exhibits remarkable proficiency in generating and reasoning with knowledge in materials science. Although MatChat requires further refinement to meet the diverse material design needs, this research undeniably highlights its impressive reasoning capabilities and innovative potential in the field of materials science. MatChat is now accessible online and open for use, with both the model and its application framework available as open source. This study establishes a robust foundation for collaborative innovation in the integration of generative AI in materials science.
Abstract:This is a technical report for the GigaCrowd challenge. Reconstructing 3D crowds from monocular images is a challenging problem due to mutual occlusions, server depth ambiguity, and complex spatial distribution. Since no large-scale 3D crowd dataset can be used to train a robust model, the current multi-person mesh recovery methods can hardly achieve satisfactory performance in crowded scenes. In this paper, we exploit the crowd features and propose a crowd-constrained optimization to improve the common single-person method on crowd images. To avoid scale variations, we first detect human bounding-boxes and 2D poses from the original images with off-the-shelf detectors. Then, we train a single-person mesh recovery network using existing in-the-wild image datasets. To promote a more reasonable spatial distribution, we further propose a crowd constraint to refine the single-person network parameters. With the optimization, we can obtain accurate body poses and shapes with reasonable absolute positions from a large-scale crowd image using a single-person backbone. The code will be publicly available at~\url{https://github.com/boycehbz/CrowdRec}.
Abstract:Due to the mutual occlusion, severe scale variation, and complex spatial distribution, the current multi-person mesh recovery methods cannot produce accurate absolute body poses and shapes in large-scale crowded scenes. To address the obstacles, we fully exploit crowd features for reconstructing groups of people from a monocular image. A novel hypergraph relational reasoning network is proposed to formulate the complex and high-order relation correlations among individuals and groups in the crowd. We first extract compact human features and location information from the original high-resolution image. By conducting the relational reasoning on the extracted individual features, the underlying crowd collectiveness and interaction relationship can provide additional group information for the reconstruction. Finally, the updated individual features and the localization information are used to regress human meshes in camera coordinates. To facilitate the network training, we further build pseudo ground-truth on two crowd datasets, which may also promote future research on pose estimation and human behavior understanding in crowded scenes. The experimental results show that our approach outperforms other baseline methods both in crowded and common scenarios. The code and datasets are publicly available at https://github.com/boycehbz/GroupRec.