Abstract:Despite the impressive capabilities of large language models (LLMs), they currently exhibit two primary limitations, \textbf{\uppercase\expandafter{\romannumeral 1}}: They struggle to \textbf{autonomously solve the real world engineering problem}. \textbf{\uppercase\expandafter{\romannumeral 2}}: They remain \textbf{challenged in reasoning through complex logic problems}. To address these challenges, we developed the \textsc{Infant Agent}, integrating task-aware functions, operators, a hierarchical management system, and a memory retrieval mechanism. Together, these components enable large language models to sustain extended reasoning processes and handle complex, multi-step tasks efficiently, all while significantly reducing API costs. Using the \textsc{Infant Agent}, GPT-4o's accuracy on the SWE-bench-lite dataset rises from $\mathbf{0.33\%}$ to $\mathbf{30\%}$, and in the AIME-2024 mathematics competition, it increases GPT-4o's accuracy from $\mathbf{13.3\%}$ to $\mathbf{37\%}$.
Abstract:Dynamic 3D point cloud sequences serve as one of the most common and practical representation modalities of dynamic real-world environments. However, their unstructured nature in both spatial and temporal domains poses significant challenges to effective and efficient processing. Existing deep point cloud sequence modeling approaches imitate the mature 2D video learning mechanisms by developing complex spatio-temporal point neighbor grouping and feature aggregation schemes, often resulting in methods lacking effectiveness, efficiency, and expressive power. In this paper, we propose a novel generic representation called \textit{Structured Point Cloud Videos} (SPCVs). Intuitively, by leveraging the fact that 3D geometric shapes are essentially 2D manifolds, SPCV re-organizes a point cloud sequence as a 2D video with spatial smoothness and temporal consistency, where the pixel values correspond to the 3D coordinates of points. The structured nature of our SPCV representation allows for the seamless adaptation of well-established 2D image/video techniques, enabling efficient and effective processing and analysis of 3D point cloud sequences. To achieve such re-organization, we design a self-supervised learning pipeline that is geometrically regularized and driven by self-reconstructive and deformation field learning objectives. Additionally, we construct SPCV-based frameworks for both low-level and high-level 3D point cloud sequence processing and analysis tasks, including action recognition, temporal interpolation, and compression. Extensive experiments demonstrate the versatility and superiority of the proposed SPCV, which has the potential to offer new possibilities for deep learning on unstructured 3D point cloud sequences. Code will be released at https://github.com/ZENGYIMING-EAMON/SPCV.
Abstract:Object rearrangement, a fundamental challenge in robotics, demands versatile strategies to handle diverse objects, configurations, and functional needs. To achieve this, the AI robot needs to learn functional rearrangement priors in order to specify precise goals that meet the functional requirements. Previous methods typically learn such priors from either laborious human annotations or manually designed heuristics, which limits scalability and generalization. In this work, we propose a novel approach that leverages large models to distill functional rearrangement priors. Specifically, our approach collects diverse arrangement examples using both LLMs and VLMs and then distills the examples into a diffusion model. During test time, the learned diffusion model is conditioned on the initial configuration and guides the positioning of objects to meet functional requirements. In this manner, we create a handshaking point that combines the strengths of conditional generative models and large models. Extensive experiments on multiple domains, including real-world scenarios, demonstrate the effectiveness of our approach in generating compatible goals for object rearrangement tasks, significantly outperforming baseline methods.
Abstract:Point clouds are characterized by irregularity and unstructuredness, which pose challenges in efficient data exploitation and discriminative feature extraction. In this paper, we present an unsupervised deep neural architecture called Flattening-Net to represent irregular 3D point clouds of arbitrary geometry and topology as a completely regular 2D point geometry image (PGI) structure, in which coordinates of spatial points are captured in colors of image pixels. \mr{Intuitively, Flattening-Net implicitly approximates a locally smooth 3D-to-2D surface flattening process while effectively preserving neighborhood consistency.} \mr{As a generic representation modality, PGI inherently encodes the intrinsic property of the underlying manifold structure and facilitates surface-style point feature aggregation.} To demonstrate its potential, we construct a unified learning framework directly operating on PGIs to achieve \mr{diverse types of high-level and low-level} downstream applications driven by specific task networks, including classification, segmentation, reconstruction, and upsampling. Extensive experiments demonstrate that our methods perform favorably against the current state-of-the-art competitors. We will make the code and data publicly available at https://github.com/keeganhk/Flattening-Net.
Abstract:Motivated by the intuition that the critical step of localizing a 2D image in the corresponding 3D point cloud is establishing 2D-3D correspondence between them, we propose the first feature-based dense correspondence framework for addressing the image-to-point cloud registration problem, dubbed CorrI2P, which consists of three modules, i.e., feature embedding, symmetric overlapping region detection, and pose estimation through the established correspondence. Specifically, given a pair of a 2D image and a 3D point cloud, we first transform them into high-dimensional feature space and feed the resulting features into a symmetric overlapping region detector to determine the region where the image and point cloud overlap each other. Then we use the features of the overlapping regions to establish the 2D-3D correspondence before running EPnP within RANSAC to estimate the camera's pose. Experimental results on KITTI and NuScenes datasets show that our CorrI2P outperforms state-of-the-art image-to-point cloud registration methods significantly. We will make the code publicly available.
Abstract:We propose WarpingGAN, an effective and efficient 3D point cloud generation network. Unlike existing methods that generate point clouds by directly learning the mapping functions between latent codes and 3D shapes, Warping-GAN learns a unified local-warping function to warp multiple identical pre-defined priors (i.e., sets of points uniformly distributed on regular 3D grids) into 3D shapes driven by local structure-aware semantics. In addition, we also ingeniously utilize the principle of the discriminator and tailor a stitching loss to eliminate the gaps between different partitions of a generated shape corresponding to different priors for boosting quality. Owing to the novel generating mechanism, WarpingGAN, a single lightweight network after one-time training, is capable of efficiently generating uniformly distributed 3D point clouds with various resolutions. Extensive experimental results demonstrate the superiority of our WarpingGAN over state-of-the-art methods in terms of quantitative metrics, visual quality, and efficiency. The source code is publicly available at https://github.com/yztang4/WarpingGAN.git.
Abstract:This paper investigates the problem of temporally interpolating dynamic 3D point clouds with large non-rigid deformation. We formulate the problem as estimation of point-wise trajectories (i.e., smooth curves) and further reason that temporal irregularity and under-sampling are two major challenges. To tackle the challenges, we propose IDEA-Net, an end-to-end deep learning framework, which disentangles the problem under the assistance of the explicitly learned temporal consistency. Specifically, we propose a temporal consistency learning module to align two consecutive point cloud frames point-wisely, based on which we can employ linear interpolation to obtain coarse trajectories/in-between frames. To compensate the high-order nonlinear components of trajectories, we apply aligned feature embeddings that encode local geometry properties to regress point-wise increments, which are combined with the coarse estimations. We demonstrate the effectiveness of our method on various point cloud sequences and observe large improvement over state-of-the-art methods both quantitatively and visually. Our framework can bring benefits to 3D motion data acquisition. The source code is publicly available at https://github.com/ZENGYIMING-EAMON/IDEA-Net.git.
Abstract:One-bit compressive sensing is concerned with the accurate recovery of an underlying sparse signal of interest from its one-bit noisy measurements. The conventional signal recovery approaches for this problem are mainly developed based on the assumption that an exact knowledge of the sensing matrix is available. In this work, however, we present a novel data-driven and model-based methodology that achieves blind recovery; i.e., signal recovery without requiring the knowledge of the sensing matrix. To this end, we make use of the deep unfolding technique and develop a model-driven deep neural architecture which is designed for this specific task. The proposed deep architecture is able to learn an alternative sensing matrix by taking advantage of the underlying unfolded algorithm such that the resulting learned recovery algorithm can accurately and quickly (in terms of the number of iterations) recover the underlying compressed signal of interest from its one-bit noisy measurements. In addition, due to the incorporation of the domain knowledge and the mathematical model of the system into the proposed deep architecture, the resulting network benefits from enhanced interpretability, has a very small number of trainable parameters, and requires very small number of training samples, as compared to the commonly used black-box deep neural network alternatives for the problem at hand.
Abstract:This paper addresses the problem of computing dense correspondence between 3D shapes in the form of point clouds, which is a challenging and fundamental problem in computer vision and digital geometry processing. Conventional approaches often solve the problem in a supervised manner, requiring massive annotated data, which is difficult and/or expensive to obtain. Motivated by the intuition that one can transform two aligned point clouds to each other more easily and meaningfully than a misaligned pair, we propose CorrNet3D -- the first unsupervised and end-to-end deep learning-based framework -- to drive the learning of dense correspondence by means of deformation-like reconstruction to overcome the need for annotated data. Specifically, CorrNet3D consists of a deep feature embedding module and two novel modules called correspondence indicator and symmetric deformation. Feeding a pair of raw point clouds, our model first learns the pointwise features and passes them into the indicator to generate a learnable correspondence matrix used to permute the input pair. The symmetric deformer, with an additional regularized loss, transforms the two permuted point clouds to each other to drive the unsupervised learning of the correspondence. The extensive experiments on both synthetic and real-world datasets of rigid and non-rigid 3D shapes show our CorrNet3D outperforms state-of-the-art methods to a large extent, including those taking meshes as input. CorrNet3D is a flexible framework in that it can be easily adapted to supervised learning if annotated data are available.
Abstract:Downsampling is a commonly-used technique in 3D point cloudprocessing for saving storage space, transmission bandwidth andcomputational complexity. The classic downsampling methods,such as farthest point sampling and Poisson disk sampling, thoughwidely used for decades, are independent of the subsequent applications, since the downsampled point cloud may compromisetheir performance severely. This paper explores the problem oftask-oriented downsampling, which aims to downsample a pointcloud and maintain the performance of subsequent applications asmuch as possible. We propose MOPS-Net, a novel end-to-end deepneural network which is designed from the perspective of matrixoptimization, making it fundamentally different from the existing deep learning-based methods. Due to its discrete and combinatorial nature, it is difficult to solve the downsampling problem directly. To tackle this challenge, we relax the binary constraint of each variableand lift 3D points to a higher dimensional feature space, in which a constrained and differentiable matrix optimization problem withan implicit objective function is formulated. We then propose a deep neural network architecture to mimic the matrix optimization problem by exploring both the local and the global structures of the input data. With a task network, MOPS-Net can be end-to-endtrained. Moreover, it is permutation-invariant, making it robust to input data. We also extend MOPS-Net such that a single network after one-time training is capable of handling arbitrary downsam-pling ratios. Extensive experimental results show that MOPS-Net can achieve favorable performance against state-of-the-art deeplearning-based method over various tasks, including point cloud classification, retrieval and reconstruction.