Abstract:Encouraged by the growing availability of pre-trained 2D diffusion models, image-to-3D generation by leveraging Score Distillation Sampling (SDS) is making remarkable progress. Most existing methods combine novel-view lifting from 2D diffusion models which usually take the reference image as a condition while applying hard L2 image supervision at the reference view. Yet heavily adhering to the image is prone to corrupting the inductive knowledge of the 2D diffusion model leading to flat or distorted 3D generation frequently. In this work, we reexamine image-to-3D in a novel perspective and present Isotropic3D, an image-to-3D generation pipeline that takes only an image CLIP embedding as input. Isotropic3D allows the optimization to be isotropic w.r.t. the azimuth angle by solely resting on the SDS loss. The core of our framework lies in a two-stage diffusion model fine-tuning. Firstly, we fine-tune a text-to-3D diffusion model by substituting its text encoder with an image encoder, by which the model preliminarily acquires image-to-image capabilities. Secondly, we perform fine-tuning using our Explicit Multi-view Attention (EMA) which combines noisy multi-view images with the noise-free reference image as an explicit condition. CLIP embedding is sent to the diffusion model throughout the whole process while reference images are discarded once after fine-tuning. As a result, with a single image CLIP embedding, Isotropic3D is capable of generating multi-view mutually consistent images and also a 3D model with more symmetrical and neat content, well-proportioned geometry, rich colored texture, and less distortion compared with existing image-to-3D methods while still preserving the similarity to the reference image to a large extent. The project page is available at https://isotropic3d.github.io/. The code and models are available at https://github.com/pkunliu/Isotropic3D.
Abstract:3D semantic scene completion (SSC) is an ill-posed task that requires inferring a dense 3D scene from incomplete observations. Previous methods either explicitly incorporate 3D geometric input or rely on learnt 3D prior behind monocular RGB images. However, 3D sensors such as LiDAR are expensive and intrusive while monocular cameras face challenges in modeling precise geometry due to the inherent ambiguity. In this work, we propose StereoScene for 3D Semantic Scene Completion (SSC), which explores taking full advantage of light-weight camera inputs without resorting to any external 3D sensors. Our key insight is to leverage stereo matching to resolve geometric ambiguity. To improve its robustness in unmatched areas, we introduce bird's-eye-view (BEV) representation to inspire hallucination ability with rich context information. On top of the stereo and BEV representations, a mutual interactive aggregation (MIA) module is carefully devised to fully unleash their power. Specifically, a Bi-directional Interaction Transformer (BIT) augmented with confidence re-weighting is used to encourage reliable prediction through mutual guidance while a Dual Volume Aggregation (DVA) module is designed to facilitate complementary aggregation. Experimental results on SemanticKITTI demonstrate that the proposed StereoScene outperforms the state-of-the-art camera-based methods by a large margin with a relative improvement of 26.9% in geometry and 38.6% in semantic.
Abstract:The global minimum point of an optimization problem is of interest in engineering fields and it is difficult to be found, especially for a nonconvex optimization problem. In this article, we consider a quasi-genetic algorithm and the continuation Newton method for this problem. Firstly, we use the continuation Newton method with the deflation technique to find critical points of the objective function as many as possible. Then, we use those critical points as the initial evolutionary seeds of the quasi-genetic algorithm. After evolving into several generations such as twenty generations, we obtain a suboptimal point of the optimization problem. Finally, we use this suboptimal point as the initial point of the continuation Newton method to obtain the critical point of the original objective function, and output the minimizer between this final critical point and the suboptimal point of the quasi-genetic algorithm as the global minimum point of the original optimization problem. Numerical results show that the proposed method is quite reliable to find the global optimal point of the unconstrained optimization problem, compared to the multi-start method (the built-in subroutine GlobalSearch.m of the MATLAB R2020a environment).
Abstract:We consider the weak target detection problem with unknown parameter in colocated multiple-input multiple-output (MIMO) radar. To cope with the sheer amount of data for large-size systems, a multi-bit quantizer is utilized in the sampling process. As a low-complexity alternative to classic generalized likelihood ratio test (GLRT) for quantized data, we propose the multi-bit detector on Rao test with a closed-form test statistic, whose theoretical asymptotic distribution is provided to generalize the actual detection performance. Besides, we refine the design of quantizer by optimized quantization thresholds, which are obtained resorting to the popular particle swarm optimization algorithmthe (PSOA). The simulation is conducted to demonstrate the performance variations of detectors based on unquantized and quantized data. The numerical results corroborate our theoretical analyses and show that the performance with 3-bit quantization approaches the case without quantization.
Abstract:Scene text detection has received attention for years and achieved an impressive performance across various benchmarks. In this work, we propose an efficient and accurate approach to detect multioriented text in scene images. The proposed feature fusion mechanism allows us to use a shallower network to reduce the computational complexity. A self-attention mechanism is adopted to suppress false positive detections. Experiments on public benchmarks including ICDAR 2013, ICDAR 2015 and MSRA-TD500 show that our proposed approach can achieve better or comparable performances with fewer parameters and less computational cost.
Abstract:In this paper, we consider the block-sparse signals recovery problem in the context of multiple measurement vectors (MMV) with common row sparsity patterns. We develop a new method for recovery of common row sparsity MMV signals, where a pattern-coupled hierarchical Gaussian prior model is introduced to characterize both the block-sparsity of the coefficients and the statistical dependency between neighboring coefficients of the common row sparsity MMV signals. Unlike many other methods, the proposed method is able to automatically capture the block sparse structure of the unknown signal. Our method is developed using an expectation-maximization (EM) framework. Simulation results show that our proposed method offers competitive performance in recovering block-sparse common row sparsity pattern MMV signals.