Abstract:We explore Bird's-Eye View (BEV) generation, converting a BEV map into its corresponding multi-view street images. Valued for its unified spatial representation aiding multi-sensor fusion, BEV is pivotal for various autonomous driving applications. Creating accurate street-view images from BEV maps is essential for portraying complex traffic scenarios and enhancing driving algorithms. Concurrently, diffusion-based conditional image generation models have demonstrated remarkable outcomes, adept at producing diverse, high-quality, and condition-aligned results. Nonetheless, the training of these models demands substantial data and computational resources. Hence, exploring methods to fine-tune these advanced models, like Stable Diffusion, for specific conditional generation tasks emerges as a promising avenue. In this paper, we introduce a practical framework for generating images from a BEV layout. Our approach comprises two main components: the Neural View Transformation and the Street Image Generation. The Neural View Transformation phase converts the BEV map into aligned multi-view semantic segmentation maps by learning the shape correspondence between the BEV and perspective views. Subsequently, the Street Image Generation phase utilizes these segmentations as a condition to guide a fine-tuned latent diffusion model. This finetuning process ensures both view and style consistency. Our model leverages the generative capacity of large pretrained diffusion models within traffic contexts, effectively yielding diverse and condition-coherent street view images.
Abstract:Deep neural networks often suffer from poor generalization due to complex and non-convex loss landscapes. Sharpness-Aware Minimization (SAM) is a popular solution that smooths the loss landscape by minimizing the maximized change of training loss when adding a perturbation to the weight. However, indiscriminate perturbation of SAM on all parameters is suboptimal and results in excessive computation, double the overhead of common optimizers like Stochastic Gradient Descent (SGD). In this paper, we propose Sparse SAM (SSAM), an efficient and effective training scheme that achieves sparse perturbation by a binary mask. To obtain the sparse mask, we provide two solutions based on Fisher information and dynamic sparse training, respectively. We investigate the impact of different masks, including unstructured, structured, and $N$:$M$ structured patterns, as well as explicit and implicit forms of implementing sparse perturbation. We theoretically prove that SSAM can converge at the same rate as SAM, i.e., $O(\log T/\sqrt{T})$. Sparse SAM has the potential to accelerate training and smooth the loss landscape effectively. Extensive experimental results on CIFAR and ImageNet-1K confirm that our method is superior to SAM in terms of efficiency, and the performance is preserved or even improved with a perturbation of merely 50\% sparsity. Code is available at https://github.com/Mi-Peng/Systematic-Investigation-of-Sparse-Perturbed-Sharpness-Aware-Minimization-Optimizer.
Abstract:In this paper, we discover two factors that inhibit POMs from achieving high perceptual quality: 1) center-oriented optimization (COO) problem and 2) model's low-frequency tendency. First, POMs tend to generate an SR image whose position in the feature space is closest to the distribution center of all potential high-resolution (HR) images, resulting in such POMs losing high-frequency details. Second, $90\%$ area of an image consists of low-frequency signals; in contrast, human perception relies on an image's high-frequency details. However, POMs apply the same calculation to process different-frequency areas, so that POMs tend to restore the low-frequency regions. Based on these two factors, we propose a Detail Enhanced Contrastive Loss (DECLoss), by combining a high-frequency enhancement module and spatial contrastive learning module, to reduce the influence of the COO problem and low-Frequency tendency. Experimental results show the efficiency and effectiveness when applying DECLoss on several regular SR models. E.g, in EDSR, our proposed method achieves 3.60$\times$ faster learning speed compared to a GAN-based method with a subtle degradation in visual quality. In addition, our final results show that an SR network equipped with our DECLoss generates more realistic and visually pleasing textures compared to state-of-the-art methods. %The source code of the proposed method is included in the supplementary material and will be made publicly available in the future.