Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hongjun Wang

Senior member, IEEE

Panoptic Captioning: Seeking An Equivalency Bridge for Image and Text

May 22, 2025

Kun-Yu Lin, Hongjun Wang, Weining Ren, Kai Han

Abstract:This work introduces panoptic captioning, a novel task striving to seek the minimum text equivalence of images. We take the first step towards panoptic captioning by formulating it as a task of generating a comprehensive textual description for an image, which encapsulates all entities, their respective locations and attributes, relationships among entities, as well as global image state.Through an extensive evaluation, our work reveals that state-of-the-art Multi-modal Large Language Models (MLLMs) have limited performance in solving panoptic captioning. To address this, we propose an effective data engine named PancapEngine to produce high-quality data and a novel method named PancapChain to improve panoptic captioning. Specifically, our PancapEngine first detects diverse categories of entities in images by an elaborate detection suite, and then generates required panoptic captions using entity-aware prompts. Additionally, our PancapChain explicitly decouples the challenging panoptic captioning task into multiple stages and generates panoptic captions step by step. More importantly, we contribute a comprehensive metric named PancapScore and a human-curated test set for reliable model evaluation.Experiments show that our PancapChain-13B model can beat state-of-the-art open-source MLLMs like InternVL-2.5-78B and even surpass proprietary models like GPT-4o and Gemini-2.0-Pro, demonstrating the effectiveness of our data engine and method. Project page: https://visual-ai.github.io/pancap/

* Project page: https://visual-ai.github.io/pancap/

Via

Access Paper or Ask Questions

NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results

Apr 19, 2025

Xin Li, Yeying Jin, Xin Jin, Zongwei Wu, Bingchen Li, Yufei Wang, Wenhan Yang, Yu Li, Zhibo Chen, Bihan Wen(+127 more)

Abstract:This paper reviews the NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images. This challenge received a wide range of impressive solutions, which are developed and evaluated using our collected real-world Raindrop Clarity dataset. Unlike existing deraining datasets, our Raindrop Clarity dataset is more diverse and challenging in degradation types and contents, which includes day raindrop-focused, day background-focused, night raindrop-focused, and night background-focused degradations. This dataset is divided into three subsets for competition: 14,139 images for training, 240 images for validation, and 731 images for testing. The primary objective of this challenge is to establish a new and powerful benchmark for the task of removing raindrops under varying lighting and focus conditions. There are a total of 361 participants in the competition, and 32 teams submitting valid solutions and fact sheets for the final testing phase. These submissions achieved state-of-the-art (SOTA) performance on the Raindrop Clarity dataset. The project can be found at https://lixinustc.github.io/CVPR-NTIRE2025-RainDrop-Competition.github.io/.

* Challenge Report of CVPR NTIRE 2025; 26 pages; Methods from 32 teams

Via

Access Paper or Ask Questions

Robustifying Fourier Features Embeddings for Implicit Neural Representations

Feb 08, 2025

Mingze Ma, Qingtian Zhu, Yifan Zhan, Zhengwei Yin, Hongjun Wang, Yinqiang Zheng

Abstract:Implicit Neural Representations (INRs) employ neural networks to represent continuous functions by mapping coordinates to the corresponding values of the target function, with applications e.g., inverse graphics. However, INRs face a challenge known as spectral bias when dealing with scenes containing varying frequencies. To overcome spectral bias, the most common approach is the Fourier features-based methods such as positional encoding. However, Fourier features-based methods will introduce noise to output, which degrades their performances when applied to downstream tasks. In response, this paper initially hypothesizes that combining multi-layer perceptrons (MLPs) with Fourier feature embeddings mutually enhances their strengths, yet simultaneously introduces limitations inherent in Fourier feature embeddings. By presenting a simple theorem, we validate our hypothesis, which serves as a foundation for the design of our solution. Leveraging these insights, we propose the use of multi-layer perceptrons (MLPs) without additive

Via

Access Paper or Ask Questions

Parallel Sequence Modeling via Generalized Spatial Propagation Network

Jan 21, 2025

Hongjun Wang, Wonmin Byeon, Jiarui Xu, Jinwei Gu, Ka Chun Cheung, Xiaolong Wang, Kai Han, Jan Kautz, Sifei Liu

Figure 1 for Parallel Sequence Modeling via Generalized Spatial Propagation Network

Figure 2 for Parallel Sequence Modeling via Generalized Spatial Propagation Network

Figure 3 for Parallel Sequence Modeling via Generalized Spatial Propagation Network

Figure 4 for Parallel Sequence Modeling via Generalized Spatial Propagation Network

Abstract:We present the Generalized Spatial Propagation Network (GSPN), a new attention mechanism optimized for vision tasks that inherently captures 2D spatial structures. Existing attention models, including transformers, linear attention, and state-space models like Mamba, process multi-dimensional data as 1D sequences, compromising spatial coherence and efficiency. GSPN overcomes these limitations by directly operating on spatially coherent image data and forming dense pairwise connections through a line-scan approach. Central to GSPN is the Stability-Context Condition, which ensures stable, context-aware propagation across 2D sequences and reduces the effective sequence length to $\sqrt{N}$ for a square map with N elements, significantly enhancing computational efficiency. With learnable, input-dependent weights and no reliance on positional embeddings, GSPN achieves superior spatial fidelity and state-of-the-art performance in vision tasks, including ImageNet classification, class-guided image generation, and text-to-image generation. Notably, GSPN accelerates SD-XL with softmax-attention by over $84\times$ when generating 16K images.

* Project page: http://whj363636.github.io/GSPN/

Via

Access Paper or Ask Questions

Unveiling the Inflexibility of Adaptive Embedding in Traffic Forecasting

Nov 18, 2024

Hongjun Wang, Jiyuan Chen, Lingyu Zhang, Renhe Jiang, Xuan Song

Figure 1 for Unveiling the Inflexibility of Adaptive Embedding in Traffic Forecasting

Figure 2 for Unveiling the Inflexibility of Adaptive Embedding in Traffic Forecasting

Figure 3 for Unveiling the Inflexibility of Adaptive Embedding in Traffic Forecasting

Figure 4 for Unveiling the Inflexibility of Adaptive Embedding in Traffic Forecasting

Abstract:Spatiotemporal Graph Neural Networks (ST-GNNs) and Transformers have shown significant promise in traffic forecasting by effectively modeling temporal and spatial correlations. However, rapid urbanization in recent years has led to dynamic shifts in traffic patterns and travel demand, posing major challenges for accurate long-term traffic prediction. The generalization capability of ST-GNNs in extended temporal scenarios and cross-city applications remains largely unexplored. In this study, we evaluate state-of-the-art models on an extended traffic benchmark and observe substantial performance degradation in existing ST-GNNs over time, which we attribute to their limited inductive capabilities. Our analysis reveals that this degradation stems from an inability to adapt to evolving spatial relationships within urban environments. To address this limitation, we reconsider the design of adaptive embeddings and propose a Principal Component Analysis (PCA) embedding approach that enables models to adapt to new scenarios without retraining. We incorporate PCA embeddings into existing ST-GNN and Transformer architectures, achieving marked improvements in performance. Notably, PCA embeddings allow for flexibility in graph structures between training and testing, enabling models trained on one city to perform zero-shot predictions on other cities. This adaptability demonstrates the potential of PCA embeddings in enhancing the robustness and generalization of spatiotemporal models.

Via

Access Paper or Ask Questions

Learning to Balance: Diverse Normalization for Cloth-Changing Person Re-Identification

Oct 14, 2024

Hongjun Wang, Jiyuan Chen, Zhengwei Yin, Xuan Song, Yinqiang Zheng

Figure 1 for Learning to Balance: Diverse Normalization for Cloth-Changing Person Re-Identification

Figure 2 for Learning to Balance: Diverse Normalization for Cloth-Changing Person Re-Identification

Figure 3 for Learning to Balance: Diverse Normalization for Cloth-Changing Person Re-Identification

Figure 4 for Learning to Balance: Diverse Normalization for Cloth-Changing Person Re-Identification

Abstract:Cloth-Changing Person Re-Identification (CC-ReID) involves recognizing individuals in images regardless of clothing status. In this paper, we empirically and experimentally demonstrate that completely eliminating or fully retaining clothing features is detrimental to the task. Existing work, either relying on clothing labels, silhouettes, or other auxiliary data, fundamentally aim to balance the learning of clothing and identity features. However, we practically find that achieving this balance is challenging and nuanced. In this study, we introduce a novel module called Diverse Norm, which expands personal features into orthogonal spaces and employs channel attention to separate clothing and identity features. A sample re-weighting optimization strategy is also introduced to guarantee the opposite optimization direction. Diverse Norm presents a simple yet effective approach that does not require additional data. Furthermore, Diverse Norm can be seamlessly integrated ResNet50 and significantly outperforms the state-of-the-art methods.

Via

Access Paper or Ask Questions

On Feature Decorrelation in Cloth-Changing Person Re-identification

Oct 07, 2024

Hongjun Wang, Jiyuan Chen, Renhe Jiang, Xuan Song, Yinqiang Zheng

Figure 1 for On Feature Decorrelation in Cloth-Changing Person Re-identification

Figure 2 for On Feature Decorrelation in Cloth-Changing Person Re-identification

Figure 3 for On Feature Decorrelation in Cloth-Changing Person Re-identification

Figure 4 for On Feature Decorrelation in Cloth-Changing Person Re-identification

Abstract:Cloth-changing person re-identification (CC-ReID) poses a significant challenge in computer vision. A prevailing approach is to prompt models to concentrate on causal attributes, like facial features and hairstyles, rather than confounding elements such as clothing appearance. Traditional methods to achieve this involve integrating multi-modality data or employing manually annotated clothing labels, which tend to complicate the model and require extensive human effort. In our study, we demonstrate that simply reducing feature correlations during training can significantly enhance the baseline model's performance. We theoretically elucidate this effect and introduce a novel regularization technique based on density ratio estimation. This technique aims to minimize feature correlation in the training process of cloth-changing ReID baselines. Our approach is model-independent, offering broad enhancements without needing additional data or labels. We validate our method through comprehensive experiments on prevalent CC-ReID datasets, showing its effectiveness in improving baseline models' generalization capabilities.

Via

Access Paper or Ask Questions

Robust Traffic Forecasting against Spatial Shift over Years

Oct 01, 2024

Hongjun Wang, Jiyuan Chen, Tong Pan, Zheng Dong, Lingyu Zhang, Renhe Jiang, Xuan Song

Figure 1 for Robust Traffic Forecasting against Spatial Shift over Years

Figure 2 for Robust Traffic Forecasting against Spatial Shift over Years

Figure 3 for Robust Traffic Forecasting against Spatial Shift over Years

Figure 4 for Robust Traffic Forecasting against Spatial Shift over Years

Abstract:Recent advancements in Spatiotemporal Graph Neural Networks (ST-GNNs) and Transformers have demonstrated promising potential for traffic forecasting by effectively capturing both temporal and spatial correlations. The generalization ability of spatiotemporal models has received considerable attention in recent scholarly discourse. However, no substantive datasets specifically addressing traffic out-of-distribution (OOD) scenarios have been proposed. Existing ST-OOD methods are either constrained to testing on extant data or necessitate manual modifications to the dataset. Consequently, the generalization capacity of current spatiotemporal models in OOD scenarios remains largely underexplored. In this paper, we investigate state-of-the-art models using newly proposed traffic OOD benchmarks and, surprisingly, find that these models experience a significant decline in performance. Through meticulous analysis, we attribute this decline to the models' inability to adapt to previously unobserved spatial relationships. To address this challenge, we propose a novel Mixture of Experts (MoE) framework, which learns a set of graph generators (i.e., graphons) during training and adaptively combines them to generate new graphs based on novel environmental conditions to handle spatial distribution shifts during testing. We further extend this concept to the Transformer architecture, achieving substantial improvements. Our method is both parsimonious and efficacious, and can be seamlessly integrated into any spatiotemporal model, outperforming current state-of-the-art approaches in addressing spatial dynamics.

Via

Access Paper or Ask Questions

STGformer: Efficient Spatiotemporal Graph Transformer for Traffic Forecasting

Oct 01, 2024

Hongjun Wang, Jiyuan Chen, Tong Pan, Zheng Dong, Lingyu Zhang, Renhe Jiang, Xuan Song

Figure 1 for STGformer: Efficient Spatiotemporal Graph Transformer for Traffic Forecasting

Figure 2 for STGformer: Efficient Spatiotemporal Graph Transformer for Traffic Forecasting

Figure 3 for STGformer: Efficient Spatiotemporal Graph Transformer for Traffic Forecasting

Figure 4 for STGformer: Efficient Spatiotemporal Graph Transformer for Traffic Forecasting

Abstract:Traffic forecasting is a cornerstone of smart city management, enabling efficient resource allocation and transportation planning. Deep learning, with its ability to capture complex nonlinear patterns in spatiotemporal (ST) data, has emerged as a powerful tool for traffic forecasting. While graph neural networks (GCNs) and transformer-based models have shown promise, their computational demands often hinder their application to real-world road networks, particularly those with large-scale spatiotemporal interactions. To address these challenges, we propose a novel spatiotemporal graph transformer (STGformer) architecture. STGformer effectively balances the strengths of GCNs and Transformers, enabling efficient modeling of both global and local traffic patterns while maintaining a manageable computational footprint. Unlike traditional approaches that require multiple attention layers, STG attention block captures high-order spatiotemporal interactions in a single layer, significantly reducing computational cost. In particular, STGformer achieves a 100x speedup and a 99.8\% reduction in GPU memory usage compared to STAEformer during batch inference on a California road graph with 8,600 sensors. We evaluate STGformer on the LargeST benchmark and demonstrate its superiority over state-of-the-art Transformer-based methods such as PDFormer and STAEformer, which underline STGformer's potential to revolutionize traffic forecasting by overcoming the computational and memory limitations of existing approaches, making it a promising foundation for future spatiotemporal modeling tasks.

Via

Access Paper or Ask Questions

Dissecting Out-of-Distribution Detection and Open-Set Recognition: A Critical Analysis of Methods and Benchmarks

Aug 30, 2024

Hongjun Wang, Sagar Vaze, Kai Han

Figure 1 for Dissecting Out-of-Distribution Detection and Open-Set Recognition: A Critical Analysis of Methods and Benchmarks

Figure 2 for Dissecting Out-of-Distribution Detection and Open-Set Recognition: A Critical Analysis of Methods and Benchmarks

Figure 3 for Dissecting Out-of-Distribution Detection and Open-Set Recognition: A Critical Analysis of Methods and Benchmarks

Figure 4 for Dissecting Out-of-Distribution Detection and Open-Set Recognition: A Critical Analysis of Methods and Benchmarks

Abstract:Detecting test-time distribution shift has emerged as a key capability for safely deployed machine learning models, with the question being tackled under various guises in recent years. In this paper, we aim to provide a consolidated view of the two largest sub-fields within the community: out-of-distribution (OOD) detection and open-set recognition (OSR). In particular, we aim to provide rigorous empirical analysis of different methods across settings and provide actionable takeaways for practitioners and researchers. Concretely, we make the following contributions: (i) We perform rigorous cross-evaluation between state-of-the-art methods in the OOD detection and OSR settings and identify a strong correlation between the performances of methods for them; (ii) We propose a new, large-scale benchmark setting which we suggest better disentangles the problem tackled by OOD detection and OSR, re-evaluating state-of-the-art OOD detection and OSR methods in this setting; (iii) We surprisingly find that the best performing method on standard benchmarks (Outlier Exposure) struggles when tested at scale, while scoring rules which are sensitive to the deep feature magnitude consistently show promise; and (iv) We conduct empirical analysis to explain these phenomena and highlight directions for future research. Code: https://github.com/Visual-AI/Dissect-OOD-OSR

* Accepted to IJCV, preprint version; v2: add supplementary

Via

Access Paper or Ask Questions