Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wanzeng Kong

3D Scene-Camera Representation with Joint Camera Photometric Optimization

Jun 26, 2025

Weichen Dai, Kangcheng Ma, Jiaxin Wang, Kecen Pan, Yuhang Ming, Hua Zhang, Wanzeng Kong

Abstract:Representing scenes from multi-view images is a crucial task in computer vision with extensive applications. However, inherent photometric distortions in the camera imaging can significantly degrade image quality. Without accounting for these distortions, the 3D scene representation may inadvertently incorporate erroneous information unrelated to the scene, diminishing the quality of the representation. In this paper, we propose a novel 3D scene-camera representation with joint camera photometric optimization. By introducing internal and external photometric model, we propose a full photometric model and corresponding camera representation. Based on simultaneously optimizing the parameters of the camera representation, the proposed method effectively separates scene-unrelated information from the 3D scene representation. Additionally, during the optimization of the photometric parameters, we introduce a depth regularization to prevent the 3D scene representation from fitting scene-unrelated information. By incorporating the camera model as part of the mapping process, the proposed method constructs a complete map that includes both the scene radiance field and the camera photometric model. Experimental results demonstrate that the proposed method can achieve high-quality 3D scene representations, even under conditions of imaging degradation, such as vignetting and dirt.

Via

Access Paper or Ask Questions

Uncertainty Aware Human-machine Collaboration in Camouflaged Object Detection

Feb 12, 2025

Ziyue Yang, Kehan Wang, Yuhang Ming, Yong Peng, Han Yang, Qiong Chen, Wanzeng Kong

Abstract:Camouflaged Object Detection (COD), the task of identifying objects concealed within their environments, has seen rapid growth due to its wide range of practical applications. A key step toward developing trustworthy COD systems is the estimation and effective utilization of uncertainty. In this work, we propose a human-machine collaboration framework for classifying the presence of camouflaged objects, leveraging the complementary strengths of computer vision (CV) models and noninvasive brain-computer interfaces (BCIs). Our approach introduces a multiview backbone to estimate uncertainty in CV model predictions, utilizes this uncertainty during training to improve efficiency, and defers low-confidence cases to human evaluation via RSVP-based BCIs during testing for more reliable decision-making. We evaluated the framework in the CAMO dataset, achieving state-of-the-art results with an average improvement of 4.56\% in balanced accuracy (BA) and 3.66\% in the F1 score compared to existing methods. For the best-performing participants, the improvements reached 7.6\% in BA and 6.66\% in the F1 score. Analysis of the training process revealed a strong correlation between our confidence measures and precision, while an ablation study confirmed the effectiveness of the proposed training policy and the human-machine collaboration strategy. In general, this work reduces human cognitive load, improves system reliability, and provides a strong foundation for advancements in real-world COD applications and human-computer interaction. Our code and data are available at: https://github.com/ziyuey/Uncertainty-aware-human-machine-collaboration-in-camouflaged-object-identification.

Via

Access Paper or Ask Questions

SLC$^2$-SLAM: Semantic-guided Loop Closure with Shared Latent Code for NeRF SLAM

Jan 15, 2025

Yuhang Ming, Di Ma, Weichen Dai, Han Yang, Rui Fan, Guofeng Zhang, Wanzeng Kong

Figure 1 for SLC$^2$-SLAM: Semantic-guided Loop Closure with Shared Latent Code for NeRF SLAM

Figure 2 for SLC$^2$-SLAM: Semantic-guided Loop Closure with Shared Latent Code for NeRF SLAM

Figure 3 for SLC$^2$-SLAM: Semantic-guided Loop Closure with Shared Latent Code for NeRF SLAM

Figure 4 for SLC$^2$-SLAM: Semantic-guided Loop Closure with Shared Latent Code for NeRF SLAM

Abstract:Targeting the notorious cumulative drift errors in NeRF SLAM, we propose a Semantic-guided Loop Closure with Shared Latent Code, dubbed SLC$^2$-SLAM. Especially, we argue that latent codes stored in many NeRF SLAM systems are not fully exploited, as they are only used for better reconstruction. In this paper, we propose a simple yet effective way to detect potential loops using the same latent codes as local features. To further improve the loop detection performance, we use the semantic information, which are also decoded from the same latent codes to guide the aggregation of local features. Finally, with the potential loops detected, we close them with a graph optimization followed by bundle adjustment to refine both the estimated poses and the reconstructed scene. To evaluate the performance of our SLC$^2$-SLAM, we conduct extensive experiments on Replica and ScanNet datasets. Our proposed semantic-guided loop closure significantly outperforms the pre-trained NetVLAD and ORB combined with Bag-of-Words, which are used in all the other NeRF SLAM with loop closure. As a result, our SLC$^2$-SLAM also demonstrated better tracking and reconstruction performance, especially in larger scenes with more loops, like ScanNet.

* 8 pages, 5 figures, 4 tables

Via

Access Paper or Ask Questions

Pattern-Matching Dynamic Memory Network for Dual-Mode Traffic Prediction

Aug 12, 2024

Wenchao Weng, Mei Wu, Hanyu Jiang, Wanzeng Kong, Xiangjie Kong, Feng Xia

Figure 1 for Pattern-Matching Dynamic Memory Network for Dual-Mode Traffic Prediction

Figure 2 for Pattern-Matching Dynamic Memory Network for Dual-Mode Traffic Prediction

Figure 3 for Pattern-Matching Dynamic Memory Network for Dual-Mode Traffic Prediction

Figure 4 for Pattern-Matching Dynamic Memory Network for Dual-Mode Traffic Prediction

Abstract:In recent years, deep learning has increasingly gained attention in the field of traffic prediction. Existing traffic prediction models often rely on GCNs or attention mechanisms with O(N^2) complexity to dynamically extract traffic node features, which lack efficiency and are not lightweight. Additionally, these models typically only utilize historical data for prediction, without considering the impact of the target information on the prediction. To address these issues, we propose a Pattern-Matching Dynamic Memory Network (PM-DMNet). PM-DMNet employs a novel dynamic memory network to capture traffic pattern features with only O(N) complexity, significantly reducing computational overhead while achieving excellent performance. The PM-DMNet also introduces two prediction methods: Recursive Multi-step Prediction (RMP) and Parallel Multi-step Prediction (PMP), which leverage the time features of the prediction targets to assist in the forecasting process. Furthermore, a transfer attention mechanism is integrated into PMP, transforming historical data features to better align with the predicted target states, thereby capturing trend changes more accurately and reducing errors. Extensive experiments demonstrate the superiority of the proposed model over existing benchmarks. The source codes are available at: https://github.com/wengwenchao123/PM-DMNet.

Via

Access Paper or Ask Questions

VIPeR: Visual Incremental Place Recognition with Adaptive Mining and Lifelong Learning

Jul 31, 2024

Yuhang Ming, Minyang Xu, Xingrui Yang, Weicai Ye, Weihan Wang, Yong Peng, Weichen Dai, Wanzeng Kong

Abstract:Visual place recognition (VPR) is an essential component of many autonomous and augmented/virtual reality systems. It enables the systems to robustly localize themselves in large-scale environments. Existing VPR methods demonstrate attractive performance at the cost of heavy pre-training and limited generalizability. When deployed in unseen environments, these methods exhibit significant performance drops. Targeting this issue, we present VIPeR, a novel approach for visual incremental place recognition with the ability to adapt to new environments while retaining the performance of previous environments. We first introduce an adaptive mining strategy that balances the performance within a single environment and the generalizability across multiple environments. Then, to prevent catastrophic forgetting in lifelong learning, we draw inspiration from human memory systems and design a novel memory bank for our VIPeR. Our memory bank contains a sensory memory, a working memory and a long-term memory, with the first two focusing on the current environment and the last one for all previously visited environments. Additionally, we propose a probabilistic knowledge distillation to explicitly safeguard the previously learned knowledge. We evaluate our proposed VIPeR on three large-scale datasets, namely Oxford Robotcar, Nordland, and TartanAir. For comparison, we first set a baseline performance with naive finetuning. Then, several more recent lifelong learning methods are compared. Our VIPeR achieves better performance in almost all aspects with the biggest improvement of 13.65% in average performance.

* 8 pages, 4 figures

Via

Access Paper or Ask Questions

Enhanced Coherence-Aware Network with Hierarchical Disentanglement for Aspect-Category Sentiment Analysis

Mar 15, 2024

Jin Cui, Fumiyo Fukumoto, Xinfeng Wang, Yoshimi Suzuki, Jiyi Li, Noriko Tomuro, Wanzeng Kong

Figure 1 for Enhanced Coherence-Aware Network with Hierarchical Disentanglement for Aspect-Category Sentiment Analysis

Figure 2 for Enhanced Coherence-Aware Network with Hierarchical Disentanglement for Aspect-Category Sentiment Analysis

Figure 3 for Enhanced Coherence-Aware Network with Hierarchical Disentanglement for Aspect-Category Sentiment Analysis

Figure 4 for Enhanced Coherence-Aware Network with Hierarchical Disentanglement for Aspect-Category Sentiment Analysis

Abstract:Aspect-category-based sentiment analysis (ACSA), which aims to identify aspect categories and predict their sentiments has been intensively studied due to its wide range of NLP applications. Most approaches mainly utilize intrasentential features. However, a review often includes multiple different aspect categories, and some of them do not explicitly appear in the review. Even in a sentence, there is more than one aspect category with its sentiments, and they are entangled intra-sentence, which makes the model fail to discriminately preserve all sentiment characteristics. In this paper, we propose an enhanced coherence-aware network with hierarchical disentanglement (ECAN) for ACSA tasks. Specifically, we explore coherence modeling to capture the contexts across the whole review and to help the implicit aspect and sentiment identification. To address the issue of multiple aspect categories and sentiment entanglement, we propose a hierarchical disentanglement module to extract distinct categories and sentiment features. Extensive experimental and visualization results show that our ECAN effectively decouples multiple categories and sentiments entangled in the coherence representations and achieves state-of-the-art (SOTA) performance. Our codes and data are available online: \url{https://github.com/cuijin-23/ECAN}.

* Accepted by LREC-COLING 2024

Via

Access Paper or Ask Questions

AEGIS-Net: Attention-guided Multi-Level Feature Aggregation for Indoor Place Recognition

Dec 15, 2023

Yuhang Ming, Jian Ma, Xingrui Yang, Weichen Dai, Yong Peng, Wanzeng Kong

Figure 1 for AEGIS-Net: Attention-guided Multi-Level Feature Aggregation for Indoor Place Recognition

Figure 2 for AEGIS-Net: Attention-guided Multi-Level Feature Aggregation for Indoor Place Recognition

Figure 3 for AEGIS-Net: Attention-guided Multi-Level Feature Aggregation for Indoor Place Recognition

Figure 4 for AEGIS-Net: Attention-guided Multi-Level Feature Aggregation for Indoor Place Recognition

Abstract:We present AEGIS-Net, a novel indoor place recognition model that takes in RGB point clouds and generates global place descriptors by aggregating lower-level color, geometry features and higher-level implicit semantic features. However, rather than simple feature concatenation, self-attention modules are employed to select the most important local features that best describe an indoor place. Our AEGIS-Net is made of a semantic encoder, a semantic decoder and an attention-guided feature embedding. The model is trained in a 2-stage process with the first stage focusing on an auxiliary semantic segmentation task and the second one on the place recognition task. We evaluate our AEGIS-Net on the ScanNetPR dataset and compare its performance with a pre-deep-learning feature-based method and five state-of-the-art deep-learning-based methods. Our AEGIS-Net achieves exceptional performance and outperforms all six methods.

* Accepted by 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2024)

Via

Access Paper or Ask Questions

InterMulti:Multi-view Multimodal Interactions with Text-dominated Hierarchical High-order Fusion for Emotion Analysis

Dec 20, 2022

Feng Qiu, Wanzeng Kong, Yu Ding

Abstract:Humans are sophisticated at reading interlocutors' emotions from multimodal signals, such as speech contents, voice tones and facial expressions. However, machines might struggle to understand various emotions due to the difficulty of effectively decoding emotions from the complex interactions between multimodal signals. In this paper, we propose a multimodal emotion analysis framework, InterMulti, to capture complex multimodal interactions from different views and identify emotions from multimodal signals. Our proposed framework decomposes signals of different modalities into three kinds of multimodal interaction representations, including a modality-full interaction representation, a modality-shared interaction representation, and three modality-specific interaction representations. Additionally, to balance the contribution of different modalities and learn a more informative latent interaction representation, we developed a novel Text-dominated Hierarchical High-order Fusion(THHF) module. THHF module reasonably integrates the above three kinds of representations into a comprehensive multimodal interaction representation. Extensive experimental results on widely used datasets, (i.e.) MOSEI, MOSI and IEMOCAP, demonstrate that our method outperforms the state-of-the-art.

* 9 pages, 3 figures. arXiv admin note: text overlap with arXiv:2212.08661

Via

Access Paper or Ask Questions

EffMulti: Efficiently Modeling Complex Multimodal Interactions for Emotion Analysis

Dec 16, 2022

Feng Qiu, Chengyang Xie, Yu Ding, Wanzeng Kong

Abstract:Humans are skilled in reading the interlocutor's emotion from multimodal signals, including spoken words, simultaneous speech, and facial expressions. It is still a challenge to effectively decode emotions from the complex interactions of multimodal signals. In this paper, we design three kinds of multimodal latent representations to refine the emotion analysis process and capture complex multimodal interactions from different views, including a intact three-modal integrating representation, a modality-shared representation, and three modality-individual representations. Then, a modality-semantic hierarchical fusion is proposed to reasonably incorporate these representations into a comprehensive interaction representation. The experimental results demonstrate that our EffMulti outperforms the state-of-the-art methods. The compelling performance benefits from its well-designed framework with ease of implementation, lower computing complexity, and less trainable parameters.

* 6 pages,1 figure

Via

Access Paper or Ask Questions