Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Peng Hu

Robust Duality Learning for Unsupervised Visible-Infrared Person Re-Identfication

May 05, 2025

Yongxiang Li, Yuan Sun, Yang Qin, Dezhong Peng, Xi Peng, Peng Hu

Abstract:Unsupervised visible-infrared person re-identification (UVI-ReID) aims to retrieve pedestrian images across different modalities without costly annotations, but faces challenges due to the modality gap and lack of supervision. Existing methods often adopt self-training with clustering-generated pseudo-labels but implicitly assume these labels are always correct. In practice, however, this assumption fails due to inevitable pseudo-label noise, which hinders model learning. To address this, we introduce a new learning paradigm that explicitly considers Pseudo-Label Noise (PLN), characterized by three key challenges: noise overfitting, error accumulation, and noisy cluster correspondence. To this end, we propose a novel Robust Duality Learning framework (RoDE) for UVI-ReID to mitigate the effects of noisy pseudo-labels. First, to combat noise overfitting, a Robust Adaptive Learning mechanism (RAL) is proposed to dynamically emphasize clean samples while down-weighting noisy ones. Second, to alleviate error accumulation-where the model reinforces its own mistakes-RoDE employs dual distinct models that are alternately trained using pseudo-labels from each other, encouraging diversity and preventing collapse. However, this dual-model strategy introduces misalignment between clusters across models and modalities, creating noisy cluster correspondence. To resolve this, we introduce Cluster Consistency Matching (CCM), which aligns clusters across models and modalities by measuring cross-cluster similarity. Extensive experiments on three benchmarks demonstrate the effectiveness of RoDE.

Via

Access Paper or Ask Questions

Toward Onboard AI-Enabled Solutions to Space Object Detection for Space Sustainability

May 03, 2025

Wenxuan Zhang, Peng Hu

Abstract:The rapid expansion of advanced low-Earth orbit (LEO) satellites in large constellations is positioning space assets as key to the future, enabling global internet access and relay systems for deep space missions. A solution to the challenge is effective space object detection (SOD) for collision assessment and avoidance. In SOD, an LEO satellite must detect other satellites and objects with high precision and minimal delay. This paper investigates the feasibility and effectiveness of employing vision sensors for SOD tasks based on deep learning (DL) models. It introduces models based on the Squeeze-and-Excitation (SE) layer, Vision Transformer (ViT), and the Generalized Efficient Layer Aggregation Network (GELAN) and evaluates their performance under SOD scenarios. Experimental results show that the proposed models achieve mean average precision at intersection over union threshold 0.5 (mAP50) scores of up to 0.751 and mean average precision averaged over intersection over union thresholds from 0.5 to 0.95 (mAP50:95) scores of up to 0.280. Compared to the baseline GELAN-t model, the proposed GELAN-ViT-SE model increases the average mAP50 from 0.721 to 0.751, improves the mAP50:95 from 0.266 to 0.274, reduces giga floating point operations (GFLOPs) from 7.3 to 5.6, and lowers peak power consumption from 2080.7 mW to 2028.7 mW by 2.5\%.

* This paper has been accepted at the 18th International Conference on Space Operations (SpaceOps 2025)

Via

Access Paper or Ask Questions

Rotational ultrasound and photoacoustic tomography of the human body

Apr 22, 2025

Yang Zhang, Shuai Na, Jonathan J. Russin, Karteekeya Sastry, Li Lin, Junfu Zheng, Yilin Luo, Xin Tong, Yujin An, Peng Hu(+4 more)

Abstract:Imaging the human body's morphological and angiographic information is essential for diagnosing, monitoring, and treating medical conditions. Ultrasonography performs the morphological assessment of the soft tissue based on acoustic impedance variations, whereas photoacoustic tomography (PAT) can visualize blood vessels based on intrinsic hemoglobin absorption. Three-dimensional (3D) panoramic imaging of the vasculature is generally not practical in conventional ultrasonography with limited field-of-view (FOV) probes, and PAT does not provide sufficient scattering-based soft tissue morphological contrast. Complementing each other, fast panoramic rotational ultrasound tomography (RUST) and PAT are integrated for hybrid rotational ultrasound and photoacoustic tomography (RUS-PAT), which obtains 3D ultrasound structural and PAT angiographic images of the human body quasi-simultaneously. The RUST functionality is achieved in a cost-effective manner using a single-element ultrasonic transducer for ultrasound transmission and rotating arc-shaped arrays for 3D panoramic detection. RUST is superior to conventional ultrasonography, which either has a limited FOV with a linear array or is high-cost with a hemispherical array that requires both transmission and receiving. By switching the acoustic source to a light source, the system is conveniently converted to PAT mode to acquire angiographic images in the same region. Using RUS-PAT, we have successfully imaged the human head, breast, hand, and foot with a 10 cm diameter FOV, submillimeter isotropic resolution, and 10 s imaging time for each modality. The 3D RUS-PAT is a powerful tool for high-speed, 3D, dual-contrast imaging of the human body with potential for rapid clinical translation.

Via

Access Paper or Ask Questions

LLaVA-ReID: Selective Multi-image Questioner for Interactive Person Re-Identification

Apr 15, 2025

Yiding Lu, Mouxing Yang, Dezhong Peng, Peng Hu, Yijie Lin, Xi Peng

Abstract:Traditional text-based person ReID assumes that person descriptions from witnesses are complete and provided at once. However, in real-world scenarios, such descriptions are often partial or vague. To address this limitation, we introduce a new task called interactive person re-identification (Inter-ReID). Inter-ReID is a dialogue-based retrieval task that iteratively refines initial descriptions through ongoing interactions with the witnesses. To facilitate the study of this new task, we construct a dialogue dataset that incorporates multiple types of questions by decomposing fine-grained attributes of individuals. We further propose LLaVA-ReID, a question model that generates targeted questions based on visual and textual contexts to elicit additional details about the target person. Leveraging a looking-forward strategy, we prioritize the most informative questions as supervision during training. Experimental results on both Inter-ReID and text-based ReID benchmarks demonstrate that LLaVA-ReID significantly outperforms baselines.

Via

Access Paper or Ask Questions

IQPFR: An Image Quality Prior for Blind Face Restoration and Beyond

Mar 12, 2025

Peng Hu, Chunming He, Lei Xu, Jingduo Tian, Sina Farsiu, Yulun Zhang, Pei Liu, Xiu Li

Abstract:Blind Face Restoration (BFR) addresses the challenge of reconstructing degraded low-quality (LQ) facial images into high-quality (HQ) outputs. Conventional approaches predominantly rely on learning feature representations from ground-truth (GT) data; however, inherent imperfections in GT datasets constrain restoration performance to the mean quality level of the training data, rather than attaining maximally attainable visual quality. To overcome this limitation, we propose a novel framework that incorporates an Image Quality Prior (IQP) derived from No-Reference Image Quality Assessment (NR-IQA) models to guide the restoration process toward optimal HQ reconstructions. Our methodology synergizes this IQP with a learned codebook prior through two critical innovations: (1) During codebook learning, we devise a dual-branch codebook architecture that disentangles feature extraction into universal structural components and HQ-specific attributes, ensuring comprehensive representation of both common and high-quality facial characteristics. (2) In the codebook lookup stage, we implement a quality-conditioned Transformer-based framework. NR-IQA-derived quality scores act as dynamic conditioning signals to steer restoration toward the highest feasible quality standard. This score-conditioned paradigm enables plug-and-play enhancement of existing BFR architectures without modifying the original structure. We also formulate a discrete representation-based quality optimization strategy that circumvents over-optimization artifacts prevalent in continuous latent space approaches. Extensive experiments demonstrate that our method outperforms state-of-the-art techniques across multiple benchmarks. Besides, our quality-conditioned framework demonstrates consistent performance improvements when integrated with prior BFR models. The code will be released.

Via

Access Paper or Ask Questions

MaIR: A Locality- and Continuity-Preserving Mamba for Image Restoration

Dec 28, 2024

Boyun Li, Haiyu Zhao, Wenxin Wang, Peng Hu, Yuanbiao Gou, Xi Peng

Abstract:Recent advancements in Mamba have shown promising results in image restoration. These methods typically flatten 2D images into multiple distinct 1D sequences along rows and columns, process each sequence independently using selective scan operation, and recombine them to form the outputs. However, such a paradigm overlooks two vital aspects: i) the local relationships and spatial continuity inherent in natural images, and ii) the discrepancies among sequences unfolded through totally different ways. To overcome the drawbacks, we explore two problems in Mamba-based restoration methods: i) how to design a scanning strategy preserving both locality and continuity while facilitating restoration, and ii) how to aggregate the distinct sequences unfolded in totally different ways. To address these problems, we propose a novel Mamba-based Image Restoration model (MaIR), which consists of Nested S-shaped Scanning strategy (NSS) and Sequence Shuffle Attention block (SSA). Specifically, NSS preserves locality and continuity of the input images through the stripe-based scanning region and the S-shaped scanning path, respectively. SSA aggregates sequences through calculating attention weights within the corresponding channels of different sequences. Thanks to NSS and SSA, MaIR surpasses 40 baselines across 14 challenging datasets, achieving state-of-the-art performance on the tasks of image super-resolution, denoising, deblurring and dehazing. Our codes will be available after acceptance.

Via

Access Paper or Ask Questions

DiFiC: Your Diffusion Model Holds the Secret to Fine-Grained Clustering

Dec 25, 2024

Ruohong Yang, Peng Hu, Xi Peng, Xiting Liu, Yunfan Li

Abstract:Fine-grained clustering is a practical yet challenging task, whose essence lies in capturing the subtle differences between instances of different classes. Such subtle differences can be easily disrupted by data augmentation or be overwhelmed by redundant information in data, leading to significant performance degradation for existing clustering methods. In this work, we introduce DiFiC a fine-grained clustering method building upon the conditional diffusion model. Distinct from existing works that focus on extracting discriminative features from images, DiFiC resorts to deducing the textual conditions used for image generation. To distill more precise and clustering-favorable object semantics, DiFiC further regularizes the diffusion target and guides the distillation process utilizing neighborhood similarity. Extensive experiments demonstrate that DiFiC outperforms both state-of-the-art discriminative and generative clustering methods on four fine-grained image clustering benchmarks. We hope the success of DiFiC will inspire future research to unlock the potential of diffusion models in tasks beyond generation. The code will be released.

Via

Access Paper or Ask Questions

ROUTE: Robust Multitask Tuning and Collaboration for Text-to-SQL

Dec 13, 2024

Yang Qin, Chao Chen, Zhihang Fu, Ze Chen, Dezhong Peng, Peng Hu, Jieping Ye

Figure 1 for ROUTE: Robust Multitask Tuning and Collaboration for Text-to-SQL

Figure 2 for ROUTE: Robust Multitask Tuning and Collaboration for Text-to-SQL

Figure 3 for ROUTE: Robust Multitask Tuning and Collaboration for Text-to-SQL

Figure 4 for ROUTE: Robust Multitask Tuning and Collaboration for Text-to-SQL

Abstract:Despite the significant advancements in Text-to-SQL (Text2SQL) facilitated by large language models (LLMs), the latest state-of-the-art techniques are still trapped in the in-context learning of closed-source LLMs (e.g., GPT-4), which limits their applicability in open scenarios. To address this challenge, we propose a novel RObust mUltitask Tuning and collaboration mEthod (ROUTE) to improve the comprehensive capabilities of open-source LLMs for Text2SQL, thereby providing a more practical solution. Our approach begins with multi-task supervised fine-tuning (SFT) using various synthetic training data related to SQL generation. Unlike existing SFT-based Text2SQL methods, we introduced several additional SFT tasks, including schema linking, noise correction, and continuation writing. Engaging in a variety of SQL generation tasks enhances the model's understanding of SQL syntax and improves its ability to generate high-quality SQL queries. Additionally, inspired by the collaborative modes of LLM agents, we introduce a Multitask Collaboration Prompting (MCP) strategy. This strategy leverages collaboration across several SQL-related tasks to reduce hallucinations during SQL generation, thereby maximizing the potential of enhancing Text2SQL performance through explicit multitask capabilities. Extensive experiments and in-depth analyses have been performed on eight open-source LLMs and five widely-used benchmarks. The results demonstrate that our proposal outperforms the latest Text2SQL methods and yields leading performance.

Via

Access Paper or Ask Questions

Sensing for Space Safety and Sustainability: A Deep Learning Approach with Vision Transformers

Dec 12, 2024

Wenxuan Zhang, Peng Hu

Abstract:The rapid increase of space assets represented by small satellites in low Earth orbit can enable ubiquitous digital services for everyone. However, due to the dynamic space environment, numerous space objects, complex atmospheric conditions, and unexpected events can easily introduce adverse conditions affecting space safety, operations, and sustainability of the outer space environment. This challenge calls for responsive, effective satellite object detection (SOD) solutions that allow a small satellite to assess and respond to collision risks, with the consideration of constrained resources on a small satellite platform. This paper discusses the SOD tasks and onboard deep learning (DL) approach to the tasks. Two new DL models are proposed, called GELAN-ViT and GELAN-RepViT, which incorporate vision transformer (ViT) into the Generalized Efficient Layer Aggregation Network (GELAN) architecture and address limitations by separating the convolutional neural network and ViT paths. These models outperform the state-of-the-art YOLOv9-t in terms of mean average precision (mAP) and computational costs. On the SOD dataset, our proposed models can achieve around 95% mAP50 with giga-floating point operations (GFLOPs) reduced by over 5.0. On the VOC 2012 dataset, they can achieve $\geq$ 60.7% mAP50 with GFLOPs reduced by over 5.2.

* To be published in the 12th Annual IEEE International Conference on Wireless for Space and Extreme Environments (WiSEE 2024)

Via

Access Paper or Ask Questions

Test-time Adaptation for Cross-modal Retrieval with Query Shift

Oct 21, 2024

Haobin Li, Peng Hu, Qianjun Zhang, Xi Peng, Xiting Liu, Mouxing Yang

Figure 1 for Test-time Adaptation for Cross-modal Retrieval with Query Shift

Figure 2 for Test-time Adaptation for Cross-modal Retrieval with Query Shift

Figure 3 for Test-time Adaptation for Cross-modal Retrieval with Query Shift

Figure 4 for Test-time Adaptation for Cross-modal Retrieval with Query Shift

Abstract:The success of most existing cross-modal retrieval methods heavily relies on the assumption that the given queries follow the same distribution of the source domain. However, such an assumption is easily violated in real-world scenarios due to the complexity and diversity of queries, thus leading to the query shift problem. Specifically, query shift refers to the online query stream originating from the domain that follows a different distribution with the source one. In this paper, we observe that query shift would not only diminish the uniformity (namely, within-modality scatter) of the query modality but also amplify the gap between query and gallery modalities. Based on the observations, we propose a novel method dubbed Test-time adaptation for Cross-modal Retrieval (TCR). In brief, TCR employs a novel module to refine the query predictions (namely, retrieval results of the query) and a joint objective to prevent query shift from disturbing the common space, thus achieving online adaptation for the cross-modal retrieval models with query shift. Expensive experiments demonstrate the effectiveness of the proposed TCR against query shift. The code will be released upon acceptance.

* 22 pages, 8 figures

Via

Access Paper or Ask Questions