Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yuhui Zheng

Chain-of-Talkers (CoTalk): Fast Human Annotation of Dense Image Captions

May 28, 2025

Yijun Shen, Delong Chen, Fan Liu, Xingyu Wang, Chuanyi Zhang, Liang Yao, Yuhui Zheng

Abstract:While densely annotated image captions significantly facilitate the learning of robust vision-language alignment, methodologies for systematically optimizing human annotation efforts remain underexplored. We introduce Chain-of-Talkers (CoTalk), an AI-in-the-loop methodology designed to maximize the number of annotated samples and improve their comprehensiveness under fixed budget constraints (e.g., total human annotation time). The framework is built upon two key insights. First, sequential annotation reduces redundant workload compared to conventional parallel annotation, as subsequent annotators only need to annotate the ``residual'' -- the missing visual information that previous annotations have not covered. Second, humans process textual input faster by reading while outputting annotations with much higher throughput via talking; thus a multimodal interface enables optimized efficiency. We evaluate our framework from two aspects: intrinsic evaluations that assess the comprehensiveness of semantic units, obtained by parsing detailed captions into object-attribute trees and analyzing their effective connections; extrinsic evaluation measures the practical usage of the annotated captions in facilitating vision-language alignment. Experiments with eight participants show our Chain-of-Talkers (CoTalk) improves annotation speed (0.42 vs. 0.30 units/sec) and retrieval performance (41.13\% vs. 40.52\%) over the parallel method.

Via

Access Paper or Ask Questions

RemoteSAM: Towards Segment Anything for Earth Observation

May 23, 2025

Liang Yao, Fan Liu, Delong Chen, Chuanyi Zhang, Yijun Wang, Ziyun Chen, Wei Xu, Shimin Di, Yuhui Zheng

Abstract:We aim to develop a robust yet flexible visual foundation model for Earth observation. It should possess strong capabilities in recognizing and localizing diverse visual targets while providing compatibility with various input-output interfaces required across different task scenarios. Current systems cannot meet these requirements, as they typically utilize task-specific architecture trained on narrow data domains with limited semantic coverage. Our study addresses these limitations from two aspects: data and modeling. We first introduce an automatic data engine that enjoys significantly better scalability compared to previous human annotation or rule-based approaches. It has enabled us to create the largest dataset of its kind to date, comprising 270K image-text-mask triplets covering an unprecedented range of diverse semantic categories and attribute specifications. Based on this data foundation, we further propose a task unification paradigm that centers around referring expression segmentation. It effectively handles a wide range of vision-centric perception tasks, including classification, detection, segmentation, grounding, etc, using a single model without any task-specific heads. Combining these innovations on data and modeling, we present RemoteSAM, a foundation model that establishes new SoTA on several earth observation perception benchmarks, outperforming other foundation models such as Falcon, GeoChat, and LHRS-Bot with significantly higher efficiency. Models and data are publicly available at https://github.com/1e12Leon/RemoteSAM.

Via

Access Paper or Ask Questions

Self-Assessed Generation: Trustworthy Label Generation for Optical Flow and Stereo Matching in Real-world

Oct 14, 2024

Han Ling, Yinghui Sun, Quansen Sun, Ivor Tsang, Yuhui Zheng

Figure 1 for Self-Assessed Generation: Trustworthy Label Generation for Optical Flow and Stereo Matching in Real-world

Figure 2 for Self-Assessed Generation: Trustworthy Label Generation for Optical Flow and Stereo Matching in Real-world

Figure 3 for Self-Assessed Generation: Trustworthy Label Generation for Optical Flow and Stereo Matching in Real-world

Figure 4 for Self-Assessed Generation: Trustworthy Label Generation for Optical Flow and Stereo Matching in Real-world

Abstract:A significant challenge facing current optical flow and stereo methods is the difficulty in generalizing them well to the real world. This is mainly due to the high costs required to produce datasets, and the limitations of existing self-supervised methods on fuzzy results and complex model training problems. To address the above challenges, we propose a unified self-supervised generalization framework for optical flow and stereo tasks: Self-Assessed Generation (SAG). Unlike previous self-supervised methods, SAG is data-driven, using advanced reconstruction techniques to construct a reconstruction field from RGB images and generate datasets based on it. Afterward, we quantified the confidence level of the generated results from multiple perspectives, such as reconstruction field distribution, geometric consistency, and structural similarity, to eliminate inevitable defects in the generation process. We also designed a 3D flight foreground automatic rendering pipeline in SAG to encourage the network to learn occlusion and motion foreground. Experimentally, because SAG does not involve changes to methods or loss functions, it can directly self-supervised train the state-of-the-art deep networks, greatly improving the generalization performance of self-supervised methods on current mainstream optical flow and stereo-matching datasets. Compared to previous training modes, SAG is more generalized, cost-effective, and accurate.

Via

Access Paper or Ask Questions

Vision Transformer based Random Walk for Group Re-Identification

Oct 08, 2024

Guoqing Zhang, Tianqi Liu, Wenxuan Fang, Yuhui Zheng

Figure 1 for Vision Transformer based Random Walk for Group Re-Identification

Figure 2 for Vision Transformer based Random Walk for Group Re-Identification

Figure 3 for Vision Transformer based Random Walk for Group Re-Identification

Figure 4 for Vision Transformer based Random Walk for Group Re-Identification

Abstract:Group re-identification (re-ID) aims to match groups with the same people under different cameras, mainly involves the challenges of group members and layout changes well. Most existing methods usually use the k-nearest neighbor algorithm to update node features to consider changes in group membership, but these methods cannot solve the problem of group layout changes. To this end, we propose a novel vision transformer based random walk framework for group re-ID. Specifically, we design a vision transformer based on a monocular depth estimation algorithm to construct a graph through the average depth value of pedestrian features to fully consider the impact of camera distance on group members relationships. In addition, we propose a random walk module to reconstruct the graph by calculating affinity scores between target and gallery images to remove pedestrians who do not belong to the current group. Experimental results show that our framework is superior to most methods.

* 6 pages

Via

Access Paper or Ask Questions

Training-Free Instance Segmentation from Semantic Image Segmentation Masks

Aug 02, 2023

Yuchen Shen, Dong Zhang, Yuhui Zheng, Zechao Li, Liyong Fu, Qiaolin Ye

Abstract:In recent years, the development of instance segmentation has garnered significant attention in a wide range of applications. However, the training of a fully-supervised instance segmentation model requires costly both instance-level and pixel-level annotations. In contrast, weakly-supervised instance segmentation methods (i.e., with image-level class labels or point labels) struggle to satisfy the accuracy and recall requirements of practical scenarios. In this paper, we propose a novel paradigm for instance segmentation called training-free instance segmentation (TFISeg), which achieves instance segmentation results from image masks predicted using off-the-shelf semantic segmentation models. TFISeg does not require training a semantic or/and instance segmentation model and avoids the need for instance-level image annotations. Therefore, it is highly efficient. Specifically, we first obtain a semantic segmentation mask of the input image via a trained semantic segmentation model. Then, we calculate a displacement field vector for each pixel based on the segmentation mask, which can indicate representations belonging to the same class but different instances, i.e., obtaining the instance-level object information. Finally, instance segmentation results are obtained after being refined by a learnable category-agnostic object boundary branch. Extensive experimental results on two challenging datasets and representative semantic segmentation baselines (including CNNs and Transformers) demonstrate that TFISeg can achieve competitive results compared to the state-of-the-art fully-supervised instance segmentation methods without the need for additional human resources or increased computational costs. The code is available at: TFISeg

* 14 pages,5 figures

Via

Access Paper or Ask Questions

Deep High-Resolution Representation Learning for Cross-Resolution Person Re-identification

May 25, 2021

Guoqing Zhang, Yu Ge, Zhicheng Dong, Hao Wang, Yuhui Zheng, Shengyong Chen

Figure 1 for Deep High-Resolution Representation Learning for Cross-Resolution Person Re-identification

Figure 2 for Deep High-Resolution Representation Learning for Cross-Resolution Person Re-identification

Figure 3 for Deep High-Resolution Representation Learning for Cross-Resolution Person Re-identification

Figure 4 for Deep High-Resolution Representation Learning for Cross-Resolution Person Re-identification

Abstract:Person re-identification (re-ID) tackles the problem of matching person images with the same identity from different cameras. In practical applications, due to the differences in camera performance and distance between cameras and persons of interest, captured person images usually have various resolutions. We name this problem as Cross-Resolution Person Re-identification which brings a great challenge for matching correctly. In this paper, we propose a Deep High-Resolution Pseudo-Siamese Framework (PS-HRNet) to solve the above problem. Specifically, in order to restore the resolution of low-resolution images and make reasonable use of different channel information of feature maps, we introduce and innovate VDSR module with channel attention (CA) mechanism, named as VDSR-CA. Then we reform the HRNet by designing a novel representation head to extract discriminating features, named as HRNet-ReID. In addition, a pseudo-siamese framework is constructed to reduce the difference of feature distributions between low-resolution images and high-resolution images. The experimental results on five cross-resolution person datasets verify the effectiveness of our proposed approach. Compared with the state-of-the-art methods, our proposed PS-HRNet improves 3.4\%, 6.2\%, 2.5\%,1.1\% and 4.2\% at Rank-1 on MLR-Market-1501, MLR-CUHK03, MLR-VIPeR, MLR-DukeMTMC-reID, and CAVIAR datasets, respectively. Our code is available at \url{https://github.com/zhguoqing}.

* 12 pages

Via

Access Paper or Ask Questions

TIPCB: A Simple but Effective Part-based Convolutional Baseline for Text-based Person Search

May 25, 2021

Yuhao Chen, Guoqing Zhang, Yujiang Lu, Zhenxing Wang, Yuhui Zheng, Ruili Wang

Figure 1 for TIPCB: A Simple but Effective Part-based Convolutional Baseline for Text-based Person Search

Figure 2 for TIPCB: A Simple but Effective Part-based Convolutional Baseline for Text-based Person Search

Figure 3 for TIPCB: A Simple but Effective Part-based Convolutional Baseline for Text-based Person Search

Figure 4 for TIPCB: A Simple but Effective Part-based Convolutional Baseline for Text-based Person Search

Abstract:Text-based person search is a sub-task in the field of image retrieval, which aims to retrieve target person images according to a given textual description. The significant feature gap between two modalities makes this task very challenging. Many existing methods attempt to utilize local alignment to address this problem in the fine-grained level. However, most relevant methods introduce additional models or complicated training and evaluation strategies, which are hard to use in realistic scenarios. In order to facilitate the practical application, we propose a simple but effective end-to-end learning framework for text-based person search named TIPCB (i.e., Text-Image Part-based Convolutional Baseline). Firstly, a novel dual-path local alignment network structure is proposed to extract visual and textual local representations, in which images are segmented horizontally and texts are aligned adaptively. Then, we propose a multi-stage cross-modal matching strategy, which eliminates the modality gap from three feature levels, including low level, local level and global level. Extensive experiments are conducted on the widely-used benchmark dataset (CUHK-PEDES) and verify that our method outperforms the state-of-the-art methods by 3.69%, 2.95% and 2.31% in terms of Top-1, Top-5 and Top-10. Our code has been released in https://github.com/OrangeYHChen/TIPCB.

* 27 pages

Via

Access Paper or Ask Questions

Reference-Aided Part-Aligned Feature Disentangling for Video Person Re-Identification

Mar 21, 2021

Guoqing Zhang, Yuhao Chen, Yang Dai, Yuhui Zheng, Yi Wu

Figure 1 for Reference-Aided Part-Aligned Feature Disentangling for Video Person Re-Identification

Figure 2 for Reference-Aided Part-Aligned Feature Disentangling for Video Person Re-Identification

Figure 3 for Reference-Aided Part-Aligned Feature Disentangling for Video Person Re-Identification

Figure 4 for Reference-Aided Part-Aligned Feature Disentangling for Video Person Re-Identification

Abstract:Recently, video-based person re-identification (re-ID) has drawn increasing attention in compute vision community because of its practical application prospects. Due to the inaccurate person detections and pose changes, pedestrian misalignment significantly increases the difficulty of feature extraction and matching. To address this problem, in this paper, we propose a \textbf{R}eference-\textbf{A}ided \textbf{P}art-\textbf{A}ligned (\textbf{RAPA}) framework to disentangle robust features of different parts. Firstly, in order to obtain better references between different videos, a pose-based reference feature learning module is introduced. Secondly, an effective relation-based part feature disentangling module is explored to align frames within each video. By means of using both modules, the informative parts of pedestrian in videos are well aligned and more discriminative feature representation is generated. Comprehensive experiments on three widely-used benchmarks, i.e. iLIDS-VID, PRID-2011 and MARS datasets verify the effectiveness of the proposed framework. Our code will be made publicly available.

* 6 pages, accepted by ICME 2021

Via

Access Paper or Ask Questions

Hybrid-Attention Guided Network with Multiple Resolution Features for Person Re-Identification

Sep 16, 2020

Guoqing Zhang, Junchuan Yang, Yuhui Zheng, Yi Wu, Shengyong Chen

Figure 1 for Hybrid-Attention Guided Network with Multiple Resolution Features for Person Re-Identification

Figure 2 for Hybrid-Attention Guided Network with Multiple Resolution Features for Person Re-Identification

Figure 3 for Hybrid-Attention Guided Network with Multiple Resolution Features for Person Re-Identification

Figure 4 for Hybrid-Attention Guided Network with Multiple Resolution Features for Person Re-Identification

Abstract:Extracting effective and discriminative features is very important for addressing the challenging person re-identification (re-ID) task. Prevailing deep convolutional neural networks (CNNs) usually use high-level features for identifying pedestrian. However, some essential spatial information resided in low-level features such as shape, texture and color will be lost when learning the high-level features, due to extensive padding and pooling operations in the training stage. In addition, most existing person re-ID methods are mainly based on hand-craft bounding boxes where images are precisely aligned. It is unrealistic in practical applications, since the exploited object detection algorithms often produce inaccurate bounding boxes. This will inevitably degrade the performance of existing algorithms. To address these problems, we put forward a novel person re-ID model that fuses high- and low-level embeddings to reduce the information loss caused in learning high-level features. Then we divide the fused embedding into several parts and reconnect them to obtain the global feature and more significant local features, so as to alleviate the affect caused by the inaccurate bounding boxes. In addition, we also introduce the spatial and channel attention mechanisms in our model, which aims to mine more discriminative features related to the target. Finally, we reconstruct the feature extractor to ensure that our model can obtain more richer and robust features. Extensive experiments display the superiority of our approach compared with existing approaches. Our code is available at https://github.com/libraflower/MutipleFeature-for-PRID.

* 11 pages, 8 figures, 66 conferences

Via

Access Paper or Ask Questions

An Efficient Algorithm for the Piecewise-Smooth Model with Approximately Explicit Solutions

Dec 08, 2016

Huihui Song, Yuhui Zheng, Kaihua Zhang

Figure 1 for An Efficient Algorithm for the Piecewise-Smooth Model with Approximately Explicit Solutions

Figure 2 for An Efficient Algorithm for the Piecewise-Smooth Model with Approximately Explicit Solutions

Abstract:This paper presents an efficient approach to image segmentation that approximates the piecewise-smooth (PS) functional in [12] with explicit solutions. By rendering some rational constraints on the initial conditions and the final solutions of the PS functional, we propose two novel formulations which can be approximated to be the explicit solutions of the evolution partial differential equations (PDEs) of the PS model, in which only one PDE needs to be solved efficiently. Furthermore, an energy term that regularizes the level set function to be a signed distance function is incorporated into our evolution formulation, and the time-consuming re-initialization is avoided. Experiments on synthetic and real images show that our method is more efficient than both the PS model and the local binary fitting (LBF) model [4], while having similar segmentation accuracy as the LBF model.

Via

Access Paper or Ask Questions