Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yiyang Su

SapiensID: Foundation for Human Recognition

Apr 07, 2025

Minchul Kim, Dingqiang Ye, Yiyang Su, Feng Liu, Xiaoming Liu

Abstract:Existing human recognition systems often rely on separate, specialized models for face and body analysis, limiting their effectiveness in real-world scenarios where pose, visibility, and context vary widely. This paper introduces SapiensID, a unified model that bridges this gap, achieving robust performance across diverse settings. SapiensID introduces (i) Retina Patch (RP), a dynamic patch generation scheme that adapts to subject scale and ensures consistent tokenization of regions of interest, (ii) a masked recognition model (MRM) that learns from variable token length, and (iii) Semantic Attention Head (SAH), an module that learns pose-invariant representations by pooling features around key body parts. To facilitate training, we introduce WebBody4M, a large-scale dataset capturing diverse poses and scale variations. Extensive experiments demonstrate that SapiensID achieves state-of-the-art results on various body ReID benchmarks, outperforming specialized models in both short-term and long-term scenarios while remaining competitive with dedicated face recognition systems. Furthermore, SapiensID establishes a strong baseline for the newly introduced challenge of Cross Pose-Scale ReID, demonstrating its ability to generalize to complex, real-world conditions.

* To appear in CVPR2025

Via

Access Paper or Ask Questions

Open-Set Biometrics: Beyond Good Closed-Set Models

Jul 23, 2024

Yiyang Su, Minchul Kim, Feng Liu, Anil Jain, Xiaoming Liu

Abstract:Biometric recognition has primarily addressed closed-set identification, assuming all probe subjects are in the gallery. However, most practical applications involve open-set biometrics, where probe subjects may or may not be present in the gallery. This poses distinct challenges in effectively distinguishing individuals in the gallery while minimizing false detections. While it is commonly believed that powerful biometric models can excel in both closed- and open-set scenarios, existing loss functions are inconsistent with open-set evaluation. They treat genuine (mated) and imposter (non-mated) similarity scores symmetrically and neglect the relative magnitudes of imposter scores. To address these issues, we simulate open-set evaluation using minibatches during training and introduce novel loss functions: (1) the identification-detection loss optimized for open-set performance under selective thresholds and (2) relative threshold minimization to reduce the maximum negative score for each probe. Across diverse biometric tasks, including face recognition, gait recognition, and person re-identification, our experiments demonstrate the effectiveness of the proposed loss functions, significantly enhancing open-set performance while positively impacting closed-set performance. Our code and models are available at https://github.com/prevso1088/open-set-biometrics.

* Published at ECCV 2024

Via

Access Paper or Ask Questions

KeyPoint Relative Position Encoding for Face Recognition

Mar 21, 2024

Minchul Kim, Yiyang Su, Feng Liu, Anil Jain, Xiaoming Liu

Figure 1 for KeyPoint Relative Position Encoding for Face Recognition

Figure 2 for KeyPoint Relative Position Encoding for Face Recognition

Figure 3 for KeyPoint Relative Position Encoding for Face Recognition

Figure 4 for KeyPoint Relative Position Encoding for Face Recognition

Abstract:In this paper, we address the challenge of making ViT models more robust to unseen affine transformations. Such robustness becomes useful in various recognition tasks such as face recognition when image alignment failures occur. We propose a novel method called KP-RPE, which leverages key points (e.g.~facial landmarks) to make ViT more resilient to scale, translation, and pose variations. We begin with the observation that Relative Position Encoding (RPE) is a good way to bring affine transform generalization to ViTs. RPE, however, can only inject the model with prior knowledge that nearby pixels are more important than far pixels. Keypoint RPE (KP-RPE) is an extension of this principle, where the significance of pixels is not solely dictated by their proximity but also by their relative positions to specific keypoints within the image. By anchoring the significance of pixels around keypoints, the model can more effectively retain spatial relationships, even when those relationships are disrupted by affine transformations. We show the merit of KP-RPE in face and gait recognition. The experimental results demonstrate the effectiveness in improving face recognition performance from low-quality images, particularly where alignment is prone to failure. Code and pre-trained models are available.

* To appear in CVPR2024

Via

Access Paper or Ask Questions

ChatGPT-Powered Hierarchical Comparisons for Image Classification

Nov 01, 2023

Zhiyuan Ren, Yiyang Su, Xiaoming Liu

Abstract:The zero-shot open-vocabulary challenge in image classification is tackled by pretrained vision-language models like CLIP, which benefit from incorporating class-specific knowledge from large language models (LLMs) like ChatGPT. However, biases in CLIP lead to similar descriptions for distinct but related classes, prompting our novel image classification framework via hierarchical comparisons: using LLMs to recursively group classes into hierarchies and classifying images by comparing image-text embeddings at each hierarchy level, resulting in an intuitive, effective, and explainable approach.

* Neurips 2023 Poster

Via

Access Paper or Ask Questions

Separating Invisible Sounds Toward Universal Audiovisual Scene-Aware Sound Separation

Oct 18, 2023

Yiyang Su, Ali Vosoughi, Shijian Deng, Yapeng Tian, Chenliang Xu

Figure 1 for Separating Invisible Sounds Toward Universal Audiovisual Scene-Aware Sound Separation

Figure 2 for Separating Invisible Sounds Toward Universal Audiovisual Scene-Aware Sound Separation

Figure 3 for Separating Invisible Sounds Toward Universal Audiovisual Scene-Aware Sound Separation

Figure 4 for Separating Invisible Sounds Toward Universal Audiovisual Scene-Aware Sound Separation

Abstract:The audio-visual sound separation field assumes visible sources in videos, but this excludes invisible sounds beyond the camera's view. Current methods struggle with such sounds lacking visible cues. This paper introduces a novel "Audio-Visual Scene-Aware Separation" (AVSA-Sep) framework. It includes a semantic parser for visible and invisible sounds and a separator for scene-informed separation. AVSA-Sep successfully separates both sound types, with joint training and cross-modal alignment enhancing effectiveness.

* Accepted at ICCV 2023 - AV4D, 4 figures, 3 tables

Via

Access Paper or Ask Questions

FarSight: A Physics-Driven Whole-Body Biometric System at Large Distance and Altitude

Jun 29, 2023

Feng Liu, Ryan Ashbaugh, Nicholas Chimitt, Najmul Hassan, Ali Hassani, Ajay Jaiswal, Minchul Kim, Zhiyuan Mao, Christopher Perry, Zhiyuan Ren(+10 more)

Figure 1 for FarSight: A Physics-Driven Whole-Body Biometric System at Large Distance and Altitude

Figure 2 for FarSight: A Physics-Driven Whole-Body Biometric System at Large Distance and Altitude

Figure 3 for FarSight: A Physics-Driven Whole-Body Biometric System at Large Distance and Altitude

Figure 4 for FarSight: A Physics-Driven Whole-Body Biometric System at Large Distance and Altitude

Abstract:Whole-body biometric recognition is an important area of research due to its vast applications in law enforcement, border security, and surveillance. This paper presents the end-to-end design, development and evaluation of FarSight, an innovative software system designed for whole-body (fusion of face, gait and body shape) biometric recognition. FarSight accepts videos from elevated platforms and drones as input and outputs a candidate list of identities from a gallery. The system is designed to address several challenges, including (i) low-quality imagery, (ii) large yaw and pitch angles, (iii) robust feature extraction to accommodate large intra-person variabilities and large inter-person similarities, and (iv) the large domain gap between training and test sets. FarSight combines the physics of imaging and deep learning models to enhance image restoration and biometric feature encoding. We test FarSight's effectiveness using the newly acquired IARPA Biometric Recognition and Identification at Altitude and Range (BRIAR) dataset. Notably, FarSight demonstrated a substantial performance increase on the BRIAR dataset, with gains of +11.82% Rank-20 identification and +11.3% TAR@1% FAR.

* 11 pages, 7 figures

Via

Access Paper or Ask Questions

Actor-Action Video Classification CSC 249/449 Spring 2020 Challenge Report

Aug 18, 2020

Jing Shi, Zhiheng Li, Haitian Zheng, Yihang Xu, Tianyou Xiao, Weitao Tan, Xiaoning Guo, Sizhe Li, Bin Yang, Zhexin Xu(+23 more)

Figure 1 for Actor-Action Video Classification CSC 249/449 Spring 2020 Challenge Report

Figure 2 for Actor-Action Video Classification CSC 249/449 Spring 2020 Challenge Report

Figure 3 for Actor-Action Video Classification CSC 249/449 Spring 2020 Challenge Report

Figure 4 for Actor-Action Video Classification CSC 249/449 Spring 2020 Challenge Report

Abstract:This technical report summarizes submissions and compiles from Actor-Action video classification challenge held as a final project in CSC 249/449 Machine Vision course (Spring 2020) at University of Rochester

Via

Access Paper or Ask Questions