Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Annan Li

OctoNav: Towards Generalist Embodied Navigation

Jun 11, 2025

Chen Gao, Liankai Jin, Xingyu Peng, Jiazhao Zhang, Yue Deng, Annan Li, He Wang, Si Liu

Abstract:Embodied navigation stands as a foundation pillar within the broader pursuit of embodied AI. However, previous navigation research is divided into different tasks/capabilities, e.g., ObjNav, ImgNav and VLN, where they differ in task objectives and modalities, making datasets and methods are designed individually. In this work, we take steps toward generalist navigation agents, which can follow free-form instructions that include arbitrary compounds of multi-modal and multi-capability. To achieve this, we propose a large-scale benchmark and corresponding method, termed OctoNav-Bench and OctoNav-R1. Specifically, OctoNav-Bench features continuous environments and is constructed via a designed annotation pipeline. We thoroughly craft instruction-trajectory pairs, where instructions are diverse in free-form with arbitrary modality and capability. Also, we construct a Think-Before-Action (TBA-CoT) dataset within OctoNav-Bench to provide the thinking process behind actions. For OctoNav-R1, we build it upon MLLMs and adapt it to a VLA-type model, which can produce low-level actions solely based on 2D visual observations. Moreover, we design a Hybrid Training Paradigm (HTP) that consists of three stages, i.e., Action-/TBA-SFT, Nav-GPRO, and Online RL stages. Each stage contains specifically designed learning policies and rewards. Importantly, for TBA-SFT and Nav-GRPO designs, we are inspired by the OpenAI-o1 and DeepSeek-R1, which show impressive reasoning ability via thinking-before-answer. Thus, we aim to investigate how to achieve thinking-before-action in the embodied navigation field, to improve model's reasoning ability toward generalists. Specifically, we propose TBA-SFT to utilize the TBA-CoT dataset to fine-tune the model as a cold-start phrase and then leverage Nav-GPRO to improve its thinking ability. Finally, OctoNav-R1 shows superior performance compared with previous methods.

* 31 pages, 25 figures

Via

Access Paper or Ask Questions

CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?

Aug 20, 2024

Yuwei Zhao, Ziyang Luo, Yuchen Tian, Hongzhan Lin, Weixiang Yan, Annan Li, Jing Ma

Figure 1 for CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?

Figure 2 for CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?

Figure 3 for CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?

Figure 4 for CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?

Abstract:Recent advancements in large language models (LLMs) have showcased impressive code generation capabilities, primarily evaluated through language-to-code benchmarks. However, these benchmarks may not fully capture a model's code understanding abilities. We introduce CodeJudge-Eval (CJ-Eval), a novel benchmark designed to assess LLMs' code understanding abilities from the perspective of code judging rather than code generation. CJ-Eval challenges models to determine the correctness of provided code solutions, encompassing various error types and compilation issues. By leveraging a diverse set of problems and a fine-grained judging system, CJ-Eval addresses the limitations of traditional benchmarks, including the potential memorization of solutions. Evaluation of 12 well-known LLMs on CJ-Eval reveals that even state-of-the-art models struggle, highlighting the benchmark's ability to probe deeper into models' code understanding abilities. Our benchmark will be available at \url{https://github.com/CodeLLM-Research/CodeJudge-Eval}.

* Work in progress

Via

Access Paper or Ask Questions

GLGait: A Global-Local Temporal Receptive Field Network for Gait Recognition in the Wild

Aug 13, 2024

Guozhen Peng, Yunhong Wang, Yuwei Zhao, Shaoxiong Zhang, Annan Li

Figure 1 for GLGait: A Global-Local Temporal Receptive Field Network for Gait Recognition in the Wild

Figure 2 for GLGait: A Global-Local Temporal Receptive Field Network for Gait Recognition in the Wild

Figure 3 for GLGait: A Global-Local Temporal Receptive Field Network for Gait Recognition in the Wild

Figure 4 for GLGait: A Global-Local Temporal Receptive Field Network for Gait Recognition in the Wild

Abstract:Gait recognition has attracted increasing attention from academia and industry as a human recognition technology from a distance in non-intrusive ways without requiring cooperation. Although advanced methods have achieved impressive success in lab scenarios, most of them perform poorly in the wild. Recently, some Convolution Neural Networks (ConvNets) based methods have been proposed to address the issue of gait recognition in the wild. However, the temporal receptive field obtained by convolution operations is limited for long gait sequences. If directly replacing convolution blocks with visual transformer blocks, the model may not enhance a local temporal receptive field, which is important for covering a complete gait cycle. To address this issue, we design a Global-Local Temporal Receptive Field Network (GLGait). GLGait employs a Global-Local Temporal Module (GLTM) to establish a global-local temporal receptive field, which mainly consists of a Pseudo Global Temporal Self-Attention (PGTA) and a temporal convolution operation. Specifically, PGTA is used to obtain a pseudo global temporal receptive field with less memory and computation complexity compared with a multi-head self-attention (MHSA). The temporal convolution operation is used to enhance the local temporal receptive field. Besides, it can also aggregate pseudo global temporal receptive field to a true holistic temporal receptive field. Furthermore, we also propose a Center-Augmented Triplet Loss (CTL) in GLGait to reduce the intra-class distance and expand the positive samples in the training stage. Extensive experiments show that our method obtains state-of-the-art results on in-the-wild datasets, $i.e.$, Gait3D and GREW. The code is available at https://github.com/bgdpgz/GLGait.

* Accepted by ACM MM2024

Via

Access Paper or Ask Questions

RealGait: Gait Recognition for Person Re-Identification

Jan 13, 2022

Shaoxiong Zhang, Yunhong Wang, Tianrui Chai, Annan Li, Anil K. Jain

Figure 1 for RealGait: Gait Recognition for Person Re-Identification

Figure 2 for RealGait: Gait Recognition for Person Re-Identification

Figure 3 for RealGait: Gait Recognition for Person Re-Identification

Figure 4 for RealGait: Gait Recognition for Person Re-Identification

Abstract:Human gait is considered a unique biometric identifier which can be acquired in a covert manner at a distance. However, models trained on existing public domain gait datasets which are captured in controlled scenarios lead to drastic performance decline when applied to real-world unconstrained gait data. On the other hand, video person re-identification techniques have achieved promising performance on large-scale publicly available datasets. Given the diversity of clothing characteristics, clothing cue is not reliable for person recognition in general. So, it is actually not clear why the state-of-the-art person re-identification methods work as well as they do. In this paper, we construct a new gait dataset by extracting silhouettes from an existing video person re-identification challenge which consists of 1,404 persons walking in an unconstrained manner. Based on this dataset, a consistent and comparative study between gait recognition and person re-identification can be carried out. Given that our experimental results show that current gait recognition approaches designed under data collected in controlled scenarios are inappropriate for real surveillance scenarios, we propose a novel gait recognition method, called RealGait. Our results suggest that recognizing people by their gait in real surveillance scenarios is feasible and the underlying gait pattern is probably the true reason why video person re-idenfification works in practice.

Via

Access Paper or Ask Questions

Will You Ever Become Popular? Learning to Predict Virality of Dance Clips

Nov 06, 2021

Jiahao Wang, Yunhong Wang, Nina Weng, Tianrui Chai, Annan Li, Faxi Zhang, Sansi Yu

Figure 1 for Will You Ever Become Popular? Learning to Predict Virality of Dance Clips

Figure 2 for Will You Ever Become Popular? Learning to Predict Virality of Dance Clips

Figure 3 for Will You Ever Become Popular? Learning to Predict Virality of Dance Clips

Figure 4 for Will You Ever Become Popular? Learning to Predict Virality of Dance Clips

Abstract:Dance challenges are going viral in video communities like TikTok nowadays. Once a challenge becomes popular, thousands of short-form videos will be uploaded in merely a couple of days. Therefore, virality prediction from dance challenges is of great commercial value and has a wide range of applications, such as smart recommendation and popularity promotion. In this paper, a novel multi-modal framework which integrates skeletal, holistic appearance, facial and scenic cues is proposed for comprehensive dance virality prediction. To model body movements, we propose a pyramidal skeleton graph convolutional network (PSGCN) which hierarchically refines spatio-temporal skeleton graphs. Meanwhile, we introduce a relational temporal convolutional network (RTCN) to exploit appearance dynamics with non-local temporal relations. An attentive fusion approach is finally proposed to adaptively aggregate predictions from different modalities. To validate our method, we introduce a large-scale viral dance video (VDV) dataset, which contains over 4,000 dance clips of eight viral dance challenges. Extensive experiments on the VDV dataset demonstrate the efficacy of our model. Extensive experiments on the VDV dataset well demonstrate the effectiveness of our approach. Furthermore, we show that short video applications like multi-dimensional recommendation and action feedback can be derived from our model.

* Accepted by TOMM

Via

Access Paper or Ask Questions

Video Person Re-identification using Attribute-enhanced Features

Aug 16, 2021

Tianrui Chai, Zhiyuan Chen, Annan Li, Jiaxin Chen, Xinyu Mei, Yunhong Wang

Figure 1 for Video Person Re-identification using Attribute-enhanced Features

Figure 2 for Video Person Re-identification using Attribute-enhanced Features

Figure 3 for Video Person Re-identification using Attribute-enhanced Features

Figure 4 for Video Person Re-identification using Attribute-enhanced Features

Abstract:Video-based person re-identification (Re-ID) which aims to associate people across non-overlapping cameras using surveillance video is a challenging task. Pedestrian attribute, such as gender, age and clothing characteristics contains rich and supplementary information but is less explored in video person Re-ID. In this work, we propose a novel network architecture named Attribute Salience Assisted Network (ASA-Net) for attribute-assisted video person Re-ID, which achieved considerable improvement to existing works by two methods.First, to learn a better separation of the target from background, we propose to learn the visual attention from middle-level attribute instead of high-level identities. The proposed Attribute Salient Region Enhance (ASRE) module can attend more accurately on the body of pedestrian. Second, we found that many identity-irrelevant but object or subject-relevant factors like the view angle and movement of the target pedestrian can greatly influence the two dimensional appearance of a pedestrian. This problem can be mitigated by investigating both identity-relevant and identity-irrelevant attributes via a novel triplet loss which is referred as the Pose~\&~Motion-Invariant (PMI) triplet loss.

Via

Access Paper or Ask Questions

Few-Shot Fine-Grained Action Recognition via Bidirectional Attention and Contrastive Meta-Learning

Aug 15, 2021

Jiahao Wang, Yunhong Wang, Sheng Liu, Annan Li

Figure 1 for Few-Shot Fine-Grained Action Recognition via Bidirectional Attention and Contrastive Meta-Learning

Figure 2 for Few-Shot Fine-Grained Action Recognition via Bidirectional Attention and Contrastive Meta-Learning

Figure 3 for Few-Shot Fine-Grained Action Recognition via Bidirectional Attention and Contrastive Meta-Learning

Figure 4 for Few-Shot Fine-Grained Action Recognition via Bidirectional Attention and Contrastive Meta-Learning

Abstract:Fine-grained action recognition is attracting increasing attention due to the emerging demand of specific action understanding in real-world applications, whereas the data of rare fine-grained categories is very limited. Therefore, we propose the few-shot fine-grained action recognition problem, aiming to recognize novel fine-grained actions with only few samples given for each class. Although progress has been made in coarse-grained actions, existing few-shot recognition methods encounter two issues handling fine-grained actions: the inability to capture subtle action details and the inadequacy in learning from data with low inter-class variance. To tackle the first issue, a human vision inspired bidirectional attention module (BAM) is proposed. Combining top-down task-driven signals with bottom-up salient stimuli, BAM captures subtle action details by accurately highlighting informative spatio-temporal regions. To address the second issue, we introduce contrastive meta-learning (CML). Compared with the widely adopted ProtoNet-based method, CML generates more discriminative video representations for low inter-class variance data, since it makes full use of potential contrastive pairs in each training episode. Furthermore, to fairly compare different models, we establish specific benchmark protocols on two large-scale fine-grained action recognition datasets. Extensive experiments show that our method consistently achieves state-of-the-art performance across evaluated tasks.

* Accepted in ACM Multimedia 2021

Via

Access Paper or Ask Questions

Silhouette based View embeddings for Gait Recognition under Multiple Views

Aug 12, 2021

Tianrui Chai, Xinyu Mei, Annan Li, Yunhong Wang

Figure 1 for Silhouette based View embeddings for Gait Recognition under Multiple Views

Figure 2 for Silhouette based View embeddings for Gait Recognition under Multiple Views

Figure 3 for Silhouette based View embeddings for Gait Recognition under Multiple Views

Figure 4 for Silhouette based View embeddings for Gait Recognition under Multiple Views

Abstract:Gait recognition under multiple views is an important computer vision and pattern recognition task. In the emerging convolutional neural network based approaches, the information of view angle is ignored to some extent. Instead of direct view estimation and training view-specific recognition models, we propose a compatible framework that can embed view information into existing architectures of gait recognition. The embedding is simply achieved by a selective projection layer. Experimental results on two large public datasets show that the proposed framework is very effective.

* ICIP 2021

Via

Access Paper or Ask Questions

Text2Event: Controllable Sequence-to-Structure Generation for End-to-end Event Extraction

Jun 17, 2021

Yaojie Lu, Hongyu Lin, Jin Xu, Xianpei Han, Jialong Tang, Annan Li, Le Sun, Meng Liao, Shaoyi Chen

Figure 1 for Text2Event: Controllable Sequence-to-Structure Generation for End-to-end Event Extraction

Figure 2 for Text2Event: Controllable Sequence-to-Structure Generation for End-to-end Event Extraction

Figure 3 for Text2Event: Controllable Sequence-to-Structure Generation for End-to-end Event Extraction

Figure 4 for Text2Event: Controllable Sequence-to-Structure Generation for End-to-end Event Extraction

Abstract:Event extraction is challenging due to the complex structure of event records and the semantic gap between text and event. Traditional methods usually extract event records by decomposing the complex structure prediction task into multiple subtasks. In this paper, we propose Text2Event, a sequence-to-structure generation paradigm that can directly extract events from the text in an end-to-end manner. Specifically, we design a sequence-to-structure network for unified event extraction, a constrained decoding algorithm for event knowledge injection during inference, and a curriculum learning algorithm for efficient model learning. Experimental results show that, by uniformly modeling all tasks in a single model and universally predicting different labels, our method can achieve competitive performance using only record-level annotations in both supervised learning and transfer learning settings.

* Accepted to ACL2021 (main conference)

Via

Access Paper or Ask Questions

ISCAS at SemEval-2020 Task 5: Pre-trained Transformers for Counterfactual Statement Modeling

Sep 17, 2020

Yaojie Lu, Annan Li, Hongyu Lin, Xianpei Han, Le Sun

Figure 1 for ISCAS at SemEval-2020 Task 5: Pre-trained Transformers for Counterfactual Statement Modeling

Figure 2 for ISCAS at SemEval-2020 Task 5: Pre-trained Transformers for Counterfactual Statement Modeling

Figure 3 for ISCAS at SemEval-2020 Task 5: Pre-trained Transformers for Counterfactual Statement Modeling

Figure 4 for ISCAS at SemEval-2020 Task 5: Pre-trained Transformers for Counterfactual Statement Modeling

Abstract:ISCAS participated in two subtasks of SemEval 2020 Task 5: detecting counterfactual statements and detecting antecedent and consequence. This paper describes our system which is based on pre-trained transformers. For the first subtask, we train several transformer-based classifiers for detecting counterfactual statements. For the second subtask, we formulate antecedent and consequence extraction as a query-based question answering problem. The two subsystems both achieved third place in the evaluation. Our system is openly released at https://github.com/casnlu/ISCAS-SemEval2020Task5.

Via

Access Paper or Ask Questions