Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Liqi Yan

Radiance Field Learners As UAV First-Person Viewers

Aug 10, 2024

Liqi Yan, Qifan Wang, Junhan Zhao, Qiang Guan, Zheng Tang, Jianhui Zhang, Dongfang Liu

Figure 1 for Radiance Field Learners As UAV First-Person Viewers

Figure 2 for Radiance Field Learners As UAV First-Person Viewers

Figure 3 for Radiance Field Learners As UAV First-Person Viewers

Figure 4 for Radiance Field Learners As UAV First-Person Viewers

Abstract:First-Person-View (FPV) holds immense potential for revolutionizing the trajectory of Unmanned Aerial Vehicles (UAVs), offering an exhilarating avenue for navigating complex building structures. Yet, traditional Neural Radiance Field (NeRF) methods face challenges such as sampling single points per iteration and requiring an extensive array of views for supervision. UAV videos exacerbate these issues with limited viewpoints and significant spatial scale variations, resulting in inadequate detail rendering across diverse scales. In response, we introduce FPV-NeRF, addressing these challenges through three key facets: (1) Temporal consistency. Leveraging spatio-temporal continuity ensures seamless coherence between frames; (2) Global structure. Incorporating various global features during point sampling preserves space integrity; (3) Local granularity. Employing a comprehensive framework and multi-resolution supervision for multi-scale scene feature representation tackles the intricacies of UAV video spatial scales. Additionally, due to the scarcity of publicly available FPV videos, we introduce an innovative view synthesis method using NeRF to generate FPV perspectives from UAV footage, enhancing spatial perception for drones. Our novel dataset spans diverse trajectories, from outdoor to indoor environments, in the UAV domain, differing significantly from traditional NeRF scenarios. Through extensive experiments encompassing both interior and exterior building structures, FPV-NeRF demonstrates a superior understanding of the UAV flying space, outperforming state-of-the-art methods in our curated UAV dataset. Explore our project page for further insights: https://fpv-nerf.github.io/.

* European Conference on Computer Vision (ECCV 2024)
* Accepted to ECCV 2024

Via

Access Paper or Ask Questions

Solve the Puzzle of Instance Segmentation in Videos: A Weakly Supervised Framework with Spatio-Temporal Collaboration

Dec 15, 2022

Liqi Yan, Qifan Wang, Siqi Ma, Jingang Wang, Changbin Yu

Figure 1 for Solve the Puzzle of Instance Segmentation in Videos: A Weakly Supervised Framework with Spatio-Temporal Collaboration

Figure 2 for Solve the Puzzle of Instance Segmentation in Videos: A Weakly Supervised Framework with Spatio-Temporal Collaboration

Figure 3 for Solve the Puzzle of Instance Segmentation in Videos: A Weakly Supervised Framework with Spatio-Temporal Collaboration

Figure 4 for Solve the Puzzle of Instance Segmentation in Videos: A Weakly Supervised Framework with Spatio-Temporal Collaboration

Abstract:Instance segmentation in videos, which aims to segment and track multiple objects in video frames, has garnered a flurry of research attention in recent years. In this paper, we present a novel weakly supervised framework with \textbf{S}patio-\textbf{T}emporal \textbf{C}ollaboration for instance \textbf{Seg}mentation in videos, namely \textbf{STC-Seg}. Concretely, STC-Seg demonstrates four contributions. First, we leverage the complementary representations from unsupervised depth estimation and optical flow to produce effective pseudo-labels for training deep networks and predicting high-quality instance masks. Second, to enhance the mask generation, we devise a puzzle loss, which enables end-to-end training using box-level annotations. Third, our tracking module jointly utilizes bounding-box diagonal points with spatio-temporal discrepancy to model movements, which largely improves the robustness to different object appearances. Finally, our framework is flexible and enables image-level instance segmentation methods to operate the video-level task. We conduct an extensive set of experiments on the KITTI MOTS and YT-VIS datasets. Experimental results demonstrate that our method achieves strong performance and even outperforms fully supervised TrackR-CNN and MaskTrack R-CNN. We believe that STC-Seg can be a valuable addition to the community, as it reflects the tip of an iceberg about the innovative opportunities in the weakly supervised paradigm for instance segmentation in videos.

* IEEE Transactions on Circuits and Systems for Video Technology (2022)

Via

Access Paper or Ask Questions

GL-RG: Global-Local Representation Granularity for Video Captioning

May 22, 2022

Liqi Yan, Qifan Wang, Yiming Cui, Fuli Feng, Xiaojun Quan, Xiangyu Zhang, Dongfang Liu

Figure 1 for GL-RG: Global-Local Representation Granularity for Video Captioning

Figure 2 for GL-RG: Global-Local Representation Granularity for Video Captioning

Figure 3 for GL-RG: Global-Local Representation Granularity for Video Captioning

Figure 4 for GL-RG: Global-Local Representation Granularity for Video Captioning

Abstract:Video captioning is a challenging task as it needs to accurately transform visual understanding into natural language description. To date, state-of-the-art methods inadequately model global-local representation across video frames for caption generation, leaving plenty of room for improvement. In this work, we approach the video captioning task from a new perspective and propose a GL-RG framework for video captioning, namely a \textbf{G}lobal-\textbf{L}ocal \textbf{R}epresentation \textbf{G}ranularity. Our GL-RG demonstrates three advantages over the prior efforts: 1) we explicitly exploit extensive visual representations from different video ranges to improve linguistic expression; 2) we devise a novel global-local encoder to produce rich semantic vocabulary to obtain a descriptive granularity of video contents across frames; 3) we develop an incremental training strategy which organizes model learning in an incremental fashion to incur an optimal captioning behavior. Experimental results on the challenging MSR-VTT and MSVD datasets show that our DL-RG outperforms recent state-of-the-art methods by a significant margin. Code is available at \url{https://github.com/ylqi/GL-RG}.

* Accepted to IJCAI 2022

Via

Access Paper or Ask Questions

TF-Blender: Temporal Feature Blender for Video Object Detection

Aug 12, 2021

Yiming Cui, Liqi Yan, Zhiwen Cao, Dongfang Liu

Figure 1 for TF-Blender: Temporal Feature Blender for Video Object Detection

Figure 2 for TF-Blender: Temporal Feature Blender for Video Object Detection

Figure 3 for TF-Blender: Temporal Feature Blender for Video Object Detection

Figure 4 for TF-Blender: Temporal Feature Blender for Video Object Detection

Abstract:Video objection detection is a challenging task because isolated video frames may encounter appearance deterioration, which introduces great confusion for detection. One of the popular solutions is to exploit the temporal information and enhance per-frame representation through aggregating features from neighboring frames. Despite achieving improvements in detection, existing methods focus on the selection of higher-level video frames for aggregation rather than modeling lower-level temporal relations to increase the feature representation. To address this limitation, we propose a novel solution named TF-Blender,which includes three modules: 1) Temporal relation mod-els the relations between the current frame and its neighboring frames to preserve spatial information. 2). Feature adjustment enriches the representation of every neigh-boring feature map; 3) Feature blender combines outputs from the first two modules and produces stronger features for the later detection tasks. For its simplicity, TF-Blender can be effortlessly plugged into any detection network to improve detection behavior. Extensive evaluations on ImageNet VID and YouTube-VIS benchmarks indicate the performance guarantees of using TF-Blender on recent state-of-the-art methods.

Via

Access Paper or Ask Questions

Hierarchical Attention Fusion for Geo-Localization

Feb 18, 2021

Liqi Yan, Yiming Cui, Yingjie Chen, Dongfang Liu

Figure 1 for Hierarchical Attention Fusion for Geo-Localization

Figure 2 for Hierarchical Attention Fusion for Geo-Localization

Figure 3 for Hierarchical Attention Fusion for Geo-Localization

Figure 4 for Hierarchical Attention Fusion for Geo-Localization

Abstract:Geo-localization is a critical task in computer vision. In this work, we cast the geo-localization as a 2D image retrieval task. Current state-of-the-art methods for 2D geo-localization are not robust to locate a scene with drastic scale variations because they only exploit features from one semantic level for image representations. To address this limitation, we introduce a hierarchical attention fusion network using multi-scale features for geo-localization. We extract the hierarchical feature maps from a convolutional neural network (CNN) and organically fuse the extracted features for image representations. Our training is self-supervised using adaptive weights to control the attention of feature emphasis from each hierarchical level. Evaluation results on the image retrieval and the large-scale geo-localization benchmarks indicate that our method outperforms the existing state-of-the-art methods. Code is available here: \url{https://github.com/YanLiqi/HAF}.

Via

Access Paper or Ask Questions

DenserNet: Weakly Supervised Visual Localization Using Multi-scale Feature Aggregation

Dec 31, 2020

Dongfang Liu, Yiming Cui, Liqi Yan, Christos Mousas, Baijian Yang, Yingjie Chen

Figure 1 for DenserNet: Weakly Supervised Visual Localization Using Multi-scale Feature Aggregation

Figure 2 for DenserNet: Weakly Supervised Visual Localization Using Multi-scale Feature Aggregation

Figure 3 for DenserNet: Weakly Supervised Visual Localization Using Multi-scale Feature Aggregation

Figure 4 for DenserNet: Weakly Supervised Visual Localization Using Multi-scale Feature Aggregation

Abstract:In this work, we introduce a Denser Feature Network (DenserNet) for visual localization. Our work provides three principal contributions. First, we develop a convolutional neural network (CNN) architecture which aggregates feature maps at different semantic levels for image representations. Using denser feature maps, our method can produce more keypoint features and increase image retrieval accuracy. Second, our model is trained end-to-end without pixel-level annotation other than positive and negative GPS-tagged image pairs. We use a weakly supervised triplet ranking loss to learn discriminative features and encourage keypoint feature repeatability for image representation. Finally, our method is computationally efficient as our architecture has shared features and parameters during computation. Our method can perform accurate large-scale localization under challenging conditions while remaining the computational constraint. Extensive experiment results indicate that our method sets a new state-of-the-art on four challenging large-scale localization benchmarks and three image retrieval benchmarks.

* Proceeding with The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

Via

Access Paper or Ask Questions

Multimodal Aggregation Approach for Memory Vision-Voice Indoor Navigation with Meta-Learning

Sep 01, 2020

Liqi Yan, Dongfang Liu, Yaoxian Song, Changbin Yu

Figure 1 for Multimodal Aggregation Approach for Memory Vision-Voice Indoor Navigation with Meta-Learning

Figure 2 for Multimodal Aggregation Approach for Memory Vision-Voice Indoor Navigation with Meta-Learning

Figure 3 for Multimodal Aggregation Approach for Memory Vision-Voice Indoor Navigation with Meta-Learning

Figure 4 for Multimodal Aggregation Approach for Memory Vision-Voice Indoor Navigation with Meta-Learning

Abstract:Vision and voice are two vital keys for agents' interaction and learning. In this paper, we present a novel indoor navigation model called Memory Vision-Voice Indoor Navigation (MVV-IN), which receives voice commands and analyzes multimodal information of visual observation in order to enhance robots' environment understanding. We make use of single RGB images taken by a first-view monocular camera. We also apply a self-attention mechanism to keep the agent focusing on key areas. Memory is important for the agent to avoid repeating certain tasks unnecessarily and in order for it to adapt adequately to new scenes, therefore, we make use of meta-learning. We have experimented with various functional features extracted from visual observation. Comparative experiments prove that our methods outperform state-of-the-art baselines.

* 8 pages, 6 figures, 2 tables, accepted at 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2020)

Via

Access Paper or Ask Questions

Crowd Video Captioning

Nov 13, 2019

Liqi Yan, Mingjian Zhu, Changbin Yu

Abstract:Describing a video automatically with natural language is a challenging task in the area of computer vision. In most cases, the on-site situation of great events is reported in news, but the situation of the off-site spectators in the entrance and exit is neglected which also arouses people's interest. Since the deployment of reporters in the entrance and exit costs lots of manpower, how to automatically describe the behavior of a crowd of off-site spectators is significant and remains a problem. To tackle this problem, we propose a new task called crowd video captioning (CVC) which aims to describe the crowd of spectators. We also provide baseline methods for this task and evaluate them on the dataset WorldExpo'10. Our experimental results show that captioning models have a fairly deep understanding of the crowd in video and perform satisfactorily in the CVC task.

* IECON 2019

Via

Access Paper or Ask Questions