Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yaxian Li

STAR-GNN: Spatial-Temporal Video Representation for Content-based Retrieval

Aug 15, 2022

Guoping Zhao, Bingqing Zhang, Mingyu Zhang, Yaxian Li, Jiajun Liu, Ji-Rong Wen

Figure 1 for STAR-GNN: Spatial-Temporal Video Representation for Content-based Retrieval

Figure 2 for STAR-GNN: Spatial-Temporal Video Representation for Content-based Retrieval

Figure 3 for STAR-GNN: Spatial-Temporal Video Representation for Content-based Retrieval

Figure 4 for STAR-GNN: Spatial-Temporal Video Representation for Content-based Retrieval

Abstract:We propose a video feature representation learning framework called STAR-GNN, which applies a pluggable graph neural network component on a multi-scale lattice feature graph. The essence of STAR-GNN is to exploit both the temporal dynamics and spatial contents as well as visual connections between regions at different scales in the frames. It models a video with a lattice feature graph in which the nodes represent regions of different granularity, with weighted edges that represent the spatial and temporal links. The contextual nodes are aggregated simultaneously by graph neural networks with parameters trained with retrieval triplet loss. In the experiments, we show that STAR-GNN effectively implements a dynamic attention mechanism on video frame sequences, resulting in the emphasis for dynamic and semantically rich content in the video, and is robust to noise and redundancies. Empirical results show that STAR-GNN achieves state-of-the-art performance for Content-Based Video Retrieval.

* 6 pages, 2 figures, ICME 2022 accepted paper

Via

Access Paper or Ask Questions

InvisibiliTee: Angle-agnostic Cloaking from Person-Tracking Systems with a Tee

Aug 15, 2022

Yaxian Li, Bingqing Zhang, Guoping Zhao, Mingyu Zhang, Jiajun Liu, Ziwei Wang, Jirong Wen

Figure 1 for InvisibiliTee: Angle-agnostic Cloaking from Person-Tracking Systems with a Tee

Figure 2 for InvisibiliTee: Angle-agnostic Cloaking from Person-Tracking Systems with a Tee

Figure 3 for InvisibiliTee: Angle-agnostic Cloaking from Person-Tracking Systems with a Tee

Figure 4 for InvisibiliTee: Angle-agnostic Cloaking from Person-Tracking Systems with a Tee

Abstract:After a survey for person-tracking system-induced privacy concerns, we propose a black-box adversarial attack method on state-of-the-art human detection models called InvisibiliTee. The method learns printable adversarial patterns for T-shirts that cloak wearers in the physical world in front of person-tracking systems. We design an angle-agnostic learning scheme which utilizes segmentation of the fashion dataset and a geometric warping process so the adversarial patterns generated are effective in fooling person detectors from all camera angles and for unseen black-box detection models. Empirical results in both digital and physical environments show that with the InvisibiliTee on, person-tracking systems' ability to detect the wearer drops significantly.

* 12 pages, 10 figures and the ICANN 2022 accpeted paper

Via

Access Paper or Ask Questions

Emotion-Based End-to-End Matching Between Image and Music in Valence-Arousal Space

Aug 22, 2020

Sicheng Zhao, Yaxian Li, Xingxu Yao, Weizhi Nie, Pengfei Xu, Jufeng Yang, Kurt Keutzer

Figure 1 for Emotion-Based End-to-End Matching Between Image and Music in Valence-Arousal Space

Figure 2 for Emotion-Based End-to-End Matching Between Image and Music in Valence-Arousal Space

Figure 3 for Emotion-Based End-to-End Matching Between Image and Music in Valence-Arousal Space

Figure 4 for Emotion-Based End-to-End Matching Between Image and Music in Valence-Arousal Space

Abstract:Both images and music can convey rich semantics and are widely used to induce specific emotions. Matching images and music with similar emotions might help to make emotion perceptions more vivid and stronger. Existing emotion-based image and music matching methods either employ limited categorical emotion states which cannot well reflect the complexity and subtlety of emotions, or train the matching model using an impractical multi-stage pipeline. In this paper, we study end-to-end matching between image and music based on emotions in the continuous valence-arousal (VA) space. First, we construct a large-scale dataset, termed Image-Music-Emotion-Matching-Net (IMEMNet), with over 140K image-music pairs. Second, we propose cross-modal deep continuous metric learning (CDCML) to learn a shared latent embedding space which preserves the cross-modal similarity relationship in the continuous matching space. Finally, we refine the embedding space by further preserving the single-modal emotion relationship in the VA spaces of both images and music. The metric learning in the embedding space and task regression in the label space are jointly optimized for both cross-modal matching and single-modal VA prediction. The extensive experiments conducted on IMEMNet demonstrate the superiority of CDCML for emotion-based image and music matching as compared to the state-of-the-art approaches.

* Accepted by ACM Multimedia 2020

Via

Access Paper or Ask Questions

Multi-source Distilling Domain Adaptation

Nov 22, 2019

Sicheng Zhao, Guangzhi Wang, Shanghang Zhang, Yang Gu, Yaxian Li, Zhichao Song, Pengfei Xu, Runbo Hu, Hua Chai, Kurt Keutzer

Figure 1 for Multi-source Distilling Domain Adaptation

Figure 2 for Multi-source Distilling Domain Adaptation

Figure 3 for Multi-source Distilling Domain Adaptation

Figure 4 for Multi-source Distilling Domain Adaptation

Abstract:Deep neural networks suffer from performance decay when there is domain shift between the labeled source domain and unlabeled target domain, which motivates the research on domain adaptation (DA). Conventional DA methods usually assume that the labeled data is sampled from a single source distribution. However, in practice, labeled data may be collected from multiple sources, while naive application of the single-source DA algorithms may lead to suboptimal solutions. In this paper, we propose a novel multi-source distilling domain adaptation (MDDA) network, which not only considers the different distances among multiple sources and the target, but also investigates the different similarities of the source samples to the target ones. Specifically, the proposed MDDA includes four stages: (1) pre-train the source classifiers separately using the training data from each source; (2) adversarially map the target into the feature space of each source respectively by minimizing the empirical Wasserstein distance between source and target; (3) select the source training samples that are closer to the target to fine-tune the source classifiers; and (4) classify each encoded target feature by corresponding source classifier, and aggregate different predictions using respective domain weight, which corresponds to the discrepancy between each source and target. Extensive experiments are conducted on public DA benchmarks, and the results demonstrate that the proposed MDDA significantly outperforms the state-of-the-art approaches. Our source code is released at: https://github.com/daoyuan98/MDDA.

* Accepted by AAAI 2020

Via

Access Paper or Ask Questions