Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Guangting Wang

Visual Perception by Large Language Model's Weights

May 30, 2024

Feipeng Ma, Hongwei Xue, Guangting Wang, Yizhou Zhou, Fengyun Rao, Shilin Yan, Yueyi Zhang, Siying Wu, Mike Zheng Shou, Xiaoyan Sun

Figure 1 for Visual Perception by Large Language Model's Weights

Figure 2 for Visual Perception by Large Language Model's Weights

Figure 3 for Visual Perception by Large Language Model's Weights

Figure 4 for Visual Perception by Large Language Model's Weights

Abstract:Existing Multimodal Large Language Models (MLLMs) follow the paradigm that perceives visual information by aligning visual features with the input space of Large Language Models (LLMs), and concatenating visual tokens with text tokens to form a unified sequence input for LLMs. These methods demonstrate promising results on various vision-language tasks but are limited by the high computational effort due to the extended input sequence resulting from the involvement of visual tokens. In this paper, instead of input space alignment, we propose a novel parameter space alignment paradigm that represents visual information as model weights. For each input image, we use a vision encoder to extract visual features, convert features into perceptual weights, and merge the perceptual weights with LLM's weights. In this way, the input of LLM does not require visual tokens, which reduces the length of the input sequence and greatly improves efficiency. Following this paradigm, we propose VLoRA with the perceptual weights generator. The perceptual weights generator is designed to convert visual features to perceptual weights with low-rank property, exhibiting a form similar to LoRA. The experimental results show that our VLoRA achieves comparable performance on various benchmarks for MLLMs, while significantly reducing the computational costs for both training and inference. The code and models will be made open-source.

Via

Access Paper or Ask Questions

Multi-Modal Generative Embedding Model

May 29, 2024

Feipeng Ma, Hongwei Xue, Guangting Wang, Yizhou Zhou, Fengyun Rao, Shilin Yan, Yueyi Zhang, Siying Wu, Mike Zheng Shou, Xiaoyan Sun

Abstract:Most multi-modal tasks can be formulated into problems of either generation or embedding. Existing models usually tackle these two types of problems by decoupling language modules into a text decoder for generation, and a text encoder for embedding. To explore the minimalism of multi-modal paradigms, we attempt to achieve only one model per modality in this work. We propose a Multi-Modal Generative Embedding Model (MM-GEM), whereby the generative and embedding objectives are encapsulated in one Large Language Model. We also propose a PoolAggregator to boost efficiency and enable the ability of fine-grained embedding and generation. A surprising finding is that these two objectives do not significantly conflict with each other. For example, MM-GEM instantiated from ViT-Large and TinyLlama shows competitive performance on benchmarks for multimodal embedding models such as cross-modal retrieval and zero-shot classification, while has good ability of image captioning. Additionally, MM-GEM can seamlessly execute region-level image caption generation and retrieval tasks. Besides, the advanced text model in MM-GEM brings over 5% improvement in Recall@1 for long text and image retrieval.

Via

Access Paper or Ask Questions

Correlation-Aware Deep Tracking

Mar 03, 2022

Fei Xie, Chunyu Wang, Guangting Wang, Yue Cao, Wankou Yang, Wenjun Zeng

Figure 1 for Correlation-Aware Deep Tracking

Figure 2 for Correlation-Aware Deep Tracking

Figure 3 for Correlation-Aware Deep Tracking

Figure 4 for Correlation-Aware Deep Tracking

Abstract:Robustness and discrimination power are two fundamental requirements in visual object tracking. In most tracking paradigms, we find that the features extracted by the popular Siamese-like networks cannot fully discriminatively model the tracked targets and distractor objects, hindering them from simultaneously meeting these two requirements. While most methods focus on designing robust correlation operations, we propose a novel target-dependent feature network inspired by the self-/cross-attention scheme. In contrast to the Siamese-like feature extraction, our network deeply embeds cross-image feature correlation in multiple layers of the feature network. By extensively matching the features of the two images through multiple layers, it is able to suppress non-target features, resulting in instance-varying feature extraction. The output features of the search image can be directly used for predicting target locations without extra correlation step. Moreover, our model can be flexibly pre-trained on abundant unpaired images, leading to notably faster convergence than the existing methods. Extensive experiments show our method achieves the state-of-the-art results while running at real-time. Our feature networks also can be applied to existing tracking pipelines seamlessly to raise the tracking performance. Code will be available.

* accepted by CVPR2022

Via

Access Paper or Ask Questions

When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism

Jan 26, 2022

Guangting Wang, Yucheng Zhao, Chuanxin Tang, Chong Luo, Wenjun Zeng

Figure 1 for When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism

Figure 2 for When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism

Figure 3 for When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism

Figure 4 for When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism

Abstract:Attention mechanism has been widely believed as the key to success of vision transformers (ViTs), since it provides a flexible and powerful way to model spatial relationships. However, is the attention mechanism truly an indispensable part of ViT? Can it be replaced by some other alternatives? To demystify the role of attention mechanism, we simplify it into an extremely simple case: ZERO FLOP and ZERO parameter. Concretely, we revisit the shift operation. It does not contain any parameter or arithmetic calculation. The only operation is to exchange a small portion of the channels between neighboring features. Based on this simple operation, we construct a new backbone network, namely ShiftViT, where the attention layers in ViT are substituted by shift operations. Surprisingly, ShiftViT works quite well in several mainstream tasks, e.g., classification, detection, and segmentation. The performance is on par with or even better than the strong baseline Swin Transformer. These results suggest that the attention mechanism might not be the vital factor that makes ViT successful. It can be even replaced by a zero-parameter operation. We should pay more attentions to the remaining parts of ViT in the future work. Code is available at github.com/microsoft/SPACH.

* accepted by AAAI-22

Via

Access Paper or Ask Questions

Learning Tracking Representations via Dual-Branch Fully Transformer Networks

Dec 05, 2021

Fei Xie, Chunyu Wang, Guangting Wang, Wankou Yang, Wenjun Zeng

Figure 1 for Learning Tracking Representations via Dual-Branch Fully Transformer Networks

Figure 2 for Learning Tracking Representations via Dual-Branch Fully Transformer Networks

Figure 3 for Learning Tracking Representations via Dual-Branch Fully Transformer Networks

Figure 4 for Learning Tracking Representations via Dual-Branch Fully Transformer Networks

Abstract:We present a Siamese-like Dual-branch network based on solely Transformers for tracking. Given a template and a search image, we divide them into non-overlapping patches and extract a feature vector for each patch based on its matching results with others within an attention window. For each token, we estimate whether it contains the target object and the corresponding size. The advantage of the approach is that the features are learned from matching, and ultimately, for matching. So the features are aligned with the object tracking task. The method achieves better or comparable results as the best-performing methods which first use CNN to extract features and then use Transformer to fuse them. It outperforms the state-of-the-art methods on the GOT-10k and VOT2020 benchmarks. In addition, the method achieves real-time inference speed (about $40$ fps) on one GPU. The code and models will be released.

* ICCV21 Workshops

Via

Access Paper or Ask Questions

Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?

Sep 12, 2021

Chuanxin Tang, Yucheng Zhao, Guangting Wang, Chong Luo, Wenxuan Xie, Wenjun Zeng

Figure 1 for Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?

Figure 2 for Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?

Figure 3 for Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?

Figure 4 for Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?

Abstract:Transformers have sprung up in the field of computer vision. In this work, we explore whether the core self-attention module in Transformer is the key to achieving excellent performance in image recognition. To this end, we build an attention-free network called sMLPNet based on the existing MLP-based vision models. Specifically, we replace the MLP module in the token-mixing step with a novel sparse MLP (sMLP) module. For 2D image tokens, sMLP applies 1D MLP along the axial directions and the parameters are shared among rows or columns. By sparse connection and weight sharing, sMLP module significantly reduces the number of model parameters and computational complexity, avoiding the common over-fitting problem that plagues the performance of MLP-like models. When only trained on the ImageNet-1K dataset, the proposed sMLPNet achieves 81.9% top-1 accuracy with only 24M parameters, which is much better than most CNNs and vision Transformers under the same model size constraint. When scaling up to 66M parameters, sMLPNet achieves 83.4% top-1 accuracy, which is on par with the state-of-the-art Swin Transformer. The success of sMLPNet suggests that the self-attention mechanism is not necessarily a silver bullet in computer vision. Code will be made publicly available.

Via

Access Paper or Ask Questions

A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP

Aug 30, 2021

Yucheng Zhao, Guangting Wang, Chuanxin Tang, Chong Luo, Wenjun Zeng, Zheng-Jun Zha

Figure 1 for A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP

Figure 2 for A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP

Figure 3 for A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP

Figure 4 for A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP

Abstract:Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision. Recently, Transformer and multi-layer perceptron (MLP)-based models, such as Vision Transformer and MLP-Mixer, started to lead new trends as they showed promising results in the ImageNet classification task. In this paper, we conduct empirical studies on these DNN structures and try to understand their respective pros and cons. To ensure a fair comparison, we first develop a unified framework called SPACH which adopts separate modules for spatial and channel processing. Our experiments under the SPACH framework reveal that all structures can achieve competitive performance at a moderate scale. However, they demonstrate distinctive behaviors when the network size scales up. Based on our findings, we propose two hybrid models using convolution and Transformer modules. The resulting Hybrid-MS-S+ model achieves 83.9% top-1 accuracy with 63M parameters and 12.3G FLOPS. It is already on par with the SOTA models with sophisticated designs. The code and models will be made publicly available.

Via

Access Paper or Ask Questions

Self-Supervised Visual Representations Learning by Contrastive Mask Prediction

Aug 18, 2021

Yucheng Zhao, Guangting Wang, Chong Luo, Wenjun Zeng, Zheng-Jun Zha

Figure 1 for Self-Supervised Visual Representations Learning by Contrastive Mask Prediction

Figure 2 for Self-Supervised Visual Representations Learning by Contrastive Mask Prediction

Figure 3 for Self-Supervised Visual Representations Learning by Contrastive Mask Prediction

Figure 4 for Self-Supervised Visual Representations Learning by Contrastive Mask Prediction

Abstract:Advanced self-supervised visual representation learning methods rely on the instance discrimination (ID) pretext task. We point out that the ID task has an implicit semantic consistency (SC) assumption, which may not hold in unconstrained datasets. In this paper, we propose a novel contrastive mask prediction (CMP) task for visual representation learning and design a mask contrast (MaskCo) framework to implement the idea. MaskCo contrasts region-level features instead of view-level features, which makes it possible to identify the positive sample without any assumptions. To solve the domain gap between masked and unmasked features, we design a dedicated mask prediction head in MaskCo. This module is shown to be the key to the success of the CMP. We evaluated MaskCo on training datasets beyond ImageNet and compare its performance with MoCo V2. Results show that MaskCo achieves comparable performance with MoCo V2 using ImageNet training dataset, but demonstrates a stronger performance across a range of downstream tasks when COCO or Conceptual Captions are used for training. MaskCo provides a promising alternative to the ID-based methods for self-supervised learning in the wild.

* Accepted to ICCV 2021

Via

Access Paper or Ask Questions

Unsupervised Visual Representation Learning by Tracking Patches in Video

May 06, 2021

Guangting Wang, Yizhou Zhou, Chong Luo, Wenxuan Xie, Wenjun Zeng, Zhiwei Xiong

Figure 1 for Unsupervised Visual Representation Learning by Tracking Patches in Video

Figure 2 for Unsupervised Visual Representation Learning by Tracking Patches in Video

Figure 3 for Unsupervised Visual Representation Learning by Tracking Patches in Video

Figure 4 for Unsupervised Visual Representation Learning by Tracking Patches in Video

Abstract:Inspired by the fact that human eyes continue to develop tracking ability in early and middle childhood, we propose to use tracking as a proxy task for a computer vision system to learn the visual representations. Modelled on the Catch game played by the children, we design a Catch-the-Patch (CtP) game for a 3D-CNN model to learn visual representations that would help with video-related tasks. In the proposed pretraining framework, we cut an image patch from a given video and let it scale and move according to a pre-set trajectory. The proxy task is to estimate the position and size of the image patch in a sequence of video frames, given only the target bounding box in the first frame. We discover that using multiple image patches simultaneously brings clear benefits. We further increase the difficulty of the game by randomly making patches invisible. Extensive experiments on mainstream benchmarks demonstrate the superior performance of CtP against other video pretraining methods. In addition, CtP-pretrained features are less sensitive to domain gaps than those trained by a supervised action recognition task. When both trained on Kinetics-400, we are pleasantly surprised to find that CtP-pretrained representation achieves much higher action classification accuracy than its fully supervised counterpart on Something-Something dataset. Code is available online: github.com/microsoft/CtP.

* To appear in CVPR'21. Code available at github.com/microsoft/CtP

Via

Access Paper or Ask Questions

Tracking by Instance Detection: A Meta-Learning Approach

Apr 02, 2020

Guangting Wang, Chong Luo, Xiaoyan Sun, Zhiwei Xiong, Wenjun Zeng

Figure 1 for Tracking by Instance Detection: A Meta-Learning Approach

Figure 2 for Tracking by Instance Detection: A Meta-Learning Approach

Figure 3 for Tracking by Instance Detection: A Meta-Learning Approach

Figure 4 for Tracking by Instance Detection: A Meta-Learning Approach

Abstract:We consider the tracking problem as a special type of object detection problem, which we call instance detection. With proper initialization, a detector can be quickly converted into a tracker by learning the new instance from a single image. We find that model-agnostic meta-learning (MAML) offers a strategy to initialize the detector that satisfies our needs. We propose a principled three-step approach to build a high-performance tracker. First, pick any modern object detector trained with gradient descent. Second, conduct offline training (or initialization) with MAML. Third, perform domain adaptation using the initial frame. We follow this procedure to build two trackers, named Retina-MAML and FCOS-MAML, based on two modern detectors RetinaNet and FCOS. Evaluations on four benchmarks show that both trackers are competitive against state-of-the-art trackers. On OTB-100, Retina-MAML achieves the highest ever AUC of 0.712. On TrackingNet, FCOS-MAML ranks the first on the leader board with an AUC of 0.757 and the normalized precision of 0.822. Both trackers run in real-time at 40 FPS.

* This paper has been accepted by CVPR'20 as an oral

Via

Access Paper or Ask Questions