Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Cheng-Yang Fu

FACET: Fairness in Computer Vision Evaluation Benchmark

Aug 31, 2023

Laura Gustafson, Chloe Rolland, Nikhila Ravi, Quentin Duval, Aaron Adcock, Cheng-Yang Fu, Melissa Hall, Candace Ross

Figure 1 for FACET: Fairness in Computer Vision Evaluation Benchmark

Figure 2 for FACET: Fairness in Computer Vision Evaluation Benchmark

Figure 3 for FACET: Fairness in Computer Vision Evaluation Benchmark

Figure 4 for FACET: Fairness in Computer Vision Evaluation Benchmark

Abstract:Computer vision models have known performance disparities across attributes such as gender and skin tone. This means during tasks such as classification and detection, model performance differs for certain classes based on the demographics of the people in the image. These disparities have been shown to exist, but until now there has not been a unified approach to measure these differences for common use-cases of computer vision models. We present a new benchmark named FACET (FAirness in Computer Vision EvaluaTion), a large, publicly available evaluation set of 32k images for some of the most common vision tasks - image classification, object detection and segmentation. For every image in FACET, we hired expert reviewers to manually annotate person-related attributes such as perceived skin tone and hair type, manually draw bounding boxes and label fine-grained person-related classes such as disk jockey or guitarist. In addition, we use FACET to benchmark state-of-the-art vision models and present a deeper understanding of potential performance disparities and challenges across sensitive demographic attributes. With the exhaustive annotations collected, we probe models using single demographics attributes as well as multiple attributes using an intersectional approach (e.g. hair color and perceived skin tone). Our results show that classification, detection, segmentation, and visual grounding models exhibit performance disparities across demographic attributes and intersections of attributes. These harms suggest that not all people represented in datasets receive fair and equitable treatment in these vision tasks. We hope current and future results using our benchmark will contribute to fairer, more robust vision models. FACET is available publicly at https://facet.metademolab.com/

Via

Access Paper or Ask Questions

Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation

Nov 23, 2022

Tsu-Jui Fu, Licheng Yu, Ning Zhang, Cheng-Yang Fu, Jong-Chyi Su, William Yang Wang, Sean Bell

Abstract:Generating a video given the first several static frames is challenging as it anticipates reasonable future frames with temporal coherence. Besides video prediction, the ability to rewind from the last frame or infilling between the head and tail is also crucial, but they have rarely been explored for video completion. Since there could be different outcomes from the hints of just a few frames, a system that can follow natural language to perform video completion may significantly improve controllability. Inspired by this, we introduce a novel task, text-guided video completion (TVC), which requests the model to generate a video from partial frames guided by an instruction. We then propose Multimodal Masked Video Generation (MMVG) to address this TVC task. During training, MMVG discretizes the video frames into visual tokens and masks most of them to perform video completion from any time point. At inference time, a single MMVG model can address all 3 cases of TVC, including video prediction, rewind, and infilling, by applying corresponding masking conditions. We evaluate MMVG in various video scenarios, including egocentric, animation, and gaming. Extensive experimental results indicate that MMVG is effective in generating high-quality visual appearances with text guidance for TVC.

Via

Access Paper or Ask Questions

Where is my Wallet? Modeling Object Proposal Sets for Egocentric Visual Query Localization

Nov 18, 2022

Mengmeng Xu, Yanghao Li, Cheng-Yang Fu, Bernard Ghanem, Tao Xiang, Juan-Manuel Perez-Rua

Abstract:This paper deals with the problem of localizing objects in image and video datasets from visual exemplars. In particular, we focus on the challenging problem of egocentric visual query localization. We first identify grave implicit biases in current query-conditioned model design and visual query datasets. Then, we directly tackle such biases at both frame and object set levels. Concretely, our method solves these issues by expanding limited annotations and dynamically dropping object proposals during training. Additionally, we propose a novel transformer-based module that allows for object-proposal set context to be considered while incorporating query information. We name our module Conditioned Contextual Transformer or CocoFormer. Our experiments show the proposed adaptations improve egocentric query detection, leading to a better visual query localization system in both 2D and 3D configurations. Thus, we are able to improve frame-level detection performance from 26.28% to 31.26 in AP, which correspondingly improves the VQ2D and VQ3D localization scores by significant margins. Our improved context-aware query object detector ranked first and second in the VQ2D and VQ3D tasks in the 2nd Ego4D challenge. In addition to this, we showcase the relevance of our proposed model in the Few-Shot Detection (FSD) task, where we also achieve SOTA results. Our code is available at https://github.com/facebookresearch/vq2d_cvpr.

* We ranked first and second in the VQ2D and VQ3D tasks in the 2nd Ego4D challenge

Via

Access Paper or Ask Questions

Token Merging: Your ViT But Faster

Oct 17, 2022

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, Judy Hoffman

Figure 1 for Token Merging: Your ViT But Faster

Figure 2 for Token Merging: Your ViT But Faster

Figure 3 for Token Merging: Your ViT But Faster

Figure 4 for Token Merging: Your ViT But Faster

Abstract:We introduce Token Merging (ToMe), a simple method to increase the throughput of existing ViT models without needing to train. ToMe gradually combines similar tokens in a transformer using a general and light-weight matching algorithm that is as fast as pruning while being more accurate. Off-the-shelf, ToMe can 2x the throughput of state-of-the-art ViT-L @ 512 and ViT-H @ 518 models on images and 2.2x the throughput of ViT-L on video with only a 0.2-0.3% accuracy drop in each case. ToMe can also easily be applied during training, improving in practice training speed up to 2x for MAE fine-tuning on video. Training with ToMe further minimizes accuracy drop, leading to 2x the throughput of ViT-B on audio for only a 0.4% mAP drop. Qualitatively, we find that ToMe merges object parts into one token, even over multiple frames of video. Overall, ToMe's accuracy and speed are competitive with state-of-the-art on images, video, and audio.

* Preprint. Code will be available here: https://github.com/facebookresearch/ToMe

Via

Access Paper or Ask Questions

Hydra Attention: Efficient Attention with Many Heads

Sep 15, 2022

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Judy Hoffman

Figure 1 for Hydra Attention: Efficient Attention with Many Heads

Figure 2 for Hydra Attention: Efficient Attention with Many Heads

Figure 3 for Hydra Attention: Efficient Attention with Many Heads

Figure 4 for Hydra Attention: Efficient Attention with Many Heads

Abstract:While transformers have begun to dominate many tasks in vision, applying them to large images is still computationally difficult. A large reason for this is that self-attention scales quadratically with the number of tokens, which in turn, scales quadratically with the image size. On larger images (e.g., 1080p), over 60% of the total computation in the network is spent solely on creating and applying attention matrices. We take a step toward solving this issue by introducing Hydra Attention, an extremely efficient attention operation for Vision Transformers (ViTs). Paradoxically, this efficiency comes from taking multi-head attention to its extreme: by using as many attention heads as there are features, Hydra Attention is computationally linear in both tokens and features with no hidden constants, making it significantly faster than standard self-attention in an off-the-shelf ViT-B/16 by a factor of the token count. Moreover, Hydra Attention retains high accuracy on ImageNet and, in some cases, actually improves it.

* Accepted CADL 2022 (ECCV Workshop)

Via

Access Paper or Ask Questions

Negative Frames Matter in Egocentric Visual Query 2D Localization

Aug 03, 2022

Mengmeng Xu, Cheng-Yang Fu, Yanghao Li, Bernard Ghanem, Juan-Manuel Perez-Rua, Tao Xiang

Figure 1 for Negative Frames Matter in Egocentric Visual Query 2D Localization

Figure 2 for Negative Frames Matter in Egocentric Visual Query 2D Localization

Figure 3 for Negative Frames Matter in Egocentric Visual Query 2D Localization

Figure 4 for Negative Frames Matter in Egocentric Visual Query 2D Localization

Abstract:The recently released Ego4D dataset and benchmark significantly scales and diversifies the first-person visual perception data. In Ego4D, the Visual Queries 2D Localization task aims to retrieve objects appeared in the past from the recording in the first-person view. This task requires a system to spatially and temporally localize the most recent appearance of a given object query, where query is registered by a single tight visual crop of the object in a different scene. Our study is based on the three-stage baseline introduced in the Episodic Memory benchmark. The baseline solves the problem by detection and tracking: detect the similar objects in all the frames, then run a tracker from the most confident detection result. In the VQ2D challenge, we identified two limitations of the current baseline. (1) The training configuration has redundant computation. Although the training set has millions of instances, most of them are repetitive and the number of unique object is only around 14.6k. The repeated gradient computation of the same object lead to an inefficient training; (2) The false positive rate is high on background frames. This is due to the distribution gap between training and evaluation. During training, the model is only able to see the clean, stable, and labeled frames, but the egocentric videos also have noisy, blurry, or unlabeled background frames. To this end, we developed a more efficient and effective solution. Concretely, we bring the training loop from ~15 days to less than 24 hours, and we achieve 0.17% spatial-temporal AP, which is 31% higher than the baseline. Our solution got the first ranking on the public leaderboard. Our code is publicly available at https://github.com/facebookresearch/vq2d_cvpr.

* First place winning solution for VQ2D task in CVPR-2022 Ego4D Challenge. Our code is publicly available at https://github.com/facebookresearch/vq2d_cvpr

Via

Access Paper or Ask Questions

End-to-End Visual Editing with a Generatively Pre-Trained Artist

May 03, 2022

Andrew Brown, Cheng-Yang Fu, Omkar Parkhi, Tamara L. Berg, Andrea Vedaldi

Figure 1 for End-to-End Visual Editing with a Generatively Pre-Trained Artist

Figure 2 for End-to-End Visual Editing with a Generatively Pre-Trained Artist

Figure 3 for End-to-End Visual Editing with a Generatively Pre-Trained Artist

Figure 4 for End-to-End Visual Editing with a Generatively Pre-Trained Artist

Abstract:We consider the targeted image editing problem: blending a region in a source image with a driver image that specifies the desired change. Differently from prior works, we solve this problem by learning a conditional probability distribution of the edits, end-to-end. Training such a model requires addressing a fundamental technical challenge: the lack of example edits for training. To this end, we propose a self-supervised approach that simulates edits by augmenting off-the-shelf images in a target domain. The benefits are remarkable: implemented as a state-of-the-art auto-regressive transformer, our approach is simple, sidesteps difficulties with previous methods based on GAN-like priors, obtains significantly better edits, and is efficient. Furthermore, we show that different blending effects can be learned by an intuitive control of the augmentation process, with no other changes required to the model architecture. We demonstrate the superiority of this approach across several datasets in extensive quantitative and qualitative experiments, including human studies, significantly outperforming prior work.

Via

Access Paper or Ask Questions

IMP: Instance Mask Projection for High Accuracy Semantic Segmentation of Things

Jun 15, 2019

Cheng-Yang Fu, Tamara L. Berg, Alexander C. Berg

Figure 1 for IMP: Instance Mask Projection for High Accuracy Semantic Segmentation of Things

Figure 2 for IMP: Instance Mask Projection for High Accuracy Semantic Segmentation of Things

Figure 3 for IMP: Instance Mask Projection for High Accuracy Semantic Segmentation of Things

Figure 4 for IMP: Instance Mask Projection for High Accuracy Semantic Segmentation of Things

Abstract:In this work, we present a new operator, called Instance Mask Projection (IMP), which projects a predicted Instance Segmentation as a new feature for semantic segmentation. It also supports back propagation so is trainable end-to-end. Our experiments show the effectiveness of IMP on both Clothing Parsing (with complex layering, large deformations, and non-convex objects), and on Street Scene Segmentation (with many overlapping instances and small objects). On the Varied Clothing Parsing dataset (VCP), we show instance mask projection can improve 3 points on mIOU from a state-of-the-art Panoptic FPN segmentation approach. On the ModaNet clothing parsing dataset, we show a dramatic improvement of 20.4% absolutely compared to existing baseline semantic segmentation results. In addition, the instance mask projection operator works well on other (non-clothing) datasets, providing an improvement of 3 points in mIOU on Thing classes of Cityscapes, a self-driving dataset, on top of a state-of-the-art approach.

Via

Access Paper or Ask Questions

RetinaMask: Learning to predict masks improves state-of-the-art single-shot detection for free

Jan 10, 2019

Cheng-Yang Fu, Mykhailo Shvets, Alexander C. Berg

Figure 1 for RetinaMask: Learning to predict masks improves state-of-the-art single-shot detection for free

Figure 2 for RetinaMask: Learning to predict masks improves state-of-the-art single-shot detection for free

Figure 3 for RetinaMask: Learning to predict masks improves state-of-the-art single-shot detection for free

Figure 4 for RetinaMask: Learning to predict masks improves state-of-the-art single-shot detection for free

Abstract:Recently two-stage detectors have surged ahead of single-shot detectors in the accuracy-vs-speed trade-off. Nevertheless single-shot detectors are immensely popular in embedded vision applications. This paper brings single-shot detectors up to the same level as current two-stage techniques. We do this by improving training for the state-of-the-art single-shot detector, RetinaNet, in three ways: integrating instance mask prediction for the first time, making the loss function adaptive and more stable, and including additional hard examples in training. We call the resulting augmented network RetinaMask. The detection component of RetinaMask has the same computational cost as the original RetinaNet, but is more accurate. COCO test-dev results are up to 41.4 mAP for RetinaMask-101 vs 39.1mAP for RetinaNet-101, while the runtime is the same during evaluation. Adding Group Normalization increases the performance of RetinaMask-101 to 41.7 mAP. Code is at:https://github.com/chengyangfu/retinamask

Via

Access Paper or Ask Questions

Target Driven Instance Detection

Jul 17, 2018

Phil Ammirato, Cheng-Yang Fu, Mykhailo Shvets, Jana Kosecka, Alexander C. Berg

Figure 1 for Target Driven Instance Detection

Figure 2 for Target Driven Instance Detection

Figure 3 for Target Driven Instance Detection

Figure 4 for Target Driven Instance Detection

Abstract:While state-of-the-art general object detectors are getting better and better, there are not many systems specifically designed to take advantage of the instance detection problem. For many applications, such as household robotics, a system may need to recognize a few very specific instances at a time. Speed can be critical in these applications, as can the need to recognize previously unseen instances. We introduce a Target Driven Instance Detector(TDID), which modifies existing general object detectors for the instance recognition setting. TDID not only improves performance on instances seen during training, with a fast runtime, but is also able to generalize to detect novel instances.

Via

Access Paper or Ask Questions