Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zhizhong Li

THRONE: An Object-based Hallucination Benchmark for the Free-form Generations of Large Vision-Language Models

May 08, 2024

Prannay Kaul, Zhizhong Li, Hao Yang, Yonatan Dukler, Ashwin Swaminathan, C. J. Taylor, Stefano Soatto

Abstract:Mitigating hallucinations in large vision-language models (LVLMs) remains an open problem. Recent benchmarks do not address hallucinations in open-ended free-form responses, which we term "Type I hallucinations". Instead, they focus on hallucinations responding to very specific question formats -- typically a multiple-choice response regarding a particular object or attribute -- which we term "Type II hallucinations". Additionally, such benchmarks often require external API calls to models which are subject to change. In practice, we observe that a reduction in Type II hallucinations does not lead to a reduction in Type I hallucinations but rather that the two forms of hallucinations are often anti-correlated. To address this, we propose THRONE, a novel object-based automatic framework for quantitatively evaluating Type I hallucinations in LVLM free-form outputs. We use public language models (LMs) to identify hallucinations in LVLM responses and compute informative metrics. By evaluating a large selection of recent LVLMs using public datasets, we show that an improvement in existing metrics do not lead to a reduction in Type I hallucinations, and that established benchmarks for measuring Type I hallucinations are incomplete. Finally, we provide a simple and effective data augmentation method to reduce Type I and Type II hallucinations as a strong baseline.

* In CVPR 2024

Via

Access Paper or Ask Questions

Get the Best of Both Worlds: Improving Accuracy and Transferability by Grassmann Class Representation

Aug 03, 2023

Haoqi Wang, Zhizhong Li, Wayne Zhang

Abstract:We generalize the class vectors found in neural networks to linear subspaces (i.e.~points in the Grassmann manifold) and show that the Grassmann Class Representation (GCR) enables the simultaneous improvement in accuracy and feature transferability. In GCR, each class is a subspace and the logit is defined as the norm of the projection of a feature onto the class subspace. We integrate Riemannian SGD into deep learning frameworks such that class subspaces in a Grassmannian are jointly optimized with the rest model parameters. Compared to the vector form, the representative capability of subspaces is more powerful. We show that on ImageNet-1K, the top-1 error of ResNet50-D, ResNeXt50, Swin-T and Deit3-S are reduced by 5.6%, 4.5%, 3.0% and 3.5%, respectively. Subspaces also provide freedom for features to vary and we observed that the intra-class feature variability grows when the subspace dimension increases. Consequently, we found the quality of GCR features is better for downstream tasks. For ResNet50-D, the average linear transfer accuracy across 6 datasets improves from 77.98% to 79.70% compared to the strong baseline of vanilla softmax. For Swin-T, it improves from 81.5% to 83.4% and for Deit3, it improves from 73.8% to 81.4%. With these encouraging results, we believe that more applications could benefit from the Grassmann class representation. Code is released at https://github.com/innerlee/GCR.

* ICCV 2023

Via

Access Paper or Ask Questions

Collaborative Anomaly Detection

Sep 20, 2022

Ke Bai, Aonan Zhang, Zhizhong Li, Ricardo Heano, Chong Wang, Lawrence Carin

Figure 1 for Collaborative Anomaly Detection

Figure 2 for Collaborative Anomaly Detection

Figure 3 for Collaborative Anomaly Detection

Figure 4 for Collaborative Anomaly Detection

Abstract:In recommendation systems, items are likely to be exposed to various users and we would like to learn about the familiarity of a new user with an existing item. This can be formulated as an anomaly detection (AD) problem distinguishing between "common users" (nominal) and "fresh users" (anomalous). Considering the sheer volume of items and the sparsity of user-item paired data, independently applying conventional single-task detection methods on each item quickly becomes difficult, while correlations between items are ignored. To address this multi-task anomaly detection problem, we propose collaborative anomaly detection (CAD) to jointly learn all tasks with an embedding encoding correlations among tasks. We explore CAD with conditional density estimation and conditional likelihood ratio estimation. We found that: $i$) estimating a likelihood ratio enjoys more efficient learning and yields better results than density estimation. $ii$) It is beneficial to select a small number of tasks in advance to learn a task embedding model, and then use it to warm-start all task embeddings. Consequently, these embeddings can capture correlations between tasks and generalize to new correlated tasks.

Via

Access Paper or Ask Questions

Class-Incremental Learning with Strong Pre-trained Models

Apr 07, 2022

Tz-Ying Wu, Gurumurthy Swaminathan, Zhizhong Li, Avinash Ravichandran, Nuno Vasconcelos, Rahul Bhotika, Stefano Soatto

Figure 1 for Class-Incremental Learning with Strong Pre-trained Models

Figure 2 for Class-Incremental Learning with Strong Pre-trained Models

Figure 3 for Class-Incremental Learning with Strong Pre-trained Models

Figure 4 for Class-Incremental Learning with Strong Pre-trained Models

Abstract:Class-incremental learning (CIL) has been widely studied under the setting of starting from a small number of classes (base classes). Instead, we explore an understudied real-world setting of CIL that starts with a strong model pre-trained on a large number of base classes. We hypothesize that a strong base model can provide a good representation for novel classes and incremental learning can be done with small adaptations. We propose a 2-stage training scheme, i) feature augmentation -- cloning part of the backbone and fine-tuning it on the novel data, and ii) fusion -- combining the base and novel classifiers into a unified classifier. Experiments show that the proposed method significantly outperforms state-of-the-art CIL methods on the large-scale ImageNet dataset (e.g. +10% overall accuracy than the best). We also propose and analyze understudied practical CIL scenarios, such as base-novel overlap with distribution shift. Our proposed method is robust and generalizes to all analyzed CIL settings.

* Accepted at CVPR 2022, code to be released soon

Via

Access Paper or Ask Questions

ViM: Out-Of-Distribution with Virtual-logit Matching

Mar 21, 2022

Haoqi Wang, Zhizhong Li, Litong Feng, Wayne Zhang

Figure 1 for ViM: Out-Of-Distribution with Virtual-logit Matching

Figure 2 for ViM: Out-Of-Distribution with Virtual-logit Matching

Figure 3 for ViM: Out-Of-Distribution with Virtual-logit Matching

Figure 4 for ViM: Out-Of-Distribution with Virtual-logit Matching

Abstract:Most of the existing Out-Of-Distribution (OOD) detection algorithms depend on single input source: the feature, the logit, or the softmax probability. However, the immense diversity of the OOD examples makes such methods fragile. There are OOD samples that are easy to identify in the feature space while hard to distinguish in the logit space and vice versa. Motivated by this observation, we propose a novel OOD scoring method named Virtual-logit Matching (ViM), which combines the class-agnostic score from feature space and the In-Distribution (ID) class-dependent logits. Specifically, an additional logit representing the virtual OOD class is generated from the residual of the feature against the principal space, and then matched with the original logits by a constant scaling. The probability of this virtual logit after softmax is the indicator of OOD-ness. To facilitate the evaluation of large-scale OOD detection in academia, we create a new OOD dataset for ImageNet-1K, which is human-annotated and is 8.8x the size of existing datasets. We conducted extensive experiments, including CNNs and vision transformers, to demonstrate the effectiveness of the proposed ViM score. In particular, using the BiT-S model, our method gets an average AUROC 90.91% on four difficult OOD benchmarks, which is 4% ahead of the best baseline. Code and dataset are available at https://github.com/haoqiwang/vim.

* CVPR 2022

Via

Access Paper or Ask Questions

MMOCR: A Comprehensive Toolbox for Text Detection, Recognition and Understanding

Aug 14, 2021

Zhanghui Kuang, Hongbin Sun, Zhizhong Li, Xiaoyu Yue, Tsui Hin Lin, Jianyong Chen, Huaqiang Wei, Yiqin Zhu, Tong Gao, Wenwei Zhang(+3 more)

Figure 1 for MMOCR: A Comprehensive Toolbox for Text Detection, Recognition and Understanding

Figure 2 for MMOCR: A Comprehensive Toolbox for Text Detection, Recognition and Understanding

Figure 3 for MMOCR: A Comprehensive Toolbox for Text Detection, Recognition and Understanding

Figure 4 for MMOCR: A Comprehensive Toolbox for Text Detection, Recognition and Understanding

Abstract:We present MMOCR-an open-source toolbox which provides a comprehensive pipeline for text detection and recognition, as well as their downstream tasks such as named entity recognition and key information extraction. MMOCR implements 14 state-of-the-art algorithms, which is significantly more than all the existing open-source OCR projects we are aware of to date. To facilitate future research and industrial applications of text recognition-related problems, we also provide a large number of trained models and detailed benchmarks to give insights into the performance of text detection, recognition and understanding. MMOCR is publicly released at https://github.com/open-mmlab/mmocr.

* Accepted to ACM MM (Open Source Competition Track)

Via

Access Paper or Ask Questions

Representation Consolidation for Training Expert Students

Jul 16, 2021

Zhizhong Li, Avinash Ravichandran, Charless Fowlkes, Marzia Polito, Rahul Bhotika, Stefano Soatto

Figure 1 for Representation Consolidation for Training Expert Students

Figure 2 for Representation Consolidation for Training Expert Students

Figure 3 for Representation Consolidation for Training Expert Students

Figure 4 for Representation Consolidation for Training Expert Students

Abstract:Traditionally, distillation has been used to train a student model to emulate the input/output functionality of a teacher. A more useful goal than emulation, yet under-explored, is for the student to learn feature representations that transfer well to future tasks. However, we observe that standard distillation of task-specific teachers actually *reduces* the transferability of student representations to downstream tasks. We show that a multi-head, multi-task distillation method using an unlabeled proxy dataset and a generalist teacher is sufficient to consolidate representations from task-specific teacher(s) and improve downstream performance, outperforming the teacher(s) and the strong baseline of ImageNet pretrained features. Our method can also combine the representational knowledge of multiple teachers trained on one or multiple domains into a single model, whose representation is improved on all teachers' domain(s).

Via

Access Paper or Ask Questions

Learning Curves for Analysis of Deep Networks

Oct 21, 2020

Derek Hoiem, Tanmay Gupta, Zhizhong Li, Michal M. Shlapentokh-Rothman

Figure 1 for Learning Curves for Analysis of Deep Networks

Figure 2 for Learning Curves for Analysis of Deep Networks

Figure 3 for Learning Curves for Analysis of Deep Networks

Figure 4 for Learning Curves for Analysis of Deep Networks

Abstract:A learning curve models a classifier's test error as a function of the number of training samples. Prior works show that learning curves can be used to select model parameters and extrapolate performance. We investigate how to use learning curves to analyze the impact of design choices, such as pre-training, architecture, and data augmentation. We propose a method to robustly estimate learning curves, abstract their parameters into error and data-reliance, and evaluate the effectiveness of different parameterizations. We also provide several interesting observations based on learning curves for a variety of image classification models.

Via

Access Paper or Ask Questions

Regularizing Reasons for Outfit Evaluation with Gradient Penalty

Feb 02, 2020

Xingxing Zou, Zhizhong Li, Ke Bai, Dahua Lin, Waikeung Wong

Figure 1 for Regularizing Reasons for Outfit Evaluation with Gradient Penalty

Figure 2 for Regularizing Reasons for Outfit Evaluation with Gradient Penalty

Figure 3 for Regularizing Reasons for Outfit Evaluation with Gradient Penalty

Figure 4 for Regularizing Reasons for Outfit Evaluation with Gradient Penalty

Abstract:In this paper, we build an outfit evaluation system which provides feedbacks consisting of a judgment with a convincing explanation. The system is trained in a supervised manner which faithfully follows the domain knowledge in fashion. We create the EVALUATION3 dataset which is annotated with judgment, the decisive reason for the judgment, and all corresponding attributes (e.g. print, silhouette, and material \etc.). In the training process, features of all attributes in an outfit are first extracted and then concatenated as the input for the intra-factor compatibility net. Then, the inter-factor compatibility net is used to compute the loss for judgment. We penalize the gradient of judgment loss of so that our Grad-CAM-like reason is regularized to be consistent with the labeled reason. In inference, according to the obtained information of judgment, reason, and attributes, a user-friendly explanation sentence is generated by the pre-defined templates. The experimental results show that the obtained network combines the advantages of high precision and good interpretation.

* 10 pages

Via

Access Paper or Ask Questions

Dreaming to Distill: Data-free Knowledge Transfer via DeepInversion

Dec 18, 2019

Hongxu Yin, Pavlo Molchanov, Zhizhong Li, Jose M. Alvarez, Arun Mallya, Derek Hoiem, Niraj K. Jha, Jan Kautz

Figure 1 for Dreaming to Distill: Data-free Knowledge Transfer via DeepInversion

Figure 2 for Dreaming to Distill: Data-free Knowledge Transfer via DeepInversion

Figure 3 for Dreaming to Distill: Data-free Knowledge Transfer via DeepInversion

Figure 4 for Dreaming to Distill: Data-free Knowledge Transfer via DeepInversion

Abstract:We introduce DeepInversion, a new method for synthesizing images from the image distribution used to train a deep neural network. We 'invert' a trained network (teacher) to synthesize class-conditional input images starting from random noise, without using any additional information about the training dataset. Keeping the teacher fixed, our method optimizes the input while regularizing the distribution of intermediate feature maps using information stored in the batch normalization layers of the teacher. Further, we improve the diversity of synthesized images using Adaptive DeepInversion, which maximizes the Jensen-Shannon divergence between the teacher and student network logits. The resulting synthesized images from networks trained on the CIFAR-10 and ImageNet datasets demonstrate high fidelity and degree of realism, and help enable a new breed of data-free applications - ones that do not require any real images or labeled data. We demonstrate the applicability of our proposed method to three tasks of immense practical importance -- (i) data-free network pruning, (ii) data-free knowledge transfer, and (iii) data-free continual learning.

Via

Access Paper or Ask Questions