Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Asim Kadav

Long-VMNet: Accelerating Long-Form Video Understanding via Fixed Memory

Mar 17, 2025

Saket Gurukar, Asim Kadav

Abstract:Long-form video understanding is essential for various applications such as video retrieval, summarizing, and question answering. Yet, traditional approaches demand substantial computing power and are often bottlenecked by GPU memory. To tackle this issue, we present Long-Video Memory Network, Long-VMNet, a novel video understanding method that employs a fixed-size memory representation to store discriminative patches sampled from the input video. Long-VMNet achieves improved efficiency by leveraging a neural sampler that identifies discriminative tokens. Additionally, Long-VMNet only needs one scan through the video, greatly boosting efficiency. Our results on the Rest-ADL dataset demonstrate an 18x -- 75x improvement in inference times for long-form video retrieval and answering questions, with a competitive predictive performance.

Via

Access Paper or Ask Questions

Automated Clinical Data Extraction with Knowledge Conditioned LLMs

Jun 26, 2024

Diya Li, Asim Kadav, Aijing Gao, Rui Li, Richard Bourgon

Abstract:The extraction of lung lesion information from clinical and medical imaging reports is crucial for research on and clinical care of lung-related diseases. Large language models (LLMs) can be effective at interpreting unstructured text in reports, but they often hallucinate due to a lack of domain-specific knowledge, leading to reduced accuracy and posing challenges for use in clinical settings. To address this, we propose a novel framework that aligns generated internal knowledge with external knowledge through in-context learning (ICL). Our framework employs a retriever to identify relevant units of internal or external knowledge and a grader to evaluate the truthfulness and helpfulness of the retrieved internal-knowledge rules, to align and update the knowledge bases. Our knowledge-conditioned approach also improves the accuracy and reliability of LLM outputs by addressing the extraction task in two stages: (i) lung lesion finding detection and primary structured field parsing, followed by (ii) further parsing of lesion description text into additional structured fields. Experiments with expert-curated test datasets demonstrate that this ICL approach can increase the F1 score for key fields (lesion size, margin and solidity) by an average of 12.9% over existing ICL methods.

Via

Access Paper or Ask Questions

Learning Higher-order Object Interactions for Keypoint-based Video Understanding

May 16, 2023

Yi Huang, Asim Kadav, Farley Lai, Deep Patel, Hans Peter Graf

Figure 1 for Learning Higher-order Object Interactions for Keypoint-based Video Understanding

Figure 2 for Learning Higher-order Object Interactions for Keypoint-based Video Understanding

Figure 3 for Learning Higher-order Object Interactions for Keypoint-based Video Understanding

Figure 4 for Learning Higher-order Object Interactions for Keypoint-based Video Understanding

Abstract:Action recognition is an important problem that requires identifying actions in video by learning complex interactions across scene actors and objects. However, modern deep-learning based networks often require significant computation, and may capture scene context using various modalities that further increases compute costs. Efficient methods such as those used for AR/VR often only use human-keypoint information but suffer from a loss of scene context that hurts accuracy. In this paper, we describe an action-localization method, KeyNet, that uses only the keypoint data for tracking and action recognition. Specifically, KeyNet introduces the use of object based keypoint information to capture context in the scene. Our method illustrates how to build a structured intermediate representation that allows modeling higher-order interactions in the scene from object and human keypoints without using any RGB information. We find that KeyNet is able to track and classify human actions at just 5 FPS. More importantly, we demonstrate that object keypoints can be modeled to recover any loss in context from using keypoint information over AVA action and Kinetics datasets.

* SRVU - ICCV' 2021 workshop

Via

Access Paper or Ask Questions

Self-supervised Video Representation Learning with Cascade Positive Retrieval

Jan 21, 2022

Cheng-En Wu, Farley Lai, Yu Hen Hu, Asim Kadav

Figure 1 for Self-supervised Video Representation Learning with Cascade Positive Retrieval

Figure 2 for Self-supervised Video Representation Learning with Cascade Positive Retrieval

Figure 3 for Self-supervised Video Representation Learning with Cascade Positive Retrieval

Figure 4 for Self-supervised Video Representation Learning with Cascade Positive Retrieval

Abstract:Self-supervised video representation learning has been shown to effectively improve downstream tasks such as video retrieval and action recognition. In this paper, we present the Cascade Positive Retrieval (CPR) that successively mines positive examples w.r.t. the query for contrastive learning in a cascade of stages. Specifically, CPR exploits multiple views of a query example in different modalities, where an alternative view may help find another positive example dissimilar in the query view. We explore the effects of possible CPR configurations in ablations including the number of mining stages, the top similar example selection ratio in each stage, and progressive training with an incremental number of the final Top-k selection. The overall mining quality is measured to reflect the recall across training set classes. CPR reaches a median class mining recall of 83.3%, outperforming previous work by 5.5%. Implementation-wise, CPR is complementary to pretext tasks and can be easily applied to previous work. In the evaluation of pretraining on UCF101, CPR consistently improves existing work and even achieves state-of-the-art R@1 of 56.7% and 24.4% in video retrieval as well as 83.8% and 54.8% in action recognition on UCF101 and HMDB51. For transfer from large video dataset Kinetics400 to UCF101 and HDMB, CPR benefits existing work, showing competitive Top-1 accuracies of 85.1% and 57.4% despite pretraining at a lower resolution and frame sampling rate. The code will be released soon for reproducing the results. The code is available at https://github.com/necla-ml/CPR.

Via

Access Paper or Ask Questions

SplitBrain: Hybrid Data and Model Parallel Deep Learning

Dec 31, 2021

Farley Lai, Asim Kadav, Erik Kruus

Figure 1 for SplitBrain: Hybrid Data and Model Parallel Deep Learning

Figure 2 for SplitBrain: Hybrid Data and Model Parallel Deep Learning

Figure 3 for SplitBrain: Hybrid Data and Model Parallel Deep Learning

Figure 4 for SplitBrain: Hybrid Data and Model Parallel Deep Learning

Abstract:The recent success of deep learning applications has coincided with those widely available powerful computational resources for training sophisticated machine learning models with huge datasets. Nonetheless, training large models such as convolutional neural networks using model parallelism (as opposed to data parallelism) is challenging because the complex nature of communication between model shards makes it difficult to partition the computation efficiently across multiple machines with an acceptable trade-off. This paper presents SplitBrain, a high performance distributed deep learning framework supporting hybrid data and model parallelism. Specifically, SplitBrain provides layer-specific partitioning that co-locates compute intensive convolutional layers while sharding memory demanding layers. A novel scalable group communication is proposed to further improve the training throughput with reduced communication overhead. The results show that SplitBrain can achieve nearly linear speedup while saving up to 67\% of memory consumption for data and model parallel VGG over CIFAR-10.

Via

Access Paper or Ask Questions

COMPOSER: Compositional Learning of Group Activity in Videos

Dec 11, 2021

Honglu Zhou, Asim Kadav, Aviv Shamsian, Shijie Geng, Farley Lai, Long Zhao, Ting Liu, Mubbasir Kapadia, Hans Peter Graf

Figure 1 for COMPOSER: Compositional Learning of Group Activity in Videos

Figure 2 for COMPOSER: Compositional Learning of Group Activity in Videos

Figure 3 for COMPOSER: Compositional Learning of Group Activity in Videos

Figure 4 for COMPOSER: Compositional Learning of Group Activity in Videos

Abstract:Group Activity Recognition (GAR) detects the activity performed by a group of actors in a short video clip. The task requires the compositional understanding of scene entities and relational reasoning between them. We approach GAR by modeling the video as a series of tokens that represent the multi-scale semantic concepts in the video. We propose COMPOSER, a Multiscale Transformer based architecture that performs attention-based reasoning over tokens at each scale and learns group activity compositionally. In addition, we only use the keypoint modality which reduces scene biases and improves the generalization ability of the model. We improve the multi-scale representations in COMPOSER by clustering the intermediate scale representations, while maintaining consistent cluster assignments between scales. Finally, we use techniques such as auxiliary prediction and novel data augmentations (e.g., Actor Dropout) to aid model training. We demonstrate the model's strength and interpretability on the challenging Volleyball dataset. COMPOSER achieves a new state-of-the-art 94.5% accuracy with the keypoint-only modality. COMPOSER outperforms the latest GAR methods that rely on RGB signals, and performs favorably compared against methods that exploit multiple modalities. Our code will be available.

Via

Access Paper or Ask Questions

Dual Projection Generative Adversarial Networks for Conditional Image Generation

Aug 20, 2021

Ligong Han, Martin Renqiang Min, Anastasis Stathopoulos, Yu Tian, Ruijiang Gao, Asim Kadav, Dimitris Metaxas

Figure 1 for Dual Projection Generative Adversarial Networks for Conditional Image Generation

Figure 2 for Dual Projection Generative Adversarial Networks for Conditional Image Generation

Figure 3 for Dual Projection Generative Adversarial Networks for Conditional Image Generation

Figure 4 for Dual Projection Generative Adversarial Networks for Conditional Image Generation

Abstract:Conditional Generative Adversarial Networks (cGANs) extend the standard unconditional GAN framework to learning joint data-label distributions from samples, and have been established as powerful generative models capable of generating high-fidelity imagery. A challenge of training such a model lies in properly infusing class information into its generator and discriminator. For the discriminator, class conditioning can be achieved by either (1) directly incorporating labels as input or (2) involving labels in an auxiliary classification loss. In this paper, we show that the former directly aligns the class-conditioned fake-and-real data distributions $P(\text{image}|\text{class})$ ({\em data matching}), while the latter aligns data-conditioned class distributions $P(\text{class}|\text{image})$ ({\em label matching}). Although class separability does not directly translate to sample quality and becomes a burden if classification itself is intrinsically difficult, the discriminator cannot provide useful guidance for the generator if features of distinct classes are mapped to the same point and thus become inseparable. Motivated by this intuition, we propose a Dual Projection GAN (P2GAN) model that learns to balance between {\em data matching} and {\em label matching}. We then propose an improved cGAN model with Auxiliary Classification that directly aligns the fake and real conditionals $P(\text{class}|\text{image})$ by minimizing their $f$-divergence. Experiments on a synthetic Mixture of Gaussian (MoG) dataset and a variety of real-world datasets including CIFAR100, ImageNet, and VGGFace2 demonstrate the efficacy of our proposed models.

* Accepted at ICCV-21

Via

Access Paper or Ask Questions

Hopper: Multi-hop Transformer for Spatiotemporal Reasoning

Mar 22, 2021

Honglu Zhou, Asim Kadav, Farley Lai, Alexandru Niculescu-Mizil, Martin Renqiang Min, Mubbasir Kapadia, Hans Peter Graf

Figure 1 for Hopper: Multi-hop Transformer for Spatiotemporal Reasoning

Figure 2 for Hopper: Multi-hop Transformer for Spatiotemporal Reasoning

Figure 3 for Hopper: Multi-hop Transformer for Spatiotemporal Reasoning

Figure 4 for Hopper: Multi-hop Transformer for Spatiotemporal Reasoning

Abstract:This paper considers the problem of spatiotemporal object-centric reasoning in videos. Central to our approach is the notion of object permanence, i.e., the ability to reason about the location of objects as they move through the video while being occluded, contained or carried by other objects. Existing deep learning based approaches often suffer from spatiotemporal biases when applied to video reasoning problems. We propose Hopper, which uses a Multi-hop Transformer for reasoning object permanence in videos. Given a video and a localization query, Hopper reasons over image and object tracks to automatically hop over critical frames in an iterative fashion to predict the final position of the object of interest. We demonstrate the effectiveness of using a contrastive loss to reduce spatiotemporal biases. We evaluate over CATER dataset and find that Hopper achieves 73.2% Top-1 accuracy using just 1 FPS by hopping through just a few critical frames. We also demonstrate Hopper can perform long-term reasoning by building a CATER-h dataset that requires multi-step reasoning to localize objects of interest correctly.

Via

Access Paper or Ask Questions

S3VAE: Self-Supervised Sequential VAE for Representation Disentanglement and Data Generation

May 23, 2020

Yizhe Zhu, Martin Renqiang Min, Asim Kadav, Hans Peter Graf

Figure 1 for S3VAE: Self-Supervised Sequential VAE for Representation Disentanglement and Data Generation

Figure 2 for S3VAE: Self-Supervised Sequential VAE for Representation Disentanglement and Data Generation

Figure 3 for S3VAE: Self-Supervised Sequential VAE for Representation Disentanglement and Data Generation

Figure 4 for S3VAE: Self-Supervised Sequential VAE for Representation Disentanglement and Data Generation

Abstract:We propose a sequential variational autoencoder to learn disentangled representations of sequential data (e.g., videos and audios) under self-supervision. Specifically, we exploit the benefits of some readily accessible supervisory signals from input data itself or some off-the-shelf functional models and accordingly design auxiliary tasks for our model to utilize these signals. With the supervision of the signals, our model can easily disentangle the representation of an input sequence into static factors and dynamic factors (i.e., time-invariant and time-varying parts). Comprehensive experiments across videos and audios verify the effectiveness of our model on representation disentanglement and generation of sequential data, and demonstrate that, our model with self-supervision performs comparable to, if not better than, the fully-supervised model with ground truth labels, and outperforms state-of-the-art unsupervised models by a large margin.

* to appear in CVPR2020

Via

Access Paper or Ask Questions

15 Keypoints Is All You Need

Dec 05, 2019

Michael Snower, Asim Kadav, Farley Lai, Hans Peter Graf

Figure 1 for 15 Keypoints Is All You Need

Figure 2 for 15 Keypoints Is All You Need

Figure 3 for 15 Keypoints Is All You Need

Figure 4 for 15 Keypoints Is All You Need

Abstract:Pose tracking is an important problem that requires identifying unique human pose-instances and matching them temporally across different frames of a video. However, existing pose tracking methods are unable to accurately model temporal relationships and require significant computation, often computing the tracks offline. We present an efficient Multi-person Pose Tracking method, KeyTrack, that only relies on keypoint information without using any RGB or optical flow information to track human keypoints in real-time. Keypoints are tracked using our Pose Entailment method, in which, first, a pair of pose estimates is sampled from different frames in a video and tokenized. Then, a Transformer-based network makes a binary classification as to whether one pose temporally follows another. Furthermore, we improve our top-down pose estimation method with a novel, parameter-free, keypoint refinement technique that improves the keypoint estimates used during the Pose Entailment step. We achieve state-of-the-art results on the PoseTrack'17 and the PoseTrack'18 benchmarks while using only a fraction of the computation required by most other methods for computing the tracking information.

Via

Access Paper or Ask Questions