Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Nakul Agarwal

Pose-Aware Weakly-Supervised Action Segmentation

Apr 08, 2025

Seth Z. Zhao, Reza Ghoddoosian, Isht Dwivedi, Nakul Agarwal, Behzad Dariush

Figure 1 for Pose-Aware Weakly-Supervised Action Segmentation

Figure 2 for Pose-Aware Weakly-Supervised Action Segmentation

Figure 3 for Pose-Aware Weakly-Supervised Action Segmentation

Figure 4 for Pose-Aware Weakly-Supervised Action Segmentation

Abstract:Understanding human behavior is an important problem in the pursuit of visual intelligence. A challenge in this endeavor is the extensive and costly effort required to accurately label action segments. To address this issue, we consider learning methods that demand minimal supervision for segmentation of human actions in long instructional videos. Specifically, we introduce a weakly-supervised framework that uniquely incorporates pose knowledge during training while omitting its use during inference, thereby distilling pose knowledge pertinent to each action component. We propose a pose-inspired contrastive loss as a part of the whole weakly-supervised framework which is trained to distinguish action boundaries more effectively. Our approach, validated through extensive experiments on representative datasets, outperforms previous state-of-the-art (SOTA) in segmenting long instructional videos under both online and offline settings. Additionally, we demonstrate the framework's adaptability to various segmentation backbones and pose extractors across different datasets.

Via

Access Paper or Ask Questions

ACE: Action Concept Enhancement of Video-Language Models in Procedural Videos

Nov 23, 2024

Reza Ghoddoosian, Nakul Agarwal, Isht Dwivedi, Behzad Darisuh

Figure 1 for ACE: Action Concept Enhancement of Video-Language Models in Procedural Videos

Figure 2 for ACE: Action Concept Enhancement of Video-Language Models in Procedural Videos

Figure 3 for ACE: Action Concept Enhancement of Video-Language Models in Procedural Videos

Figure 4 for ACE: Action Concept Enhancement of Video-Language Models in Procedural Videos

Abstract:Vision-language models (VLMs) are capable of recognizing unseen actions. However, existing VLMs lack intrinsic understanding of procedural action concepts. Hence, they overfit to fixed labels and are not invariant to unseen action synonyms. To address this, we propose a simple fine-tuning technique, Action Concept Enhancement (ACE), to improve the robustness and concept understanding of VLMs in procedural action classification. ACE continually incorporates augmented action synonyms and negatives in an auxiliary classification loss by stochastically replacing fixed labels during training. This creates new combinations of action labels over the course of fine-tuning and prevents overfitting to fixed action representations. We show the enhanced concept understanding of our VLM, by visualizing the alignment of encoded embeddings of unseen action synonyms in the embedding space. Our experiments on the ATA, IKEA and GTEA datasets demonstrate the efficacy of ACE in domains of cooking and assembly leading to significant improvements in zero-shot action classification while maintaining competitive performance on seen actions.

* Accepted at WACV 2025

Via

Access Paper or Ask Questions

IIT Bombay Racing Driverless: Autonomous Driving Stack for Formula Student AI

Aug 12, 2024

Yash Rampuria, Deep Boliya, Shreyash Gupta, Gopalan Iyengar, Ayush Rohilla, Mohak Vyas, Chaitanya Langde, Mehul Vijay Chanda, Ronak Gautam Matai, Kothapalli Namitha(+11 more)

Abstract:This work presents the design and development of IIT Bombay Racing's Formula Student style autonomous racecar algorithm capable of running at the racing events of Formula Student-AI, held in the UK. The car employs a cutting-edge sensor suite of the compute unit NVIDIA Jetson Orin AGX, 2 ZED2i stereo cameras, 1 Velodyne Puck VLP16 LiDAR and SBG Systems Ellipse N GNSS/INS IMU. It features deep learning algorithms and control systems to navigate complex tracks and execute maneuvers without any human intervention. The design process involved extensive simulations and testing to optimize the vehicle's performance and ensure its safety. The algorithms have been tested on a small scale, in-house manufactured 4-wheeled robot and on simulation software. The results obtained for testing various algorithms in perception, simultaneous localization and mapping, path planning and controls have been detailed.

* 8 pages, 19 figures

Via

Access Paper or Ask Questions

M2D2M: Multi-Motion Generation from Text with Discrete Diffusion Models

Jul 19, 2024

Seunggeun Chi, Hyung-gun Chi, Hengbo Ma, Nakul Agarwal, Faizan Siddiqui, Karthik Ramani, Kwonjoon Lee

Figure 1 for M2D2M: Multi-Motion Generation from Text with Discrete Diffusion Models

Figure 2 for M2D2M: Multi-Motion Generation from Text with Discrete Diffusion Models

Figure 3 for M2D2M: Multi-Motion Generation from Text with Discrete Diffusion Models

Figure 4 for M2D2M: Multi-Motion Generation from Text with Discrete Diffusion Models

Abstract:We introduce the Multi-Motion Discrete Diffusion Models (M2D2M), a novel approach for human motion generation from textual descriptions of multiple actions, utilizing the strengths of discrete diffusion models. This approach adeptly addresses the challenge of generating multi-motion sequences, ensuring seamless transitions of motions and coherence across a series of actions. The strength of M2D2M lies in its dynamic transition probability within the discrete diffusion model, which adapts transition probabilities based on the proximity between motion tokens, encouraging mixing between different modes. Complemented by a two-phase sampling strategy that includes independent and joint denoising steps, M2D2M effectively generates long-term, smooth, and contextually coherent human motion sequences, utilizing a model trained for single-motion generation. Extensive experiments demonstrate that M2D2M surpasses current state-of-the-art benchmarks for motion generation from text descriptions, showcasing its efficacy in interpreting language semantics and generating dynamic, realistic motions.

Via

Access Paper or Ask Questions

Can't make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models

May 30, 2024

Himangi Mittal, Nakul Agarwal, Shao-Yuan Lo, Kwonjoon Lee

Figure 1 for Can't make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models

Figure 2 for Can't make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models

Figure 3 for Can't make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models

Figure 4 for Can't make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models

Abstract:We introduce PlausiVL, a large video-language model for anticipating action sequences that are plausible in the real-world. While significant efforts have been made towards anticipating future actions, prior approaches do not take into account the aspect of plausibility in an action sequence. To address this limitation, we explore the generative capability of a large video-language model in our work and further, develop the understanding of plausibility in an action sequence by introducing two objective functions, a counterfactual-based plausible action sequence learning loss and a long-horizon action repetition loss. We utilize temporal logical constraints as well as verb-noun action pair logical constraints to create implausible/counterfactual action sequences and use them to train the model with plausible action sequence learning loss. This loss helps the model to differentiate between plausible and not plausible action sequences and also helps the model to learn implicit temporal cues crucial for the task of action anticipation. The long-horizon action repetition loss puts a higher penalty on the actions that are more prone to repetition over a longer temporal window. With this penalization, the model is able to generate diverse, plausible action sequences. We evaluate our approach on two large-scale datasets, Ego4D and EPIC-Kitchens-100, and show improvements on the task of action anticipation.

* CVPR 2024

Via

Access Paper or Ask Questions

Multi-Objective Recommendation via Multivariate Policy Learning

May 03, 2024

Olivier Jeunen, Jatin Mandav, Ivan Potapov, Nakul Agarwal, Sourabh Vaid, Wenzhe Shi, Aleksei Ustimenko

Figure 1 for Multi-Objective Recommendation via Multivariate Policy Learning

Figure 2 for Multi-Objective Recommendation via Multivariate Policy Learning

Figure 3 for Multi-Objective Recommendation via Multivariate Policy Learning

Figure 4 for Multi-Objective Recommendation via Multivariate Policy Learning

Abstract:Real-world recommender systems often need to balance multiple objectives when deciding which recommendations to present to users. These include behavioural signals (e.g. clicks, shares, dwell time), as well as broader objectives (e.g. diversity, fairness). Scalarisation methods are commonly used to handle this balancing task, where a weighted average of per-objective reward signals determines the final score used for ranking. Naturally, how these weights are computed exactly, is key to success for any online platform. We frame this as a decision-making task, where the scalarisation weights are actions taken to maximise an overall North Star reward (e.g. long-term user retention or growth). We extend existing policy learning methods to the continuous multivariate action domain, proposing to maximise a pessimistic lower bound on the North Star reward that the learnt policy will yield. Typical lower bounds based on normal approximations suffer from insufficient coverage, and we propose an efficient and effective policy-dependent correction for this. We provide guidance to design stochastic data collection policies, as well as highly sensitive reward signals. Empirical observations from simulations, offline and online experiments highlight the efficacy of our deployed approach.

Via

Access Paper or Ask Questions

Disentangled Neural Relational Inference for Interpretable Motion Prediction

Jan 07, 2024

Victoria M. Dax, Jiachen Li, Enna Sachdeva, Nakul Agarwal, Mykel J. Kochenderfer

Abstract:Effective interaction modeling and behavior prediction of dynamic agents play a significant role in interactive motion planning for autonomous robots. Although existing methods have improved prediction accuracy, few research efforts have been devoted to enhancing prediction model interpretability and out-of-distribution (OOD) generalizability. This work addresses these two challenging aspects by designing a variational auto-encoder framework that integrates graph-based representations and time-sequence models to efficiently capture spatio-temporal relations between interactive agents and predict their dynamics. Our model infers dynamic interaction graphs in a latent space augmented with interpretable edge features that characterize the interactions. Moreover, we aim to enhance model interpretability and performance in OOD scenarios by disentangling the latent space of edge features, thereby strengthening model versatility and robustness. We validate our approach through extensive experiments on both simulated and real-world datasets. The results show superior performance compared to existing methods in modeling spatio-temporal relations, motion prediction, and identifying time-invariant latent features.

* IEEE Robotics and Automation Letters, Date: FEBRUARY 2024 , Volume: 9, Issue: 2, ISSN: 2377-3766, pp1452-1459

Via

Access Paper or Ask Questions

Vamos: Versatile Action Models for Video Understanding

Nov 22, 2023

Shijie Wang, Qi Zhao, Minh Quan Do, Nakul Agarwal, Kwonjoon Lee, Chen Sun

Figure 1 for Vamos: Versatile Action Models for Video Understanding

Figure 2 for Vamos: Versatile Action Models for Video Understanding

Figure 3 for Vamos: Versatile Action Models for Video Understanding

Figure 4 for Vamos: Versatile Action Models for Video Understanding

Abstract:What makes good video representations for video understanding, such as anticipating future activities, or answering video-conditioned questions? While earlier approaches focus on end-to-end learning directly from video pixels, we propose to revisit text-based representations, such as discrete action labels, or free-form video captions, which are interpretable and can be directly consumed by large language models (LLMs). Intuitively, different video understanding tasks may require representations that are complementary and at different granularities. To this end, we propose versatile action models (Vamos), a learning framework powered by a large language model as the "reasoner", and can flexibly leverage visual embeddings, action labels, and free-form descriptions extracted from videos as its input. We evaluate Vamos on four complementary video understanding benchmarks, Ego4D, Next-QA, IntentQA, and EgoSchema, on its capability to model temporal dynamics, encode visual history, and perform reasoning. Surprisingly, we observe that text-based representations consistently achieve competitive performance on all benchmarks, and that visual embeddings provide marginal or no performance improvement, demonstrating the effectiveness of text-based video representation in the LLM era. We perform extensive ablation study and qualitative analysis to support our observations, and achieve state-of-the-art performance on three benchmarks.

* Under submission. Code and models will be released at https://brown-palm.github.io/Vamos/

Via

Access Paper or Ask Questions

Object-centric Video Representation for Long-term Action Anticipation

Oct 31, 2023

Ce Zhang, Changcheng Fu, Shijie Wang, Nakul Agarwal, Kwonjoon Lee, Chiho Choi, Chen Sun

Figure 1 for Object-centric Video Representation for Long-term Action Anticipation

Figure 2 for Object-centric Video Representation for Long-term Action Anticipation

Figure 3 for Object-centric Video Representation for Long-term Action Anticipation

Figure 4 for Object-centric Video Representation for Long-term Action Anticipation

Abstract:This paper focuses on building object-centric representations for long-term action anticipation in videos. Our key motivation is that objects provide important cues to recognize and predict human-object interactions, especially when the predictions are longer term, as an observed "background" object could be used by the human actor in the future. We observe that existing object-based video recognition frameworks either assume the existence of in-domain supervised object detectors or follow a fully weakly-supervised pipeline to infer object locations from action labels. We propose to build object-centric video representations by leveraging visual-language pretrained models. This is achieved by "object prompts", an approach to extract task-specific object-centric representations from general-purpose pretrained models without finetuning. To recognize and predict human-object interactions, we use a Transformer-based neural architecture which allows the "retrieval" of relevant objects for action anticipation at various time scales. We conduct extensive evaluations on the Ego4D, 50Salads, and EGTEA Gaze+ benchmarks. Both quantitative and qualitative results confirm the effectiveness of our proposed method.

* This is an accepted WACV 2024 paper. Our code is available at https://github.com/brown-palm/ObjectPrompt

Via

Access Paper or Ask Questions

Rank2Tell: A Multimodal Driving Dataset for Joint Importance Ranking and Reasoning

Sep 12, 2023

Enna Sachdeva, Nakul Agarwal, Suhas Chundi, Sean Roelofs, Jiachen Li, Behzad Dariush, Chiho Choi, Mykel Kochenderfer

Abstract:The widespread adoption of commercial autonomous vehicles (AVs) and advanced driver assistance systems (ADAS) may largely depend on their acceptance by society, for which their perceived trustworthiness and interpretability to riders are crucial. In general, this task is challenging because modern autonomous systems software relies heavily on black-box artificial intelligence models. Towards this goal, this paper introduces a novel dataset, Rank2Tell, a multi-modal ego-centric dataset for Ranking the importance level and Telling the reason for the importance. Using various close and open-ended visual question answering, the dataset provides dense annotations of various semantic, spatial, temporal, and relational attributes of various important objects in complex traffic scenarios. The dense annotations and unique attributes of the dataset make it a valuable resource for researchers working on visual scene understanding and related fields. Further, we introduce a joint model for joint importance level ranking and natural language captions generation to benchmark our dataset and demonstrate performance with quantitative evaluations.

Via

Access Paper or Ask Questions