Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Akshita Gupta

A multi-modal dataset for insect biodiversity with imagery and DNA at the trap and individual level

Jul 09, 2025

Johanna Orsholm, John Quinto, Hannu Autto, Gaia Banelyte, Nicolas Chazot, Jeremy deWaard, Stephanie deWaard, Arielle Farrell, Brendan Furneaux, Bess Hardwick(+19 more)

Abstract:Insects comprise millions of species, many experiencing severe population declines under environmental and habitat changes. High-throughput approaches are crucial for accelerating our understanding of insect diversity, with DNA barcoding and high-resolution imaging showing strong potential for automatic taxonomic classification. However, most image-based approaches rely on individual specimen data, unlike the unsorted bulk samples collected in large-scale ecological surveys. We present the Mixed Arthropod Sample Segmentation and Identification (MassID45) dataset for training automatic classifiers of bulk insect samples. It uniquely combines molecular and imaging data at both the unsorted sample level and the full set of individual specimens. Human annotators, supported by an AI-assisted tool, performed two tasks on bulk images: creating segmentation masks around each individual arthropod and assigning taxonomic labels to over 17 000 specimens. Combining the taxonomic resolution of DNA barcodes with precise abundance estimates of bulk images holds great potential for rapid, large-scale characterization of insect communities. This dataset pushes the boundaries of tiny object detection and instance segmentation, fostering innovation in both ecological and machine learning research.

* 13 pages, 6 figures, submitted to Scientific Data

Via

Access Paper or Ask Questions

Global Task-aware Fault Detection, Identification For On-Orbit Multi-Spacecraft Collaborative Inspection

May 06, 2025

Akshita Gupta, Yashwanth Kumar Nakka, Changrak Choi, Amir Rahmani

Abstract:In this paper, we present a global-to-local task-aware fault detection and identification algorithm to detect failures in a multi-spacecraft system performing a collaborative inspection (referred to as global) task. The inspection task is encoded as a cost functional $\costH$ that informs global (task allocation and assignment) and local (agent-level) decision-making. The metric $\costH$ is a function of the inspection sensor model, and the agent full-pose. We use the cost functional $\costH$ to design a metric that compares the expected and actual performance to detect the faulty agent using a threshold. We use higher-order cost gradients $\costH$ to derive a new metric to identify the type of fault, including task-specific sensor fault, an agent-level actuator, and sensor faults. Furthermore, we propose an approach to design adaptive thresholds for each fault mentioned above to incorporate the time dependence of the inspection task. We demonstrate the efficacy of the proposed method empirically, by simulating and detecting faults (such as inspection sensor faults, actuators, and sensor faults) in a low-Earth orbit collaborative spacecraft inspection task using the metrics and the threshold designed using the global task cost $\costH$.

* published. 33rd AAS AIAA Conference 2023

Via

Access Paper or Ask Questions

Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis

Nov 26, 2024

Akshita Gupta, Tatiana Likhomanenko, Karren Dai Yang, Richard He Bai, Zakaria Aldeneh, Navdeep Jaitly

Figure 1 for Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis

Figure 2 for Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis

Figure 3 for Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis

Figure 4 for Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis

Abstract:In this paper, we propose a new task -- generating speech from videos of people and their transcripts (VTTS) -- to motivate new techniques for multimodal speech generation. This task generalizes the task of generating speech from cropped lip videos, and is also more complicated than the task of generating generic audio clips (e.g., dog barking) from videos and text. Multilingual versions of the task could lead to new techniques for cross-lingual dubbing. We also present a decoder-only multimodal model for this task, which we call Visatronic. This model embeds vision, text and speech directly into the common subspace of a transformer model and uses an autoregressive loss to learn a generative model of discretized mel-spectrograms conditioned on speaker videos and transcripts of their speech. By embedding all modalities into a common subspace, Visatronic can achieve improved results over models that use only text or video as input. Further, it presents a much simpler approach for multimodal speech generation compared to prevailing approaches which rely on lip-detectors and complicated architectures to fuse modalities while producing better results. Since the model is flexible enough to accommodate different ways of ordering inputs as a sequence, we carefully explore different strategies to better understand the best way to propagate information to the generative steps. To facilitate further research on VTTS, we will release (i) our code, (ii) clean transcriptions for the large-scale VoxCeleb2 dataset, and (iii) a standardized evaluation protocol for VTTS incorporating both objective and subjective metrics.

Via

Access Paper or Ask Questions

Open-Vocabulary Temporal Action Localization using Multimodal Guidance

Jun 21, 2024

Akshita Gupta, Aditya Arora, Sanath Narayan, Salman Khan, Fahad Shahbaz Khan, Graham W. Taylor

Abstract:Open-Vocabulary Temporal Action Localization (OVTAL) enables a model to recognize any desired action category in videos without the need to explicitly curate training data for all categories. However, this flexibility poses significant challenges, as the model must recognize not only the action categories seen during training but also novel categories specified at inference. Unlike standard temporal action localization, where training and test categories are predetermined, OVTAL requires understanding contextual cues that reveal the semantics of novel categories. To address these challenges, we introduce OVFormer, a novel open-vocabulary framework extending ActionFormer with three key contributions. First, we employ task-specific prompts as input to a large language model to obtain rich class-specific descriptions for action categories. Second, we introduce a cross-attention mechanism to learn the alignment between class representations and frame-level video features, facilitating the multimodal guided features. Third, we propose a two-stage training strategy which includes training with a larger vocabulary dataset and finetuning to downstream data to generalize to novel categories. OVFormer extends existing TAL methods to open-vocabulary settings. Comprehensive evaluations on the THUMOS14 and ActivityNet-1.3 benchmarks demonstrate the effectiveness of our method. Code and pretrained models will be publicly released.

Via

Access Paper or Ask Questions

LoSA: Long-Short-range Adapter for Scaling End-to-End Temporal Action Localization

Apr 01, 2024

Akshita Gupta, Gaurav Mittal, Ahmed Magooda, Ye Yu, Graham W. Taylor, Mei Chen

Figure 1 for LoSA: Long-Short-range Adapter for Scaling End-to-End Temporal Action Localization

Figure 2 for LoSA: Long-Short-range Adapter for Scaling End-to-End Temporal Action Localization

Figure 3 for LoSA: Long-Short-range Adapter for Scaling End-to-End Temporal Action Localization

Figure 4 for LoSA: Long-Short-range Adapter for Scaling End-to-End Temporal Action Localization

Abstract:Temporal Action Localization (TAL) involves localizing and classifying action snippets in an untrimmed video. The emergence of large video foundation models has led RGB-only video backbones to outperform previous methods needing both RGB and optical flow modalities. Leveraging these large models is often limited to training only the TAL head due to the prohibitively large GPU memory required to adapt the video backbone for TAL. To overcome this limitation, we introduce LoSA, the first memory-and-parameter-efficient backbone adapter designed specifically for TAL to handle untrimmed videos. LoSA specializes for TAL by introducing Long-Short-range Adapters that adapt the intermediate layers of the video backbone over different temporal ranges. These adapters run parallel to the video backbone to significantly reduce memory footprint. LoSA also includes Long-Short-range Fusion that strategically combines the output of these adapters from the video backbone layers to enhance the video features provided to the TAL head. Experiments show that LoSA significantly outperforms all existing methods on standard TAL benchmarks, THUMOS-14 and ActivityNet-v1.3, by scaling end-to-end backbone adaptation to billion-parameter-plus models like VideoMAEv2~(ViT-g) and leveraging them beyond head-only transfer learning.

Via

Access Paper or Ask Questions

Multimodality in Online Education: A Comparative Study

Dec 17, 2023

Praneeta Immadisetty, Pooja Rajesh, Akshita Gupta, Anala M R, Soumya A, K. N. Subramanya

Figure 1 for Multimodality in Online Education: A Comparative Study

Figure 2 for Multimodality in Online Education: A Comparative Study

Figure 3 for Multimodality in Online Education: A Comparative Study

Figure 4 for Multimodality in Online Education: A Comparative Study

Abstract:The commencement of the decade brought along with it a grave pandemic and in response the movement of education forums predominantly into the online world. With a surge in the usage of online video conferencing platforms and tools to better gauge student understanding, there needs to be a mechanism to assess whether instructors can grasp the extent to which students understand the subject and their response to the educational stimuli. The current systems consider only a single cue with a lack of focus in the educational domain. Thus, there is a necessity for the measurement of an all-encompassing holistic overview of the students' reaction to the subject matter. This paper highlights the need for a multimodal approach to affect recognition and its deployment in the online classroom while considering four cues, posture and gesture, facial, eye tracking and verbal recognition. It compares the various machine learning models available for each cue and provides the most suitable approach given the available dataset and parameters of classroom footage. A multimodal approach derived from weighted majority voting is proposed by combining the most fitting models from this analysis of individual cues based on accuracy, ease of procuring data corpus, sensitivity and any major drawbacks.

Via

Access Paper or Ask Questions

OW-DETR: Open-world Detection Transformer

Dec 09, 2021

Akshita Gupta, Sanath Narayan, K J Joseph, Salman Khan, Fahad Shahbaz Khan, Mubarak Shah

Figure 1 for OW-DETR: Open-world Detection Transformer

Figure 2 for OW-DETR: Open-world Detection Transformer

Figure 3 for OW-DETR: Open-world Detection Transformer

Figure 4 for OW-DETR: Open-world Detection Transformer

Abstract:Open-world object detection (OWOD) is a challenging computer vision problem, where the task is to detect a known set of object categories while simultaneously identifying unknown objects. Additionally, the model must incrementally learn new classes that become known in the next training episodes. Distinct from standard object detection, the OWOD setting poses significant challenges for generating quality candidate proposals on potentially unknown objects, separating the unknown objects from the background and detecting diverse unknown objects. Here, we introduce a novel end-to-end transformer-based framework, OW-DETR, for open-world object detection. The proposed OW-DETR comprises three dedicated components namely, attention-driven pseudo-labeling, novelty classification and objectness scoring to explicitly address the aforementioned OWOD challenges. Our OW-DETR explicitly encodes multi-scale contextual information, possesses less inductive bias, enables knowledge transfer from known classes to the unknown class and can better discriminate between unknown objects and background. Comprehensive experiments are performed on two benchmarks: MS-COCO and PASCAL VOC. The extensive ablations reveal the merits of our proposed contributions. Further, our model outperforms the recently introduced OWOD approach, ORE, with absolute gains ranging from 1.8% to 3.3% in terms of unknown recall on the MS-COCO benchmark. In the case of incremental object detection, OW-DETR outperforms the state-of-the-art for all settings on the PASCAL VOC benchmark. Our codes and models will be publicly released.

* 15 pages

Via

Access Paper or Ask Questions

Discriminative Region-based Multi-Label Zero-Shot Learning

Aug 20, 2021

Sanath Narayan, Akshita Gupta, Salman Khan, Fahad Shahbaz Khan, Ling Shao, Mubarak Shah

Figure 1 for Discriminative Region-based Multi-Label Zero-Shot Learning

Figure 2 for Discriminative Region-based Multi-Label Zero-Shot Learning

Figure 3 for Discriminative Region-based Multi-Label Zero-Shot Learning

Figure 4 for Discriminative Region-based Multi-Label Zero-Shot Learning

Abstract:Multi-label zero-shot learning (ZSL) is a more realistic counter-part of standard single-label ZSL since several objects can co-exist in a natural image. However, the occurrence of multiple objects complicates the reasoning and requires region-specific processing of visual features to preserve their contextual cues. We note that the best existing multi-label ZSL method takes a shared approach towards attending to region features with a common set of attention maps for all the classes. Such shared maps lead to diffused attention, which does not discriminatively focus on relevant locations when the number of classes are large. Moreover, mapping spatially-pooled visual features to the class semantics leads to inter-class feature entanglement, thus hampering the classification. Here, we propose an alternate approach towards region-based discriminability-preserving multi-label zero-shot classification. Our approach maintains the spatial resolution to preserve region-level characteristics and utilizes a bi-level attention module (BiAM) to enrich the features by incorporating both region and scene context information. The enriched region-level features are then mapped to the class semantics and only their class predictions are spatially pooled to obtain image-level predictions, thereby keeping the multi-class features disentangled. Our approach sets a new state of the art on two large-scale multi-label zero-shot benchmarks: NUS-WIDE and Open Images. On NUS-WIDE, our approach achieves an absolute gain of 6.9% mAP for ZSL, compared to the best published results.

* Accepted to ICCV 2021. Source code is available at https://github.com/akshitac8/BiAM

Via

Access Paper or Ask Questions

Generative Multi-Label Zero-Shot Learning

Jan 28, 2021

Akshita Gupta, Sanath Narayan, Salman Khan, Fahad Shahbaz Khan, Ling Shao, Joost van de Weijer

Figure 1 for Generative Multi-Label Zero-Shot Learning

Figure 2 for Generative Multi-Label Zero-Shot Learning

Figure 3 for Generative Multi-Label Zero-Shot Learning

Figure 4 for Generative Multi-Label Zero-Shot Learning

Abstract:Multi-label zero-shot learning strives to classify images into multiple unseen categories for which no data is available during training. The test samples can additionally contain seen categories in the generalized variant. Existing approaches rely on learning either shared or label-specific attention from the seen classes. Nevertheless, computing reliable attention maps for unseen classes during inference in a multi-label setting is still a challenge. In contrast, state-of-the-art single-label generative adversarial network (GAN) based approaches learn to directly synthesize the class-specific visual features from the corresponding class attribute embeddings. However, synthesizing multi-label features from GANs is still unexplored in the context of zero-shot setting. In this work, we introduce different fusion approaches at the attribute-level, feature-level and cross-level (across attribute and feature-levels) for synthesizing multi-label features from their corresponding multi-label class embedding. To the best of our knowledge, our work is the first to tackle the problem of multi-label feature synthesis in the (generalized) zero-shot setting. Comprehensive experiments are performed on three zero-shot image classification benchmarks: NUS-WIDE, Open Images and MS COCO. Our cross-level fusion-based generative approach outperforms the state-of-the-art on all three datasets. Furthermore, we show the generalization capabilities of our fusion approach in the zero-shot detection task on MS COCO, achieving favorable performance against existing methods. The source code is available at https://github.com/akshitac8/Generative_MLZSL.

* 10 pages, source code is available at https://github.com/akshitac8/Generative_MLZSL

Via

Access Paper or Ask Questions

Safety Verification of Model Based Reinforcement Learning Controllers

Oct 21, 2020

Akshita Gupta, Inseok Hwang

Figure 1 for Safety Verification of Model Based Reinforcement Learning Controllers

Figure 2 for Safety Verification of Model Based Reinforcement Learning Controllers

Figure 3 for Safety Verification of Model Based Reinforcement Learning Controllers

Figure 4 for Safety Verification of Model Based Reinforcement Learning Controllers

Abstract:Model-based reinforcement learning (RL) has emerged as a promising tool for developing controllers for real world systems (e.g., robotics, autonomous driving, etc.). However, real systems often have constraints imposed on their state space which must be satisfied to ensure the safety of the system and its environment. Developing a verification tool for RL algorithms is challenging because the non-linear structure of neural networks impedes analytical verification of such models or controllers. To this end, we present a novel safety verification framework for model-based RL controllers using reachable set analysis. The proposed frame-work can efficiently handle models and controllers which are represented using neural networks. Additionally, if a controller fails to satisfy the safety constraints in general, the proposed framework can also be used to identify the subset of initial states from which the controller can be safely executed.

Via

Access Paper or Ask Questions