Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Rajat Saini

WTS: A Pedestrian-Centric Traffic Video Dataset for Fine-grained Spatial-Temporal Understanding

Jul 22, 2024

Quan Kong, Yuki Kawana, Rajat Saini, Ashutosh Kumar, Jingjing Pan, Ta Gu, Yohei Ozao, Balazs Opra, David C. Anastasiu, Yoichi Sato(+1 more)

Figure 1 for WTS: A Pedestrian-Centric Traffic Video Dataset for Fine-grained Spatial-Temporal Understanding

Figure 2 for WTS: A Pedestrian-Centric Traffic Video Dataset for Fine-grained Spatial-Temporal Understanding

Figure 3 for WTS: A Pedestrian-Centric Traffic Video Dataset for Fine-grained Spatial-Temporal Understanding

Figure 4 for WTS: A Pedestrian-Centric Traffic Video Dataset for Fine-grained Spatial-Temporal Understanding

Abstract:In this paper, we address the challenge of fine-grained video event understanding in traffic scenarios, vital for autonomous driving and safety. Traditional datasets focus on driver or vehicle behavior, often neglecting pedestrian perspectives. To fill this gap, we introduce the WTS dataset, highlighting detailed behaviors of both vehicles and pedestrians across over 1.2k video events in hundreds of traffic scenarios. WTS integrates diverse perspectives from vehicle ego and fixed overhead cameras in a vehicle-infrastructure cooperative environment, enriched with comprehensive textual descriptions and unique 3D Gaze data for a synchronized 2D/3D view, focusing on pedestrian analysis. We also pro-vide annotations for 5k publicly sourced pedestrian-related traffic videos. Additionally, we introduce LLMScorer, an LLM-based evaluation metric to align inference captions with ground truth. Using WTS, we establish a benchmark for dense video-to-text tasks, exploring state-of-the-art Vision-Language Models with an instance-aware VideoLLM method as a baseline. WTS aims to advance fine-grained video event understanding, enhancing traffic safety and autonomous driving development.

* ECCV24. Website: https://woven-visionai.github.io/wts-dataset-homepage/

Via

Access Paper or Ask Questions

On the Demystification of Knowledge Distillation: A Residual Network Perspective

Jun 30, 2020

Nandan Kumar Jha, Rajat Saini, Sparsh Mittal

Figure 1 for On the Demystification of Knowledge Distillation: A Residual Network Perspective

Figure 2 for On the Demystification of Knowledge Distillation: A Residual Network Perspective

Figure 3 for On the Demystification of Knowledge Distillation: A Residual Network Perspective

Figure 4 for On the Demystification of Knowledge Distillation: A Residual Network Perspective

Abstract:Knowledge distillation (KD) is generally considered as a technique for performing model compression and learned-label smoothing. However, in this paper, we study and investigate the KD approach from a new perspective: we study its efficacy in training a deeper network without any residual connections. We find that in most of the cases, non-residual student networks perform equally or better than their residual versions trained on raw data without KD (baseline network). Surprisingly, in some cases, they surpass the accuracy of baseline networks even with the inferior teachers. After a certain depth of non-residual student network, the accuracy drop, coming from the removal of residual connections, is substantial, and training with KD boosts the accuracy of the student up to a great extent; however, it does not fully recover the accuracy drop. Furthermore, we observe that the conventional teacher-student view of KD is incomplete and does not adequately explain our findings. We propose a novel interpretation of KD with the Trainee-Mentor hypothesis, which provides a holistic view of KD. We also present two viewpoints, loss landscape, and feature reuse, to explain the interplay between residual connections and KD. We substantiate our claims through extensive experiments on residual networks.

Via

Access Paper or Ask Questions

ULSAM: Ultra-Lightweight Subspace Attention Module for Compact Convolutional Neural Networks

Jun 26, 2020

Rajat Saini, Nandan Kumar Jha, Bedanta Das, Sparsh Mittal, C. Krishna Mohan

Figure 1 for ULSAM: Ultra-Lightweight Subspace Attention Module for Compact Convolutional Neural Networks

Figure 2 for ULSAM: Ultra-Lightweight Subspace Attention Module for Compact Convolutional Neural Networks

Figure 3 for ULSAM: Ultra-Lightweight Subspace Attention Module for Compact Convolutional Neural Networks

Figure 4 for ULSAM: Ultra-Lightweight Subspace Attention Module for Compact Convolutional Neural Networks

Abstract:The capability of the self-attention mechanism to model the long-range dependencies has catapulted its deployment in vision models. Unlike convolution operators, self-attention offers infinite receptive field and enables compute-efficient modeling of global dependencies. However, the existing state-of-the-art attention mechanisms incur high compute and/or parameter overheads, and hence unfit for compact convolutional neural networks (CNNs). In this work, we propose a simple yet effective "Ultra-Lightweight Subspace Attention Mechanism" (ULSAM), which infers different attention maps for each feature map subspace. We argue that leaning separate attention maps for each feature subspace enables multi-scale and multi-frequency feature representation, which is more desirable for fine-grained image classification. Our method of subspace attention is orthogonal and complementary to the existing state-of-the-arts attention mechanisms used in vision models. ULSAM is end-to-end trainable and can be deployed as a plug-and-play module in the pre-existing compact CNNs. Notably, our work is the first attempt that uses a subspace attention mechanism to increase the efficiency of compact CNNs. To show the efficacy of ULSAM, we perform experiments with MobileNet-V1 and MobileNet-V2 as backbone architectures on ImageNet-1K and three fine-grained image classification datasets. We achieve $\approx$13% and $\approx$25% reduction in both the FLOPs and parameter counts of MobileNet-V2 with a 0.27% and more than 1% improvement in top-1 accuracy on the ImageNet-1K and fine-grained image classification datasets (respectively). Code and trained models are available at https://github.com/Nandan91/ULSAM.

* WACV (2020) 1627-1636
* Accepted as a conference paper in 2020 IEEE Winter Conference on Applications of Computer Vision (WACV)

Via

Access Paper or Ask Questions

E2GC: Energy-efficient Group Convolution in Deep Neural Networks

Jun 26, 2020

Nandan Kumar Jha, Rajat Saini, Subhrajit Nag, Sparsh Mittal

Figure 1 for E2GC: Energy-efficient Group Convolution in Deep Neural Networks

Figure 2 for E2GC: Energy-efficient Group Convolution in Deep Neural Networks

Figure 3 for E2GC: Energy-efficient Group Convolution in Deep Neural Networks

Figure 4 for E2GC: Energy-efficient Group Convolution in Deep Neural Networks

Abstract:The number of groups ($g$) in group convolution (GConv) is selected to boost the predictive performance of deep neural networks (DNNs) in a compute and parameter efficient manner. However, we show that naive selection of $g$ in GConv creates an imbalance between the computational complexity and degree of data reuse, which leads to suboptimal energy efficiency in DNNs. We devise an optimum group size model, which enables a balance between computational cost and data movement cost, thus, optimize the energy-efficiency of DNNs. Based on the insights from this model, we propose an "energy-efficient group convolution" (E2GC) module where, unlike the previous implementations of GConv, the group size ($G$) remains constant. Further, to demonstrate the efficacy of the E2GC module, we incorporate this module in the design of MobileNet-V1 and ResNeXt-50 and perform experiments on two GPUs, P100 and P4000. We show that, at comparable computational complexity, DNNs with constant group size (E2GC) are more energy-efficient than DNNs with a fixed number of groups (F$g$GC). For example, on P100 GPU, the energy-efficiency of MobileNet-V1 and ResNeXt-50 is increased by 10.8% and 4.73% (respectively) when E2GC modules substitute the F$g$GC modules in both the DNNs. Furthermore, through our extensive experimentation with ImageNet-1K and Food-101 image classification datasets, we show that the E2GC module enables a trade-off between generalization ability and representational power of DNN. Thus, the predictive performance of DNNs can be optimized by selecting an appropriate $G$. The code and trained models are available at https://github.com/iithcandle/E2GC-release.

* VLSID (2020) 155-160
* Accepted as a conference paper in 2020 33rd International Conference on VLSI Design and 2020 19th International Conference on Embedded Systems (VLSID)

Via

Access Paper or Ask Questions