Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Aditya Vora

Articulate That Object Part (ATOP): 3D Part Articulation from Text and Motion Personalization

Feb 11, 2025

Aditya Vora, Sauradip Nag, Hao Zhang

Abstract:We present ATOP (Articulate That Object Part), a novel method based on motion personalization to articulate a 3D object with respect to a part and its motion as prescribed in a text prompt. Specifically, the text input allows us to tap into the power of modern-day video diffusion to generate plausible motion samples for the right object category and part. In turn, the input 3D object provides image prompting to personalize the generated video to that very object we wish to articulate. Our method starts with a few-shot finetuning for category-specific motion generation, a key first step to compensate for the lack of articulation awareness by current video diffusion models. For this, we finetune a pre-trained multi-view image generation model for controllable multi-view video generation, using a small collection of video samples obtained for the target object category. This is followed by motion video personalization that is realized by multi-view rendered images of the target 3D object. At last, we transfer the personalized video motion to the target 3D object via differentiable rendering to optimize part motion parameters by a score distillation sampling loss. We show that our method is capable of generating realistic motion videos and predict 3D motion parameters in a more accurate and generalizable way, compared to prior works.

* Technical Report, 16 pages

Via

Access Paper or Ask Questions

DiViNeT: 3D Reconstruction from Disparate Views via Neural Template Regularization

Jun 15, 2023

Aditya Vora, Akshay Gadi Patil, Hao Zhang

Abstract:We present a volume rendering-based neural surface reconstruction method that takes as few as three disparate RGB images as input. Our key idea is to regularize the reconstruction, which is severely ill-posed and leaving significant gaps between the sparse views, by learning a set of neural templates that act as surface priors. Our method coined DiViNet, operates in two stages. The first stage learns the templates, in the form of 3D Gaussian functions, across different scenes, without 3D supervision. In the reconstruction stage, our predicted templates serve as anchors to help "stitch" the surfaces over sparse regions. We demonstrate that our approach is not only able to complete the surface geometry but also reconstructs surface details to a reasonable extent from few disparate input views. On the DTU and BlendedMVS datasets, our approach achieves the best reconstruction quality among existing methods in the presence of such sparse views, and performs on par, if not better, with competing methods when dense views are employed as inputs.

Via

Access Paper or Ask Questions

FCHD: A fast and accurate head detector

Sep 26, 2018

Aditya Vora

Figure 1 for FCHD: A fast and accurate head detector

Figure 2 for FCHD: A fast and accurate head detector

Figure 3 for FCHD: A fast and accurate head detector

Figure 4 for FCHD: A fast and accurate head detector

Abstract:In this paper, we propose FCHD-Fully Convolutional Head Detector, which is an end-to-end trainable head detection model, which runs at 5 fps and with 0.70 average precision (AP), on a very modest GPU. Recent head detection techniques have avoided using anchors as a starting point for detection especially in the cases where the detection has to happen in the wild. The reason is poor performance of anchor-based techniques under scenarios where the object size is small. We argue that a good AP can be obtained with carefully designed anchors, where the anchor design choices are made based on the receptive field size of the hidden layers. Our contribution is two folds. 1) A simple fully convolutional anchor based model which is end-to-end trainable and has a very low inference time. 2) Carefully chosen anchor sizes which play a key role in getting good average precision. Our model achieves comparable results than many other baselines on challenging head detection dataset like BRAINWASH. Along with accuracy, our model has least runtime among all the baselines along with modest hardware requirements which makes it suitable for edge deployments in surveillance applications. The code is made open-source at https://github.com/aditya-vora/FCHD-Fully-Convolutional-Head-Detector.

* 5 pages, 3 figures, under consideration at Computer Vision and Image Understanding

Via

Access Paper or Ask Questions

A Classification approach towards Unsupervised Learning of Visual Representations

Jun 01, 2018

Aditya Vora

Figure 1 for A Classification approach towards Unsupervised Learning of Visual Representations

Figure 2 for A Classification approach towards Unsupervised Learning of Visual Representations

Figure 3 for A Classification approach towards Unsupervised Learning of Visual Representations

Figure 4 for A Classification approach towards Unsupervised Learning of Visual Representations

Abstract:In this paper, we present a technique for unsupervised learning of visual representations. Specifically, we train a model for foreground and background classification task, in the process of which it learns visual representations. Foreground and background patches for training come af- ter mining for such patches from hundreds and thousands of unlabelled videos available on the web which we ex- tract using a proposed patch extraction algorithm. With- out using any supervision, with just using 150, 000 unla- belled videos and the PASCAL VOC 2007 dataset, we train a object recognition model that achieves 45.3 mAP which is close to the best performing unsupervised feature learn- ing technique whereas better than many other proposed al- gorithms. The code for patch extraction is implemented in Matlab and available open source at the following link .

Via

Access Paper or Ask Questions

Iterative Spectral Clustering for Unsupervised Object Localization

Jun 29, 2017

Aditya Vora, Shanmuganathan Raman

Figure 1 for Iterative Spectral Clustering for Unsupervised Object Localization

Figure 2 for Iterative Spectral Clustering for Unsupervised Object Localization

Figure 3 for Iterative Spectral Clustering for Unsupervised Object Localization

Figure 4 for Iterative Spectral Clustering for Unsupervised Object Localization

Abstract:This paper addresses the problem of unsupervised object localization in an image. Unlike previous supervised and weakly supervised algorithms that require bounding box or image level annotations for training classifiers in order to learn features representing the object, we propose a simple yet effective technique for localization using iterative spectral clustering. This iterative spectral clustering approach along with appropriate cluster selection strategy in each iteration naturally helps in searching of object region in the image. In order to estimate the final localization window, we group the proposals obtained from the iterative spectral clustering step based on the perceptual similarity, and average the coordinates of the proposals from the top scoring groups. We benchmark our algorithm on challenging datasets like Object Discovery and PASCAL VOC 2007, achieving an average CorLoc percentage of 51% and 35% respectively which is comparable to various other weakly supervised algorithms despite being completely unsupervised.

Via

Access Paper or Ask Questions

Flow-free Video Object Segmentation

Jun 29, 2017

Aditya Vora, Shanmuganathan Raman

Figure 1 for Flow-free Video Object Segmentation

Figure 2 for Flow-free Video Object Segmentation

Figure 3 for Flow-free Video Object Segmentation

Figure 4 for Flow-free Video Object Segmentation

Abstract:Segmenting foreground object from a video is a challenging task because of the large deformations of the objects, occlusions, and background clutter. In this paper, we propose a frame-by-frame but computationally efficient approach for video object segmentation by clustering visually similar generic object segments throughout the video. Our algorithm segments various object instances appearing in the video and then perform clustering in order to group visually similar segments into one cluster. Since the object that needs to be segmented appears in most part of the video, we can retrieve the foreground segments from the cluster having maximum number of segments, thus filtering out noisy segments that do not represent any object. We then apply a track and fill approach in order to localize the objects in the frames where the object segmentation framework fails to segment any object. Our algorithm performs comparably to the recent automatic methods for video object segmentation when benchmarked on DAVIS dataset while being computationally much faster.

Via

Access Paper or Ask Questions