Abstract:We present a volume rendering-based neural surface reconstruction method that takes as few as three disparate RGB images as input. Our key idea is to regularize the reconstruction, which is severely ill-posed and leaving significant gaps between the sparse views, by learning a set of neural templates that act as surface priors. Our method coined DiViNet, operates in two stages. The first stage learns the templates, in the form of 3D Gaussian functions, across different scenes, without 3D supervision. In the reconstruction stage, our predicted templates serve as anchors to help "stitch" the surfaces over sparse regions. We demonstrate that our approach is not only able to complete the surface geometry but also reconstructs surface details to a reasonable extent from few disparate input views. On the DTU and BlendedMVS datasets, our approach achieves the best reconstruction quality among existing methods in the presence of such sparse views, and performs on par, if not better, with competing methods when dense views are employed as inputs.
Abstract:In this paper, we propose FCHD-Fully Convolutional Head Detector, which is an end-to-end trainable head detection model, which runs at 5 fps and with 0.70 average precision (AP), on a very modest GPU. Recent head detection techniques have avoided using anchors as a starting point for detection especially in the cases where the detection has to happen in the wild. The reason is poor performance of anchor-based techniques under scenarios where the object size is small. We argue that a good AP can be obtained with carefully designed anchors, where the anchor design choices are made based on the receptive field size of the hidden layers. Our contribution is two folds. 1) A simple fully convolutional anchor based model which is end-to-end trainable and has a very low inference time. 2) Carefully chosen anchor sizes which play a key role in getting good average precision. Our model achieves comparable results than many other baselines on challenging head detection dataset like BRAINWASH. Along with accuracy, our model has least runtime among all the baselines along with modest hardware requirements which makes it suitable for edge deployments in surveillance applications. The code is made open-source at https://github.com/aditya-vora/FCHD-Fully-Convolutional-Head-Detector.
Abstract:In this paper, we present a technique for unsupervised learning of visual representations. Specifically, we train a model for foreground and background classification task, in the process of which it learns visual representations. Foreground and background patches for training come af- ter mining for such patches from hundreds and thousands of unlabelled videos available on the web which we ex- tract using a proposed patch extraction algorithm. With- out using any supervision, with just using 150, 000 unla- belled videos and the PASCAL VOC 2007 dataset, we train a object recognition model that achieves 45.3 mAP which is close to the best performing unsupervised feature learn- ing technique whereas better than many other proposed al- gorithms. The code for patch extraction is implemented in Matlab and available open source at the following link .
Abstract:This paper addresses the problem of unsupervised object localization in an image. Unlike previous supervised and weakly supervised algorithms that require bounding box or image level annotations for training classifiers in order to learn features representing the object, we propose a simple yet effective technique for localization using iterative spectral clustering. This iterative spectral clustering approach along with appropriate cluster selection strategy in each iteration naturally helps in searching of object region in the image. In order to estimate the final localization window, we group the proposals obtained from the iterative spectral clustering step based on the perceptual similarity, and average the coordinates of the proposals from the top scoring groups. We benchmark our algorithm on challenging datasets like Object Discovery and PASCAL VOC 2007, achieving an average CorLoc percentage of 51% and 35% respectively which is comparable to various other weakly supervised algorithms despite being completely unsupervised.
Abstract:Segmenting foreground object from a video is a challenging task because of the large deformations of the objects, occlusions, and background clutter. In this paper, we propose a frame-by-frame but computationally efficient approach for video object segmentation by clustering visually similar generic object segments throughout the video. Our algorithm segments various object instances appearing in the video and then perform clustering in order to group visually similar segments into one cluster. Since the object that needs to be segmented appears in most part of the video, we can retrieve the foreground segments from the cluster having maximum number of segments, thus filtering out noisy segments that do not represent any object. We then apply a track and fill approach in order to localize the objects in the frames where the object segmentation framework fails to segment any object. Our algorithm performs comparably to the recent automatic methods for video object segmentation when benchmarked on DAVIS dataset while being computationally much faster.