Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Rakesh Kumar

Diffusion-Guided Gaussian Splatting for Large-Scale Unconstrained 3D Reconstruction and Novel View Synthesis

Apr 02, 2025

Niluthpol Chowdhury Mithun, Tuan Pham, Qiao Wang, Ben Southall, Kshitij Minhas, Bogdan Matei, Stephan Mandt, Supun Samarasekera, Rakesh Kumar

Abstract:Recent advancements in 3D Gaussian Splatting (3DGS) and Neural Radiance Fields (NeRF) have achieved impressive results in real-time 3D reconstruction and novel view synthesis. However, these methods struggle in large-scale, unconstrained environments where sparse and uneven input coverage, transient occlusions, appearance variability, and inconsistent camera settings lead to degraded quality. We propose GS-Diff, a novel 3DGS framework guided by a multi-view diffusion model to address these limitations. By generating pseudo-observations conditioned on multi-view inputs, our method transforms under-constrained 3D reconstruction problems into well-posed ones, enabling robust optimization even with sparse data. GS-Diff further integrates several enhancements, including appearance embedding, monocular depth priors, dynamic object modeling, anisotropy regularization, and advanced rasterization techniques, to tackle geometric and photometric challenges in real-world settings. Experiments on four benchmarks demonstrate that GS-Diff consistently outperforms state-of-the-art baselines by significant margins.

* WACV ULTRRA Workshop 2025

Via

Access Paper or Ask Questions

Cross-View Visual Geo-Localization for Outdoor Augmented Reality

Mar 28, 2023

Niluthpol Chowdhury Mithun, Kshitij Minhas, Han-Pang Chiu, Taragay Oskiper, Mikhail Sizintsev, Supun Samarasekera, Rakesh Kumar

Figure 1 for Cross-View Visual Geo-Localization for Outdoor Augmented Reality

Figure 2 for Cross-View Visual Geo-Localization for Outdoor Augmented Reality

Figure 3 for Cross-View Visual Geo-Localization for Outdoor Augmented Reality

Figure 4 for Cross-View Visual Geo-Localization for Outdoor Augmented Reality

Abstract:Precise estimation of global orientation and location is critical to ensure a compelling outdoor Augmented Reality (AR) experience. We address the problem of geo-pose estimation by cross-view matching of query ground images to a geo-referenced aerial satellite image database. Recently, neural network-based methods have shown state-of-the-art performance in cross-view matching. However, most of the prior works focus only on location estimation, ignoring orientation, which cannot meet the requirements in outdoor AR applications. We propose a new transformer neural network-based model and a modified triplet ranking loss for joint location and orientation estimation. Experiments on several benchmark cross-view geo-localization datasets show that our model achieves state-of-the-art performance. Furthermore, we present an approach to extend the single image query-based geo-localization approach by utilizing temporal information from a navigation pipeline for robust continuous geo-localization. Experimentation on several large-scale real-world video sequences demonstrates that our approach enables high-precision and stable AR insertion.

* IEEE VR 2023

Via

Access Paper or Ask Questions

Hardware Acceleration of Neural Graphics

Mar 16, 2023

Muhammad Husnain Mubarik, Ramakrishna Kanungo, Tobias Zirr, Rakesh Kumar

Figure 1 for Hardware Acceleration of Neural Graphics

Figure 2 for Hardware Acceleration of Neural Graphics

Figure 3 for Hardware Acceleration of Neural Graphics

Figure 4 for Hardware Acceleration of Neural Graphics

Abstract:Rendering and inverse-rendering algorithms that drive conventional computer graphics have recently been superseded by neural representations (NR). NRs have recently been used to learn the geometric and the material properties of the scenes and use the information to synthesize photorealistic imagery, thereby promising a replacement for traditional rendering algorithms with scalable quality and predictable performance. In this work we ask the question: Does neural graphics (NG) need hardware support? We studied representative NG applications showing that, if we want to render 4k res. at 60FPS there is a gap of 1.5X-55X in the desired performance on current GPUs. For AR/VR applications, there is an even larger gap of 2-4 OOM between the desired performance and the required system power. We identify that the input encoding and the MLP kernels are the performance bottlenecks, consuming 72%,60% and 59% of application time for multi res. hashgrid, multi res. densegrid and low res. densegrid encodings, respectively. We propose a NG processing cluster, a scalable and flexible hardware architecture that directly accelerates the input encoding and MLP kernels through dedicated engines and supports a wide range of NG applications. We also accelerate the rest of the kernels by fusing them together in Vulkan, which leads to 9.94X kernel-level performance improvement compared to un-fused implementation of the pre-processing and the post-processing kernels. Our results show that, NGPC gives up to 58X end-to-end application-level performance improvement, for multi res. hashgrid encoding on average across the four NG applications, the performance benefits are 12X,20X,33X and 39X for the scaling factor of 8,16,32 and 64, respectively. Our results show that with multi res. hashgrid encoding, NGPC enables the rendering of 4k res. at 30FPS for NeRF and 8k res. at 120FPS for all our other NG applications.

Via

Access Paper or Ask Questions

GraphMapper: Efficient Visual Navigation by Scene Graph Generation

May 17, 2022

Zachary Seymour, Niluthpol Chowdhury Mithun, Han-Pang Chiu, Supun Samarasekera, Rakesh Kumar

Figure 1 for GraphMapper: Efficient Visual Navigation by Scene Graph Generation

Figure 2 for GraphMapper: Efficient Visual Navigation by Scene Graph Generation

Figure 3 for GraphMapper: Efficient Visual Navigation by Scene Graph Generation

Figure 4 for GraphMapper: Efficient Visual Navigation by Scene Graph Generation

Abstract:Understanding the geometric relationships between objects in a scene is a core capability in enabling both humans and autonomous agents to navigate in new environments. A sparse, unified representation of the scene topology will allow agents to act efficiently to move through their environment, communicate the environment state with others, and utilize the representation for diverse downstream tasks. To this end, we propose a method to train an autonomous agent to learn to accumulate a 3D scene graph representation of its environment by simultaneously learning to navigate through said environment. We demonstrate that our approach, GraphMapper, enables the learning of effective navigation policies through fewer interactions with the environment than vision-based systems alone. Further, we show that GraphMapper can act as a modular scene encoder to operate alongside existing Learning-based solutions to not only increase navigational efficiency but also generate intermediate scene representations that are useful for other future tasks.

* ICPR 2022

Via

Access Paper or Ask Questions

SASRA: Semantically-aware Spatio-temporal Reasoning Agent for Vision-and-Language Navigation in Continuous Environments

Aug 26, 2021

Muhammad Zubair Irshad, Niluthpol Chowdhury Mithun, Zachary Seymour, Han-Pang Chiu, Supun Samarasekera, Rakesh Kumar

Figure 1 for SASRA: Semantically-aware Spatio-temporal Reasoning Agent for Vision-and-Language Navigation in Continuous Environments

Figure 2 for SASRA: Semantically-aware Spatio-temporal Reasoning Agent for Vision-and-Language Navigation in Continuous Environments

Figure 3 for SASRA: Semantically-aware Spatio-temporal Reasoning Agent for Vision-and-Language Navigation in Continuous Environments

Figure 4 for SASRA: Semantically-aware Spatio-temporal Reasoning Agent for Vision-and-Language Navigation in Continuous Environments

Abstract:This paper presents a novel approach for the Vision-and-Language Navigation (VLN) task in continuous 3D environments, which requires an autonomous agent to follow natural language instructions in unseen environments. Existing end-to-end learning-based VLN methods struggle at this task as they focus mostly on utilizing raw visual observations and lack the semantic spatio-temporal reasoning capabilities which is crucial in generalizing to new environments. In this regard, we present a hybrid transformer-recurrence model which focuses on combining classical semantic mapping techniques with a learning-based method. Our method creates a temporal semantic memory by building a top-down local ego-centric semantic map and performs cross-modal grounding to align map and language modalities to enable effective learning of VLN policy. Empirical results in a photo-realistic long-horizon simulation environment show that the proposed approach outperforms a variety of state-of-the-art methods and baselines with over 22% relative improvement in SPL in prior unseen environments.

* 10 pages, 4 figures

Via

Access Paper or Ask Questions

MaAST: Map Attention with Semantic Transformersfor Efficient Visual Navigation

Mar 21, 2021

Zachary Seymour, Kowshik Thopalli, Niluthpol Mithun, Han-Pang Chiu, Supun Samarasekera, Rakesh Kumar

Figure 1 for MaAST: Map Attention with Semantic Transformersfor Efficient Visual Navigation

Figure 2 for MaAST: Map Attention with Semantic Transformersfor Efficient Visual Navigation

Figure 3 for MaAST: Map Attention with Semantic Transformersfor Efficient Visual Navigation

Figure 4 for MaAST: Map Attention with Semantic Transformersfor Efficient Visual Navigation

Abstract:Visual navigation for autonomous agents is a core task in the fields of computer vision and robotics. Learning-based methods, such as deep reinforcement learning, have the potential to outperform the classical solutions developed for this task; however, they come at a significantly increased computational load. Through this work, we design a novel approach that focuses on performing better or comparable to the existing learning-based solutions but under a clear time/computational budget. To this end, we propose a method to encode vital scene semantics such as traversable paths, unexplored areas, and observed scene objects -- alongside raw visual streams such as RGB, depth, and semantic segmentation masks -- into a semantically informed, top-down egocentric map representation. Further, to enable the effective use of this information, we introduce a novel 2-D map attention mechanism, based on the successful multi-layer Transformer networks. We conduct experiments on 3-D reconstructed indoor PointGoal visual navigation and demonstrate the effectiveness of our approach. We show that by using our novel attention schema and auxiliary rewards to better utilize scene semantics, we outperform multiple baselines trained with only raw inputs or implicit semantic information while operating with an 80% decrease in the agent's experience.

* 6 pages, 5 figures, accepted at ICRA 2021

Via

Access Paper or Ask Questions

Efficient Kernel based Matched Filter Approach for Segmentation of Retinal Blood Vessels

Dec 07, 2020

Sushil Kumar Saroj, Vikas Ratna, Rakesh Kumar, Nagendra Pratap Singh

Figure 1 for Efficient Kernel based Matched Filter Approach for Segmentation of Retinal Blood Vessels

Figure 2 for Efficient Kernel based Matched Filter Approach for Segmentation of Retinal Blood Vessels

Figure 3 for Efficient Kernel based Matched Filter Approach for Segmentation of Retinal Blood Vessels

Figure 4 for Efficient Kernel based Matched Filter Approach for Segmentation of Retinal Blood Vessels

Abstract:Retinal blood vessels structure contains information about diseases like obesity, diabetes, hypertension and glaucoma. This information is very useful in identification and treatment of these fatal diseases. To obtain this information, there is need to segment these retinal vessels. Many kernel based methods have been given for segmentation of retinal vessels but their kernels are not appropriate to vessel profile cause poor performance. To overcome this, a new and efficient kernel based matched filter approach has been proposed. The new matched filter is used to generate the matched filter response (MFR) image. We have applied Otsu thresholding method on obtained MFR image to extract the vessels. We have conducted extensive experiments to choose best value of parameters for the proposed matched filter kernel. The proposed approach has examined and validated on two online available DRIVE and STARE datasets. The proposed approach has specificity 98.50%, 98.23% and accuracy 95.77 %, 95.13% for DRIVE and STARE dataset respectively. Obtained results confirm that the proposed method has better performance than others. The reason behind increased performance is due to appropriate proposed kernel which matches retinal blood vessel profile more accurately.

Via

Access Paper or Ask Questions

RGB2LIDAR: Towards Solving Large-Scale Cross-Modal Visual Localization

Sep 12, 2020

Niluthpol Chowdhury Mithun, Karan Sikka, Han-Pang Chiu, Supun Samarasekera, Rakesh Kumar

Figure 1 for RGB2LIDAR: Towards Solving Large-Scale Cross-Modal Visual Localization

Figure 2 for RGB2LIDAR: Towards Solving Large-Scale Cross-Modal Visual Localization

Figure 3 for RGB2LIDAR: Towards Solving Large-Scale Cross-Modal Visual Localization

Figure 4 for RGB2LIDAR: Towards Solving Large-Scale Cross-Modal Visual Localization

Abstract:We study an important, yet largely unexplored problem of large-scale cross-modal visual localization by matching ground RGB images to a geo-referenced aerial LIDAR 3D point cloud (rendered as depth images). Prior works were demonstrated on small datasets and did not lend themselves to scaling up for large-scale applications. To enable large-scale evaluation, we introduce a new dataset containing over 550K pairs (covering 143 km^2 area) of RGB and aerial LIDAR depth images. We propose a novel joint embedding based method that effectively combines the appearance and semantic cues from both modalities to handle drastic cross-modal variations. Experiments on the proposed dataset show that our model achieves a strong result of a median rank of 5 in matching across a large test set of 50K location pairs collected from a 14km^2 area. This represents a significant advancement over prior works in performance and scale. We conclude with qualitative results to highlight the challenging nature of this task and the benefits of the proposed model. Our work provides a foundation for further research in cross-modal visual localization.

* ACM Multimedia 2020

Via

Access Paper or Ask Questions

ApproxNet: Content and Contention Aware Video Analytics System for the Edge

Aug 28, 2019

Ran Xu, Jinkyu Koo, Rakesh Kumar, Peter Bai, Subrata Mitra, Ganga Maghanath, Saurabh Bagchi

Figure 1 for ApproxNet: Content and Contention Aware Video Analytics System for the Edge

Figure 2 for ApproxNet: Content and Contention Aware Video Analytics System for the Edge

Figure 3 for ApproxNet: Content and Contention Aware Video Analytics System for the Edge

Figure 4 for ApproxNet: Content and Contention Aware Video Analytics System for the Edge

Abstract:Videos take lot of time to transport over the network, hence running analytics on live video at the edge devices, right where it was captured has become an important system driver. However these edge devices, e.g., IoT devices, surveillance cameras, AR/VR gadgets are resource constrained. This makes it impossible to run state-of-the-art heavy Deep Neural Networks (DNNs) on them and yet provide low and stable latency under various circumstances, such as, changes in the resource availability on the device, the content characteristics, or requirements from the user. In this paper we introduce ApproxNet, a video analytics system for the edge. It enables novel dynamic approximation techniques to achieve desired inference latency and accuracy trade-off under different system conditions and resource contentions, variations in the complexity of the video contents and user requirements. It achieves this by enabling two approximation knobs within a single DNN model, rather than creating and maintaining an ensemble of models (such as in MCDNN [Mobisys-16]). Ensemble models run into memory issues on the lightweight devices and incur large switching penalties among the models in response to runtime changes. We show that ApproxNet can adapt seamlessly at runtime to video content changes and changes in system dynamics to provide low and stable latency for object detection on a video stream. We compare the accuracy and the latency to ResNet [2015], MCDNN, and MobileNets [Google-2017].

Via

Access Paper or Ask Questions

Semantically-Aware Attentive Neural Embeddings for Image-based Visual Localization

Dec 08, 2018

Zachary Seymour, Karan Sikka, Han-Pang Chiu, Supun Samarasekera, Rakesh Kumar

Figure 1 for Semantically-Aware Attentive Neural Embeddings for Image-based Visual Localization

Figure 2 for Semantically-Aware Attentive Neural Embeddings for Image-based Visual Localization

Figure 3 for Semantically-Aware Attentive Neural Embeddings for Image-based Visual Localization

Figure 4 for Semantically-Aware Attentive Neural Embeddings for Image-based Visual Localization

Abstract:We present a novel method for fusing appearance and semantic information using visual attention for 2D image-based localization (2D-VL) across extreme changes in viewing conditions. Our deep learning based method is motivated by the intuition that specific scene regions remain stable in the semantic modality even in the presence of vast differences in the appearance modality. The proposed attention-based module learns to focus not only on discriminative visual regions for place recognition but also on consistently stable semantic regions to perform 2D-VL. We show the effectiveness of this model by comparing against state-of-the-art (SOTA) methods on several challenging localization datasets. We report an average absolute improvement of 19% over current SOTA 2D-VL methods. Furthermore, we present an extensive study demonstrating the effectiveness and contribution of each component of our model, showing 8%-15% absolute improvement from adding semantic information, and an additional 4% from our proposed attention module, over both prior methods as well as a competitive baseline.

Via

Access Paper or Ask Questions