Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Karthik Ramani

CATSplat: Context-Aware Transformer with Spatial Guidance for Generalizable 3D Gaussian Splatting from A Single-View Image

Dec 17, 2024

Wonseok Roh, Hwanhee Jung, Jong Wook Kim, Seunggwan Lee, Innfarn Yoo, Andreas Lugmayr, Seunggeun Chi, Karthik Ramani, Sangpil Kim

Figure 1 for CATSplat: Context-Aware Transformer with Spatial Guidance for Generalizable 3D Gaussian Splatting from A Single-View Image

Figure 2 for CATSplat: Context-Aware Transformer with Spatial Guidance for Generalizable 3D Gaussian Splatting from A Single-View Image

Figure 3 for CATSplat: Context-Aware Transformer with Spatial Guidance for Generalizable 3D Gaussian Splatting from A Single-View Image

Figure 4 for CATSplat: Context-Aware Transformer with Spatial Guidance for Generalizable 3D Gaussian Splatting from A Single-View Image

Abstract:Recently, generalizable feed-forward methods based on 3D Gaussian Splatting have gained significant attention for their potential to reconstruct 3D scenes using finite resources. These approaches create a 3D radiance field, parameterized by per-pixel 3D Gaussian primitives, from just a few images in a single forward pass. However, unlike multi-view methods that benefit from cross-view correspondences, 3D scene reconstruction with a single-view image remains an underexplored area. In this work, we introduce CATSplat, a novel generalizable transformer-based framework designed to break through the inherent constraints in monocular settings. First, we propose leveraging textual guidance from a visual-language model to complement insufficient information from a single image. By incorporating scene-specific contextual details from text embeddings through cross-attention, we pave the way for context-aware 3D scene reconstruction beyond relying solely on visual cues. Moreover, we advocate utilizing spatial guidance from 3D point features toward comprehensive geometric understanding under single-view settings. With 3D priors, image features can capture rich structural insights for predicting 3D Gaussians without multi-view techniques. Extensive experiments on large-scale datasets demonstrate the state-of-the-art performance of CATSplat in single-view 3D scene reconstruction with high-quality novel view synthesis.

Via

Access Paper or Ask Questions

Estimating Ego-Body Pose from Doubly Sparse Egocentric Video Data

Nov 05, 2024

Seunggeun Chi, Pin-Hao Huang, Enna Sachdeva, Hengbo Ma, Karthik Ramani, Kwonjoon Lee

Figure 1 for Estimating Ego-Body Pose from Doubly Sparse Egocentric Video Data

Figure 2 for Estimating Ego-Body Pose from Doubly Sparse Egocentric Video Data

Figure 3 for Estimating Ego-Body Pose from Doubly Sparse Egocentric Video Data

Figure 4 for Estimating Ego-Body Pose from Doubly Sparse Egocentric Video Data

Abstract:We study the problem of estimating the body movements of a camera wearer from egocentric videos. Current methods for ego-body pose estimation rely on temporally dense sensor data, such as IMU measurements from spatially sparse body parts like the head and hands. However, we propose that even temporally sparse observations, such as hand poses captured intermittently from egocentric videos during natural or periodic hand movements, can effectively constrain overall body motion. Naively applying diffusion models to generate full-body pose from head pose and sparse hand pose leads to suboptimal results. To overcome this, we develop a two-stage approach that decomposes the problem into temporal completion and spatial completion. First, our method employs masked autoencoders to impute hand trajectories by leveraging the spatiotemporal correlations between the head pose sequence and intermittent hand poses, providing uncertainty estimates. Subsequently, we employ conditional diffusion models to generate plausible full-body motions based on these temporally dense trajectories of the head and hands, guided by the uncertainty estimates from the imputation. The effectiveness of our method was rigorously tested and validated through comprehensive experiments conducted on various HMD setup with AMASS and Ego-Exo4D datasets.

* Accepted at NeurIPS 2024

Via

Access Paper or Ask Questions

M2D2M: Multi-Motion Generation from Text with Discrete Diffusion Models

Jul 19, 2024

Seunggeun Chi, Hyung-gun Chi, Hengbo Ma, Nakul Agarwal, Faizan Siddiqui, Karthik Ramani, Kwonjoon Lee

Figure 1 for M2D2M: Multi-Motion Generation from Text with Discrete Diffusion Models

Figure 2 for M2D2M: Multi-Motion Generation from Text with Discrete Diffusion Models

Figure 3 for M2D2M: Multi-Motion Generation from Text with Discrete Diffusion Models

Figure 4 for M2D2M: Multi-Motion Generation from Text with Discrete Diffusion Models

Abstract:We introduce the Multi-Motion Discrete Diffusion Models (M2D2M), a novel approach for human motion generation from textual descriptions of multiple actions, utilizing the strengths of discrete diffusion models. This approach adeptly addresses the challenge of generating multi-motion sequences, ensuring seamless transitions of motions and coherence across a series of actions. The strength of M2D2M lies in its dynamic transition probability within the discrete diffusion model, which adapts transition probabilities based on the proximity between motion tokens, encouraging mixing between different modes. Complemented by a two-phase sampling strategy that includes independent and joint denoising steps, M2D2M effectively generates long-term, smooth, and contextually coherent human motion sequences, utilizing a model trained for single-motion generation. Extensive experiments demonstrate that M2D2M surpasses current state-of-the-art benchmarks for motion generation from text descriptions, showcasing its efficacy in interpreting language semantics and generating dynamic, realistic motions.

Via

Access Paper or Ask Questions

InfoGCN++: Learning Representation by Predicting the Future for Online Human Skeleton-based Action Recognition

Oct 16, 2023

Seunggeun Chi, Hyung-gun Chi, Qixing Huang, Karthik Ramani

Abstract:Skeleton-based action recognition has made significant advancements recently, with models like InfoGCN showcasing remarkable accuracy. However, these models exhibit a key limitation: they necessitate complete action observation prior to classification, which constrains their applicability in real-time situations such as surveillance and robotic systems. To overcome this barrier, we introduce InfoGCN++, an innovative extension of InfoGCN, explicitly developed for online skeleton-based action recognition. InfoGCN++ augments the abilities of the original InfoGCN model by allowing real-time categorization of action types, independent of the observation sequence's length. It transcends conventional approaches by learning from current and anticipated future movements, thereby creating a more thorough representation of the entire sequence. Our approach to prediction is managed as an extrapolation issue, grounded on observed actions. To enable this, InfoGCN++ incorporates Neural Ordinary Differential Equations, a concept that lets it effectively model the continuous evolution of hidden states. Following rigorous evaluations on three skeleton-based action recognition benchmarks, InfoGCN++ demonstrates exceptional performance in online action recognition. It consistently equals or exceeds existing techniques, highlighting its significant potential to reshape the landscape of real-time action recognition applications. Consequently, this work represents a major leap forward from InfoGCN, pushing the limits of what's possible in online, skeleton-based action recognition. The code for InfoGCN++ is publicly available at https://github.com/stnoah1/infogcn2 for further exploration and validation.

* This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Via

Access Paper or Ask Questions

AircraftVerse: A Large-Scale Multimodal Dataset of Aerial Vehicle Designs

Jun 08, 2023

Adam D. Cobb, Anirban Roy, Daniel Elenius, F. Michael Heim, Brian Swenson, Sydney Whittington, James D. Walker, Theodore Bapty, Joseph Hite, Karthik Ramani(+2 more)

Figure 1 for AircraftVerse: A Large-Scale Multimodal Dataset of Aerial Vehicle Designs

Figure 2 for AircraftVerse: A Large-Scale Multimodal Dataset of Aerial Vehicle Designs

Figure 3 for AircraftVerse: A Large-Scale Multimodal Dataset of Aerial Vehicle Designs

Figure 4 for AircraftVerse: A Large-Scale Multimodal Dataset of Aerial Vehicle Designs

Abstract:We present AircraftVerse, a publicly available aerial vehicle design dataset. Aircraft design encompasses different physics domains and, hence, multiple modalities of representation. The evaluation of these cyber-physical system (CPS) designs requires the use of scientific analytical and simulation models ranging from computer-aided design tools for structural and manufacturing analysis, computational fluid dynamics tools for drag and lift computation, battery models for energy estimation, and simulation models for flight control and dynamics. AircraftVerse contains 27,714 diverse air vehicle designs - the largest corpus of engineering designs with this level of complexity. Each design comprises the following artifacts: a symbolic design tree describing topology, propulsion subsystem, battery subsystem, and other design details; a STandard for the Exchange of Product (STEP) model data; a 3D CAD design using a stereolithography (STL) file format; a 3D point cloud for the shape of the design; and evaluation results from high fidelity state-of-the-art physics models that characterize performance metrics such as maximum flight distance and hover-time. We also present baseline surrogate models that use different modalities of design representation to predict design performance metrics, which we provide as part of our dataset release. Finally, we discuss the potential impact of this dataset on the use of learning in aircraft design and, more generally, in CPS. AircraftVerse is accompanied by a data card, and it is released under Creative Commons Attribution-ShareAlike (CC BY-SA) license. The dataset is hosted at https://zenodo.org/record/6525446, baseline models and code at https://github.com/SRI-CSL/AircraftVerse, and the dataset description at https://aircraftverse.onrender.com/.

* The dataset is hosted at https://zenodo.org/record/6525446, baseline models and code at https://github.com/SRI-CSL/AircraftVerse, and the dataset description at https://aircraftverse.onrender.com/

Via

Access Paper or Ask Questions

Egocentric View Hand Action Recognition by Leveraging Hand Surface and Hand Grasp Type

Sep 08, 2021

Sangpil Kim, Jihyun Bae, Hyunggun Chi, Sunghee Hong, Byoung Soo Koh, Karthik Ramani

Figure 1 for Egocentric View Hand Action Recognition by Leveraging Hand Surface and Hand Grasp Type

Figure 2 for Egocentric View Hand Action Recognition by Leveraging Hand Surface and Hand Grasp Type

Figure 3 for Egocentric View Hand Action Recognition by Leveraging Hand Surface and Hand Grasp Type

Figure 4 for Egocentric View Hand Action Recognition by Leveraging Hand Surface and Hand Grasp Type

Abstract:We introduce a multi-stage framework that uses mean curvature on a hand surface and focuses on learning interaction between hand and object by analyzing hand grasp type for hand action recognition in egocentric videos. The proposed method does not require 3D information of objects including 6D object poses which are difficult to annotate for learning an object's behavior while it interacts with hands. Instead, the framework synthesizes the mean curvature of the hand mesh model to encode the hand surface geometry in 3D space. Additionally, our method learns the hand grasp type which is highly correlated with the hand action. From our experiment, we notice that using hand grasp type and mean curvature of hand increases the performance of the hand action recognition.

Via

Access Paper or Ask Questions

DAR-Net: Dynamic Aggregation Network for Semantic Scene Segmentation

Jul 28, 2019

Zongyue Zhao, Min Liu, Karthik Ramani

Figure 1 for DAR-Net: Dynamic Aggregation Network for Semantic Scene Segmentation

Figure 2 for DAR-Net: Dynamic Aggregation Network for Semantic Scene Segmentation

Figure 3 for DAR-Net: Dynamic Aggregation Network for Semantic Scene Segmentation

Figure 4 for DAR-Net: Dynamic Aggregation Network for Semantic Scene Segmentation

Abstract:Traditional grid/neighbor-based static pooling has become a constraint for point cloud geometry analysis. In this paper, we propose DAR-Net, a novel network architecture that focuses on dynamic feature aggregation. The central idea of DAR-Net is generating a self-adaptive pooling skeleton that considers both scene complexity and local geometry features. Providing variable semi-local receptive fields and weights, the skeleton serves as a bridge that connect local convolutional feature extractors and a global recurrent feature integrator. Experimental results on indoor scene datasets show advantages of the proposed approach compared to state-of-the-art architectures that adopt static pooling methods.

Via

Access Paper or Ask Questions

CT-GAN: Conditional Transformation Generative Adversarial Network for Image Attribute Modification

Aug 31, 2018

Sangpil Kim, Nick Winovich, Guang Lin, Karthik Ramani

Figure 1 for CT-GAN: Conditional Transformation Generative Adversarial Network for Image Attribute Modification

Figure 2 for CT-GAN: Conditional Transformation Generative Adversarial Network for Image Attribute Modification

Figure 3 for CT-GAN: Conditional Transformation Generative Adversarial Network for Image Attribute Modification

Figure 4 for CT-GAN: Conditional Transformation Generative Adversarial Network for Image Attribute Modification

Abstract:We propose a novel, fully-convolutional conditional generative model capable of learning image transformations using a light-weight network suited for real-time applications. We introduce the conditional transformation unit (CTU) designed to produce specified attribute modifications and an adaptive discriminator used to stabilize the learning procedure. We show that the network is capable of accurately modeling several discrete modifications simultaneously and can produce seamless continuous attribute modification via piece-wise interpolation. We also propose a task-divided decoder that incorporates a refinement map, designed to improve the network's coarse pixel estimation, along with RGB color balance parameters. We exceed state-of-the-art results on synthetic face and chair datasets and demonstrate the model's robustness using real hand pose datasets. Moreover, the proposed fully-convolutional model requires significantly fewer weights than conventional alternatives and is shown to provide an effective framework for producing a diverse range of real-time image attribute modifications.

Via

Access Paper or Ask Questions

3D Object Classification via Spherical Projections

Dec 12, 2017

Zhangjie Cao, Qixing Huang, Karthik Ramani

Figure 1 for 3D Object Classification via Spherical Projections

Figure 2 for 3D Object Classification via Spherical Projections

Figure 3 for 3D Object Classification via Spherical Projections

Figure 4 for 3D Object Classification via Spherical Projections

Abstract:In this paper, we introduce a new method for classifying 3D objects. Our main idea is to project a 3D object onto a spherical domain centered around its barycenter and develop neural network to classify the spherical projection. We introduce two complementary projections. The first captures depth variations of a 3D object, and the second captures contour-information viewed from different angles. Spherical projections combine key advantages of two main-stream 3D classification methods: image-based and 3D-based. Specifically, spherical projections are locally planar, allowing us to use massive image datasets (e.g, ImageNet) for pre-training. Also spherical projections are similar to voxel-based methods, as they encode complete information of a 3D object in a single neural network capturing dependencies across different views. Our novel network design can fully utilize these advantages. Experimental results on ModelNet40 and ShapeNetCore show that our method is superior to prior methods.

Via

Access Paper or Ask Questions

SurfNet: Generating 3D shape surfaces using deep residual networks

Mar 12, 2017

Ayan Sinha, Asim Unmesh, Qixing Huang, Karthik Ramani

Figure 1 for SurfNet: Generating 3D shape surfaces using deep residual networks

Figure 2 for SurfNet: Generating 3D shape surfaces using deep residual networks

Figure 3 for SurfNet: Generating 3D shape surfaces using deep residual networks

Figure 4 for SurfNet: Generating 3D shape surfaces using deep residual networks

Abstract:3D shape models are naturally parameterized using vertices and faces, \ie, composed of polygons forming a surface. However, current 3D learning paradigms for predictive and generative tasks using convolutional neural networks focus on a voxelized representation of the object. Lifting convolution operators from the traditional 2D to 3D results in high computational overhead with little additional benefit as most of the geometry information is contained on the surface boundary. Here we study the problem of directly generating the 3D shape surface of rigid and non-rigid shapes using deep convolutional neural networks. We develop a procedure to create consistent `geometry images' representing the shape surface of a category of 3D objects. We then use this consistent representation for category-specific shape surface generation from a parametric representation or an image by developing novel extensions of deep residual networks for the task of geometry image generation. Our experiments indicate that our network learns a meaningful representation of shape surfaces allowing it to interpolate between shape orientations and poses, invent new shape surfaces and reconstruct 3D shape surfaces from previously unseen images.

* CVPR 2017 paper

Via

Access Paper or Ask Questions