Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Prabhdeep Singh Sethi

Carnegie Mellon University

StyleSplat: 3D Object Style Transfer with Gaussian Splatting

Jul 12, 2024

Sahil Jain, Avik Kuthiala, Prabhdeep Singh Sethi, Prakanshul Saxena

Abstract:Recent advancements in radiance fields have opened new avenues for creating high-quality 3D assets and scenes. Style transfer can enhance these 3D assets with diverse artistic styles, transforming creative expression. However, existing techniques are often slow or unable to localize style transfer to specific objects. We introduce StyleSplat, a lightweight method for stylizing 3D objects in scenes represented by 3D Gaussians from reference style images. Our approach first learns a photorealistic representation of the scene using 3D Gaussian splatting while jointly segmenting individual 3D objects. We then use a nearest-neighbor feature matching loss to finetune the Gaussians of the selected objects, aligning their spherical harmonic coefficients with the style image to ensure consistency and visual appeal. StyleSplat allows for quick, customizable style transfer and localized stylization of multiple objects within a scene, each with a different style. We demonstrate its effectiveness across various 3D scenes and styles, showcasing enhanced control and customization in 3D creation.

* for code and results, see http://bernard0047.github.io/stylesplat

Via

Access Paper or Ask Questions

Listen Then See: Video Alignment with Speaker Attention

Apr 21, 2024

Aviral Agrawal, Carlos Mateo Samudio Lezcano, Iqui Balam Heredia-Marin, Prabhdeep Singh Sethi

Figure 1 for Listen Then See: Video Alignment with Speaker Attention

Figure 2 for Listen Then See: Video Alignment with Speaker Attention

Figure 3 for Listen Then See: Video Alignment with Speaker Attention

Figure 4 for Listen Then See: Video Alignment with Speaker Attention

Abstract:Video-based Question Answering (Video QA) is a challenging task and becomes even more intricate when addressing Socially Intelligent Question Answering (SIQA). SIQA requires context understanding, temporal reasoning, and the integration of multimodal information, but in addition, it requires processing nuanced human behavior. Furthermore, the complexities involved are exacerbated by the dominance of the primary modality (text) over the others. Thus, there is a need to help the task's secondary modalities to work in tandem with the primary modality. In this work, we introduce a cross-modal alignment and subsequent representation fusion approach that achieves state-of-the-art results (82.06\% accuracy) on the Social IQ 2.0 dataset for SIQA. Our approach exhibits an improved ability to leverage the video modality by using the audio modality as a bridge with the language modality. This leads to enhanced performance by reducing the prevalent issue of language overfitting and resultant video modality bypassing encountered by current existing techniques. Our code and models are publicly available at https://github.com/sts-vlcc/sts-vlcc

Via

Access Paper or Ask Questions