Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Divya Kothandaraman

CamMimic: Zero-Shot Image To Camera Motion Personalized Video Generation Using Diffusion Models

Apr 13, 2025

Pooja Guhan, Divya Kothandaraman, Tsung-Wei Huang, Guan-Ming Su, Dinesh Manocha

Abstract:We introduce CamMimic, an innovative algorithm tailored for dynamic video editing needs. It is designed to seamlessly transfer the camera motion observed in a given reference video onto any scene of the user's choice in a zero-shot manner without requiring any additional data. Our algorithm achieves this using a two-phase strategy by leveraging a text-to-video diffusion model. In the first phase, we develop a multi-concept learning method using a combination of LoRA layers and an orthogonality loss to capture and understand the underlying spatial-temporal characteristics of the reference video as well as the spatial features of the user's desired scene. The second phase proposes a unique homography-based refinement strategy to enhance the temporal and spatial alignment of the generated video. We demonstrate the efficacy of our method through experiments conducted on a dataset containing combinations of diverse scenes and reference videos containing a variety of camera motions. In the absence of an established metric for assessing camera motion transfer between unrelated scenes, we propose CameraScore, a novel metric that utilizes homography representations to measure camera motion similarity between the reference and generated videos. Extensive quantitative and qualitative evaluations demonstrate that our approach generates high-quality, motion-enhanced videos. Additionally, a user study reveals that 70.31% of participants preferred our method for scene preservation, while 90.45% favored it for motion transfer. We hope this work lays the foundation for future advancements in camera motion transfer across different scenes.

Via

Access Paper or Ask Questions

Text2Story: Advancing Video Storytelling with Text Guidance

Mar 08, 2025

Taewon Kang, Divya Kothandaraman, Ming C. Lin

Figure 1 for Text2Story: Advancing Video Storytelling with Text Guidance

Figure 2 for Text2Story: Advancing Video Storytelling with Text Guidance

Figure 3 for Text2Story: Advancing Video Storytelling with Text Guidance

Figure 4 for Text2Story: Advancing Video Storytelling with Text Guidance

Abstract:Generating coherent long-form video sequences from discrete input using only text prompts is a critical task in content creation. While diffusion-based models excel at short video synthesis, long-form storytelling from text remains largely unexplored and a challenge due to challenges pertaining to temporal coherency, preserving semantic meaning and action continuity across the video. We introduce a novel storytelling approach to enable seamless video generation with natural action transitions and structured narratives. We present a bidirectional time-weighted latent blending strategy to ensure temporal consistency between segments of the long-form video being generated. Further, our method extends the Black-Scholes algorithm from prompt mixing for image generation to video generation, enabling controlled motion evolution through structured text conditioning. To further enhance motion continuity, we propose a semantic action representation framework to encode high-level action semantics into the blending process, dynamically adjusting transitions based on action similarity, ensuring smooth yet adaptable motion changes. Latent space blending maintains spatial coherence between objects in a scene, while time-weighted blending enforces bidirectional constraints for temporal consistency. This integrative approach prevents abrupt transitions while ensuring fluid storytelling. Extensive experiments demonstrate significant improvements over baselines, achieving temporally consistent and visually compelling video narratives without any additional training. Our approach bridges the gap between short clips and extended video to establish a new paradigm in GenAI-driven video synthesis from text.

* 15 pages, 6 figures

Via

Access Paper or Ask Questions

Novel View Synthesis from a Single Image with Pretrained Diffusion Guidance

Aug 12, 2024

Taewon Kang, Divya Kothandaraman, Dinesh Manocha, Ming C. Lin

Figure 1 for Novel View Synthesis from a Single Image with Pretrained Diffusion Guidance

Figure 2 for Novel View Synthesis from a Single Image with Pretrained Diffusion Guidance

Figure 3 for Novel View Synthesis from a Single Image with Pretrained Diffusion Guidance

Figure 4 for Novel View Synthesis from a Single Image with Pretrained Diffusion Guidance

Abstract:Recent 3D novel view synthesis (NVS) methods are limited to single-object-centric scenes generated from new viewpoints and struggle with complex environments. They often require extensive 3D data for training, lacking generalization beyond training distribution. Conversely, 3D-free methods can generate text-controlled views of complex, in-the-wild scenes using a pretrained stable diffusion model without tedious fine-tuning, but lack camera control. In this paper, we introduce HawkI++, a method capable of generating camera-controlled viewpoints from a single input image. HawkI++ excels in handling complex and diverse scenes without additional 3D data or extensive training. It leverages widely available pretrained NVS models for weak guidance, integrating this knowledge into a 3D-free view synthesis approach to achieve the desired results efficiently. Our experimental results demonstrate that HawkI++ outperforms existing models in both qualitative and quantitative evaluations, providing high-fidelity and consistent novel view synthesis at desired camera angles across a wide variety of scenes.

* 6 pages, 7 figures

Via

Access Paper or Ask Questions

Text Prompting for Multi-Concept Video Customization by Autoregressive Generation

May 22, 2024

Divya Kothandaraman, Kihyuk Sohn, Ruben Villegas, Paul Voigtlaender, Dinesh Manocha, Mohammad Babaeizadeh

Abstract:We present a method for multi-concept customization of pretrained text-to-video (T2V) models. Intuitively, the multi-concept customized video can be derived from the (non-linear) intersection of the video manifolds of the individual concepts, which is not straightforward to find. We hypothesize that sequential and controlled walking towards the intersection of the video manifolds, directed by text prompting, leads to the solution. To do so, we generate the various concepts and their corresponding interactions, sequentially, in an autoregressive manner. Our method can generate videos of multiple custom concepts (subjects, action and background) such as a teddy bear running towards a brown teapot, a dog playing violin and a teddy bear swimming in the ocean. We quantitatively evaluate our method using videoCLIP and DINO scores, in addition to human evaluation. Videos for results presented in this paper can be found at https://github.com/divyakraman/MultiConceptVideo2024.

* Paper accepted to AI4CC Workshop at CVPR 2024

Via

Access Paper or Ask Questions

Prompt Mixing in Diffusion Models using the Black Scholes Algorithm

May 22, 2024

Divya Kothandaraman, Ming Lin, Dinesh Manocha

Figure 1 for Prompt Mixing in Diffusion Models using the Black Scholes Algorithm

Figure 2 for Prompt Mixing in Diffusion Models using the Black Scholes Algorithm

Figure 3 for Prompt Mixing in Diffusion Models using the Black Scholes Algorithm

Figure 4 for Prompt Mixing in Diffusion Models using the Black Scholes Algorithm

Abstract:We introduce a novel approach for prompt mixing, aiming to generate images at the intersection of multiple text prompts using pre-trained text-to-image diffusion models. At each time step during diffusion denoising, our algorithm forecasts predictions w.r.t. the generated image and makes informed text conditioning decisions. To do so, we leverage the connection between diffusion models (rooted in non-equilibrium thermodynamics) and the Black-Scholes model for pricing options in Finance, and draw analogies between the variables in both contexts to derive an appropriate algorithm for prompt mixing using the Black Scholes model. Specifically, the parallels between diffusion models and the Black-Scholes model enable us to leverage properties related to the dynamics of the Markovian model derived in the Black-Scholes algorithm. Our prompt-mixing algorithm is data-efficient, meaning it does not need additional training. Furthermore, it operates without human intervention or hyperparameter tuning. We highlight the benefits of our approach by comparing it qualitatively and quantitatively to other prompt mixing techniques, including linear interpolation, alternating prompts, step-wise prompt switching, and CLIP-guided prompt selection across various scenarios such as single object per text prompt, multiple objects per text prompt and objects against backgrounds. Code is available at https://github.com/divyakraman/BlackScholesDiffusion2024.

Via

Access Paper or Ask Questions

AerialBooth: Mutual Information Guidance for Text Controlled Aerial View Synthesis from a Single Image

Nov 27, 2023

Divya Kothandaraman, Tianyi Zhou, Ming Lin, Dinesh Manocha

Figure 1 for AerialBooth: Mutual Information Guidance for Text Controlled Aerial View Synthesis from a Single Image

Figure 2 for AerialBooth: Mutual Information Guidance for Text Controlled Aerial View Synthesis from a Single Image

Figure 3 for AerialBooth: Mutual Information Guidance for Text Controlled Aerial View Synthesis from a Single Image

Figure 4 for AerialBooth: Mutual Information Guidance for Text Controlled Aerial View Synthesis from a Single Image

Abstract:We present a novel method, AerialBooth, for synthesizing the aerial view from a single input image using its text description. We leverage the pretrained text-to-2D image stable diffusion model as prior knowledge of the 3D world. The model is finetuned in two steps to optimize for the text embedding and the UNet that reconstruct the input image and its inverse perspective mapping respectively. The inverse perspective mapping creates variance within the text-image space of the diffusion model, while providing weak guidance for aerial view synthesis. At inference, we steer the contents of the generated image towards the input image using novel mutual information guidance that maximizes the information content between the probability distributions of the two images. We evaluate our approach on a wide spectrum of real and synthetic data, including natural scenes, indoor scenes, human action, etc. Through extensive experiments and ablation studies, we demonstrate the effectiveness of AerialBooth and also its generalizability to other text-controlled views. We also show that AerialBooth achieves the best viewpoint-fidelity trade-off though quantitative evaluation on 7 metrics analyzing viewpoint and fidelity w.r.t. input image. Code and data is available at https://github.com/divyakraman/AerialBooth2023.

Via

Access Paper or Ask Questions

PMI Sampler: Patch similarity guided frame selection for Aerial Action Recognition

Apr 14, 2023

Ruiqi Xian, Xijun Wang, Divya Kothandaraman, Dinesh Manocha

Figure 1 for PMI Sampler: Patch similarity guided frame selection for Aerial Action Recognition

Figure 2 for PMI Sampler: Patch similarity guided frame selection for Aerial Action Recognition

Figure 3 for PMI Sampler: Patch similarity guided frame selection for Aerial Action Recognition

Figure 4 for PMI Sampler: Patch similarity guided frame selection for Aerial Action Recognition

Abstract:We present a new algorithm for selection of informative frames in video action recognition. Our approach is designed for aerial videos captured using a moving camera where human actors occupy a small spatial resolution of video frames. Our algorithm utilizes the motion bias within aerial videos, which enables the selection of motion-salient frames. We introduce the concept of patch mutual information (PMI) score to quantify the motion bias between adjacent frames, by measuring the similarity of patches. We use this score to assess the amount of discriminative motion information contained in one frame relative to another. We present an adaptive frame selection strategy using shifted leaky ReLu and cumulative distribution function, which ensures that the sampled frames comprehensively cover all the essential segments with high motion salience. Our approach can be integrated with any action recognition model to enhance its accuracy. In practice, our method achieves a relative improvement of 2.2 - 13.8% in top-1 accuracy on UAV-Human, 6.8% on NEC Drone, and 9.0% on Diving48 datasets.

Via

Access Paper or Ask Questions

Aerial Diffusion: Text Guided Ground-to-Aerial View Translation from a Single Image using Diffusion Models

Mar 15, 2023

Divya Kothandaraman, Tianyi Zhou, Ming Lin, Dinesh Manocha

Figure 1 for Aerial Diffusion: Text Guided Ground-to-Aerial View Translation from a Single Image using Diffusion Models

Figure 2 for Aerial Diffusion: Text Guided Ground-to-Aerial View Translation from a Single Image using Diffusion Models

Figure 3 for Aerial Diffusion: Text Guided Ground-to-Aerial View Translation from a Single Image using Diffusion Models

Figure 4 for Aerial Diffusion: Text Guided Ground-to-Aerial View Translation from a Single Image using Diffusion Models

Abstract:We present a novel method, Aerial Diffusion, for generating aerial views from a single ground-view image using text guidance. Aerial Diffusion leverages a pretrained text-image diffusion model for prior knowledge. We address two main challenges corresponding to domain gap between the ground-view and the aerial view and the two views being far apart in the text-image embedding manifold. Our approach uses a homography inspired by inverse perspective mapping prior to finetuning the pretrained diffusion model. Additionally, using the text corresponding to the ground-view to finetune the model helps us capture the details in the ground-view image at a relatively low bias towards the ground-view image. Aerial Diffusion uses an alternating sampling strategy to compute the optimal solution on complex high-dimensional manifold and generate a high-fidelity (w.r.t. ground view) aerial image. We demonstrate the quality and versatility of Aerial Diffusion on a plethora of images from various domains including nature, human actions, indoor scenes, etc. We qualitatively prove the effectiveness of our method with extensive ablations and comparisons. To the best of our knowledge, Aerial Diffusion is the first approach that performs ground-to-aerial translation in an unsupervised manner.

* Code: https://github.com/divyakraman/AerialDiffusion

Via

Access Paper or Ask Questions

Differentiable Frequency-based Disentanglement for Aerial Video Action Recognition

Sep 15, 2022

Divya Kothandaraman, Ming Lin, Dinesh Manocha

Figure 1 for Differentiable Frequency-based Disentanglement for Aerial Video Action Recognition

Figure 2 for Differentiable Frequency-based Disentanglement for Aerial Video Action Recognition

Figure 3 for Differentiable Frequency-based Disentanglement for Aerial Video Action Recognition

Figure 4 for Differentiable Frequency-based Disentanglement for Aerial Video Action Recognition

Abstract:We present a learning algorithm for human activity recognition in videos. Our approach is designed for UAV videos, which are mainly acquired from obliquely placed dynamic cameras that contain a human actor along with background motion. Typically, the human actors occupy less than one-tenth of the spatial resolution. Our approach simultaneously harnesses the benefits of frequency domain representations, a classical analysis tool in signal processing, and data driven neural networks. We build a differentiable static-dynamic frequency mask prior to model the salient static and dynamic pixels in the video, crucial for the underlying task of action recognition. We use this differentiable mask prior to enable the neural network to intrinsically learn disentangled feature representations via an identity loss function. Our formulation empowers the network to inherently compute disentangled salient features within its layers. Further, we propose a cost-function encapsulating temporal relevance and spatial content to sample the most important frame within uniformly spaced video segments. We conduct extensive experiments on the UAV Human dataset and the NEC Drone dataset and demonstrate relative improvements of 5.72% - 13.00% over the state-of-the-art and 14.28% - 38.05% over the corresponding baseline model.

Via

Access Paper or Ask Questions

Placing Human Animations into 3D Scenes by Learning Interaction- and Geometry-Driven Keyframes

Sep 13, 2022

James F. Mullen Jr, Divya Kothandaraman, Aniket Bera, Dinesh Manocha

Figure 1 for Placing Human Animations into 3D Scenes by Learning Interaction- and Geometry-Driven Keyframes

Figure 2 for Placing Human Animations into 3D Scenes by Learning Interaction- and Geometry-Driven Keyframes

Figure 3 for Placing Human Animations into 3D Scenes by Learning Interaction- and Geometry-Driven Keyframes

Figure 4 for Placing Human Animations into 3D Scenes by Learning Interaction- and Geometry-Driven Keyframes

Abstract:We present a novel method for placing a 3D human animation into a 3D scene while maintaining any human-scene interactions in the animation. We use the notion of computing the most important meshes in the animation for the interaction with the scene, which we call "keyframes." These keyframes allow us to better optimize the placement of the animation into the scene such that interactions in the animations (standing, laying, sitting, etc.) match the affordances of the scene (e.g., standing on the floor or laying in a bed). We compare our method, which we call PAAK, with prior approaches, including POSA, PROX ground truth, and a motion synthesis method, and highlight the benefits of our method with a perceptual study. Human raters preferred our PAAK method over the PROX ground truth data 64.6\% of the time. Additionally, in direct comparisons, the raters preferred PAAK over competing methods including 61.5\% compared to POSA.

* WACV 2023. Our project website is available at https://gamma.umd.edu/paak/

Via

Access Paper or Ask Questions