Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kuldeep Kulkarni

FloAt: Flow Warping of Self-Attention for Clothing Animation Generation

Nov 22, 2024

Swasti Shreya Mishra, Kuldeep Kulkarni, Duygu Ceylan, Balaji Vasan Srinivasan

Abstract:We propose a diffusion model-based approach, FloAtControlNet to generate cinemagraphs composed of animations of human clothing. We focus on human clothing like dresses, skirts and pants. The input to our model is a text prompt depicting the type of clothing and the texture of clothing like leopard, striped, or plain, and a sequence of normal maps that capture the underlying animation that we desire in the output. The backbone of our method is a normal-map conditioned ControlNet which is operated in a training-free regime. The key observation is that the underlying animation is embedded in the flow of the normal maps. We utilize the flow thus obtained to manipulate the self-attention maps of appropriate layers. Specifically, the self-attention maps of a particular layer and frame are recomputed as a linear combination of itself and the self-attention maps of the same layer and the previous frame, warped by the flow on the normal maps of the two frames. We show that manipulating the self-attention maps greatly enhances the quality of the clothing animation, making it look more natural as well as suppressing the background artifacts. Through extensive experiments, we show that the method proposed beats all baselines both qualitatively in terms of visual results and user study. Specifically, our method is able to alleviate the background flickering that exists in other diffusion model-based baselines that we consider. In addition, we show that our method beats all baselines in terms of RMSE and PSNR computed using the input normal map sequences and the normal map sequences obtained from the output RGB frames. Further, we show that well-established evaluation metrics like LPIPS, SSIM, and CLIP scores that are generally for visual quality are not necessarily suitable for capturing the subtle motions in human clothing animations.

Via

Access Paper or Ask Questions

Crafting Parts for Expressive Object Composition

Jun 14, 2024

Harsh Rangwani, Aishwarya Agarwal, Kuldeep Kulkarni, R. Venkatesh Babu, Srikrishna Karanam

Abstract:Text-to-image generation from large generative models like Stable Diffusion, DALLE-2, etc., have become a common base for various tasks due to their superior quality and extensive knowledge bases. As image composition and generation are creative processes the artists need control over various parts of the images being generated. We find that just adding details about parts in the base text prompt either leads to an entirely different image (e.g., missing/incorrect identity) or the extra part details simply being ignored. To mitigate these issues, we introduce PartCraft, which enables image generation based on fine-grained part-level details specified for objects in the base text prompt. This allows more control for artists and enables novel object compositions by combining distinctive object parts. PartCraft first localizes object parts by denoising the object region from a specific diffusion process. This enables each part token to be localized to the right object region. After obtaining part masks, we run a localized diffusion process in each of the part regions based on fine-grained part descriptions and combine them to produce the final image. All the stages of PartCraft are based on repurposing a pre-trained diffusion model, which enables it to generalize across various domains without training. We demonstrate the effectiveness of part-level control provided by PartCraft qualitatively through visual examples and quantitatively in comparison to the contemporary baselines.

* Project Page Will Be Here: https://rangwani-harsh.github.io/PartCraft

Via

Access Paper or Ask Questions

Blowing in the Wind: CycleNet for Human Cinemagraphs from Still Images

Mar 15, 2023

Hugo Bertiche, Niloy J. Mitra, Kuldeep Kulkarni, Chun-Hao Paul Huang, Tuanfeng Y. Wang, Meysam Madadi, Sergio Escalera, Duygu Ceylan

Figure 1 for Blowing in the Wind: CycleNet for Human Cinemagraphs from Still Images

Figure 2 for Blowing in the Wind: CycleNet for Human Cinemagraphs from Still Images

Figure 3 for Blowing in the Wind: CycleNet for Human Cinemagraphs from Still Images

Figure 4 for Blowing in the Wind: CycleNet for Human Cinemagraphs from Still Images

Abstract:Cinemagraphs are short looping videos created by adding subtle motions to a static image. This kind of media is popular and engaging. However, automatic generation of cinemagraphs is an underexplored area and current solutions require tedious low-level manual authoring by artists. In this paper, we present an automatic method that allows generating human cinemagraphs from single RGB images. We investigate the problem in the context of dressed humans under the wind. At the core of our method is a novel cyclic neural network that produces looping cinemagraphs for the target loop duration. To circumvent the problem of collecting real data, we demonstrate that it is possible, by working in the image normal space, to learn garment motion dynamics on synthetic data and generalize to real data. We evaluate our method on both synthetic and real data and demonstrate that it is possible to create compelling and plausible cinemagraphs from single RGB images.

Via

Access Paper or Ask Questions

Self-supervised Multi-view Disentanglement for Expansion of Visual Collections

Feb 04, 2023

Nihal Jain, Praneetha Vaddamanu, Paridhi Maheshwari, Vishwa Vinay, Kuldeep Kulkarni

Figure 1 for Self-supervised Multi-view Disentanglement for Expansion of Visual Collections

Figure 2 for Self-supervised Multi-view Disentanglement for Expansion of Visual Collections

Figure 3 for Self-supervised Multi-view Disentanglement for Expansion of Visual Collections

Figure 4 for Self-supervised Multi-view Disentanglement for Expansion of Visual Collections

Abstract:Image search engines enable the retrieval of images relevant to a query image. In this work, we consider the setting where a query for similar images is derived from a collection of images. For visual search, the similarity measurements may be made along multiple axes, or views, such as style and color. We assume access to a set of feature extractors, each of which computes representations for a specific view. Our objective is to design a retrieval algorithm that effectively combines similarities computed over representations from multiple views. To this end, we propose a self-supervised learning method for extracting disentangled view-specific representations for images such that the inter-view overlap is minimized. We show how this allows us to compute the intent of a collection as a distribution over views. We show how effective retrieval can be performed by prioritizing candidate expansion images that match the intent of a query collection. Finally, we present a new querying mechanism for image search enabled by composing multiple collections and perform retrieval under this setting using the techniques presented in this paper.

* A version of this paper has been accepted at WSDM 2023

Via

Access Paper or Ask Questions

GEMS: Scene Expansion using Generative Models of Graphs

Jul 08, 2022

Rishi Agarwal, Tirupati Saketh Chandra, Vaidehi Patil, Aniruddha Mahapatra, Kuldeep Kulkarni, Vishwa Vinay

Figure 1 for GEMS: Scene Expansion using Generative Models of Graphs

Figure 2 for GEMS: Scene Expansion using Generative Models of Graphs

Figure 3 for GEMS: Scene Expansion using Generative Models of Graphs

Figure 4 for GEMS: Scene Expansion using Generative Models of Graphs

Abstract:Applications based on image retrieval require editing and associating in intermediate spaces that are representative of the high-level concepts like objects and their relationships rather than dense, pixel-level representations like RGB images or semantic-label maps. We focus on one such representation, scene graphs, and propose a novel scene expansion task where we enrich an input seed graph by adding new nodes (objects) and the corresponding relationships. To this end, we formulate scene graph expansion as a sequential prediction task involving multiple steps of first predicting a new node and then predicting the set of relationships between the newly predicted node and previous nodes in the graph. We propose a sequencing strategy for observed graphs that retains the clustering patterns amongst nodes. In addition, we leverage external knowledge to train our graph generation model, enabling greater generalization of node predictions. Due to the inefficiency of existing maximum mean discrepancy (MMD) based metrics for graph generation problems in evaluating predicted relationships between nodes (objects), we design novel metrics that comprehensively evaluate different aspects of predicted relations. We conduct extensive experiments on Visual Genome and VRD datasets to evaluate the expanded scene graphs using the standard MMD-based metrics and our proposed metrics. We observe that the graphs generated by our method, GEMS, better represent the real distribution of the scene graphs than the baseline methods like GraphRNN.

Via

Access Paper or Ask Questions

Controllable Animation of Fluid Elements in Still Images

Dec 06, 2021

Aniruddha Mahapatra, Kuldeep Kulkarni

Figure 1 for Controllable Animation of Fluid Elements in Still Images

Figure 2 for Controllable Animation of Fluid Elements in Still Images

Figure 3 for Controllable Animation of Fluid Elements in Still Images

Figure 4 for Controllable Animation of Fluid Elements in Still Images

Abstract:We propose a method to interactively control the animation of fluid elements in still images to generate cinemagraphs. Specifically, we focus on the animation of fluid elements like water, smoke, fire, which have the properties of repeating textures and continuous fluid motion. Taking inspiration from prior works, we represent the motion of such fluid elements in the image in the form of a constant 2D optical flow map. To this end, we allow the user to provide any number of arrow directions and their associated speeds along with a mask of the regions the user wants to animate. The user-provided input arrow directions, their corresponding speed values, and the mask are then converted into a dense flow map representing a constant optical flow map (FD). We observe that FD, obtained using simple exponential operations can closely approximate the plausible motion of elements in the image. We further refine computed dense optical flow map FD using a generative-adversarial network (GAN) to obtain a more realistic flow map. We devise a novel UNet based architecture to autoregressively generate future frames using the refined optical flow map by forward-warping the input image features at different resolutions. We conduct extensive experiments on a publicly available dataset and show that our method is superior to the baselines in terms of qualitative and quantitative metrics. In addition, we show the qualitative animations of the objects in directions that did not exist in the training set and provide a way to synthesize videos that otherwise would not exist in the real world.

Via

Access Paper or Ask Questions

SemIE: Semantically-aware Image Extrapolation

Aug 31, 2021

Bholeshwar Khurana, Soumya Ranjan Dash, Abhishek Bhatia, Aniruddha Mahapatra, Hrituraj Singh, Kuldeep Kulkarni

Figure 1 for SemIE: Semantically-aware Image Extrapolation

Figure 2 for SemIE: Semantically-aware Image Extrapolation

Figure 3 for SemIE: Semantically-aware Image Extrapolation

Figure 4 for SemIE: Semantically-aware Image Extrapolation

Abstract:We propose a semantically-aware novel paradigm to perform image extrapolation that enables the addition of new object instances. All previous methods are limited in their capability of extrapolation to merely extending the already existing objects in the image. However, our proposed approach focuses not only on (i) extending the already present objects but also on (ii) adding new objects in the extended region based on the context. To this end, for a given image, we first obtain an object segmentation map using a state-of-the-art semantic segmentation method. The, thus, obtained segmentation map is fed into a network to compute the extrapolated semantic segmentation and the corresponding panoptic segmentation maps. The input image and the obtained segmentation maps are further utilized to generate the final extrapolated image. We conduct experiments on Cityscapes and ADE20K-bedroom datasets and show that our method outperforms all baselines in terms of FID, and similarity in object co-occurrence statistics.

* To appear in International Conference on Computer Vision (ICCV) 2021. Project URL: https://semie-iccv.github.io

Via

Access Paper or Ask Questions

Halluci-Net: Scene Completion by Exploiting Object Co-occurrence Relationships

Apr 18, 2020

Kuldeep Kulkarni, Tejas Gokhale, Rajhans Singh, Pavan Turaga, Aswin Sankaranarayanan

Figure 1 for Halluci-Net: Scene Completion by Exploiting Object Co-occurrence Relationships

Figure 2 for Halluci-Net: Scene Completion by Exploiting Object Co-occurrence Relationships

Figure 3 for Halluci-Net: Scene Completion by Exploiting Object Co-occurrence Relationships

Figure 4 for Halluci-Net: Scene Completion by Exploiting Object Co-occurrence Relationships

Abstract:We address the new problem of complex scene completion from sparse label maps. We use a two-stage deep network based method, called `Halluci-Net', that uses object co-occurrence relationships to produce a dense and complete label map. The generated dense label map is fed into a state-of-the-art image synthesis method to obtain the final image. The proposed method is evaluated on the Cityscapes dataset and it outperforms a single-stage baseline method on various performance metrics like Fr\'echet Inception Distance (FID), semantic segmentation accuracy, and similarity in object co-occurrences. In addition to this, we show qualitative results on a subset of ADE20K dataset containing bedroom images.

* Image synthesis, GAN, Scene completion, Label maps

Via

Access Paper or Ask Questions

Rate-Adaptive Neural Networks for Spatial Multiplexers

Sep 08, 2018

Suhas Lohit, Rajhans Singh, Kuldeep Kulkarni, Pavan Turaga

Figure 1 for Rate-Adaptive Neural Networks for Spatial Multiplexers

Figure 2 for Rate-Adaptive Neural Networks for Spatial Multiplexers

Figure 3 for Rate-Adaptive Neural Networks for Spatial Multiplexers

Figure 4 for Rate-Adaptive Neural Networks for Spatial Multiplexers

Abstract:In resource-constrained environments, one can employ spatial multiplexing cameras to acquire a small number of measurements of a scene, and perform effective reconstruction or high-level inference using purely data-driven neural networks. However, once trained, the measurement matrix and the network are valid only for a single measurement rate (MR) chosen at training time. To overcome this drawback, we answer the following question: How can we jointly design the measurement operator and the reconstruction/inference network so that the system can operate over a \textit{range} of MRs? To this end, we present a novel training algorithm, for learning \textbf{\textit{rate-adaptive}} networks. Using standard datasets, we demonstrate that, when tested over a range of MRs, a rate-adaptive network can provide high quality reconstruction over a the entire range, resulting in up to about 15 dB improvement over previous methods, where the network is valid for only one MR. We demonstrate the effectiveness of our approach for sample-efficient object tracking where video frames are acquired at dynamically varying MRs. We also extend this algorithm to learn the measurement operator in conjunction with image recognition networks. Experiments on MNIST and CIFAR-10 confirm the applicability of our algorithm to different tasks.

Via

Access Paper or Ask Questions

CS-VQA: Visual Question Answering with Compressively Sensed Images

Jun 08, 2018

Li-Chi Huang, Kuldeep Kulkarni, Anik Jha, Suhas Lohit, Suren Jayasuriya, Pavan Turaga

Figure 1 for CS-VQA: Visual Question Answering with Compressively Sensed Images

Figure 2 for CS-VQA: Visual Question Answering with Compressively Sensed Images

Figure 3 for CS-VQA: Visual Question Answering with Compressively Sensed Images

Figure 4 for CS-VQA: Visual Question Answering with Compressively Sensed Images

Abstract:Visual Question Answering (VQA) is a complex semantic task requiring both natural language processing and visual recognition. In this paper, we explore whether VQA is solvable when images are captured in a sub-Nyquist compressive paradigm. We develop a series of deep-network architectures that exploit available compressive data to increasing degrees of accuracy, and show that VQA is indeed solvable in the compressed domain. Our results show that there is nominal degradation in VQA performance when using compressive measurements, but that accuracy can be recovered when VQA pipelines are used in conjunction with state-of-the-art deep neural networks for CS reconstruction. The results presented yield important implications for resource-constrained VQA applications.

* 5 pages, 2 figures, accepted to ICIP 2018

Via

Access Paper or Ask Questions