Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Anant Khandelwal

FlexiClip: Locality-Preserving Free-Form Character Animation

Jan 15, 2025

Anant Khandelwal

Abstract:Animating clipart images with seamless motion while maintaining visual fidelity and temporal coherence presents significant challenges. Existing methods, such as AniClipart, effectively model spatial deformations but often fail to ensure smooth temporal transitions, resulting in artifacts like abrupt motions and geometric distortions. Similarly, text-to-video (T2V) and image-to-video (I2V) models struggle to handle clipart due to the mismatch in statistical properties between natural video and clipart styles. This paper introduces FlexiClip, a novel approach designed to overcome these limitations by addressing the intertwined challenges of temporal consistency and geometric integrity. FlexiClip extends traditional B\'ezier curve-based trajectory modeling with key innovations: temporal Jacobians to correct motion dynamics incrementally, continuous-time modeling via probability flow ODEs (pfODEs) to mitigate temporal noise, and a flow matching loss inspired by GFlowNet principles to optimize smooth motion transitions. These enhancements ensure coherent animations across complex scenarios involving rapid movements and non-rigid deformations. Extensive experiments validate the effectiveness of FlexiClip in generating animations that are not only smooth and natural but also structurally consistent across diverse clipart types, including humans and animals. By integrating spatial and temporal modeling with pre-trained video diffusion models, FlexiClip sets a new standard for high-quality clipart animation, offering robust performance across a wide range of visual content. Project Page: https://creative-gen.github.io/flexiclip.github.io/

* 13 pages, 4 figures, 7 tables

Via

Access Paper or Ask Questions

PromptSync: Bridging Domain Gaps in Vision-Language Models through Class-Aware Prototype Alignment and Discrimination

Apr 12, 2024

Anant Khandelwal

Abstract:The potential for zero-shot generalization in vision-language (V-L) models such as CLIP has spurred their widespread adoption in addressing numerous downstream tasks. Previous methods have employed test-time prompt tuning to adapt the model to unseen domains, but they overlooked the issue of imbalanced class distributions. In this study, we explicitly address this problem by employing class-aware prototype alignment weighted by mean class probabilities obtained for the test sample and filtered augmented views. Additionally, we ensure that the class probabilities are as accurate as possible by performing prototype discrimination using contrastive learning. The combination of alignment and discriminative loss serves as a geometric regularizer, preventing the prompt representation from collapsing onto a single class and effectively bridging the distribution gap between the source and test domains. Our method, named PromptSync, synchronizes the prompts for each test sample on both the text and vision branches of the V-L model. In empirical evaluations on the domain generalization benchmark, our method outperforms previous best methods by 2.33% in overall performance, by 1% in base-to-novel generalization, and by 2.84% in cross-dataset transfer tasks.

* Accepted at CVPR 2024 LIMIT, 12 pages, 8 Tables, 2 Figures

Via

Access Paper or Ask Questions

RePoseDM: Recurrent Pose Alignment and Gradient Guidance for Pose Guided Image Synthesis

Oct 24, 2023

Anant Khandelwal

Abstract:Pose-guided person image synthesis task requires re-rendering a reference image, which should have a photorealistic appearance and flawless pose transfer. Since person images are highly structured, existing approaches require dense connections for complex deformations and occlusions because these are generally handled through multi-level warping and masking in latent space. But the feature maps generated by convolutional neural networks do not have equivariance, and hence even the multi-level warping does not have a perfect pose alignment. Inspired by the ability of the diffusion model to generate photorealistic images from the given conditional guidance, we propose recurrent pose alignment to provide pose-aligned texture features as conditional guidance. Moreover, we propose gradient guidance from pose interaction fields, which output the distance from the valid pose manifold given a target pose as input. This helps in learning plausible pose transfer trajectories that result in photorealism and undistorted texture details. Extensive results on two large-scale benchmarks and a user study demonstrate the ability of our proposed approach to generate photorealistic pose transfer under challenging scenarios. Additionally, we prove the efficiency of gradient guidance in pose-guided image generation on the HumanArt dataset with fine-tuned stable diffusion.

* 10 pages, 4 tables, 7 figures

Via

Access Paper or Ask Questions

Correlation Debiasing for Unbiased Scene Graph Generation in Videos

Oct 24, 2023

Anant Khandelwal

Abstract:Dynamic scene graph generation (SGG) from videos requires not only comprehensive understanding of objects across the scenes that are prone to temporal fluctuations but also a model the temporal motions and interactions with different objects. Moreover, the long-tailed distribution of visual relationships is the crucial bottleneck of most dynamic SGG methods, since most of them focus on capturing spatio-temporal context using complex architectures, which leads to the generation of biased scene graphs. To address these challenges, we propose FloCoDe: Flow-aware temporal consistency and Correlation Debiasing with uncertainty attenuation for unbiased dynamic scene graphs. FloCoDe employs feature warping using flow to detect temporally consistent objects across the frames. In addition, it uses correlation debiasing to learn the unbiased relation representation for long-tailed classes. Moreover, to attenuate the predictive uncertainties, it uses a mixture of sigmoidal cross-entropy loss and contrastive loss to incorporate label correlations to identify the commonly co-occurring relations and help debias the long-tailed ones. Extensive experimental evaluation shows a performance gain as high as 4.1% showing the superiority of generating more unbiased scene graphs.

* 11 pages, 5 tables, 4 figures

Via

Access Paper or Ask Questions

InFusion: Inject and Attention Fusion for Multi Concept Zero-Shot Text-based Video Editing

Aug 10, 2023

Anant Khandelwal

Abstract:Large text-to-image diffusion models have achieved remarkable success in generating diverse, high-quality images. Additionally, these models have been successfully leveraged to edit input images by just changing the text prompt. But when these models are applied to videos, the main challenge is to ensure temporal consistency and coherence across frames. In this paper, we propose InFusion, a framework for zero-shot text-based video editing leveraging large pre-trained image diffusion models. Our framework specifically supports editing of multiple concepts with pixel-level control over diverse concepts mentioned in the editing prompt. Specifically, we inject the difference in features obtained with source and edit prompts from U-Net residual blocks of decoder layers. When these are combined with injected attention features, it becomes feasible to query the source contents and scale edited concepts along with the injection of unedited parts. The editing is further controlled in a fine-grained manner with mask extraction and attention fusion, which cut the edited part from the source and paste it into the denoising pipeline for the editing prompt. Our framework is a low-cost alternative to one-shot tuned models for editing since it does not require training. We demonstrated complex concept editing with a generalised image model (Stable Diffusion v1.5) using LoRA. Adaptation is compatible with all the existing image diffusion techniques. Extensive experimental results demonstrate the effectiveness of existing methods in rendering high-quality and temporally consistent videos.

* 10 pages, 8 figures, 1 Table, accepted at ICCVW 2023 (ICCV 2023 Workshop on AI for Creative Video Editing and Understanding)

Via

Access Paper or Ask Questions

SegDA: Maximum Separable Segment Mask with Pseudo Labels for Domain Adaptive Semantic Segmentation

Aug 10, 2023

Anant Khandelwal

Abstract:Unsupervised Domain Adaptation (UDA) aims to solve the problem of label scarcity of the target domain by transferring the knowledge from the label rich source domain. Usually, the source domain consists of synthetic images for which the annotation is easily obtained using the well known computer graphics techniques. However, obtaining annotation for real world images (target domain) require lot of manual annotation effort and is very time consuming because it requires per pixel annotation. To address this problem we propose SegDA module to enhance transfer performance of UDA methods by learning the maximum separable segment representation. This resolves the problem of identifying visually similar classes like pedestrian/rider, sidewalk/road etc. We leveraged Equiangular Tight Frame (ETF) classifier inspired from Neural Collapse for maximal separation between segment classes. This causes the source domain pixel representation to collapse to a single vector forming a simplex vertices which are aligned to the maximal separable ETF classifier. We use this phenomenon to propose the novel architecture for domain adaptation of segment representation for target domain. Additionally, we proposed to estimate the noise in labelling the target domain images and update the decoder for noise correction which encourages the discovery of pixels for classes not identified in pseudo labels. We have used four UDA benchmarks simulating synthetic-to-real, daytime-to-nighttime, clear-to-adverse weather scenarios. Our proposed approach outperforms +2.2 mIoU on GTA -> Cityscapes, +2.0 mIoU on Synthia -> Cityscapes, +5.9 mIoU on Cityscapes -> DarkZurich, +2.6 mIoU on Cityscapes -> ACDC.

* 11 pages, 4 Tables, 3 Figures, accepted at ICCVW 2023 (ICCV 2023: 4th Workshop on Visual Perception for Navigation in Human Environments)

Via

Access Paper or Ask Questions

Large Scale Generative Multimodal Attribute Extraction for E-commerce Attributes

Jun 01, 2023

Anant Khandelwal, Happy Mittal, Shreyas Sunil Kulkarni, Deepak Gupta

Abstract:E-commerce websites (e.g. Amazon) have a plethora of structured and unstructured information (text and images) present on the product pages. Sellers often either don't label or mislabel values of the attributes (e.g. color, size etc.) for their products. Automatically identifying these attribute values from an eCommerce product page that contains both text and images is a challenging task, especially when the attribute value is not explicitly mentioned in the catalog. In this paper, we present a scalable solution for this problem where we pose attribute extraction problem as a question-answering task, which we solve using \textbf{MXT}, consisting of three key components: (i) \textbf{M}AG (Multimodal Adaptation Gate), (ii) \textbf{X}ception network, and (iii) \textbf{T}5 encoder-decoder. Our system consists of a generative model that \emph{generates} attribute-values for a given product by using both textual and visual characteristics (e.g. images) of the product. We show that our system is capable of handling zero-shot attribute prediction (when attribute value is not seen in training data) and value-absent prediction (when attribute value is not mentioned in the text) which are missing in traditional classification-based and NER-based models respectively. We have trained our models using distant supervision, removing dependency on human labeling, thus making them practical for real-world applications. With this framework, we are able to train a single model for 1000s of (product-type, attribute) pairs, thus reducing the overhead of training and maintaining separate models. Extensive experiments on two real world datasets show that our framework improves the absolute recall@90P by 10.16\% and 6.9\% from the existing state of the art models. In a popular e-commerce store, we have deployed our models for 1000s of (product-type, attribute) pairs.

* ACL 2023 Industry Track, 8 Pages

Via

Access Paper or Ask Questions

DomainInv: Domain Invariant Fine Tuning and Adversarial Label Correction For QA Domain Adaptation

May 04, 2023

Anant Khandelwal

Abstract:Existing Question Answering (QA) systems limited by the capability of answering questions from unseen domain or any out-of-domain distributions making them less reliable for deployment to real scenarios. Most importantly all the existing QA domain adaptation methods are either based on generating synthetic data or pseudo labeling the target domain data. The domain adaptation methods based on synthetic data and pseudo labeling suffers either from the requirement of computational resources or an extra overhead of carefully selecting the confidence threshold to separate the noisy examples from being in the training dataset. In this paper, we propose the unsupervised domain adaptation for unlabeled target domain by transferring the target representation near to source domain while still using the supervision from source domain. Towards that we proposed the idea of domain invariant fine tuning along with adversarial label correction to identify the target instances which lie far apart from the source domain, so that the feature encoder can be learnt to minimize the distance between such target instances and source instances class wisely, removing the possibility of learning the features of target domain which are still near to source support but are ambiguous. Evaluation of our QA domain adaptation method namely, DomainInv on multiple target QA dataset reveal the performance improvement over the strongest baseline.

* 13 pages, 2 Tables, 2 Figures

Via

Access Paper or Ask Questions

MASIL: Towards Maximum Separable Class Representation for Few Shot Class Incremental Learning

Apr 08, 2023

Anant Khandelwal

Abstract:Few Shot Class Incremental Learning (FSCIL) with few examples per class for each incremental session is the realistic setting of continual learning since obtaining large number of annotated samples is not feasible and cost effective. We present the framework MASIL as a step towards learning the maximal separable classifier. It addresses the common problem i.e forgetting of old classes and over-fitting to novel classes by learning the classifier weights to be maximally separable between classes forming a simplex Equiangular Tight Frame. We propose the idea of concept factorization explaining the collapsed features for base session classes in terms of concept basis and use these to induce classifier simplex for few shot classes. We further adds fine tuning to reduce any error occurred during factorization and train the classifier jointly on base and novel classes without retaining any base class samples in memory. Experimental results on miniImageNet, CIFAR-100 and CUB-200 demonstrate that MASIL outperforms all the benchmarks.

* 13 pages, 2 figures, 6 tables

Via

Access Paper or Ask Questions

WeaSuL: Weakly Supervised Dialogue Policy Learning: Reward Estimation for Multi-turn Dialogue

Aug 04, 2021

Anant Khandelwal

Figure 1 for WeaSuL: Weakly Supervised Dialogue Policy Learning: Reward Estimation for Multi-turn Dialogue

Figure 2 for WeaSuL: Weakly Supervised Dialogue Policy Learning: Reward Estimation for Multi-turn Dialogue

Figure 3 for WeaSuL: Weakly Supervised Dialogue Policy Learning: Reward Estimation for Multi-turn Dialogue

Abstract:An intelligent dialogue system in a multi-turn setting should not only generate the responses which are of good quality, but it should also generate the responses which can lead to long-term success of the dialogue. Although, the current approaches improved the response quality, but they over-look the training signals present in the dialogue data. We can leverage these signals to generate the weakly supervised training data for learning dialog policy and reward estimator, and make the policy take actions (generates responses) which can foresee the future direction for a successful (rewarding) conversation. We simulate the dialogue between an agent and a user (modelled similar to an agent with supervised learning objective) to interact with each other. The agent uses dynamic blocking to generate ranked diverse responses and exploration-exploitation to select among the Top-K responses. Each simulated state-action pair is evaluated (works as a weak annotation) with three quality modules: Semantic Relevant, Semantic Coherence and Consistent Flow. Empirical studies with two benchmarks indicate that our model can significantly out-perform the response quality and lead to a successful conversation on both automatic evaluation and human judgement.

* 12 pages, 1 figures, 2 tables, Accepted at DialDoc'21 at ACL-IJCNLP 2021

Via

Access Paper or Ask Questions