Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Arun Mallya

Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models

Nov 11, 2024

NVIDIA, :, Yuval Atzmon, Maciej Bala, Yogesh Balaji, Tiffany Cai, Yin Cui, Jiaojiao Fan, Yunhao Ge, Siddharth Gururani(+22 more)

Abstract:We introduce Edify Image, a family of diffusion models capable of generating photorealistic image content with pixel-perfect accuracy. Edify Image utilizes cascaded pixel-space diffusion models trained using a novel Laplacian diffusion process, in which image signals at different frequency bands are attenuated at varying rates. Edify Image supports a wide range of applications, including text-to-image synthesis, 4K upsampling, ControlNets, 360 HDR panorama generation, and finetuning for image customization.

Via

Access Paper or Ask Questions

Movie Gen: A Cast of Media Foundation Models

Oct 17, 2024

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang(+78 more)

Figure 1 for Movie Gen: A Cast of Media Foundation Models

Figure 2 for Movie Gen: A Cast of Media Foundation Models

Figure 3 for Movie Gen: A Cast of Media Foundation Models

Figure 4 for Movie Gen: A Cast of Media Foundation Models

Abstract:We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio. We also show additional capabilities such as precise instruction-based video editing and generation of personalized videos based on a user's image. Our models set a new state-of-the-art on multiple tasks: text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation. Our largest video generation model is a 30B parameter transformer trained with a maximum context length of 73K video tokens, corresponding to a generated video of 16 seconds at 16 frames-per-second. We show multiple technical innovations and simplifications on the architecture, latent spaces, training objectives and recipes, data curation, evaluation protocols, parallelization techniques, and inference optimizations that allow us to reap the benefits of scaling pre-training data, model size, and training compute for training large scale media generation models. We hope this paper helps the research community to accelerate progress and innovation in media generation models. All videos from this paper are available at https://go.fb.me/MovieGenResearchVideos.

Via

Access Paper or Ask Questions

SPACE: Speech-driven Portrait Animation with Controllable Expression

Dec 07, 2022

Siddharth Gururani, Arun Mallya, Ting-Chun Wang, Rafael Valle, Ming-Yu Liu

Figure 1 for SPACE: Speech-driven Portrait Animation with Controllable Expression

Figure 2 for SPACE: Speech-driven Portrait Animation with Controllable Expression

Figure 3 for SPACE: Speech-driven Portrait Animation with Controllable Expression

Figure 4 for SPACE: Speech-driven Portrait Animation with Controllable Expression

Abstract:Animating portraits using speech has received growing attention in recent years, with various creative and practical use cases. An ideal generated video should have good lip sync with the audio, natural facial expressions and head motions, and high frame quality. In this work, we present SPACE, which uses speech and a single image to generate high-resolution, and expressive videos with realistic head pose, without requiring a driving video. It uses a multi-stage approach, combining the controllability of facial landmarks with the high-quality synthesis power of a pretrained face generator. SPACE also allows for the control of emotions and their intensities. Our method outperforms prior methods in objective metrics for image quality and facial motions and is strongly preferred by users in pair-wise comparisons. The project website is available at https://deepimagination.cc/SPACE/

Via

Access Paper or Ask Questions

Implicit Warping for Animation with Image Sets

Oct 04, 2022

Arun Mallya, Ting-Chun Wang, Ming-Yu Liu

Figure 1 for Implicit Warping for Animation with Image Sets

Figure 2 for Implicit Warping for Animation with Image Sets

Figure 3 for Implicit Warping for Animation with Image Sets

Figure 4 for Implicit Warping for Animation with Image Sets

Abstract:We present a new implicit warping framework for image animation using sets of source images through the transfer of the motion of a driving video. A single cross- modal attention layer is used to find correspondences between the source images and the driving image, choose the most appropriate features from different source images, and warp the selected features. This is in contrast to the existing methods that use explicit flow-based warping, which is designed for animation using a single source and does not extend well to multiple sources. The pick-and-choose capability of our framework helps it achieve state-of-the-art results on multiple datasets for image animation using both single and multiple source images. The project website is available at https://deepimagination.cc/implicit warping/

* To be published at NeurIPS 2022

Via

Access Paper or Ask Questions

AdaViT: Adaptive Tokens for Efficient Vision Transformer

Dec 14, 2021

Hongxu Yin, Arash Vahdat, Jose Alvarez, Arun Mallya, Jan Kautz, Pavlo Molchanov

Figure 1 for AdaViT: Adaptive Tokens for Efficient Vision Transformer

Figure 2 for AdaViT: Adaptive Tokens for Efficient Vision Transformer

Figure 3 for AdaViT: Adaptive Tokens for Efficient Vision Transformer

Figure 4 for AdaViT: Adaptive Tokens for Efficient Vision Transformer

Abstract:We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity. AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds. We reformulate Adaptive Computation Time (ACT) for this task, extending halting to discard redundant spatial tokens. The appealing architectural properties of vision transformers enables our adaptive token reduction mechanism to speed up inference without modifying the network architecture or inference hardware. We demonstrate that AdaViT requires no extra parameters or sub-network for halting, as we base the learning of adaptive halting on the original network parameters. We further introduce distributional prior regularization that stabilizes training compared to prior ACT approaches. On the image classification task (ImageNet1K), we show that our proposed AdaViT yields high efficacy in filtering informative spatial features and cutting down on the overall compute. The proposed method improves the throughput of DeiT-Tiny by 62% and DeiT-Small by 38% with only 0.3% accuracy drop, outperforming prior art by a large margin.

Via

Access Paper or Ask Questions

Multimodal Conditional Image Synthesis with Product-of-Experts GANs

Dec 09, 2021

Xun Huang, Arun Mallya, Ting-Chun Wang, Ming-Yu Liu

Figure 1 for Multimodal Conditional Image Synthesis with Product-of-Experts GANs

Figure 2 for Multimodal Conditional Image Synthesis with Product-of-Experts GANs

Figure 3 for Multimodal Conditional Image Synthesis with Product-of-Experts GANs

Figure 4 for Multimodal Conditional Image Synthesis with Product-of-Experts GANs

Abstract:Existing conditional image synthesis frameworks generate images based on user inputs in a single modality, such as text, segmentation, sketch, or style reference. They are often unable to leverage multimodal user inputs when available, which reduces their practicality. To address this limitation, we propose the Product-of-Experts Generative Adversarial Networks (PoE-GAN) framework, which can synthesize images conditioned on multiple input modalities or any subset of them, even the empty set. PoE-GAN consists of a product-of-experts generator and a multimodal multiscale projection discriminator. Through our carefully designed training scheme, PoE-GAN learns to synthesize images with high quality and diversity. Besides advancing the state of the art in multimodal conditional image synthesis, PoE-GAN also outperforms the best existing unimodal conditional image synthesis approaches when tested in the unimodal setting. The project website is available at https://deepimagination.github.io/PoE-GAN .

Via

Access Paper or Ask Questions

GANcraft: Unsupervised 3D Neural Rendering of Minecraft Worlds

Apr 15, 2021

Zekun Hao, Arun Mallya, Serge Belongie, Ming-Yu Liu

Figure 1 for GANcraft: Unsupervised 3D Neural Rendering of Minecraft Worlds

Figure 2 for GANcraft: Unsupervised 3D Neural Rendering of Minecraft Worlds

Figure 3 for GANcraft: Unsupervised 3D Neural Rendering of Minecraft Worlds

Figure 4 for GANcraft: Unsupervised 3D Neural Rendering of Minecraft Worlds

Abstract:We present GANcraft, an unsupervised neural rendering framework for generating photorealistic images of large 3D block worlds such as those created in Minecraft. Our method takes a semantic block world as input, where each block is assigned a semantic label such as dirt, grass, or water. We represent the world as a continuous volumetric function and train our model to render view-consistent photorealistic images for a user-controlled camera. In the absence of paired ground truth real images for the block world, we devise a training technique based on pseudo-ground truth and adversarial training. This stands in contrast to prior work on neural rendering for view synthesis, which requires ground truth images to estimate scene geometry and view-dependent appearance. In addition to camera trajectory, GANcraft allows user control over both scene semantics and output style. Experimental results with comparison to strong baselines show the effectiveness of GANcraft on this novel task of photorealistic 3D block world synthesis. The project website is available at https://nvlabs.github.io/GANcraft/ .

Via

Access Paper or Ask Questions

See through Gradients: Image Batch Recovery via GradInversion

Apr 15, 2021

Hongxu Yin, Arun Mallya, Arash Vahdat, Jose M. Alvarez, Jan Kautz, Pavlo Molchanov

Figure 1 for See through Gradients: Image Batch Recovery via GradInversion

Figure 2 for See through Gradients: Image Batch Recovery via GradInversion

Figure 3 for See through Gradients: Image Batch Recovery via GradInversion

Figure 4 for See through Gradients: Image Batch Recovery via GradInversion

Abstract:Training deep neural networks requires gradient estimation from data batches to update parameters. Gradients per parameter are averaged over a set of data and this has been presumed to be safe for privacy-preserving training in joint, collaborative, and federated learning applications. Prior work only showed the possibility of recovering input data given gradients under very restrictive conditions - a single input point, or a network with no non-linearities, or a small 32x32 px input batch. Therefore, averaging gradients over larger batches was thought to be safe. In this work, we introduce GradInversion, using which input images from a larger batch (8 - 48 images) can also be recovered for large networks such as ResNets (50 layers), on complex datasets such as ImageNet (1000 classes, 224x224 px). We formulate an optimization task that converts random noise into natural images, matching gradients while regularizing image fidelity. We also propose an algorithm for target class label recovery given gradients. We further propose a group consistency regularization framework, where multiple agents starting from different random seeds work together to find an enhanced reconstruction of original data batch. We show that gradients encode a surprisingly large amount of information, such that all the individual images can be recovered with high fidelity via GradInversion, even for complex datasets, deep networks, and large batch sizes.

* CVPR 2021 accepted paper

Via

Access Paper or Ask Questions

One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing

Nov 30, 2020

Ting-Chun Wang, Arun Mallya, Ming-Yu Liu

Figure 1 for One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing

Figure 2 for One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing

Figure 3 for One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing

Figure 4 for One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing

Abstract:We propose a neural talking-head video synthesis model and demonstrate its application to video conferencing. Our model learns to synthesize a talking-head video using a source image containing the target person's appearance and a driving video that dictates the motion in the output. Our motion is encoded based on a novel keypoint representation, where the identity-specific and motion-related information is decomposed unsupervisedly. Extensive experimental validation shows that our model outperforms competing methods on benchmark datasets. Moreover, our compact keypoint representation enables a video conferencing system that achieves the same visual quality as the commercial H.264 standard while only using one-tenth of the bandwidth. Besides, we show our keypoint representation allows the user to rotate the head during synthesis, which is useful for simulating a face-to-face video conferencing experience.

* Our project page can be found at https://nvlabs.github.io/face-vid2vid

Via

Access Paper or Ask Questions

Generative Adversarial Networks for Image and Video Synthesis: Algorithms and Applications

Aug 06, 2020

Ming-Yu Liu, Xun Huang, Jiahui Yu, Ting-Chun Wang, Arun Mallya

Figure 1 for Generative Adversarial Networks for Image and Video Synthesis: Algorithms and Applications

Figure 2 for Generative Adversarial Networks for Image and Video Synthesis: Algorithms and Applications

Figure 3 for Generative Adversarial Networks for Image and Video Synthesis: Algorithms and Applications

Figure 4 for Generative Adversarial Networks for Image and Video Synthesis: Algorithms and Applications

Abstract:The generative adversarial network (GAN) framework has emerged as a powerful tool for various image and video synthesis tasks, allowing the synthesis of visual content in an unconditional or input-conditional manner. It has enabled the generation of high-resolution photorealistic images and videos, a task that was challenging or impossible with prior methods. It has also led to the creation of many new applications in content creation. In this paper, we provide an overview of GANs with a special focus on algorithms and applications for visual synthesis. We cover several important techniques to stabilize GAN training, which has a reputation for being notoriously difficult. We also discuss its applications to image translation, image processing, video synthesis, and neural rendering.

Via

Access Paper or Ask Questions