Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Patsorn Sangkloy

Cost-Aware Routing for Efficient Text-To-Image Generation

Jun 17, 2025

Qinchan, Li, Kenneth Chen, Changyue, Su, Wittawat Jitkrittum, Qi Sun, Patsorn Sangkloy

Abstract:Diffusion models are well known for their ability to generate a high-fidelity image for an input prompt through an iterative denoising process. Unfortunately, the high fidelity also comes at a high computational cost due the inherently sequential generative process. In this work, we seek to optimally balance quality and computational cost, and propose a framework to allow the amount of computation to vary for each prompt, depending on its complexity. Each prompt is automatically routed to the most appropriate text-to-image generation function, which may correspond to a distinct number of denoising steps of a diffusion model, or a disparate, independent text-to-image model. Unlike uniform cost reduction techniques (e.g., distillation, model quantization), our approach achieves the optimal trade-off by learning to reserve expensive choices (e.g., 100+ denoising steps) only for a few complex prompts, and employ more economical choices (e.g., small distilled model) for less sophisticated prompts. We empirically demonstrate on COCO and DiffusionDB that by learning to route to nine already-trained text-to-image models, our approach is able to deliver an average quality that is higher than that achievable by any of these models alone.

Via

Access Paper or Ask Questions

StyleGAN Salon: Multi-View Latent Optimization for Pose-Invariant Hairstyle Transfer

Apr 05, 2023

Sasikarn Khwanmuang, Pakkapon Phongthawee, Patsorn Sangkloy, Supasorn Suwajanakorn

Abstract:Our paper seeks to transfer the hairstyle of a reference image to an input photo for virtual hair try-on. We target a variety of challenges scenarios, such as transforming a long hairstyle with bangs to a pixie cut, which requires removing the existing hair and inferring how the forehead would look, or transferring partially visible hair from a hat-wearing person in a different pose. Past solutions leverage StyleGAN for hallucinating any missing parts and producing a seamless face-hair composite through so-called GAN inversion or projection. However, there remains a challenge in controlling the hallucinations to accurately transfer hairstyle and preserve the face shape and identity of the input. To overcome this, we propose a multi-view optimization framework that uses "two different views" of reference composites to semantically guide occluded or ambiguous regions. Our optimization shares information between two poses, which allows us to produce high fidelity and realistic results from incomplete references. Our framework produces high-quality results and outperforms prior work in a user study that consists of significantly more challenging hair transfer scenarios than previously studied. Project page: https://stylegan-salon.github.io/.

* Accepted to CVPR2023

Via

Access Paper or Ask Questions

A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch

Aug 05, 2022

Patsorn Sangkloy, Wittawat Jitkrittum, Diyi Yang, James Hays

Figure 1 for A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch

Figure 2 for A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch

Figure 3 for A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch

Figure 4 for A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch

Abstract:We address the problem of retrieving images with both a sketch and a text query. We present TASK-former (Text And SKetch transformer), an end-to-end trainable model for image retrieval using a text description and a sketch as input. We argue that both input modalities complement each other in a manner that cannot be achieved easily by either one alone. TASK-former follows the late-fusion dual-encoder approach, similar to CLIP, which allows efficient and scalable retrieval since the retrieval set can be indexed independently of the queries. We empirically demonstrate that using an input sketch (even a poorly drawn one) in addition to text considerably increases retrieval recall compared to traditional text-based image retrieval. To evaluate our approach, we collect 5,000 hand-drawn sketches for images in the test set of the COCO dataset. The collected sketches are available a https://janesjanes.github.io/tsbir/.

* ECCV 2022

Via

Access Paper or Ask Questions

Argoverse: 3D Tracking and Forecasting with Rich Maps

Nov 06, 2019

Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan(+1 more)

Figure 1 for Argoverse: 3D Tracking and Forecasting with Rich Maps

Figure 2 for Argoverse: 3D Tracking and Forecasting with Rich Maps

Figure 3 for Argoverse: 3D Tracking and Forecasting with Rich Maps

Figure 4 for Argoverse: 3D Tracking and Forecasting with Rich Maps

Abstract:We present Argoverse -- two datasets designed to support autonomous vehicle machine learning tasks such as 3D tracking and motion forecasting. Argoverse was collected by a fleet of autonomous vehicles in Pittsburgh and Miami. The Argoverse 3D Tracking dataset includes 360 degree images from 7 cameras with overlapping fields of view, 3D point clouds from long range LiDAR, 6-DOF pose, and 3D track annotations. Notably, it is the only modern AV dataset that provides forward-facing stereo imagery. The Argoverse Motion Forecasting dataset includes more than 300,000 5-second tracked scenarios with a particular vehicle identified for trajectory forecasting. Argoverse is the first autonomous vehicle dataset to include "HD maps" with 290 km of mapped lanes with geometric and semantic metadata. All data is released under a Creative Commons license at www.argoverse.org. In our baseline experiments, we illustrate how detailed map information such as lane direction, driveable area, and ground height improves the accuracy of 3D object tracking and motion forecasting. Our tracking and forecasting experiments represent only an initial exploration of the use of rich maps in robotic perception. We hope that Argoverse will enable the research community to explore these problems in greater depth.

* CVPR 2019

Via

Access Paper or Ask Questions

Kernel Mean Matching for Content Addressability of GANs

May 14, 2019

Wittawat Jitkrittum, Patsorn Sangkloy, Muhammad Waleed Gondal, Amit Raj, James Hays, Bernhard Schölkopf

Figure 1 for Kernel Mean Matching for Content Addressability of GANs

Figure 2 for Kernel Mean Matching for Content Addressability of GANs

Figure 3 for Kernel Mean Matching for Content Addressability of GANs

Figure 4 for Kernel Mean Matching for Content Addressability of GANs

Abstract:We propose a novel procedure which adds "content-addressability" to any given unconditional implicit model e.g., a generative adversarial network (GAN). The procedure allows users to control the generative process by specifying a set (arbitrary size) of desired examples based on which similar samples are generated from the model. The proposed approach, based on kernel mean matching, is applicable to any generative models which transform latent vectors to samples, and does not require retraining of the model. Experiments on various high-dimensional image generation problems (CelebA-HQ, LSUN bedroom, bridge, tower) show that our approach is able to generate images which are consistent with the input set, while retaining the image quality of the original model. To our knowledge, this is the first work that attempts to construct, at test time, a content-addressable generative model from a trained marginal model.

* Wittawat Jitkrittum and Patsorn Sangkloy contributed equally to this work

Via

Access Paper or Ask Questions

Informative Features for Model Comparison

Oct 27, 2018

Wittawat Jitkrittum, Heishiro Kanagawa, Patsorn Sangkloy, James Hays, Bernhard Schölkopf, Arthur Gretton

Figure 1 for Informative Features for Model Comparison

Figure 2 for Informative Features for Model Comparison

Figure 3 for Informative Features for Model Comparison

Figure 4 for Informative Features for Model Comparison

Abstract:Given two candidate models, and a set of target observations, we address the problem of measuring the relative goodness of fit of the two models. We propose two new statistical tests which are nonparametric, computationally efficient (runtime complexity is linear in the sample size), and interpretable. As a unique advantage, our tests can produce a set of examples (informative features) indicating the regions in the data domain where one model fits significantly better than the other. In a real-world problem of comparing GAN models, the test power of our new test matches that of the state-of-the-art test of relative goodness of fit, while being one order of magnitude faster.

* Accepted to NIPS 2018

Via

Access Paper or Ask Questions

TextureGAN: Controlling Deep Image Synthesis with Texture Patches

Apr 14, 2018

Wenqi Xian, Patsorn Sangkloy, Varun Agrawal, Amit Raj, Jingwan Lu, Chen Fang, Fisher Yu, James Hays

Figure 1 for TextureGAN: Controlling Deep Image Synthesis with Texture Patches

Figure 2 for TextureGAN: Controlling Deep Image Synthesis with Texture Patches

Figure 3 for TextureGAN: Controlling Deep Image Synthesis with Texture Patches

Figure 4 for TextureGAN: Controlling Deep Image Synthesis with Texture Patches

Abstract:In this paper, we investigate deep image synthesis guided by sketch, color, and texture. Previous image synthesis methods can be controlled by sketch and color strokes but we are the first to examine texture control. We allow a user to place a texture patch on a sketch at arbitrary locations and scales to control the desired output texture. Our generative network learns to synthesize objects consistent with these texture suggestions. To achieve this, we develop a local texture loss in addition to adversarial and content loss to train the generative network. We conduct experiments using sketches generated from real images and textures sampled from a separate texture database and results show that our proposed algorithm is able to generate plausible images that are faithful to user controls. Ablation studies show that our proposed pipeline can generate more realistic images than adapting existing methods directly.

* CVPR 2018 spotlight

Via

Access Paper or Ask Questions

Let's Dance: Learning From Online Dance Videos

Jan 23, 2018

Daniel Castro, Steven Hickson, Patsorn Sangkloy, Bhavishya Mittal, Sean Dai, James Hays, Irfan Essa

Figure 1 for Let's Dance: Learning From Online Dance Videos

Figure 2 for Let's Dance: Learning From Online Dance Videos

Figure 3 for Let's Dance: Learning From Online Dance Videos

Figure 4 for Let's Dance: Learning From Online Dance Videos

Abstract:In recent years, deep neural network approaches have naturally extended to the video domain, in their simplest case by aggregating per-frame classifications as a baseline for action recognition. A majority of the work in this area extends from the imaging domain, leading to visual-feature heavy approaches on temporal data. To address this issue we introduce "Let's Dance", a 1000 video dataset (and growing) comprised of 10 visually overlapping dance categories that require motion for their classification. We stress the important of human motion as a key distinguisher in our work given that, as we show in this work, visual information is not sufficient to classify motion-heavy categories. We compare our datasets' performance using imaging techniques with UCF-101 and demonstrate this inherent difficulty. We present a comparison of numerous state-of-the-art techniques on our dataset using three different representations (video, optical flow and multi-person pose data) in order to analyze these approaches. We discuss the motion parameterization of each of them and their value in learning to categorize online dance videos. Lastly, we release this dataset (and its three representations) for the research community to use.

* first submitted November 2016

Via

Access Paper or Ask Questions

Scribbler: Controlling Deep Image Synthesis with Sketch and Color

Dec 05, 2016

Patsorn Sangkloy, Jingwan Lu, Chen Fang, Fisher Yu, James Hays

Figure 1 for Scribbler: Controlling Deep Image Synthesis with Sketch and Color

Figure 2 for Scribbler: Controlling Deep Image Synthesis with Sketch and Color

Figure 3 for Scribbler: Controlling Deep Image Synthesis with Sketch and Color

Figure 4 for Scribbler: Controlling Deep Image Synthesis with Sketch and Color

Abstract:Recently, there have been several promising methods to generate realistic imagery from deep convolutional networks. These methods sidestep the traditional computer graphics rendering pipeline and instead generate imagery at the pixel level by learning from large collections of photos (e.g. faces or bedrooms). However, these methods are of limited utility because it is difficult for a user to control what the network produces. In this paper, we propose a deep adversarial image synthesis architecture that is conditioned on sketched boundaries and sparse color strokes to generate realistic cars, bedrooms, or faces. We demonstrate a sketch based image synthesis system which allows users to 'scribble' over the sketch to indicate preferred color for objects. Our network can then generate convincing images that satisfy both the color and the sketch constraints of user. The network is feed-forward which allows users to see the effect of their edits in real time. We compare to recent work on sketch to image synthesis and show that our approach can generate more realistic, more diverse, and more controllable outputs. The architecture is also effective at user-guided colorization of grayscale images.

* 13 pages, 14 figures

Via

Access Paper or Ask Questions