Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kaiyue Pang

VersaGen: Unleashing Versatile Visual Control for Text-to-Image Synthesis

Dec 17, 2024

Zhipeng Chen, Lan Yang, Yonggang Qi, Honggang Zhang, Kaiyue Pang, Ke Li, Yi-Zhe Song

Figure 1 for VersaGen: Unleashing Versatile Visual Control for Text-to-Image Synthesis

Figure 2 for VersaGen: Unleashing Versatile Visual Control for Text-to-Image Synthesis

Figure 3 for VersaGen: Unleashing Versatile Visual Control for Text-to-Image Synthesis

Figure 4 for VersaGen: Unleashing Versatile Visual Control for Text-to-Image Synthesis

Abstract:Despite the rapid advancements in text-to-image (T2I) synthesis, enabling precise visual control remains a significant challenge. Existing works attempted to incorporate multi-facet controls (text and sketch), aiming to enhance the creative control over generated images. However, our pilot study reveals that the expressive power of humans far surpasses the capabilities of current methods. Users desire a more versatile approach that can accommodate their diverse creative intents, ranging from controlling individual subjects to manipulating the entire scene composition. We present VersaGen, a generative AI agent that enables versatile visual control in T2I synthesis. VersaGen admits four types of visual controls: i) single visual subject; ii) multiple visual subjects; iii) scene background; iv) any combination of the three above or merely no control at all. We train an adaptor upon a frozen T2I model to accommodate the visual information into the text-dominated diffusion process. We introduce three optimization strategies during the inference phase of VersaGen to improve generation results and enhance user experience. Comprehensive experiments on COCO and Sketchy validate the effectiveness and flexibility of VersaGen, as evidenced by both qualitative and quantitative results.

* The paper has been accepted by AAAI 2025. Paper code: https://github.com/FelixChan9527/VersaGen_official

Via

Access Paper or Ask Questions

Wired Perspectives: Multi-View Wire Art Embraces Generative AI

Nov 26, 2023

Zhiyu Qu, Lan Yang, Honggang Zhang, Tao Xiang, Kaiyue Pang, Yi-Zhe Song

Abstract:Creating multi-view wire art (MVWA), a static 3D sculpture with diverse interpretations from different viewpoints, is a complex task even for skilled artists. In response, we present DreamWire, an AI system enabling everyone to craft MVWA easily. Users express their vision through text prompts or scribbles, freeing them from intricate 3D wire organisation. Our approach synergises 3D B\'ezier curves, Prim's algorithm, and knowledge distillation from diffusion models or their variants (e.g., ControlNet). This blend enables the system to represent 3D wire art, ensuring spatial continuity and overcoming data scarcity. Extensive evaluation and analysis are conducted to shed insight on the inner workings of the proposed system, including the trade-off between connectivity and visual aesthetics.

* Project page: https://dreamwireart.github.io

Via

Access Paper or Ask Questions

SketchXAI: A First Look at Explainability for Human Sketches

Apr 23, 2023

Zhiyu Qu, Yulia Gryaditskaya, Ke Li, Kaiyue Pang, Tao Xiang, Yi-Zhe Song

Abstract:This paper, for the very first time, introduces human sketches to the landscape of XAI (Explainable Artificial Intelligence). We argue that sketch as a ``human-centred'' data form, represents a natural interface to study explainability. We focus on cultivating sketch-specific explainability designs. This starts by identifying strokes as a unique building block that offers a degree of flexibility in object construction and manipulation impossible in photos. Following this, we design a simple explainability-friendly sketch encoder that accommodates the intrinsic properties of strokes: shape, location, and order. We then move on to define the first ever XAI task for sketch, that of stroke location inversion SLI. Just as we have heat maps for photos, and correlation matrices for text, SLI offers an explainability angle to sketch in terms of asking a network how well it can recover stroke locations of an unseen sketch. We offer qualitative results for readers to interpret as snapshots of the SLI process in the paper, and as GIFs on the project page. A minor but interesting note is that thanks to its sketch-specific design, our sketch encoder also yields the best sketch recognition accuracy to date while having the smallest number of parameters. The code is available at \url{https://sketchxai.github.io}.

* CVPR 2023

Via

Access Paper or Ask Questions

Making a Bird AI Expert Work for You and Me

Dec 06, 2021

Dongliang Chang, Kaiyue Pang, Ruoyi Du, Zhanyu Ma, Yi-Zhe Song, Jun Guo

Figure 1 for Making a Bird AI Expert Work for You and Me

Figure 2 for Making a Bird AI Expert Work for You and Me

Figure 3 for Making a Bird AI Expert Work for You and Me

Figure 4 for Making a Bird AI Expert Work for You and Me

Abstract:As powerful as fine-grained visual classification (FGVC) is, responding your query with a bird name of "Whip-poor-will" or "Mallard" probably does not make much sense. This however commonly accepted in the literature, underlines a fundamental question interfacing AI and human -- what constitutes transferable knowledge for human to learn from AI? This paper sets out to answer this very question using FGVC as a test bed. Specifically, we envisage a scenario where a trained FGVC model (the AI expert) functions as a knowledge provider in enabling average people (you and me) to become better domain experts ourselves, i.e. those capable in distinguishing between "Whip-poor-will" and "Mallard". Fig. 1 lays out our approach in answering this question. Assuming an AI expert trained using expert human labels, we ask (i) what is the best transferable knowledge we can extract from AI, and (ii) what is the most practical means to measure the gains in expertise given that knowledge? On the former, we propose to represent knowledge as highly discriminative visual regions that are expert-exclusive. For that, we devise a multi-stage learning framework, which starts with modelling visual attention of domain experts and novices before discriminatively distilling their differences to acquire the expert exclusive knowledge. For the latter, we simulate the evaluation process as book guide to best accommodate the learning practice of what is accustomed to humans. A comprehensive human study of 15,000 trials shows our method is able to consistently improve people of divergent bird expertise to recognise once unrecognisable birds. Interestingly, our approach also leads to improved conventional FGVC performance when the extracted knowledge defined is utilised as means to achieve discriminative localisation. Codes are available at: https://github.com/PRIS-CV/Making-a-Bird-AI-Expert-Work-for-You-and-Me

Via

Access Paper or Ask Questions

Your "Labrador" is My "Dog": Fine-Grained, or Not

Nov 18, 2020

Dongliang Chang, Kaiyue Pang, Yixiao Zheng, Zhanyu Ma, Yi-Zhe Song, Jun Guo

Figure 1 for Your "Labrador" is My "Dog": Fine-Grained, or Not

Figure 2 for Your "Labrador" is My "Dog": Fine-Grained, or Not

Figure 3 for Your "Labrador" is My "Dog": Fine-Grained, or Not

Figure 4 for Your "Labrador" is My "Dog": Fine-Grained, or Not

Abstract:Whether what you see in Figure 1 is a "labrador" or a "dog", is the question we ask in this paper. While fine-grained visual classification (FGVC) strives to arrive at the former, for the majority of us non-experts just "dog" would probably suffice. The real question is therefore -- how can we tailor for different fine-grained definitions under divergent levels of expertise. For that, we re-envisage the traditional setting of FGVC, from single-label classification, to that of top-down traversal of a pre-defined coarse-to-fine label hierarchy -- so that our answer becomes "dog"-->"gun dog"-->"retriever"-->"labrador". To approach this new problem, we first conduct a comprehensive human study where we confirm that most participants prefer multi-granularity labels, regardless whether they consider themselves experts. We then discover the key intuition that: coarse-level label prediction exacerbates fine-grained feature learning, yet fine-level feature betters the learning of coarse-level classifier. This discovery enables us to design a very simple albeit surprisingly effective solution to our new problem, where we (i) leverage level-specific classification heads to disentangle coarse-level features with fine-grained ones, and (ii) allow finer-grained features to participate in coarser-grained label predictions, which in turn helps with better disentanglement. Experiments show that our method achieves superior performance in the new FGVC setting, and performs better than state-of-the-art on traditional single-label FGVC problem as well. Thanks to its simplicity, our method can be easily implemented on top of any existing FGVC frameworks and is parameter-free.

Via

Access Paper or Ask Questions

Deep Factorised Inverse-Sketching

Aug 07, 2018

Kaiyue Pang, Da Li, Jifei Song, Yi-Zhe Song, Tao Xiang, Timothy M. Hospedales

Figure 1 for Deep Factorised Inverse-Sketching

Figure 2 for Deep Factorised Inverse-Sketching

Figure 3 for Deep Factorised Inverse-Sketching

Figure 4 for Deep Factorised Inverse-Sketching

Abstract:Modelling human free-hand sketches has become topical recently, driven by practical applications such as fine-grained sketch based image retrieval (FG-SBIR). Sketches are clearly related to photo edge-maps, but a human free-hand sketch of a photo is not simply a clean rendering of that photo's edge map. Instead there is a fundamental process of abstraction and iconic rendering, where overall geometry is warped and salient details are selectively included. In this paper we study this sketching process and attempt to invert it. We model this inversion by translating iconic free-hand sketches to contours that resemble more geometrically realistic projections of object boundaries, and separately factorise out the salient added details. This factorised re-representation makes it easier to match a free-hand sketch to a photo instance of an object. Specifically, we propose a novel unsupervised image style transfer model based on enforcing a cyclic embedding consistency constraint. A deep FG-SBIR model is then formulated to accommodate complementary discriminative detail from each factorised sketch for better matching with the corresponding photo. Our method is evaluated both qualitatively and quantitatively to demonstrate its superiority over a number of state-of-the-art alternatives for style transfer and FG-SBIR.

* Accepted to ECCV 2018

Via

Access Paper or Ask Questions

Universal Perceptual Grouping

Aug 07, 2018

Ke Li, Kaiyue Pang, Jifei Song, Yi-Zhe Song, Tao Xiang, Timothy M. Hospedales, Honggang Zhang

Figure 1 for Universal Perceptual Grouping

Figure 2 for Universal Perceptual Grouping

Figure 3 for Universal Perceptual Grouping

Figure 4 for Universal Perceptual Grouping

Abstract:In this work we aim to develop a universal sketch grouper. That is, a grouper that can be applied to sketches of any category in any domain to group constituent strokes/segments into semantically meaningful object parts. The first obstacle to this goal is the lack of large-scale datasets with grouping annotation. To overcome this, we contribute the largest sketch perceptual grouping (SPG) dataset to date, consisting of 20,000 unique sketches evenly distributed over 25 object categories. Furthermore, we propose a novel deep universal perceptual grouping model. The model is learned with both generative and discriminative losses. The generative losses improve the generalisation ability of the model to unseen object categories and datasets. The discriminative losses include a local grouping loss and a novel global grouping loss to enforce global grouping consistency. We show that the proposed model significantly outperforms the state-of-the-art groupers. Further, we show that our grouper is useful for a number of sketch analysis tasks including sketch synthesis and fine-grained sketch-based image retrieval (FG-SBIR).

* Accepted ECCV 2018

Via

Access Paper or Ask Questions

Learning to Sketch with Shortcut Cycle Consistency

May 01, 2018

Jifei Song, Kaiyue Pang, Yi-Zhe Song, Tao Xiang, Timothy Hospedales

Figure 1 for Learning to Sketch with Shortcut Cycle Consistency

Figure 2 for Learning to Sketch with Shortcut Cycle Consistency

Figure 3 for Learning to Sketch with Shortcut Cycle Consistency

Figure 4 for Learning to Sketch with Shortcut Cycle Consistency

Abstract:To see is to sketch -- free-hand sketching naturally builds ties between human and machine vision. In this paper, we present a novel approach for translating an object photo to a sketch, mimicking the human sketching process. This is an extremely challenging task because the photo and sketch domains differ significantly. Furthermore, human sketches exhibit various levels of sophistication and abstraction even when depicting the same object instance in a reference photo. This means that even if photo-sketch pairs are available, they only provide weak supervision signal to learn a translation model. Compared with existing supervised approaches that solve the problem of D(E(photo)) -> sketch, where E($\cdot$) and D($\cdot$) denote encoder and decoder respectively, we take advantage of the inverse problem (e.g., D(E(sketch)) -> photo), and combine with the unsupervised learning tasks of within-domain reconstruction, all within a multi-task learning framework. Compared with existing unsupervised approaches based on cycle consistency (i.e., D(E(D(E(photo)))) -> photo), we introduce a shortcut consistency enforced at the encoder bottleneck (e.g., D(E(photo)) -> photo) to exploit the additional self-supervision. Both qualitative and quantitative results show that the proposed model is superior to a number of state-of-the-art alternatives. We also show that the synthetic sketches can be used to train a better fine-grained sketch-based image retrieval (FG-SBIR) model, effectively alleviating the problem of sketch data scarcity.

* To appear in CVPR2018

Via

Access Paper or Ask Questions

SketchMate: Deep Hashing for Million-Scale Human Sketch Retrieval

Apr 04, 2018

Peng Xu, Yongye Huang, Tongtong Yuan, Kaiyue Pang, Yi-Zhe Song, Tao Xiang, Timothy M. Hospedales, Zhanyu Ma, Jun Guo

Figure 1 for SketchMate: Deep Hashing for Million-Scale Human Sketch Retrieval

Figure 2 for SketchMate: Deep Hashing for Million-Scale Human Sketch Retrieval

Figure 3 for SketchMate: Deep Hashing for Million-Scale Human Sketch Retrieval

Figure 4 for SketchMate: Deep Hashing for Million-Scale Human Sketch Retrieval

Abstract:We propose a deep hashing framework for sketch retrieval that, for the first time, works on a multi-million scale human sketch dataset. Leveraging on this large dataset, we explore a few sketch-specific traits that were otherwise under-studied in prior literature. Instead of following the conventional sketch recognition task, we introduce the novel problem of sketch hashing retrieval which is not only more challenging, but also offers a better testbed for large-scale sketch analysis, since: (i) more fine-grained sketch feature learning is required to accommodate the large variations in style and abstraction, and (ii) a compact binary code needs to be learned at the same time to enable efficient retrieval. Key to our network design is the embedding of unique characteristics of human sketch, where (i) a two-branch CNN-RNN architecture is adapted to explore the temporal ordering of strokes, and (ii) a novel hashing loss is specifically designed to accommodate both the temporal and abstract traits of sketches. By working with a 3.8M sketch dataset, we show that state-of-the-art hashing models specifically engineered for static images fail to perform well on temporal sketch data. Our network on the other hand not only offers the best retrieval performance on various code sizes, but also yields the best generalization performance under a zero-shot setting and when re-purposed for sketch recognition. Such superior performances effectively demonstrate the benefit of our sketch-specific design.

* Accepted by CVPR2018

Via

Access Paper or Ask Questions