Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shishi Xiao

ChArtist: Generating Pictorial Charts with Unified Spatial and Subject Control

Mar 15, 2026

Shishi Xiao, Tongyu Zhou, David Laidlaw, Gromit Yeuk-Yin Chan

Abstract:A pictorial chart is an effective medium for visual storytelling, seamlessly integrating visual elements with data charts. However, creating such images is challenging because the flexibility of visual elements often conflicts with the rigidity of chart structures. This process thus requires a creative deformation that maintains both data faithfulness and visual aesthetics. Current methods that extract dense structural cues from natural images (e.g., edge or depth maps) are ill-suited as conditioning signals for pictorial chart generation. We present ChArtist, a domain-specific diffusion model for generating pictorial charts automatically, offering two distinct types of control: 1) spatial control that aligns well with the chart structure, and 2) subject-driven control that respects the visual characteristics of a reference image. To achieve this, we introduce a skeleton-based spatial control representation. This representation encodes only the data-encoding information of the chart, allowing for the easy incorporation of reference visuals without a rigid outline constraint. We implement our method based on the Diffusion Transformer (DiT) and leverage an adaptive position encoding mechanism to manage these two controls. We further introduce Spatially Gated Attention to modulate the interaction between spatial control and subject control. To support the fine-tuning of pre-trained models for this task, we created a large-scale dataset of 30,000 triplets (skeleton, reference image, pictorial chart). We also propose a unified data accuracy metric to evaluate the data faithfulness of the generated charts. We believe this work demonstrates that current generative models can achieve data-driven visual storytelling by moving beyond general-purpose conditions to task-specific representations. Project page: https://chartist-ai.github.io/.

* Project page: https://chartist-ai.github.io/

Via

Access Paper or Ask Questions

BizGen: Advancing Article-level Visual Text Rendering for Infographics Generation

Mar 26, 2025

Yuyang Peng, Shishi Xiao, Keming Wu, Qisheng Liao, Bohan Chen, Kevin Lin, Danqing Huang, Ji Li, Yuhui Yuan

Abstract:Recently, state-of-the-art text-to-image generation models, such as Flux and Ideogram 2.0, have made significant progress in sentence-level visual text rendering. In this paper, we focus on the more challenging scenarios of article-level visual text rendering and address a novel task of generating high-quality business content, including infographics and slides, based on user provided article-level descriptive prompts and ultra-dense layouts. The fundamental challenges are twofold: significantly longer context lengths and the scarcity of high-quality business content data. In contrast to most previous works that focus on a limited number of sub-regions and sentence-level prompts, ensuring precise adherence to ultra-dense layouts with tens or even hundreds of sub-regions in business content is far more challenging. We make two key technical contributions: (i) the construction of scalable, high-quality business content dataset, i.e., Infographics-650K, equipped with ultra-dense layouts and prompts by implementing a layer-wise retrieval-augmented infographic generation scheme; and (ii) a layout-guided cross attention scheme, which injects tens of region-wise prompts into a set of cropped region latent space according to the ultra-dense layouts, and refine each sub-regions flexibly during inference using a layout conditional CFG. We demonstrate the strong results of our system compared to previous SOTA systems such as Flux and SD3 on our BizEval prompt set. Additionally, we conduct thorough ablation experiments to verify the effectiveness of each component. We hope our constructed Infographics-650K and BizEval can encourage the broader community to advance the progress of business content generation.

* Accepted by CVPR 2025. Project Page: https://bizgen-msra.github.io

Via

Access Paper or Ask Questions

CinePreGen: Camera Controllable Video Previsualization via Engine-powered Diffusion

Aug 30, 2024

Yiran Chen, Anyi Rao, Xuekun Jiang, Shishi Xiao, Ruiqing Ma, Zeyu Wang, Hui Xiong, Bo Dai

Abstract:With advancements in video generative AI models (e.g., SORA), creators are increasingly using these techniques to enhance video previsualization. However, they face challenges with incomplete and mismatched AI workflows. Existing methods mainly rely on text descriptions and struggle with camera placement, a key component of previsualization. To address these issues, we introduce CinePreGen, a visual previsualization system enhanced with engine-powered diffusion. It features a novel camera and storyboard interface that offers dynamic control, from global to local camera adjustments. This is combined with a user-friendly AI rendering workflow, which aims to achieve consistent results through multi-masked IP-Adapter and engine simulation guidelines. In our comprehensive evaluation study, we demonstrate that our system reduces development viscosity (i.e., the complexity and challenges in the development process), meets users' needs for extensive control and iteration in the design process, and outperforms other AI video production workflows in cinematic camera movement, as shown by our experiments and a within-subjects user study. With its intuitive camera controls and realistic rendering of camera motion, CinePreGen shows great potential for improving video production for both individual creators and industry professionals.

Via

Access Paper or Ask Questions

ModalChorus: Visual Probing and Alignment of Multi-modal Embeddings via Modal Fusion Map

Jul 17, 2024

Yilin Ye, Shishi Xiao, Xingchen Zeng, Wei Zeng

Abstract:Multi-modal embeddings form the foundation for vision-language models, such as CLIP embeddings, the most widely used text-image embeddings. However, these embeddings are vulnerable to subtle misalignment of cross-modal features, resulting in decreased model performance and diminished generalization. To address this problem, we design ModalChorus, an interactive system for visual probing and alignment of multi-modal embeddings. ModalChorus primarily offers a two-stage process: 1) embedding probing with Modal Fusion Map (MFM), a novel parametric dimensionality reduction method that integrates both metric and nonmetric objectives to enhance modality fusion; and 2) embedding alignment that allows users to interactively articulate intentions for both point-set and set-set alignments. Quantitative and qualitative comparisons for CLIP embeddings with existing dimensionality reduction (e.g., t-SNE and MDS) and data fusion (e.g., data context map) methods demonstrate the advantages of MFM in showcasing cross-modal features over common vision-language datasets. Case studies reveal that ModalChorus can facilitate intuitive discovery of misalignment and efficient re-alignment in scenarios ranging from zero-shot classification to cross-modal retrieval and generation.

* Accepted by VIS 2024

Via

Access Paper or Ask Questions

Generative AI for Visualization: State of the Art and Future Directions

Apr 28, 2024

Yilin Ye, Jianing Hao, Yihan Hou, Zhan Wang, Shishi Xiao, Yuyu Luo, Wei Zeng

Figure 1 for Generative AI for Visualization: State of the Art and Future Directions

Figure 2 for Generative AI for Visualization: State of the Art and Future Directions

Figure 3 for Generative AI for Visualization: State of the Art and Future Directions

Figure 4 for Generative AI for Visualization: State of the Art and Future Directions

Abstract:Generative AI (GenAI) has witnessed remarkable progress in recent years and demonstrated impressive performance in various generation tasks in different domains such as computer vision and computational design. Many researchers have attempted to integrate GenAI into visualization framework, leveraging the superior generative capacity for different operations. Concurrently, recent major breakthroughs in GenAI like diffusion model and large language model have also drastically increase the potential of GenAI4VIS. From a technical perspective, this paper looks back on previous visualization studies leveraging GenAI and discusses the challenges and opportunities for future research. Specifically, we cover the applications of different types of GenAI methods including sequence, tabular, spatial and graph generation techniques for different tasks of visualization which we summarize into four major stages: data enhancement, visual mapping generation, stylization and interaction. For each specific visualization sub-task, we illustrate the typical data and concrete GenAI algorithms, aiming to provide in-depth understanding of the state-of-the-art GenAI4VIS techniques and their limitations. Furthermore, based on the survey, we discuss three major aspects of challenges and research opportunities including evaluation, dataset, and the gap between end-to-end GenAI and generative algorithms. By summarizing different generation algorithms, their current applications and limitations, this paper endeavors to provide useful insights for future GenAI4VIS research.

Via

Access Paper or Ask Questions

TypeDance: Creating Semantic Typographic Logos from Image through Personalized Generation

Jan 20, 2024

Shishi Xiao, Liangwei Wang, Xiaojuan Ma, Wei Zeng

Figure 1 for TypeDance: Creating Semantic Typographic Logos from Image through Personalized Generation

Figure 2 for TypeDance: Creating Semantic Typographic Logos from Image through Personalized Generation

Figure 3 for TypeDance: Creating Semantic Typographic Logos from Image through Personalized Generation

Figure 4 for TypeDance: Creating Semantic Typographic Logos from Image through Personalized Generation

Abstract:Semantic typographic logos harmoniously blend typeface and imagery to represent semantic concepts while maintaining legibility. Conventional methods using spatial composition and shape substitution are hindered by the conflicting requirement for achieving seamless spatial fusion between geometrically dissimilar typefaces and semantics. While recent advances made AI generation of semantic typography possible, the end-to-end approaches exclude designer involvement and disregard personalized design. This paper presents TypeDance, an AI-assisted tool incorporating design rationales with the generative model for personalized semantic typographic logo design. It leverages combinable design priors extracted from uploaded image exemplars and supports type-imagery mapping at various structural granularity, achieving diverse aesthetic designs with flexible control. Additionally, we instantiate a comprehensive design workflow in TypeDance, including ideation, selection, generation, evaluation, and iteration. A two-task user evaluation, including imitation and creation, confirmed the usability of TypeDance in design across different usage scenarios

* 24 pages, 9 figures

Via

Access Paper or Ask Questions

The Contemporary Art of Image Search: Iterative User Intent Expansion via Vision-Language Model

Dec 05, 2023

Yilin Ye, Qian Zhu, Shishi Xiao, Kang Zhang, Wei Zeng

Figure 1 for The Contemporary Art of Image Search: Iterative User Intent Expansion via Vision-Language Model

Figure 2 for The Contemporary Art of Image Search: Iterative User Intent Expansion via Vision-Language Model

Figure 3 for The Contemporary Art of Image Search: Iterative User Intent Expansion via Vision-Language Model

Figure 4 for The Contemporary Art of Image Search: Iterative User Intent Expansion via Vision-Language Model

Abstract:Image search is an essential and user-friendly method to explore vast galleries of digital images. However, existing image search methods heavily rely on proximity measurements like tag matching or image similarity, requiring precise user inputs for satisfactory results. To meet the growing demand for a contemporary image search engine that enables accurate comprehension of users' search intentions, we introduce an innovative user intent expansion framework. Our framework leverages visual-language models to parse and compose multi-modal user inputs to provide more accurate and satisfying results. It comprises two-stage processes: 1) a parsing stage that incorporates a language parsing module with large language models to enhance the comprehension of textual inputs, along with a visual parsing module that integrates an interactive segmentation module to swiftly identify detailed visual elements within images; and 2) a logic composition stage that combines multiple user search intents into a unified logic expression for more sophisticated operations in complex searching scenarios. Moreover, the intent expansion framework enables users to perform flexible contextualized interactions with the search results to further specify or adjust their detailed search intents iteratively. We implemented the framework into an image search system for NFT (non-fungible token) search and conducted a user study to evaluate its usability and novel properties. The results indicate that the proposed framework significantly improves users' image search experience. Particularly the parsing and contextualized interactions prove useful in allowing users to express their search intents more accurately and engage in a more enjoyable iterative search experience.

* Accepted by The 2024 ACM SIGCHI Conference on Computer-Supported Cooperative Work & Social Computing (CSCW) (Proc. CSCW 2024)

Via

Access Paper or Ask Questions

Let the Chart Spark: Embedding Semantic Context into Chart with Text-to-Image Generative Model

Apr 28, 2023

Shishi Xiao, Suizi Huang, Yue Lin, Yilin Ye, Wei Zeng

Figure 1 for Let the Chart Spark: Embedding Semantic Context into Chart with Text-to-Image Generative Model

Figure 2 for Let the Chart Spark: Embedding Semantic Context into Chart with Text-to-Image Generative Model

Figure 3 for Let the Chart Spark: Embedding Semantic Context into Chart with Text-to-Image Generative Model

Figure 4 for Let the Chart Spark: Embedding Semantic Context into Chart with Text-to-Image Generative Model

Abstract:Pictorial visualization seamlessly integrates data and semantic context into visual representation, conveying complex information in a manner that is both engaging and informative. Extensive studies have been devoted to developing authoring tools to simplify the creation of pictorial visualizations. However, mainstream works mostly follow a retrieving-and-editing pipeline that heavily relies on retrieved visual elements from a dedicated corpus, which often compromise the data integrity. Text-guided generation methods are emerging, but may have limited applicability due to its predefined recognized entities. In this work, we propose ChartSpark, a novel system that embeds semantic context into chart based on text-to-image generative model. ChartSpark generates pictorial visualizations conditioned on both semantic context conveyed in textual inputs and data information embedded in plain charts. The method is generic for both foreground and background pictorial generation, satisfying the design practices identified from an empirical research into existing pictorial visualizations. We further develop an interactive visual interface that integrates a text analyzer, editing module, and evaluation module to enable users to generate, modify, and assess pictorial visualizations. We experimentally demonstrate the usability of our tool, and conclude with a discussion of the potential of using text-to-image generative model combined with interactive interface for visualization design.

Via

Access Paper or Ask Questions

WYTIWYR: A User Intent-Aware Framework with Multi-modal Inputs for Visualization Retrieval

Apr 14, 2023

Shishi Xiao, Yihan Hou, Cheng Jin, Wei Zeng

Abstract:Retrieving charts from a large corpus is a fundamental task that can benefit numerous applications such as visualization recommendations.The retrieved results are expected to conform to both explicit visual attributes (e.g., chart type, colormap) and implicit user intents (e.g., design style, context information) that vary upon application scenarios. However, existing example-based chart retrieval methods are built upon non-decoupled and low-level visual features that are hard to interpret, while definition-based ones are constrained to pre-defined attributes that are hard to extend. In this work, we propose a new framework, namely WYTIWYR (What-You-Think-Is-What-You-Retrieve), that integrates user intents into the chart retrieval process. The framework consists of two stages: first, the Annotation stage disentangles the visual attributes within the bitmap query chart; and second, the Retrieval stage embeds the user's intent with customized text prompt as well as query chart, to recall targeted retrieval result. We develop a prototype WYTIWYR system leveraging a contrastive language-image pre-training (CLIP) model to achieve zero-shot classification, and test the prototype on a large corpus with charts crawled from the Internet. Quantitative experiments, case studies, and qualitative interviews are conducted. The results demonstrate the usability and effectiveness of our proposed framework.

Via

Access Paper or Ask Questions

Label Name is Mantra: Unifying Point Cloud Segmentation across Heterogeneous Datasets

Mar 19, 2023

Yixun Liang, Hao He, Shishi Xiao, Hao Lu, Yingcong Chen

Figure 1 for Label Name is Mantra: Unifying Point Cloud Segmentation across Heterogeneous Datasets

Figure 2 for Label Name is Mantra: Unifying Point Cloud Segmentation across Heterogeneous Datasets

Figure 3 for Label Name is Mantra: Unifying Point Cloud Segmentation across Heterogeneous Datasets

Figure 4 for Label Name is Mantra: Unifying Point Cloud Segmentation across Heterogeneous Datasets

Abstract:Point cloud segmentation is a fundamental task in 3D vision that serves a wide range of applications. Although great progresses have been made these years, its practical usability is still limited by the availability of training data. Existing approaches cannot make full use of multiple datasets on hand due to the label mismatch among different datasets. In this paper, we propose a principled approach that supports learning from heterogeneous datasets with different label sets. Our idea is to utilize a pre-trained language model to embed discrete labels to a continuous latent space with the help of their label names. This unifies all labels of different datasets, so that joint training is doable. Meanwhile, classifying points in the continuous 3D space by their vocabulary tokens significantly increase the generalization ability of the model in comparison with existing approaches that have fixed decoder architecture. Besides, we also integrate prompt learning in our framework to alleviate data shifts among different data sources. Extensive experiments demonstrate that our model outperforms the state-of-the-art by a large margin.

* 11 pages, 5 figures

Via

Access Paper or Ask Questions