Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xinyan Liu

CreatiParser: Generative Image Parsing of Raster Graphic Designs into Editable Layers

Apr 21, 2026

Weidong Chen, Dexiang Hong, Zhendong Mao, Yutao Cheng, Xinyan Liu, Lei Zhang, Yongdong Zhang

Abstract:Graphic design images consist of multiple editable layers, such as text, background, and decorative elements, while most generative models produce rasterized outputs without explicit layer structures, limiting downstream editing. Existing graphic design parsing methods typically rely on multi-stage pipelines combining layout prediction, matting, and inpainting, which suffer from error accumulation and limited controllability. We propose a hybrid generative framework for raster-to-layer graphic design parsing that decomposes a design image into editable text, background, and sticker layers. Text regions are parsed using a vision-language model into a text rendering protocol, enabling faithful reconstruction and flexible re-editing, while background and sticker layers are generated using a multi-branch diffusion architecture with RGBA support. We further introduce ParserReward and integrate it with Group Relative Policy Optimization to align generation quality with human design preferences. Extensive experiments on two challenging datasets, \emph{i.e.,} the Parser-40K and Crello datasets, demonstrate superior performance over existing methods, \emph{eg.,} achieving an overall average improvement of 23.7\% across all metrics.

Via

Access Paper or Ask Questions

GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces

Apr 05, 2026

Xinyu Geng, Yanjing Xiao, Yuyang Zhang, Hanwen Wang, Xinyan Liu, Rui Min, Tianqing Fang, Yi R. Fung

Abstract:Deep research agents integrate fragmented evidence through multi-step tool use. BrowseComp offers a text-only testbed for such agents, but existing multimodal benchmarks rarely require both weak visual cues composition and BrowseComp-style multi-hop verification. Geolocation is a natural testbed because answers depend on combining multiple ambiguous visual cues and validating them with open-web evidence. Thus, we introduce GeoBrowse, a geolocation benchmark that combines visual reasoning with knowledge-intensive multi-hop queries. Level 1 tests extracting and composing fragmented visual cues, and Level 2 increases query difficulty by injecting long-tail knowledge and obfuscating key entities. To support evaluation, we provide an agentic workflow GATE with five think-with-image tools and four knowledge-intensive tools, and release expert-annotated stepwise traces grounded in verifiable evidence for trajectory-level analysis. Experiments show that GATE outperforms direct inference and open-source agents, indicating that no-tool, search-only or image-only setups are insufficient. Gains come from coherent, level-specific tool-use plans rather than more tool calls, as they more reliably reach annotated key evidence steps and make fewer errors when integrating into the final decision. The GeoBrowse bernchmark and codes are provided in https://github.com/ornamentt/GeoBrowse

Via

Access Paper or Ask Questions

FACE-net: Factual Calibration and Emotion Augmentation for Retrieval-enhanced Emotional Video Captioning

Mar 18, 2026

Weidong Chen, Cheng Ye, Zhendong Mao, Peipei Song, Xinyan Liu, Lei Zhang, Xiaojun Chang, Yongdong Zhang

Abstract:Emotional Video Captioning (EVC) is an emerging task, which aims to describe factual content with the intrinsic emotions expressed in videos. Existing works perceive global emotional cues and then combine with video content to generate descriptions. However, insufficient factual and emotional cues mining and coordination during generation make their methods difficult to deal with the factual-emotional bias, which refers to the factual and emotional requirements being different in different samples on generation. To this end, we propose a retrieval-enhanced framework with FActual Calibration and Emotion augmentation (FACE-net), which through a unified architecture collaboratively mines factual-emotional semantics and provides adaptive and accurate guidance for generation, breaking through the compromising tendency of factual-emotional descriptions in all sample learning. Technically, we firstly introduces an external repository and retrieves the most relevant sentences with the video content to augment the semantic information. Subsequently, our factual calibration via uncertainty estimation module splits the retrieved information into subject-predicate-object triplets, and self-refines and cross-refines different components through video content to effectively mine the factual semantics; while our progressive visual emotion augmentation module leverages the calibrated factual semantics as experts, interacts with the video content and emotion dictionary to generate visual queries and candidate emotions, and then aggregates them to adaptively augment emotions to each factual semantics. Moreover, to alleviate the factual-emotional bias, we design a dynamic bias adjustment routing module to predict and adjust the degree of bias of a sample.

* Submitted to TPAMI. 16 pages, 9 figures

Via

Access Paper or Ask Questions

HumanStudy-Bench: Towards AI Agent Design for Participant Simulation

Jan 31, 2026

Xuan Liu, Haoyang Shang, Zizhang Liu, Xinyan Liu, Yunze Xiao, Yiwen Tu, Haojian Jin

Abstract:Large language models (LLMs) are increasingly used as simulated participants in social science experiments, but their behavior is often unstable and highly sensitive to design choices. Prior evaluations frequently conflate base-model capabilities with experimental instantiation, obscuring whether outcomes reflect the model itself or the agent setup. We instead frame participant simulation as an agent-design problem over full experimental protocols, where an agent is defined by a base model and a specification (e.g., participant attributes) that encodes behavioral assumptions. We introduce HUMANSTUDY-BENCH, a benchmark and execution engine that orchestrates LLM-based agents to reconstruct published human-subject experiments via a Filter--Extract--Execute--Evaluate pipeline, replaying trial sequences and running the original analysis pipeline in a shared runtime that preserves the original statistical procedures end to end. To evaluate fidelity at the level of scientific inference, we propose new metrics to quantify how much human and agent behaviors agree. We instantiate 12 foundational studies as an initial suite in this dynamic benchmark, spanning individual cognition, strategic interaction, and social psychology, and covering more than 6,000 trials with human samples ranging from tens to over 2,100 participants.

Via

Access Paper or Ask Questions

Generative assimilation and prediction for weather and climate

Mar 04, 2025

Shangshang Yang, Congyi Nai, Xinyan Liu, Weidong Li, Jie Chao, Jingnan Wang, Leyi Wang, Xichen Li, Xi Chen, Bo Lu(+4 more)

Abstract:Machine learning models have shown great success in predicting weather up to two weeks ahead, outperforming process-based benchmarks. However, existing approaches mostly focus on the prediction task, and do not incorporate the necessary data assimilation. Moreover, these models suffer from error accumulation in long roll-outs, limiting their applicability to seasonal predictions or climate projections. Here, we introduce Generative Assimilation and Prediction (GAP), a unified deep generative framework for assimilation and prediction of both weather and climate. By learning to quantify the probabilistic distribution of atmospheric states under observational, predictive, and external forcing constraints, GAP excels in a broad range of weather-climate related tasks, including data assimilation, seamless prediction, and climate simulation. In particular, GAP is competitive with state-of-the-art ensemble assimilation, probabilistic weather forecast and seasonal prediction, yields stable millennial simulations, and reproduces climate variability from daily to decadal time scales.

Via

Access Paper or Ask Questions

TaQ-DiT: Time-aware Quantization for Diffusion Transformers

Nov 21, 2024

Xinyan Liu, Huihong Shi, Yang Xu, Zhongfeng Wang

Abstract:Transformer-based diffusion models, dubbed Diffusion Transformers (DiTs), have achieved state-of-the-art performance in image and video generation tasks. However, their large model size and slow inference speed limit their practical applications, calling for model compression methods such as quantization. Unfortunately, existing DiT quantization methods overlook (1) the impact of reconstruction and (2) the varying quantization sensitivities across different layers, which hinder their achievable performance. To tackle these issues, we propose innovative time-aware quantization for DiTs (TaQ-DiT). Specifically, (1) we observe a non-convergence issue when reconstructing weights and activations separately during quantization and introduce a joint reconstruction method to resolve this problem. (2) We discover that Post-GELU activations are particularly sensitive to quantization due to their significant variability across different denoising steps as well as extreme asymmetries and variations within each step. To address this, we propose time-variance-aware transformations to facilitate more effective quantization. Experimental results show that when quantizing DiTs' weights to 4-bit and activations to 8-bit (W4A8), our method significantly surpasses previous quantization methods.

Via

Access Paper or Ask Questions

Weakly Supervised Video Individual CountingWeakly Supervised Video Individual Counting

Dec 10, 2023

Xinyan Liu, Guorong Li, Yuankai Qi, Ziheng Yan, Zhenjun Han, Anton van den Hengel, Ming-Hsuan Yang, Qingming Huang

Figure 1 for Weakly Supervised Video Individual CountingWeakly Supervised Video Individual Counting

Figure 2 for Weakly Supervised Video Individual CountingWeakly Supervised Video Individual Counting

Figure 3 for Weakly Supervised Video Individual CountingWeakly Supervised Video Individual Counting

Figure 4 for Weakly Supervised Video Individual CountingWeakly Supervised Video Individual Counting

Abstract:Video Individual Counting (VIC) aims to predict the number of unique individuals in a single video. % Existing methods learn representations based on trajectory labels for individuals, which are annotation-expensive. % To provide a more realistic reflection of the underlying practical challenge, we introduce a weakly supervised VIC task, wherein trajectory labels are not provided. Instead, two types of labels are provided to indicate traffic entering the field of view (inflow) and leaving the field view (outflow). % We also propose the first solution as a baseline that formulates the task as a weakly supervised contrastive learning problem under group-level matching. In doing so, we devise an end-to-end trainable soft contrastive loss to drive the network to distinguish inflow, outflow, and the remaining. % To facilitate future study in this direction, we generate annotations from the existing VIC datasets SenseCrowd and CroHD and also build a new dataset, UAVVIC. % Extensive results show that our baseline weakly supervised method outperforms supervised methods, and thus, little information is lost in the transition to the more practically relevant weakly supervised task. The code and trained model will be public at \href{https://github.com/streamer-AP/CGNet}{CGNet}

Via

Access Paper or Ask Questions

Consistency-Aware Anchor Pyramid Network for Crowd Localization

Dec 08, 2022

Xinyan Liu, Guorong Li, Yuankai Qi, Zhenjun Han, Qingming Huang, Ming-Hsuan Yang, Nicu Sebe

Figure 1 for Consistency-Aware Anchor Pyramid Network for Crowd Localization

Figure 2 for Consistency-Aware Anchor Pyramid Network for Crowd Localization

Figure 3 for Consistency-Aware Anchor Pyramid Network for Crowd Localization

Figure 4 for Consistency-Aware Anchor Pyramid Network for Crowd Localization

Abstract:Crowd localization aims to predict the spatial position of humans in a crowd scenario. We observe that the performance of existing methods is challenged from two aspects: (i) ranking inconsistency between test and training phases; and (ii) fixed anchor resolution may underfit or overfit crowd densities of local regions. To address these problems, we design a supervision target reassignment strategy for training to reduce ranking inconsistency and propose an anchor pyramid scheme to adaptively determine the anchor density in each image region. Extensive experimental results on three widely adopted datasets (ShanghaiTech A\&B, JHU-CROWD++, UCF-QNRF) demonstrate the favorable performance against several state-of-the-art methods.

Via

Access Paper or Ask Questions

Progressive Multi-resolution Loss for Crowd Counting

Dec 08, 2022

Ziheng Yan, Yuankai Qi, Guorong Li, Xinyan Liu, Weigang Zhang, Qingming Huang, Ming-Hsuan Yang

Figure 1 for Progressive Multi-resolution Loss for Crowd Counting

Figure 2 for Progressive Multi-resolution Loss for Crowd Counting

Figure 3 for Progressive Multi-resolution Loss for Crowd Counting

Figure 4 for Progressive Multi-resolution Loss for Crowd Counting

Abstract:Crowd counting is usually handled in a density map regression fashion, which is supervised via a L2 loss between the predicted density map and ground truth. To effectively regulate models, various improved L2 loss functions have been proposed to find a better correspondence between predicted density and annotation positions. In this paper, we propose to predict the density map at one resolution but measure the density map at multiple resolutions. By maximizing the posterior probability in such a setting, we obtain a log-formed multi-resolution L2-difference loss, where the traditional single-resolution L2 loss is its particular case. We mathematically prove it is superior to a single-resolution L2 loss. Without bells and whistles, the proposed loss substantially improves several baselines and performs favorably compared to state-of-the-art methods on four crowd counting datasets, ShanghaiTech A & B, UCF-QNRF, and JHU-Crowd++.

Via

Access Paper or Ask Questions