Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zihang Liu

Semantic-Guided Diffusion Model for Single-Step Image Super-Resolution

May 11, 2025

Zihang Liu, Zhenyu Zhang, Hao Tang

Abstract:Diffusion-based image super-resolution (SR) methods have demonstrated remarkable performance. Recent advancements have introduced deterministic sampling processes that reduce inference from 15 iterative steps to a single step, thereby significantly improving the inference speed of existing diffusion models. However, their efficiency remains limited when handling complex semantic regions due to the single-step inference. To address this limitation, we propose SAMSR, a semantic-guided diffusion framework that incorporates semantic segmentation masks into the sampling process. Specifically, we introduce the SAM-Noise Module, which refines Gaussian noise using segmentation masks to preserve spatial and semantic features. Furthermore, we develop a pixel-wise sampling strategy that dynamically adjusts the residual transfer rate and noise strength based on pixel-level semantic weights, prioritizing semantically rich regions during the diffusion process. To enhance model training, we also propose a semantic consistency loss, which aligns pixel-wise semantic weights between predictions and ground truth. Extensive experiments on both real-world and synthetic datasets demonstrate that SAMSR significantly improves perceptual quality and detail recovery, particularly in semantically complex images. Our code is released at https://github.com/Liu-Zihang/SAMSR.

Via

Access Paper or Ask Questions

Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models

Mar 04, 2025

Zhifei Xie, Mingbao Lin, Zihang Liu, Pengcheng Wu, Shuicheng Yan, Chunyan Miao

Figure 1 for Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models

Figure 2 for Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models

Figure 3 for Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models

Figure 4 for Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models

Abstract:Recent advancements in multimodal reasoning have largely overlooked the audio modality. We introduce Audio-Reasoner, a large-scale audio language model for deep reasoning in audio tasks. We meticulously curated a large-scale and diverse multi-task audio dataset with simple annotations. Then, we leverage closed-source models to conduct secondary labeling, QA generation, along with structured COT process. These datasets together form a high-quality reasoning dataset with 1.2 million reasoning-rich samples, which we name CoTA. Following inference scaling principles, we train Audio-Reasoner on CoTA, enabling it to achieve great logical capabilities in audio reasoning. Experiments show state-of-the-art performance across key benchmarks, including MMAU-mini (+25.42%), AIR-Bench chat/foundation(+14.57%/+10.13%), and MELD (+8.01%). Our findings stress the core of structured CoT training in advancing audio reasoning.

* Technical report, in process

Via

Access Paper or Ask Questions

Model Balancing Helps Low-data Training and Fine-tuning

Oct 16, 2024

Zihang Liu, Yuanzhe Hu, Tianyu Pang, Yefan Zhou, Pu Ren, Yaoqing Yang

Figure 1 for Model Balancing Helps Low-data Training and Fine-tuning

Figure 2 for Model Balancing Helps Low-data Training and Fine-tuning

Figure 3 for Model Balancing Helps Low-data Training and Fine-tuning

Figure 4 for Model Balancing Helps Low-data Training and Fine-tuning

Abstract:Recent advances in foundation models have emphasized the need to align pre-trained models with specialized domains using small, curated datasets. Studies on these foundation models underscore the importance of low-data training and fine-tuning. This topic, well-known in natural language processing (NLP), has also gained increasing attention in the emerging field of scientific machine learning (SciML). To address the limitations of low-data training and fine-tuning, we draw inspiration from Heavy-Tailed Self-Regularization (HT-SR) theory, analyzing the shape of empirical spectral densities (ESDs) and revealing an imbalance in training quality across different model layers. To mitigate this issue, we adapt a recently proposed layer-wise learning rate scheduler, TempBalance, which effectively balances training quality across layers and enhances low-data training and fine-tuning for both NLP and SciML tasks. Notably, TempBalance demonstrates increasing performance gains as the amount of available tuning data decreases. Comparative analyses further highlight the effectiveness of TempBalance and its adaptability as an "add-on" method for improving model performance.

* EMNLP 2024 Oral. First two authors contributed equally

Via

Access Paper or Ask Questions

A Simple Framework for Multi-mode Spatial-Temporal Data Modeling

Aug 22, 2023

Zihang Liu, Le Yu, Tongyu Zhu, Leiei Sun

Figure 1 for A Simple Framework for Multi-mode Spatial-Temporal Data Modeling

Figure 2 for A Simple Framework for Multi-mode Spatial-Temporal Data Modeling

Figure 3 for A Simple Framework for Multi-mode Spatial-Temporal Data Modeling

Figure 4 for A Simple Framework for Multi-mode Spatial-Temporal Data Modeling

Abstract:Spatial-temporal data modeling aims to mine the underlying spatial relationships and temporal dependencies of objects in a system. However, most existing methods focus on the modeling of spatial-temporal data in a single mode, lacking the understanding of multiple modes. Though very few methods have been presented to learn the multi-mode relationships recently, they are built on complicated components with higher model complexities. In this paper, we propose a simple framework for multi-mode spatial-temporal data modeling to bring both effectiveness and efficiency together. Specifically, we design a general cross-mode spatial relationships learning component to adaptively establish connections between multiple modes and propagate information along the learned connections. Moreover, we employ multi-layer perceptrons to capture the temporal dependencies and channel correlations, which are conceptually and technically succinct. Experiments on three real-world datasets show that our model can consistently outperform the baselines with lower space and time complexity, opening up a promising direction for modeling spatial-temporal data. The generalizability of the cross-mode spatial relationships learning module is also validated.

Via

Access Paper or Ask Questions

EnsembleMOT: A Step towards Ensemble Learning of Multiple Object Tracking

Oct 11, 2022

Yunhao Du, Zihang Liu, Fei Su

Figure 1 for EnsembleMOT: A Step towards Ensemble Learning of Multiple Object Tracking

Figure 2 for EnsembleMOT: A Step towards Ensemble Learning of Multiple Object Tracking

Figure 3 for EnsembleMOT: A Step towards Ensemble Learning of Multiple Object Tracking

Abstract:Multiple Object Tracking (MOT) has rapidly progressed in recent years. Existing works tend to design a single tracking algorithm to perform both detection and association. Though ensemble learning has been exploited in many tasks, i.e, classification and object detection, it hasn't been studied in the MOT task, which is mainly caused by its complexity and evaluation metrics. In this paper, we propose a simple but effective ensemble method for MOT, called EnsembleMOT, which merges multiple tracking results from various trackers with spatio-temporal constraints. Meanwhile, several post-processing procedures are applied to filter out abnormal results. Our method is model-independent and doesn't need the learning procedure. What's more, it can easily work in conjunction with other algorithms, e.g., tracklets interpolation. Experiments on the MOT17 dataset demonstrate the effectiveness of the proposed method. Codes are available at https://github.com/dyhBUPT/EnsembleMOT.

* 5 pages, 1 figure

Via

Access Paper or Ask Questions

Modelling Evolutionary and Stationary User Preferences for Temporal Sets Prediction

Apr 14, 2022

Le Yu, Zihang Liu, Tongyu Zhu, Leilei Sun, Bowen Du, Weifeng Lv

Figure 1 for Modelling Evolutionary and Stationary User Preferences for Temporal Sets Prediction

Figure 2 for Modelling Evolutionary and Stationary User Preferences for Temporal Sets Prediction

Figure 3 for Modelling Evolutionary and Stationary User Preferences for Temporal Sets Prediction

Figure 4 for Modelling Evolutionary and Stationary User Preferences for Temporal Sets Prediction

Abstract:Given a sequence of sets, where each set is associated with a timestamp and contains an arbitrary number of elements, the task of temporal sets prediction aims to predict the elements in the subsequent set. Previous studies for temporal sets prediction mainly capture each user's evolutionary preference by learning from his/her own sequence. Although insightful, we argue that: 1) the collaborative signals latent in different users' sequences are essential but have not been exploited; 2) users also tend to show stationary preferences while existing methods fail to consider. To this end, we propose an integrated learning framework to model both the evolutionary and the stationary preferences of users for temporal sets prediction, which first constructs a universal sequence by chronologically arranging all the user-set interactions, and then learns on each user-set interaction. In particular, for each user-set interaction, we first design an evolutionary user preference modelling component to track the user's time-evolving preference and exploit the latent collaborative signals among different users. This component maintains a memory bank to store memories of the related user and elements, and continuously updates their memories based on the currently encoded messages and the past memories. Then, we devise a stationary user preference modelling module to discover each user's personalized characteristics according to the historical sequence, which adaptively aggregates the previously interacted elements from dual perspectives with the guidance of the user's and elements' embeddings. Finally, we develop a set-batch algorithm to improve the model efficiency, which can create time-consistent batches in advance and achieve 3.5x training speedups on average. Experiments on real-world datasets demonstrate the effectiveness and good interpretability of our approach.

* 25 pages, 9 figures

Via

Access Paper or Ask Questions

Semi-supervised Medical Image Segmentation via Geometry-aware Consistency Training

Feb 12, 2022

Zihang Liu, Chunhui Zhao

Figure 1 for Semi-supervised Medical Image Segmentation via Geometry-aware Consistency Training

Figure 2 for Semi-supervised Medical Image Segmentation via Geometry-aware Consistency Training

Figure 3 for Semi-supervised Medical Image Segmentation via Geometry-aware Consistency Training

Figure 4 for Semi-supervised Medical Image Segmentation via Geometry-aware Consistency Training

Abstract:The performance of supervised deep learning methods for medical image segmentation is often limited by the scarcity of labeled data. As a promising research direction, semi-supervised learning addresses this dilemma by leveraging unlabeled data information to assist the learning process. In this paper, a novel geometry-aware semi-supervised learning framework is proposed for medical image segmentation, which is a consistency-based method. Considering that the hard-to-segment regions are mainly located around the object boundary, we introduce an auxiliary prediction task to learn the global geometric information. Based on the geometric constraint, the ambiguous boundary regions are emphasized through an exponentially weighted strategy for the model training to better exploit both labeled and unlabeled data. In addition, a dual-view network is designed to perform segmentation from different perspectives and reduce the prediction uncertainty. The proposed method is evaluated on the public left atrium benchmark dataset and improves fully supervised method by 8.7% in Dice with 10% labeled images, while 4.3% with 20% labeled images. Meanwhile, our framework outperforms six state-of-the-art semi-supervised segmentation methods.

* 9 pages, 4 figures

Via

Access Paper or Ask Questions