Soochow University
Abstract:Despite domain generalization (DG) has significantly addressed the performance degradation of pre-trained models caused by domain shifts, it often falls short in real-world deployment. Test-time adaptation (TTA), which adjusts a learned model using unlabeled test data, presents a promising solution. However, most existing TTA methods struggle to deliver strong performance in medical image segmentation, primarily because they overlook the crucial prior knowledge inherent to medical images. To address this challenge, we incorporate morphological information and propose a framework based on multi-graph matching. Specifically, we introduce learnable universe embeddings that integrate morphological priors during multi-source training, along with novel unsupervised test-time paradigms for domain adaptation. This approach guarantees cycle-consistency in multi-matching while enabling the model to more effectively capture the invariant priors of unseen data, significantly mitigating the effects of domain shifts. Extensive experiments demonstrate that our method outperforms other state-of-the-art approaches on two medical image segmentation benchmarks for both multi-source and single-source domain generalization tasks. The source code is available at https://github.com/Yore0/TTDG-MGM.
Abstract:Artistic style transfer aims to transfer the learned style onto an arbitrary content image. However, most existing style transfer methods can only render consistent artistic stylized images, making it difficult for users to get enough stylized images to enjoy. To solve this issue, we propose a novel artistic style transfer framework called DyArtbank, which can generate diverse and highly realistic artistic stylized images. Specifically, we introduce a Dynamic Style Prompt ArtBank (DSPA), a set of learnable parameters. It can learn and store the style information from the collection of artworks, dynamically guiding pre-trained stable diffusion to generate diverse and highly realistic artistic stylized images. DSPA can also generate random artistic image samples with the learned style information, providing a new idea for data augmentation. Besides, a Key Content Feature Prompt (KCFP) module is proposed to provide sufficient content prompts for pre-trained stable diffusion to preserve the detailed structure of the input content image. Extensive qualitative and quantitative experiments verify the effectiveness of our proposed method. Code is available: https://github.com/Jamie-Cheung/DyArtbank
Abstract:Guided image filtering (GIF) is a popular smoothing technique, in which an additional image is used as a structure guidance for noise removal with edge preservation. The original GIF and some of its subsequent improvements are derived from a two-parameter local affine model (LAM), where the filtering output is a local affine transformation of the guidance image, but the input image is not taken into account in the LAM formulation. In this paper, we first introduce a single-parameter Prior Model based on Gaussian (highpass/lowpass) Filtering (PM-GF), in which the filtering output is the sum of a weighted portion of Gaussian highpass filtering of the guidance image and Gaussian smoothing of the input image. In the PM-GF, the guidance structure determined by Gaussian highpass filtering is obviously transferred to the filtering output, thereby better revealing the structure transfer mechanism of guided filtering. Then we propose several Gaussian highpass GIFs (GH-GIFs) based on the PM-GF by emulating the original GIF and some improvements, i.e., using PM-GF instead of LAM in these GIFs. Experimental results illustrate that the proposed GIFs outperform their counterparts in several image processing applications.
Abstract:Rapid bone scintigraphy is an essential tool for diagnosing skeletal diseases and tumor metastasis in pediatric patients, as it reduces scan time and minimizes patient discomfort. However, rapid scans often result in poor image quality, potentially affecting diagnosis due to reduced resolution and detail, which make it challenging to identify and evaluate finer anatomical structures. To address this issue, we propose the first application of SAM-based semantic priors for medical image restoration, leveraging the Segment Anything Model (SAM) to enhance rapid bone scintigraphy images in pediatric populations. Our method comprises two cascaded networks, $f^{IR1}$ and $f^{IR2}$, augmented by three key modules: a Semantic Prior Integration (SPI) module, a Semantic Knowledge Distillation (SKD) module, and a Semantic Consistency Module (SCM). The SPI and SKD modules incorporate domain-specific semantic information from a fine-tuned SAM, while the SCM maintains consistent semantic feature representation throughout the cascaded networks. In addition, we will release a novel Rapid Bone Scintigraphy dataset called RBS, the first dataset dedicated to rapid bone scintigraphy image restoration in pediatric patients. RBS consists of 137 pediatric patients aged between 0.5 and 16 years who underwent both standard and rapid bone scans. The dataset includes scans performed at 20 cm/min (standard) and 40 cm/min (rapid), representing a $2\times$ acceleration. We conducted extensive experiments on both the publicly available endoscopic dataset and RBS. The results demonstrate that our method outperforms all existing methods across various metrics, including PSNR, SSIM, FID, and LPIPS.
Abstract:Text-to-audio (TTA), which generates audio signals from textual descriptions, has received huge attention in recent years. However, recent works focused on text to monaural audio only. As we know, spatial audio provides more immersive auditory experience than monaural audio, e.g. in virtual reality. To address this issue, we propose a text-to-spatial-audio (TTSA) generation framework named DualSpec.Specifically, it first trains variational autoencoders (VAEs) for extracting the latent acoustic representations from sound event audio. Then, given text that describes sound events and event directions, the proposed method uses the encoder of a pretrained large language model to transform the text into text features. Finally, it trains a diffusion model from the latent acoustic representations and text features for the spatial audio generation. In the inference stage, only the text description is needed to generate spatial audio. Particularly, to improve the synthesis quality and azimuth accuracy of the spatial sound events simultaneously, we propose to use two kinds of acoustic features. One is the Mel spectrograms which is good for improving the synthesis quality, and the other is the short-time Fourier transform spectrograms which is good at improving the azimuth accuracy. We provide a pipeline of constructing spatial audio dataset with text prompts, for the training of the VAEs and diffusion model. We also introduce new spatial-aware evaluation metrics to quantify the azimuth errors of the generated spatial audio recordings. Experimental results demonstrate that the proposed method can generate spatial audio with high directional and event consistency.
Abstract:In the field of financial derivatives trading, managing volatility risk is crucial for protecting investment portfolios from market changes. Traditional Vega hedging strategies, which often rely on basic and rule-based models, are hard to adapt well to rapidly changing market conditions. We introduce a new framework for dynamic Vega hedging, the Adaptive Nesterov Accelerated Distributional Deep Hedging (ANADDH), which combines distributional reinforcement learning with a tailored design based on adaptive Nesterov acceleration. This approach improves the learning process in complex financial environments by modeling the hedging efficiency distribution, providing a more accurate and responsive hedging strategy. The design of adaptive Nesterov acceleration refines gradient momentum adjustments, significantly enhancing the stability and speed of convergence of the model. Through empirical analysis and comparisons, our method demonstrates substantial performance gains over existing hedging techniques. Our results confirm that this innovative combination of distributional reinforcement learning with the proposed optimization techniques improves financial risk management and highlights the practical benefits of implementing advanced neural network architectures in the finance sector.
Abstract:Deep hedging represents a cutting-edge approach to risk management for financial derivatives by leveraging the power of deep learning. However, existing methods often face challenges related to computational inefficiency, sensitivity to noisy data, and optimization complexity, limiting their practical applicability in dynamic and volatile markets. To address these limitations, we propose Deep Hedging with Linearized-objective Neural Network (DHLNN), a robust and generalizable framework that enhances the training procedure of deep learning models. By integrating a periodic fixed-gradient optimization method with linearized training dynamics, DHLNN stabilizes the training process, accelerates convergence, and improves robustness to noisy financial data. The framework incorporates trajectory-wide optimization and Black-Scholes Delta anchoring, ensuring alignment with established financial theory while maintaining flexibility to adapt to real-world market conditions. Extensive experiments on synthetic and real market data validate the effectiveness of DHLNN, demonstrating its ability to achieve faster convergence, improved stability, and superior hedging performance across diverse market scenarios.
Abstract:In decentralized financial systems, robust and efficient Federated Learning (FL) is promising to handle diverse client environments and ensure resilience to systemic risks. We propose Federated Risk-Aware Learning with Central Sensitivity Estimation (FRAL-CSE), an innovative FL framework designed to enhance scalability, stability, and robustness in collaborative financial decision-making. The framework's core innovation lies in a central acceleration mechanism, guided by a quadratic sensitivity-based approximation of global model dynamics. By leveraging local sensitivity information derived from robust risk measurements, FRAL-CSE performs a curvature-informed global update that efficiently incorporates second-order information without requiring repeated local re-evaluations, thereby enhancing training efficiency and improving optimization stability. Additionally, distortion risk measures are embedded into the training objectives to capture tail risks and ensure robustness against extreme scenarios. Extensive experiments validate the effectiveness of FRAL-CSE in accelerating convergence and improving resilience across heterogeneous datasets compared to state-of-the-art baselines.
Abstract:Direct Preference Optimization (DPO) and its variants have become increasingly popular for aligning language models with human preferences. These methods aim to teach models to better distinguish between chosen (or preferred) and rejected (or dispreferred) responses. However, prior research has identified that the probability of chosen responses often decreases during training, and this phenomenon is known as likelihood displacement. To tackle this challenge, in this work we introduce \method to controllably shift the distribution of the chosen probability. Then, we show that \method exhibits a fundamental trade-off between improving the chosen probability and sacrificing the reward margin, as supported by both theoretical analysis and experimental validation. Furthermore, we demonstrate the superiority of \method over DPO on downstream tasks such as MT-Bench and a designed win rate experiment. We believe this study shows that the likelihood displacement issue of DPO can be effectively mitigated with a simple, theoretically grounded solution. Our code is available at https://github.com/Meaquadddd/DPO-Shift.
Abstract:Embodied intelligence integrates multiple modalities, enabling agents to understand images, language, and actions simultaneously. However, existing models always depend on additional datasets or extensive pre-training to maximize performance improvements, consuming abundant training time and expensive hardware cost. To tackle this issue, we present RoboBERT, a novel end-to-end robotic manipulation model integrated with a unique training strategy. This model utilizes a CNN-based diffusion policy, enhancing and stabilizing the effectiveness of this model by separating training processes for different modalities. It also underscores the importance of data augmentation, verifying various techniques to significantly boost performance. Unlike models that depend on extra data or large foundation models, RoboBERT achieves a highly competitive success rate while using only language-labeled expert demonstrations and maintaining a relatively smaller model size. Specifically, RoboBERT achieves an average length of 4.52 on the CALVIN benchmark for \(ABCD \rightarrow D\) task, setting a new state-of-the-art (SOTA) record. Furthermore, when tested on a real robot, the model demonstrates superior performance, achieving a higher success rate than other methods trained with the same data. We propose that these concepts and methodologies of RoboBERT demonstrate extensive versatility and compatibility, contributing significantly to the development of lightweight multimodal robotic models. The code can be accessed on https://github.com/PeterWangsicheng/RoboBERT