Abstract:Transformer-based Spiking Neural Networks (SNNs) introduce a novel event-driven self-attention paradigm that combines the high performance of Transformers with the energy efficiency of SNNs. However, the larger model size and increased computational demands of the Transformer structure limit their practicality in resource-constrained scenarios. In this paper, we integrate binarization techniques into Transformer-based SNNs and propose the Binary Event-Driven Spiking Transformer, i.e. BESTformer. The proposed BESTformer can significantly reduce storage and computational demands by representing weights and attention maps with a mere 1-bit. However, BESTformer suffers from a severe performance drop from its full-precision counterpart due to the limited representation capability of binarization. To address this issue, we propose a Coupled Information Enhancement (CIE) method, which consists of a reversible framework and information enhancement distillation. By maximizing the mutual information between the binary model and its full-precision counterpart, the CIE method effectively mitigates the performance degradation of the BESTformer. Extensive experiments on static and neuromorphic datasets demonstrate that our method achieves superior performance to other binary SNNs, showcasing its potential as a compact yet high-performance model for resource-limited edge devices.
Abstract:Integrated sensing and communication (ISAC) is emerging as a pivotal technology for next-generation wireless networks. However, existing ISAC systems are based on fixed-position antennas (FPAs), which inevitably incur a loss in performance when balancing the trade-off between sensing and communication. Movable antenna (MA) technology offers promising potential to enhance ISAC performance by enabling flexible antenna movement. Nevertheless, exploiting more spatial channel variations requires larger antenna moving regions, which may invalidate the conventional far-field assumption for channels between transceivers. Therefore, this paper utilizes the MA to enhance sensing and communication capabilities in near-field ISAC systems, where a full-duplex base station (BS) is equipped with multiple transmit and receive MAs movable in large-size regions to simultaneously sense multiple targets and serve multiple uplink (UL) and downlink (DL) users for communication. We aim to maximize the weighted sum of sensing and communication rates (WSR) by jointly designing the transmit beamformers, sensing signal covariance matrices, receive beamformers, and MA positions at the BS, as well as the UL power allocation. The resulting optimization problem is challenging to solve, while we propose an efficient two-layer random position (RP) algorithm to tackle it. In addition, to reduce movement delay and cost, we design an antenna position matching (APM) algorithm based on the greedy strategy to minimize the total MA movement distance. Extensive simulation results demonstrate the substantial performance improvement achieved by deploying MAs in near-field ISAC systems. Moreover, the results show the effectiveness of the proposed APM algorithm in reducing the antenna movement distance, which is helpful for energy saving and time overhead reduction for MA-aided near-field ISAC systems with large moving regions.
Abstract:For gradient-based machine learning (ML) methods commonly adopted in practice such as stochastic gradient descent, the de facto differential privacy (DP) technique is perturbing the gradients with random Gaussian noise. Data valuation attributes the ML performance to the training data and is widely used in privacy-aware applications that require enforcing DP such as data pricing, collaborative ML, and federated learning (FL). Can existing data valuation methods still be used when DP is enforced via gradient perturbations? We show that the answer is no with the default approach of injecting i.i.d.~random noise to the gradients because the estimation uncertainty of the data value estimation paradoxically linearly scales with more estimation budget, producing estimates almost like random guesses. To address this issue, we propose to instead inject carefully correlated noise to provably remove the linear scaling of estimation uncertainty w.r.t.~the budget. We also empirically demonstrate that our method gives better data value estimates on various ML tasks and is applicable to use cases including dataset valuation and~FL.
Abstract:Generalized few-shot semantic segmentation (GFSS) aims to segment objects of both base and novel classes, using sufficient samples of base classes and few samples of novel classes. Representative GFSS approaches typically employ a two-phase training scheme, involving base class pre-training followed by novel class fine-tuning, to learn the classifiers for base and novel classes respectively. Nevertheless, distribution gap exists between base and novel classes in this process. To narrow this gap, we exploit effective knowledge transfer from base to novel classes. First, a novel prototype modulation module is designed to modulate novel class prototypes by exploiting the correlations between base and novel classes. Second, a novel classifier calibration module is proposed to calibrate the weight distribution of the novel classifier according to that of the base classifier. Furthermore, existing GFSS approaches suffer from a lack of contextual information for novel classes due to their limited samples, we thereby introduce a context consistency learning scheme to transfer the contextual knowledge from base to novel classes. Extensive experiments on PASCAL-5$^i$ and COCO-20$^i$ demonstrate that our approach significantly enhances the state of the art in the GFSS setting. The code is available at: https://github.com/HHHHedy/GFSS-EKT.
Abstract:Recently, developing unified medical image segmentation models gains increasing attention, especially with the advent of the Segment Anything Model (SAM). SAM has shown promising binary segmentation performance in natural domains, however, transferring it to the medical domain remains challenging, as medical images often possess substantial inter-category overlaps. To address this, we propose the SEmantic-Guided SAM (SEG-SAM), a unified medical segmentation model that incorporates semantic medical knowledge to enhance medical segmentation performance. First, to avoid the potential conflict between binary and semantic predictions, we introduce a semantic-aware decoder independent of SAM's original decoder, specialized for both semantic segmentation on the prompted object and classification on unprompted objects in images. To further enhance the model's semantic understanding, we solicit key characteristics of medical categories from large language models and incorporate them into SEG-SAM through a text-to-vision semantic module, adaptively transferring the language information into the visual segmentation task. In the end, we introduce the cross-mask spatial alignment strategy to encourage greater overlap between the predicted masks from SEG-SAM's two decoders, thereby benefiting both predictions. Extensive experiments demonstrate that SEG-SAM outperforms state-of-the-art SAM-based methods in unified binary medical segmentation and task-specific methods in semantic medical segmentation, showcasing promising results and potential for broader medical applications.
Abstract:Controllable person image generation aims to generate a person image conditioned on reference images, allowing precise control over the person's appearance or pose. However, prior methods often distort fine-grained textural details from the reference image, despite achieving high overall image quality. We attribute these distortions to inadequate attention to corresponding regions in the reference image. To address this, we thereby propose learning flow fields in attention (Leffa), which explicitly guides the target query to attend to the correct reference key in the attention layer during training. Specifically, it is realized via a regularization loss on top of the attention map within a diffusion-based baseline. Our extensive experiments show that Leffa achieves state-of-the-art performance in controlling appearance (virtual try-on) and pose (pose transfer), significantly reducing fine-grained detail distortion while maintaining high image quality. Additionally, we show that our loss is model-agnostic and can be used to improve the performance of other diffusion models.
Abstract:In this paper, we study the neural tangent kernel (NTK) for general partial differential equations (PDEs) based on physics-informed neural networks (PINNs). As we all know, the training of an artificial neural network can be converted to the evolution of NTK. We analyze the initialization of NTK and the convergence conditions of NTK during training for general PDEs. The theoretical results show that the homogeneity of differential operators plays a crucial role for the convergence of NTK. Moreover, based on the PINNs, we validate the convergence conditions of NTK using the initial value problems of the sine-Gordon equation and the initial-boundary value problem of the KdV equation.
Abstract:We introduce MarDini, a new family of video diffusion models that integrate the advantages of masked auto-regression (MAR) into a unified diffusion model (DM) framework. Here, MAR handles temporal planning, while DM focuses on spatial generation in an asymmetric network design: i) a MAR-based planning model containing most of the parameters generates planning signals for each masked frame using low-resolution input; ii) a lightweight generation model uses these signals to produce high-resolution frames via diffusion de-noising. MarDini's MAR enables video generation conditioned on any number of masked frames at any frame positions: a single model can handle video interpolation (e.g., masking middle frames), image-to-video generation (e.g., masking from the second frame onward), and video expansion (e.g., masking half the frames). The efficient design allocates most of the computational resources to the low-resolution planning model, making computationally expensive but important spatio-temporal attention feasible at scale. MarDini sets a new state-of-the-art for video interpolation; meanwhile, within few inference steps, it efficiently generates videos on par with those of much more expensive advanced image-to-video models.
Abstract:This paper presents, for the first time, the concept of \textit{polarforming} for wireless communications. Polarforming refers to a novel technique that enables dynamic adjustment of antenna polarization using reconfigurable polarized antennas (RPAs). It can fully leverage polarization diversity to improve the performance of wireless communication systems by aligning the effective polarization state of the incoming electromagnetic (EM) wave with the antenna polarization. To better demonstrate the benefits of polarforming, we propose a general RPA-aided system that allows for tunable antenna polarization. A wavefront-based channel model is developed to properly capture depolarization behaviors in both line-of-sight (LoS) and non-line-of-sight (NLoS) channels. Based on this model, we provide a detailed description of transmit and receive polarforming on planes of polarization (PoPs). We also evaluate the performance gains provided by polarforming under stochastic channel conditions. Specifically, we derive a closed-form expression for the relative signal-to-noise ratio (SNR) gain compared to conventional fixed-polarization antenna (FPA) systems and approximate the cumulative distribution function (CDF) for the RPA system. Our analysis reveals that polarforming offers a diversity gain of two, indicating full utilization of polarization diversity for dual-polarized antennas. Furthermore, extensive simulation results validate the effectiveness of polarforming and exhibit substantial improvements over conventional FPA systems. The results also indicate that polarforming not only can combat depolarization effects caused by wireless channels but also can overcome channel correlation when scattering is insufficient.
Abstract:This letter investigates a movable antenna (MA)-aided full-duplex (FD) satellite communication system, where the satellite, equipped with both transmit and receive MAs, serves multiple uplink (UL) and downlink (DL) user terminals (UTs) in FD mode. Specifically, we formulate a multiobjective optimization problem to minimize the UL and DL transmit powers under imperfect channel state information (CSI) conditions. To jointly optimize the MA positions and transmit powers, we propose a two-loop particle swarm optimization (PSO) algorithm based on a multiobjective optimization framework. Simulation results demonstrate that flexible adjustments of MA positions can effectively reduce the total UL and DL transmit powers, while also alleviating the burden on self-interference (SI) cancellation modules.