Abstract:The in-context learning (ICL) capability of pre-trained models based on the transformer architecture has received growing interest in recent years. While theoretical understanding has been obtained for ICL in reinforcement learning (RL), the previous results are largely confined to the single-agent setting. This work proposes to further explore the in-context learning capabilities of pre-trained transformer models in competitive multi-agent games, i.e., in-context game-playing (ICGP). Focusing on the classical two-player zero-sum games, theoretical guarantees are provided to demonstrate that pre-trained transformers can provably learn to approximate Nash equilibrium in an in-context manner for both decentralized and centralized learning settings. As a key part of the proof, constructional results are established to demonstrate that the transformer architecture is sufficiently rich to realize celebrated multi-agent game-playing algorithms, in particular, decentralized V-learning and centralized VI-ULCB.
Abstract:Federated learning (FL) is a commonly distributed algorithm for mobile users (MUs) training artificial intelligence (AI) models, however, several challenges arise when applying FL to real-world scenarios, such as label scarcity, non-IID data, and unexplainability. As a result, we propose an explainable personalized FL framework, called XPFL. First, we introduce a generative AI (GAI) assisted personalized federated semi-supervised learning, called GFed. Particularly, in local training, we utilize a GAI model to learn from large unlabeled data and apply knowledge distillation-based semi-supervised learning to train the local FL model using the knowledge acquired from the GAI model. In global aggregation, we obtain the new local FL model by fusing the local and global FL models in specific proportions, allowing each local model to incorporate knowledge from others while preserving its personalized characteristics. Second, we propose an explainable AI mechanism for FL, named XFed. Specifically, in local training, we apply a decision tree to match the input and output of the local FL model. In global aggregation, we utilize t-distributed stochastic neighbor embedding (t-SNE) to visualize the local models before and after aggregation. Finally, simulation results validate the effectiveness of the proposed XPFL framework.
Abstract:Interception of low-altitude intruding targets with low-cost drones equipped strapdown camera presents a competitive option. However, the malicious maneuvers by the non-cooperative target and the coupling of the camera make the task challenging. To solve this problem, an Image-Based Visual Servoing (IBVS) control algorithm based on proportional navigation guidance with field-of-view holding capability is designed. The proposed controller reduces the miss distance while improving the stability of the visual servo system during interception. Software-in-the-loop (SITL) simulation experiments show a 72.8% reduction in the circular error probability (CEP) compared to the most recent study. This improvement enhances interception accuracy from the decimeter to the centimeter level. Real-world experiments further validate the effectiveness of the proposed algorithm.
Abstract:The conventional reconfigurable intelligent surface (RIS) assisted far-field communication systems can only implement angle beamforming, which actually limits the capability for reconfiguring the wireless propagation environment. To overcome this limitation, this paper proposes a newly designed frequency diverse RIS (FD-RIS), which can achieve joint distance-angle beamforming with the assistance of the time modulation technology. The signal processing model for FD-RIS-aided wireless communications is first derived. Then, an optimization problem aimed at maximizing the achievable rate is formulated where the frequency-time modulations are jointly optimized to achieve distance-angle beamforming. Furthermore, a novel iterative algorithm based on the cross-entropy optimization (CEO) framework is proposed to effectively handle the non-convex optimization problem. The numerical results validate that the proposed FD-RIS assisted communication scheme can achieve a notable performance improvement compared with the baseline scheme utilizing traditional RIS. In addition, the effectiveness of the proposed CEO algorithm is further verified by comparing with the benchmark using the genetic algorithm (GA).
Abstract:In this letter, we investigate the design of multiple reconfigurable intelligent sensing surfaces (RISSs) that enhance both communication and sensing tasks. An RISS incorporates additional active elements tailored to improve sensing accuracy. Our initial task involves optimizing placement of RISSs to mitigate signal interference. Subsequently, we establish power allocation schemes for sensing and communication within the system. Our final consideration involves examining how sensing results can be utilized to enhance communication, alongside an evaluation of communication performance under the impact of sensing inaccuracies. Numerical results reveal that the sensing task reaches its optimal performance with a finite number of RISSs, while the communication task exhibits enhanced performance with an increasing number of RISSs. Additionally, we identify an optimal communication spot under user movement.
Abstract:Hannan Limitation successfully links the directivity characteristics of 2D arrays with the aperture gain limit, providing the radiation efficiency upper limit for large 2D planar antenna arrays. This demonstrates the inevitable radiation efficiency degradation caused by mutual coupling effects between array elements. However, this limitation is derived based on the assumption of infinitely large 2D arrays, which means that it is not an accurate law for small-size arrays. In this paper, we extend this theory and propose an estimation formula for the radiation efficiency upper limit of finite-sized 2D arrays. Furthermore, we analyze a 3D array structure consisting of two parallel 2D arrays. Specifically, we provide evaluation formulas for the mutual coupling strengths for both infinite and finite size arrays and derive the fundamental efficiency limit of 3D arrays. Moreover, based on the established gain limit of antenna arrays with fixed aperture sizes, we derive the achievable gain limit of finite size 3D arrays. Besides the performance analyses, we also investigate the spatial radiation characteristics of the considered 3D array structure, offering a feasible region for 2D phase settings under a given energy attenuation threshold. Through simulations, we demonstrate the effectiveness of our proposed theories and gain advantages of 3D arrays for better spatial coverage under various scenarios.
Abstract:The recent rise of EEG-based end-to-end deep learning models presents a significant challenge in elucidating how these models process raw EEG signals and generate predictions in the frequency domain. This challenge limits the transparency and credibility of EEG-based end-to-end models, hindering their application in security-sensitive areas. To address this issue, we propose a mask perturbation method to explain the behavior of end-to-end models in the frequency domain. Considering the characteristics of EEG data, we introduce a target alignment loss to mitigate the out-of-distribution problem associated with perturbation operations. Additionally, we develop a perturbation generator to define perturbation generation in the frequency domain. Our explanation method is validated through experiments on multiple representative end-to-end deep learning models in the EEG decoding field, using an established EEG benchmark dataset. The results demonstrate the effectiveness and superiority of our method, and highlight its potential to advance research in EEG-based end-to-end models.
Abstract:Understanding emotions from diverse contexts has received widespread attention in computer vision communities. The core philosophy of Context-Aware Emotion Recognition (CAER) is to provide valuable semantic cues for recognizing the emotions of target persons by leveraging rich contextual information. Current approaches invariably focus on designing sophisticated structures to extract perceptually critical representations from contexts. Nevertheless, a long-neglected dilemma is that a severe context bias in existing datasets results in an unbalanced distribution of emotional states among different contexts, causing biased visual representation learning. From a causal demystification perspective, the harmful bias is identified as a confounder that misleads existing models to learn spurious correlations based on likelihood estimation, limiting the models' performance. To address the issue, we embrace causal inference to disentangle the models from the impact of such bias, and formulate the causalities among variables in the CAER task via a customized causal graph. Subsequently, we present a Contextual Causal Intervention Module (CCIM) to de-confound the confounder, which is built upon backdoor adjustment theory to facilitate seeking approximate causal effects during model training. As a plug-and-play component, CCIM can easily integrate with existing approaches and bring significant improvements. Systematic experiments on three datasets demonstrate the effectiveness of our CCIM.
Abstract:Understanding human intentions (e.g., emotions) from videos has received considerable attention recently. Video streams generally constitute a blend of temporal data stemming from distinct modalities, including natural language, facial expressions, and auditory clues. Despite the impressive advancements of previous works via attention-based paradigms, the inherent temporal asynchrony and modality heterogeneity challenges remain in multimodal sequence fusion, causing adverse performance bottlenecks. To tackle these issues, we propose a Multimodal fusion approach for learning modality-Exclusive and modality-Agnostic representations (MEA) to refine multimodal features and leverage the complementarity across distinct modalities. On the one hand, MEA introduces a predictive self-attention module to capture reliable context dynamics within modalities and reinforce unique features over the modality-exclusive spaces. On the other hand, a hierarchical cross-modal attention module is designed to explore valuable element correlations among modalities over the modality-agnostic space. Meanwhile, a double-discriminator strategy is presented to ensure the production of distinct representations in an adversarial manner. Eventually, we propose a decoupled graph fusion mechanism to enhance knowledge exchange across heterogeneous modalities and learn robust multimodal representations for downstream tasks. Numerous experiments are implemented on three multimodal datasets with asynchronous sequences. Systematic analyses show the necessity of our approach.
Abstract:With the rising demand for high-rate and timely communications, fog radio access networks (F-RANs) offer a promising solution. This work investigates age of information (AoI) performance in F-RANs, consisting of multiple content users (CUs), enhanced remote radio heads (eRRHs), and content providers (CPs). Time-critical CUs need rapid content updates from CPs but cannot communicate directly with them; instead, eRRHs act as intermediaries. CUs decide whether to request content from a CP and which eRRH to send the request to, while eRRHs decide whether to command CPs to update content or use cached content. We study two general classes of policies: (i) oblivious policies, where decision-making is independent of historical information, and (ii) non-oblivious policies, where decisions are influenced by historical information. First, we obtain closed-form expressions for the average AoI of eRRHs under both policy types. Due to the complexity of calculating closed-form expressions for CUs, we then derive general upper bounds for their average AoI. Next, we identify optimal policies for both types. Under both optimal policies, each CU requests content from each CP at an equal rate, consolidating all requests to a single eRRH when demand is low or resources are limited, and distributing requests evenly among eRRHs when demand is high and resources are ample. eRRHs command content from each CP at an equal rate under an optimal oblivious policy, while prioritize the CP with the highest age under an optimal non-oblivious policy. Our numerical results validate these theoretical findings.