Abstract:Existing end-to-end autonomous driving (AD) algorithms typically follow the Imitation Learning (IL) paradigm, which faces challenges such as causal confusion and the open-loop gap. In this work, we establish a 3DGS-based closed-loop Reinforcement Learning (RL) training paradigm. By leveraging 3DGS techniques, we construct a photorealistic digital replica of the real physical world, enabling the AD policy to extensively explore the state space and learn to handle out-of-distribution scenarios through large-scale trial and error. To enhance safety, we design specialized rewards that guide the policy to effectively respond to safety-critical events and understand real-world causal relationships. For better alignment with human driving behavior, IL is incorporated into RL training as a regularization term. We introduce a closed-loop evaluation benchmark consisting of diverse, previously unseen 3DGS environments. Compared to IL-based methods, RAD achieves stronger performance in most closed-loop metrics, especially 3x lower collision rate. Abundant closed-loop results are presented at https://hgao-cv.github.io/RAD.
Abstract:As automated trading gains traction in the financial market, algorithmic investment strategies are increasingly prominent. While Large Language Models (LLMs) and Agent-based models exhibit promising potential in real-time market analysis and trading decisions, they still experience a significant -20% loss when confronted with rapid declines or frequent fluctuations, impeding their practical application. Hence, there is an imperative to explore a more robust and resilient framework. This paper introduces an innovative multi-agent system, HedgeAgents, aimed at bolstering system robustness via ``hedging'' strategies. In this well-balanced system, an array of hedging agents has been tailored, where HedgeAgents consist of a central fund manager and multiple hedging experts specializing in various financial asset classes. These agents leverage LLMs' cognitive capabilities to make decisions and coordinate through three types of conferences. Benefiting from the powerful understanding of LLMs, our HedgeAgents attained a 70% annualized return and a 400% total return over a period of 3 years. Moreover, we have observed with delight that HedgeAgents can even formulate investment experience comparable to those of human experts (https://hedgeagents.github.io/).
Abstract:Medical vision-language pretraining (VLP) that leverages naturally-paired medical image-report data is crucial for medical image analysis. However, existing methods struggle to accurately characterize associations between images and diseases, leading to inaccurate or incomplete diagnostic results. In this work, we propose MedFILIP, a fine-grained VLP model, introduces medical image-specific knowledge through contrastive learning, specifically: 1) An information extractor based on a large language model is proposed to decouple comprehensive disease details from reports, which excels in extracting disease deals through flexible prompt engineering, thereby effectively reducing text complexity while retaining rich information at a tiny cost. 2) A knowledge injector is proposed to construct relationships between categories and visual attributes, which help the model to make judgments based on image features, and fosters knowledge extrapolation to unfamiliar disease categories. 3) A semantic similarity matrix based on fine-grained annotations is proposed, providing smoother, information-richer labels, thus allowing fine-grained image-text alignment. 4) We validate MedFILIP on numerous datasets, e.g., RSNA-Pneumonia, NIH ChestX-ray14, VinBigData, and COVID-19. For single-label, multi-label, and fine-grained classification, our model achieves state-of-the-art performance, the classification accuracy has increased by a maximum of 6.69\%. The code is available in https://github.com/PerceptionComputingLab/MedFILIP.
Abstract:Low-earth orbit (LEO) satellite communication (SatCom) has emerged as a promising technology for improving wireless connectivity in global areas. Cell-free massive multiple-input multiple-output (CF-mMIMO), an architecture recently proposed for next-generation networks, has yet to be fully explored for LEO satellites. In this paper, we investigate the downlink performance of a CF-mMIMO LEO SatCom network, where many satellite access points (SAPs) simultaneously serve the corresponding ground user terminals (UTs). Using tools from stochastic geometry, we model the locations of SAPs and UTs on surfaces of concentric spheres using Poisson point processes (PPPs) and present expressions based on linear minimum-mean-square-error (LMMSE) channel estimation and conjugate beamforming. Then, we derive the coverage probabilities in both fading and non-fading scenarios, with significant system parameters such as the Nakagami fading parameter, number of UTs, number of SAPs, orbital altitude, and service range brought by the dome angle. Finally, the analytical model is verified by extensive Monte Carlo simulations. Simulation results show that stronger line-of-sight (LoS) effects and a more comprehensive service range of the UT bring higher coverage probability despite existing multi-user interference. Moreover, we found that there exist optimal numbers of UTs for different orbital altitudes and dome angles, which provides valuable system design insights.
Abstract:This paper investigates an analytical model for low-earth orbit (LEO) multi-satellite downlink non-orthogonal multiple access (NOMA) networks. The satellites transmit data to multiple NOMA user terminals (UTs), each employing successive interference cancellation (SIC) for decoding. Two ordering schemes are adopted for NOMA-enabled LEO satellite networks, i.e., mean signal power (MSP)-based ordering and instantaneous-signal-to-inter-satellite-interference-plus-noise ratio (ISINR)-based ordering. For each ordering scheme, we derive the coverage probabilities of UTs under different channel conditions. Moreover, we discuss how coverage is influenced by SIC, main-lobe gain, and tradeoffs between the number of satellites and their altitudes. Additionally, two user fairness-based power allocation (PA) schemes are considered, and PA coefficients with the optimal number of UTs that maximize their sum spectral efficiency (SE) are studied. Simulation results show that there exists a maximum signal-to-inter-satellite-interference-plus-noise ratio (SINR) threshold for each PA scheme that ensures the operation of NOMA in LEO satellite networks, and the benefit of NOMA only exists when the target SINR is below a certain threshold. Compared with orthogonal multiple access (OMA), NOMA increases UTs' sum SE by as much as 35\%. Furthermore, for most SINR thresholds, the sum SE increases with the number of UTs to the highest value, whilst the maximum sum SE is obtained when there are two UTs.
Abstract:Space-ground integrated network (SGIN) has been envisioned as a competitive solution for large scale and wide coverage of future wireless networks. By integrating both the non-terrestrial network (NTN) and the terrestrial network (TN), SGIN can provide high speed and omnipresent wireless network access for the users using the predefined licensed spectrums. Considering the scarcity of the spectrum resource and the low spectrum efficiency of the SGIN, we enable the NTN and TN to share the spectrum to improve overall system performance, i.e., weighted-sum area data rate (WS-ADR). However, mutual interference between NTN and TN is often inevitable and thus causes SGIN performance degradation. In this work, we consider a ground protection zone for the TN base stations, in which the NTN users are only allowed to use the NTN reserved spectrum to mitigate the NTN and TN mutual interference. We analytically derive the coverage probability and area data rate (ADR) of the typical users and study the performance under various protection zone sizes and spectrum allocation parameter settings. Simulation and numerical results demonstrate that the WS-ADR could be maximized by selecting the appropriate radius of protection zone and bandwidth allocation factor in the SGIN.
Abstract:To accommodate the increasing communication needs in non-terrestrial networks (NTNs), wireless users in remote areas may require access to more spectrum than is currently allocated. Terrestrial networks (TNs), such as cellular networks, are deployed in specific areas, but many underused licensed spectrum bands remain in remote areas. Therefore, bringing NTNs to a shared spectrum with TNs can improve network capacity under reasonable interference management. However, in satellite-terrestrial integrated networks (STINs), the comprehensive coverage of a satellite and the unbalanced communication resources of STINs make it challenging to effectively manage mutual interference between NTN and TN. This article presents the fundamentals and prospects of spectrum sharing (SS) in STINs by introducing four SS frameworks, their potential application scenarios, and technical challenges. Furthermore, advanced SS approaches related to interference management in STINs and performance metrics of SS in STINs are introduced. Moreover, a preliminary performance evaluation showcases the potential for sharing the spectrum between NTN and TN. Finally, future research opportunities for SS in STINs are discussed.
Abstract:Compositional relational reasoning (CRR) is a hallmark of human intelligence, but we lack a clear understanding of whether and how existing transformer large language models (LLMs) can solve CRR tasks. To enable systematic exploration of the CRR capability of LLMs, we first propose a new synthetic benchmark called Generalized Associative Recall (GAR) by integrating and generalizing the essence of several tasks in mechanistic interpretability (MI) study in a unified framework. Evaluation shows that GAR is challenging enough for existing LLMs, revealing their fundamental deficiency in CRR. Meanwhile, it is easy enough for systematic MI study. Then, to understand how LLMs solve GAR tasks, we use attribution patching to discover the core circuits reused by Vicuna-33B across different tasks and a set of vital attention heads. Intervention experiments show that the correct functioning of these heads significantly impacts task performance. Especially, we identify two classes of heads whose activations represent the abstract notion of true and false in GAR tasks respectively. They play a fundamental role in CRR across various models and tasks. The dataset and code are available at https://github.com/Caiyun-AI/GAR.
Abstract:Many applications demand context sensing to offer personalized and timely services. Yet, developing sensing programs can be challenging for developers and using them is privacy-concerning for end-users. In this paper, we propose to use natural language as the unified interface to process personal data and sense user context, which can effectively ease app development and make the data pipeline more transparent. Our work is inspired by large language models (LLMs) and other generative models, while directly applying them does not solve the problem - letting the model directly process the data cannot handle complex sensing requests and letting the model write the data processing program suffers error-prone code generation. We address the problem with 1) a unified data processing framework that makes context-sensing programs simpler and 2) a feedback-guided query optimizer that makes data query more informative. To evaluate the performance of natural language-based context sensing, we create a benchmark that contains 133 context sensing tasks. Extensive evaluation has shown that our approach is able to automatically solve the context-sensing tasks efficiently and precisely. The code is opensourced at https://github.com/MobileLLM/ChainStream.
Abstract:Recently, the diffusion model has emerged as a powerful generative technique for robotic policy learning, capable of modeling multi-mode action distributions. Leveraging its capability for end-to-end autonomous driving is a promising direction. However, the numerous denoising steps in the robotic diffusion policy and the more dynamic, open-world nature of traffic scenes pose substantial challenges for generating diverse driving actions at a real-time speed. To address these challenges, we propose a novel truncated diffusion policy that incorporates prior multi-mode anchors and truncates the diffusion schedule, enabling the model to learn denoising from anchored Gaussian distribution to the multi-mode driving action distribution. Additionally, we design an efficient cascade diffusion decoder for enhanced interaction with conditional scene context. The proposed model, DiffusionDrive, demonstrates 10$\times$ reduction in denoising steps compared to vanilla diffusion policy, delivering superior diversity and quality in just 2 steps. On the planning-oriented NAVSIM dataset, with the aligned ResNet-34 backbone, DiffusionDrive achieves 88.1 PDMS without bells and whistles, setting a new record, while running at a real-time speed of 45 FPS on an NVIDIA 4090. Qualitative results on challenging scenarios further confirm that DiffusionDrive can robustly generate diverse plausible driving actions. Code and model will be available at https://github.com/hustvl/DiffusionDrive.