Abstract:Many researchers collect data from the internet through crowd-sourcing or web crawling to alleviate the data-hungry challenge associated with cross-modal matching. Although such practice does not require expensive annotations, it inevitably introduces mismatched pairs and results in a noisy correspondence problem. Current approaches leverage the memorization effect of deep neural networks to distinguish noise and perform re-weighting. However, briefly lowering the weight of noisy pairs cannot eliminate the negative impact of noisy correspondence in the training process. In this paper, we propose a novel self-drop and dual-weight approach, which achieves elaborate data processing by qua-partitioning the data. Specifically, our approach partitions all data into four types: clean and significant, clean yet insignificant, vague, and noisy. We analyze the effect of noisy and clean data pairs and find that for vision-language pre-training models, a small number of clean samples is more valuable than a majority of noisy ones. Based on this observation, we employ self-drop to discard noisy samples to effectively mitigate the impact of noise. In addition, we adopt a dual-weight strategy to ensure that the model focuses more on significant samples while appropriately leveraging vague samples. Compared to the prior works, our approach is more robust and demonstrates relatively more stable performance on noisy datasets, especially under a high noise ratio. Extensive experiments on three widely used datasets, including Flickr30K, MS-COCO, and Conceptual Captions, validate the effectiveness of our approach. The source code is available at https://github.com/DongChenwei2000/SDD.
Abstract:This paper presents an overview on intelligent reflecting surface (IRS)-enabled sensing and communication for the forthcoming sixth-generation (6G) wireless networks, in which IRSs are strategically deployed to proactively reconfigure wireless environments to improve both sensing and communication (S&C) performance. First, we exploit a single IRS to enable wireless sensing in the base station's (BS's) non-line-of-sight (NLoS) area. In particular, we present three IRS-enabled NLoS target sensing architectures with fully-passive, semi-passive, and active IRSs, respectively. We compare their pros and cons by analyzing the fundamental sensing performance limits for target detection and parameter estimation. Next, we consider a single IRS to facilitate integrated sensing and communication (ISAC), in which the transmit signals at the BS are used for achieving both S&C functionalities, aided by the IRS through reflective beamforming. We present joint transmit signal and receiver processing designs for realizing efficient ISAC, and jointly optimize the transmit beamforming at the BS and reflective beamforming at the IRS to balance the fundamental performance tradeoff between S&C. Furthermore, we discuss multi-IRS networked ISAC, by particularly focusing on multi-IRS-enabled multi-link ISAC, multi-region ISAC, and ISAC signal routing, respectively. Finally, we highlight various promising research topics in this area to motivate future work.
Abstract:Orthogonal time frequency space (OTFS) modulation is anticipated to be a promising candidate for supporting integrated sensing and communications (ISAC) systems, which is considered as a pivotal technique for realizing next generation wireless networks. In this paper, we develop a minimum bit error rate (BER) precoder design for an OTFS-based ISAC system. In particular, the BER minimization problem takes into account the maximum available transmission power budget and the required sensing performance. Different from prior studies that considered ISAC in the time-frequency (TF) domain, we devise the precoder from the perspective of the delay-Doppler (DD) domain by exploiting the equivalent DD domain channel due to the fact that the DD domain channel generally tends to be sparse and quasi-static, which can facilitate a low-overhead ISAC system design. To address the non-convex optimization design problem, we resort to optimizing the lower bound of the derived average BER by adopting Jensen's inequality. Subsequently, the formulated problem is decoupled into two independent sub-problems via singular value decomposition (SVD) methodology. We then theoretically analyze the feasibility conditions of the proposed problem and present a low-complexity iterative solution via leveraging the Lagrangian duality approach. Simulation results verify the effectiveness of our proposed precoder compared to the benchmark schemes and reveal the interplay between sensing and communication for dual-functional precoder design, indicating a trade-off where transmission efficiency is sacrificed for increasing transmission reliability and sensing accuracy.
Abstract:Impulse radio ultra-wideband (IR-UWB) signals stand out for their high temporal resolution, low cost, and large bandwidth, making them a highly promising option for integrated sensing and communication (ISAC) systems. In this paper, we design an ISAC system for a bi-static passive sensing scenario that accommodates multiple targets. Specifically, we introduce two typical modulation schemes, PPM and BPSK, for data transmission. The essential coupling between sensing and communication is examined through the Fisher information matrix (FIM). Accordingly, we introduce a pilot-based decoupling approach that relies on known time-delays, as well as a differential decoupling strategy that uses a known starting symbol position. Finally, we assess the sensing and communication performance under various modulation and demodulation schemes under the constraints of current UWB standards. This assessment utilizes the Cramer-Rao Lower Bound (CRLB) for sensing and the Shannon capacity limit for communication, offering theoretical insights into choosing suitable data signal processing methods in real-world applications.
Abstract:Despite advancements in enhancing LLM safety against jailbreak attacks, evaluating LLM defenses remains a challenge, with current methods often lacking explainability and generalization to complex scenarios, leading to incomplete assessments (e.g., direct judgment without reasoning, low F1 score of GPT-4 in complex cases, bias in multilingual scenarios). To address this, we present JAILJUDGE, a comprehensive benchmark featuring diverse risk scenarios, including synthetic, adversarial, in-the-wild, and multilingual prompts, along with high-quality human-annotated datasets. The JAILJUDGE dataset includes over 35k+ instruction-tune data with reasoning explainability and JAILJUDGETEST, a 4.5k+ labeled set for risk scenarios, and a 6k+ multilingual set across ten languages. To enhance evaluation with explicit reasoning, we propose the JailJudge MultiAgent framework, which enables explainable, fine-grained scoring (1 to 10). This framework supports the construction of instruction-tuning ground truth and facilitates the development of JAILJUDGE Guard, an end-to-end judge model that provides reasoning and eliminates API costs. Additionally, we introduce JailBoost, an attacker-agnostic attack enhancer, and GuardShield, a moderation defense, both leveraging JAILJUDGE Guard. Our experiments demonstrate the state-of-the-art performance of JailJudge methods (JailJudge MultiAgent, JAILJUDGE Guard) across diverse models (e.g., GPT-4, Llama-Guard) and zero-shot scenarios. JailBoost and GuardShield significantly improve jailbreak attack and defense tasks under zero-shot settings, with JailBoost enhancing performance by 29.24% and GuardShield reducing defense ASR from 40.46% to 0.15%.
Abstract:The Direct Segment Anything Model (DirectSAM) excels in class-agnostic contour extraction. In this paper, we explore its use by applying it to optical remote sensing imagery, where semantic contour extraction-such as identifying buildings, road networks, and coastlines-holds significant practical value. Those applications are currently handled via training specialized small models separately on small datasets in each domain. We introduce a foundation model derived from DirectSAM, termed DirectSAM-RS, which not only inherits the strong segmentation capability acquired from natural images, but also benefits from a large-scale dataset we created for remote sensing semantic contour extraction. This dataset comprises over 34k image-text-contour triplets, making it at least 30 times larger than individual dataset. DirectSAM-RS integrates a prompter module: a text encoder and cross-attention layers attached to the DirectSAM architecture, which allows flexible conditioning on target class labels or referring expressions. We evaluate the DirectSAM-RS in both zero-shot and fine-tuning setting, and demonstrate that it achieves state-of-the-art performance across several downstream benchmarks.
Abstract:This paper introduces an integrated sensing, computing, and semantic communication (ISCSC) framework tailored for smart healthcare systems. The framework is evaluated in the context of smart healthcare, optimising the transmit beamforming matrix and semantic extraction ratio for improved data rates, sensing accuracy, and general data protection regulation (GDPR) compliance, while considering IoRT device computing capabilities. Semantic metrics such as semantic transmission rate and semantic secrecy rate are derived to evaluate data rate performance and GDPR risk, respectively, while the Cram\'er-Rao Bound (CRB) assesses sensing performance. Simulation results demonstrate the framework's effectiveness in ensuring reliable sensing, high data rates, and secure communication.
Abstract:As various types of crime continue to threaten public safety and economic development, predicting the occurrence of multiple types of crimes becomes increasingly vital for effective prevention measures. Although extensive efforts have been made, most of them overlook the heterogeneity of different crime categories and fail to address the issue of imbalanced spatial distribution. In this work, we propose a Spatial-Temporal Mixture-of-Graph-Experts (ST-MoGE) framework for collective multiple-type crime prediction. To enhance the model's ability to identify diverse spatial-temporal dependencies and mitigate potential conflicts caused by spatial-temporal heterogeneity of different crime categories, we introduce an attentive-gated Mixture-of-Graph-Experts (MGEs) module to capture the distinctive and shared crime patterns of each crime category. Then, we propose Cross-Expert Contrastive Learning(CECL) to update the MGEs and force each expert to focus on specific pattern modeling, thereby reducing blending and redundancy. Furthermore, to address the issue of imbalanced spatial distribution, we propose a Hierarchical Adaptive Loss Re-weighting (HALR) approach to eliminate biases and insufficient learning of data-scarce regions. To evaluate the effectiveness of our methods, we conduct comprehensive experiments on two real-world crime datasets and compare our results with twelve advanced baselines. The experimental results demonstrate the superiority of our methods.
Abstract:Power Line Autonomous Inspection (PLAI) plays a crucial role in the construction of smart grids due to its great advantages of low cost, high efficiency, and safe operation. PLAI is completed by accurately detecting the electrical components and defects in the aerial images captured by Unmanned Aerial Vehicles (UAVs). However, the visible quality of aerial images is inevitably degraded by adverse weather like haze, rain, or snow, which are found to drastically decrease the detection accuracy in our research. To circumvent this problem, we propose a new task of Power Line Aerial Image Restoration under Adverse Weather (PLAIR-AW), which aims to recover clean and high-quality images from degraded images with bad weather thus improving detection performance for PLAI. In this context, we are the first to release numerous corresponding datasets, namely, HazeCPLID, HazeTTPLA, HazeInsPLAD for power line aerial image dehazing, RainCPLID, RainTTPLA, RainInsPLAD for power line aerial image deraining, SnowCPLID, SnowInsPLAD for power line aerial image desnowing, which are synthesized upon the public power line aerial image datasets of CPLID, TTPLA, InsPLAD following the mathematical models. Meanwhile, we select numerous state-of-the-art methods from image restoration community as the baseline methods for PLAIR-AW. At last, we conduct large-scale empirical experiments to evaluate the performance of baseline methods on the proposed datasets. The proposed datasets and trained models are available at https://github.com/ntuhubin/PLAIR-AW.
Abstract:Generating high-fidelity, temporally consistent videos in autonomous driving scenarios faces a significant challenge, e.g. problematic maneuvers in corner cases. Despite recent video generation works are proposed to tackcle the mentioned problem, i.e. models built on top of Diffusion Transformers (DiT), works are still missing which are targeted on exploring the potential for multi-view videos generation scenarios. Noticeably, we propose the first DiT-based framework specifically designed for generating temporally and multi-view consistent videos which precisely match the given bird's-eye view layouts control. Specifically, the proposed framework leverages a parameter-free spatial view-inflated attention mechanism to guarantee the cross-view consistency, where joint cross-attention modules and ControlNet-Transformer are integrated to further improve the precision of control. To demonstrate our advantages, we extensively investigate the qualitative comparisons on nuScenes dataset, particularly in some most challenging corner cases. In summary, the effectiveness of our proposed method in producing long, controllable, and highly consistent videos under difficult conditions is proven to be effective.