Abstract:Convolutional Neural Networks (CNNs) have advanced Image Super-Resolution (SR), but most CNN-based methods rely solely on pixel-based transformations, often leading to artifacts and blurring, particularly with severe downsampling (e.g., 8x or 16x). Recent text-guided SR methods attempt to leverage textual information for enhanced detail, but they frequently struggle with effective alignment, resulting in inconsistent semantic coherence. To address these limitations, we introduce a multi-modal semantic enhancement approach that combines textual semantics with visual features, effectively tackling semantic mismatches and detail loss in highly degraded LR images. Our proposed multi-modal collaborative framework enables the production of realistic and high-quality SR images at significant up-scaling factors. The framework integrates text and image inputs, employing a prompt predictor, Text-Image Fusion Block (TIFBlock), and Iterative Refinement Module alongside CLIP (Contrastive Language-Image Pretraining) features to guide a progressive enhancement process with fine-grained alignment. This alignment produces high-resolution outputs with crisp details and semantic coherence, even at large scaling factors. Through extensive comparative experiments and ablation studies, we validate the effectiveness of our approach. Additionally, by incorporating textual semantic guidance, our technique enables a degree of super-resolution editability while maintaining semantic coherence.
Abstract:Large Language Models (LLMs) have achieved remarkable progress in recent years; however, their excellent performance is still largely limited to major world languages, primarily English. Many LLMs continue to face challenges with multilingual tasks, especially when it comes to low-resource languages. To address this issue, we introduced Marco-LLM: Massive multilingual training for cross-lingual enhancement LLM. We have collected a substantial amount of multilingual data for several low-resource languages and conducted extensive continual pre-training using the Qwen2 models. This effort has resulted in a multilingual LLM named Marco-LLM. Through comprehensive evaluations on various multilingual benchmarks, including MMMLU, AGIEval, Belebele, Flores-200, XCOPA and many others, Marco-LLM has demonstrated substantial improvements over state-of-the-art LLMs. Furthermore, Marco-LLM achieved substantial enhancements in any-to-any machine translation tasks, showing the effectiveness of our multilingual LLM. Marco-LLM is a pioneering multilingual LLM designed to not only perform exceptionally well in multilingual tasks, including low-resource languages, but also maintain strong performance in English and other major languages, closing the performance gap between high- and low-resource language capabilities. By bridging languages, this effort demonstrates our dedication to ensuring LLMs work accurately across various languages.
Abstract:Integrated sensing and communication (ISAC) has been envisioned as a prospective technology to enable ubiquitous sensing and communications in next-generation wireless networks. In contrast to existing works on reconfigurable intelligent surface (RIS) aided ISAC systems using conventional phased arrays (PAs), this paper investigates a frequency diverse array (FDA)-enabled RIS-aided ISAC system, where the FDA aims to provide a distance-angle-dependent beampattern to effectively suppress the clutter, and RIS is employed to establish high-quality links between the BS and users/target. We aim to maximize sum rate by jointly optimizing the BS transmit beamforming vectors, the covariance matrix of the dedicated radar signal, the RIS phase shift matrix, the FDA frequency offsets and the radar receive equalizer, while guaranteeing the required signal-to-clutter-plus-noise ratio (SCNR) of the radar echo signal. To tackle this challenging problem, we first theoretically prove that the dedicated radar signal is unnecessary for enhancing target sensing performance, based on which the original problem is much simplified. Then, we turn our attention to the single-user single-target (SUST) scenario to demonstrate that the FDA-RIS-aided ISAC system always achieves a higher SCNR than its PA-RIS-aided counterpart. Moreover, it is revealed that the SCNR increment exhibits linear growth with the BS transmit power and the number of BS receive antennas. In order to effectively solve this simplified problem, we leverage the fractional programming (FP) theory and subsequently develop an efficient alternating optimization (AO) algorithm based on symmetric alternating direction method of multipliers (SADMM) and successive convex approximation (SCA) techniques. Numerical results demonstrate the superior performance of our proposed algorithm in terms of sum rate and radar SCNR.
Abstract:This paper investigates joint device identification, channel estimation, and symbol detection for cooperative multi-satellite-enhanced random access, where orthogonal time-frequency space modulation with the large antenna array is utilized to combat the dynamics of the terrestrial-satellite links (TSLs). We introduce the generalized complex exponential basis expansion model to parameterize TSLs, thereby reducing the pilot overhead. By exploiting the block sparsity of the TSLs in the angular domain, a message passing algorithm is designed for initial channel estimation. Subsequently, we examine two cooperative modes to leverage the spatial diversity within satellite constellations: the centralized mode, where computations are performed at a high-power central server, and the distributed mode, where computations are offloaded to edge satellites with minimal signaling overhead. Specifically, in the centralized mode, device identification is achieved by aggregating backhaul information from edge satellites, and channel estimation and symbol detection are jointly enhanced through a structured approximate expectation propagation (AEP) algorithm. In the distributed mode, edge satellites share channel information and exchange soft information about data symbols, leading to a distributed version of AEP. The introduced basis expansion model for TSLs enables the efficient implementation of both centralized and distributed algorithms via fast Fourier transform. Simulation results demonstrate that proposed schemes significantly outperform conventional algorithms in terms of the activity error rate, the normalized mean squared error, and the symbol error rate. Notably, the distributed mode achieves performance comparable to the centralized mode with only two exchanges of soft information about data symbols within the constellation.
Abstract:Segment Anything Model (SAM) has attracted widespread attention for its superior interactive segmentation capabilities with visual prompts while lacking further exploration of text prompts. In this paper, we empirically investigate what text prompt encoders (e.g., CLIP or LLM) are good for adapting SAM for referring expression segmentation and introduce the Early Vision-language Fusion-based SAM (EVF-SAM). EVF-SAM is a simple yet effective referring segmentation method which exploits multimodal prompts (i.e., image and text) and comprises a pre-trained vision-language model to generate referring prompts and a SAM model for segmentation. Surprisingly, we observe that: (1) multimodal prompts and (2) vision-language models with early fusion (e.g., BEIT-3) are beneficial for prompting SAM for accurate referring segmentation. Our experiments show that the proposed EVF-SAM based on BEIT-3 can obtain state-of-the-art performance on RefCOCO/+/g for referring expression segmentation and demonstrate the superiority of prompting SAM with early vision-language fusion. In addition, the proposed EVF-SAM with 1.32B parameters achieves remarkably higher performance while reducing nearly 82% of parameters compared to previous SAM methods based on large multimodal models.
Abstract:Deep learning has significantly advanced wireless sensing technology by leveraging substantial amounts of high-quality training data. However, collecting wireless sensing data encounters diverse challenges, including unavoidable data noise, limited data scale due to significant collection overhead, and the necessity to reacquire data in new environments. Taking inspiration from the achievements of AI-generated content, this paper introduces a signal generation method that achieves data denoising, augmentation, and synthesis by disentangling distinct attributes within the signal, such as individual and environment. The approach encompasses two pivotal modules: structured signal selection and signal disentanglement generation. Structured signal selection establishes a minimal signal set with the target attributes for subsequent attribute disentanglement. Signal disentanglement generation disentangles the target attributes and reassembles them to generate novel signals. Extensive experimental results demonstrate that the proposed method can generate data that closely resembles real-world data on two wireless sensing datasets, exhibiting state-of-the-art performance. Our approach presents a robust framework for comprehending and manipulating attribute-specific information in wireless sensing.
Abstract:Referring video object segmentation (RVOS), as a supervised learning task, relies on sufficient annotated data for a given scene. However, in more realistic scenarios, only minimal annotations are available for a new scene, which poses significant challenges to existing RVOS methods. With this in mind, we propose a simple yet effective model with a newly designed cross-modal affinity (CMA) module based on a Transformer architecture. The CMA module builds multimodal affinity with a few samples, thus quickly learning new semantic information, and enabling the model to adapt to different scenarios. Since the proposed method targets limited samples for new scenes, we generalize the problem as - few-shot referring video object segmentation (FS-RVOS). To foster research in this direction, we build up a new FS-RVOS benchmark based on currently available datasets. The benchmark covers a wide range and includes multiple situations, which can maximally simulate real-world scenarios. Extensive experiments show that our model adapts well to different scenarios with only a few samples, reaching state-of-the-art performance on the benchmark. On Mini-Ref-YouTube-VOS, our model achieves an average performance of 53.1 J and 54.8 F, which are 10% better than the baselines. Furthermore, we show impressive results of 77.7 J and 74.8 F on Mini-Ref-SAIL-VOS, which are significantly better than the baselines. Code is publicly available at https://github.com/hengliusky/Few_shot_RVOS.
Abstract:Temporal characteristics are prominently evident in a substantial volume of knowledge, which underscores the pivotal role of Temporal Knowledge Graphs (TKGs) in both academia and industry. However, TKGs often suffer from incompleteness for three main reasons: the continuous emergence of new knowledge, the weakness of the algorithm for extracting structured information from unstructured data, and the lack of information in the source dataset. Thus, the task of Temporal Knowledge Graph Completion (TKGC) has attracted increasing attention, aiming to predict missing items based on the available information. In this paper, we provide a comprehensive review of TKGC methods and their details. Specifically, this paper mainly consists of three components, namely, 1)Background, which covers the preliminaries of TKGC methods, loss functions required for training, as well as the dataset and evaluation protocol; 2)Interpolation, that estimates and predicts the missing elements or set of elements through the relevant available information. It further categorizes related TKGC methods based on how to process temporal information; 3)Extrapolation, which typically focuses on continuous TKGs and predicts future events, and then classifies all extrapolation methods based on the algorithms they utilize. We further pinpoint the challenges and discuss future research directions of TKGC.
Abstract:Aiming at providing wireless communication systems with environment-perceptive capacity, emerging integrated sensing and communication (ISAC) technologies face multiple difficulties, especially in balancing the performance trade-off between the communication and radar functions. In this paper, we introduce a reconfigurable intelligent surface (RIS) to assist both data transmission and target detection in a dual-functional ISAC system. To formulate a general optimization framework, diverse communication performance metrics have been taken into account including famous capacity maximization and mean-squared error (MSE) minimization. Whereas the target detection process is modeled as a general likelihood ratio test (GLRT) due to the practical limitations, and the monotonicity of the corresponding detection probability is proved. For the single-user and single-target (SUST) scenario, the minimum transmit power of the ISAC transceiver has been revealed. By exploiting the optimal conditions of the BS design, we validate that the BS is able to realize the maximum power allocation scheme and derive the optimal BS precoder in a semi-closed form. Moreover, an alternating direction method of multipliers (ADMM) based RIS design is proposed to address the optimization of unit-modulus RIS phase shifts. For the sake of further enhancing computational efficiency, we also develop a low-complexity RIS design based on Riemannian gradient descent. Furthermore, the ISAC transceiver design for the multiple-users and multiple-targets (MUMT) scenario is also investigated, where a zero-forcing (ZF) radar receiver is adopted to cancel the interferences. Then optimal BS precoder is derived under the maximum power allocation scheme, and the RIS phase shifts can be optimized by extending the proposed ADMM-based RIS design. Numerical simulation results verify the performance of our proposed transceiver designs.
Abstract:Due to the limitations of sensors, the transmission medium and the intrinsic properties of ultrasound, the quality of ultrasound imaging is always not ideal, especially its low spatial resolution. To remedy this situation, deep learning networks have been recently developed for ultrasound image super-resolution (SR) because of the powerful approximation capability. However, most current supervised SR methods are not suitable for ultrasound medical images because the medical image samples are always rare, and usually, there are no low-resolution (LR) and high-resolution (HR) training pairs in reality. In this work, based on self-supervision and cycle generative adversarial network (CycleGAN), we propose a new perception consistency ultrasound image super-resolution (SR) method, which only requires the LR ultrasound data and can ensure the re-degenerated image of the generated SR one to be consistent with the original LR image, and vice versa. We first generate the HR fathers and the LR sons of the test ultrasound LR image through image enhancement, and then make full use of the cycle loss of LR-SR-LR and HR-LR-SR and the adversarial characteristics of the discriminator to promote the generator to produce better perceptually consistent SR results. The evaluation of PSNR/IFC/SSIM, inference efficiency and visual effects under the benchmark CCA-US and CCA-US datasets illustrate our proposed approach is effective and superior to other state-of-the-art methods.