Abstract:This study explores how to enhance the reasoning capabilities of large language models (LLMs) in knowledge base question answering (KBQA) by leveraging Monte Carlo Tree Search (MCTS). Semantic parsing-based KBQA methods are particularly challenging as these approaches require locating elements from knowledge bases and generating logical forms, demanding not only extensive annotated data but also strong reasoning capabilities. Although recent approaches leveraging LLMs as agents have demonstrated considerable potential, these studies are inherently constrained by their linear decision-making processes. To address this limitation, we propose a MCTS-based framework that enhances LLMs' reasoning capabilities through tree search methodology. We design a carefully designed step-wise reward mechanism that requires only direct prompting of open-source instruction LLMs without additional fine-tuning. Experimental results demonstrate that our approach significantly outperforms linear decision-making methods, particularly in low-resource scenarios. Additionally, we contribute new data resources to the KBQA community by annotating intermediate reasoning processes for existing question-SPARQL datasets using distant supervision. Experimental results on the extended dataset demonstrate that our method achieves comparable performance to fully supervised models while using significantly less training data.
Abstract:Large Language Models (LLMs) have shown remarkable capabilities as AI agents. However, existing methods for enhancing LLM-agent abilities often lack a focus on data quality, leading to inefficiencies and suboptimal results in both fine-tuning and prompt engineering. To address this issue, we introduce EDGE, a novel approach for identifying informative samples without needing golden answers. We propose the Guideline Effectiveness (GE) metric, which selects challenging samples by measuring the impact of human-provided guidelines in multi-turn interaction tasks. A low GE score indicates that the human expertise required for a sample is missing from the guideline, making the sample more informative. By selecting samples with low GE scores, we can improve the efficiency and outcomes of both prompt engineering and fine-tuning processes for LLMs. Extensive experiments validate the performance of our method. Our method achieves competitive results on the HotpotQA and WebShop and datasets, requiring 75\% and 50\% less data, respectively, while outperforming existing methods. We also provide a fresh perspective on the data quality of LLM-agent fine-tuning.
Abstract:Training Large Language Models (LLMs) with synthetic data is a prevalent practice in code generation. A key approach is self-training, where LLMs are iteratively trained on self-generated correct code snippets. In this case, the self-generated codes are drawn from a conditional distribution, conditioned on a specific seed description. However, the seed description is not the only valid representation that aligns with its intended meaning. With all valid descriptions and codes forming a joint space, codes drawn from the conditional distribution would lead to an underrepresentation of the full description-code space. As such, we propose Gibbs Fine-Tuning (GiFT), a novel self-training method inspired by Gibbs sampling. GiFT allows self-generated data to be drawn from the marginal distribution of the joint space, thereby mitigating the biases inherent in conditional sampling. We provide a theoretical analysis demonstrating the potential benefits of fine-tuning LLMs with code derived from the marginal distribution. Furthermore, we propose a perplexity-based code selection method to mitigate the imbalanced long-tail distribution of the self-generated codes. Empirical evaluation of two LLMs across four datasets demonstrates that GiFT achieves superior performance, particularly on more challenging benchmarks.
Abstract:Deferred neural rendering (DNR) is an emerging computer graphics pipeline designed for high-fidelity rendering and robotic perception. However, DNR heavily relies on datasets composed of numerous ray-traced images and demands substantial computational resources. It remains under-explored how to reduce the reliance on high-quality ray-traced images while maintaining the rendering fidelity. In this paper, we propose DNRSelect, which integrates a reinforcement learning-based view selector and a 3D texture aggregator for deferred neural rendering. We first propose a novel view selector for deferred neural rendering based on reinforcement learning, which is trained on easily obtained rasterized images to identify the optimal views. By acquiring only a few ray-traced images for these selected views, the selector enables DNR to achieve high-quality rendering. To further enhance spatial awareness and geometric consistency in DNR, we introduce a 3D texture aggregator that fuses pyramid features from depth maps and normal maps with UV maps. Given that acquiring ray-traced images is more time-consuming than generating rasterized images, DNRSelect minimizes the need for ray-traced data by using only a few selected views while still achieving high-fidelity rendering results. We conduct detailed experiments and ablation studies on the NeRF-Synthetic dataset to demonstrate the effectiveness of DNRSelect. The code will be released.
Abstract:A simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS) aided communication system is investigated. A robust joint beamforming design problem under the imperfect channel state information (CSI) is formulated to maximize the weighted sum of the Jain's fairness index and the normalized system sum rate. To solve this non-convex problem, an alternating optimization (AO) algorithm is proposed, which leverages the S-Procedure, successive convex approximation (SCA), and semidefinite relaxation (SDR). Simulation results demonstrate that with proposed algorithm: 1) various trade-offs between sum rate and user fairness can be achieved; 2) a larger trade-off region can be achieved by adopting STAR-RIS compared to conventional RIS; and 3) the performance degradation caused by imperfect CSI is less than 7% with our proposed robust beamforming approach.
Abstract:A simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS) aided integrated sensing, computing, and communication (ISCC) Internet of Robotic Things (IoRT) framework is proposed. Specifically, the full-duplex (FD) base station (BS) simultaneously receives the offloading signals from decision robots (DRs) and carries out target robot (TR) sensing. A computation rate maximization problem is formulated to optimize the sensing and receive beamformers at the BS and the STAR-RIS coefficients under the BS power constraint, the sensing signal-to-noise ratio constraint, and STAR-RIS coefficients constraints. The alternating optimization (AO) method is adopted to solve the proposed optimization problem. With fixed STAR-RIS coefficients, the sub-problem with respect to sensing and receiving beamformer at the BS is tackled with the weighted minimum mean-square error method. Given beamformers at the BS, the sub-problem with respect to STAR-RIS coefficients is tacked with the penalty method and successive convex approximation method. The overall algorithm is guaranteed to converge to at least a stationary point of the computation rate maximization problem. Our simulation results validate that the proposed STAR-RIS aided ISCC IoRT system can enhance the sum computation rate compared with the benchmark schemes.
Abstract:The development of sixth-generation (6G) communication technologies is confronted with the significant challenge of spectrum resource shortage. To alleviate this issue, we propose a novel simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS) aided multiple-input multiple-output (MIMO) cognitive radio (CR) system. Specifically, the underlying secondary network in the proposed CR system reuses the same frequency resources occupied by the primary network with the help of the STAR-RIS. The secondary network sum rate maximization problem is first formulated for the STAR-RIS aided MIMO CR system. The adoption of STAR-RIS necessitates an intricate beamforming design for the considered system due to its large number of coupled coefficients. The block coordinate descent method is employed to address the formulated optimization problem. In each iteration, the beamformers at the secondary base station (SBS) are optimized by solving a quadratically constrained quadratic program (QCQP) problem. Concurrently, the STAR-RIS passive beamforming problem is resolved using tailored algorithms designed for the two phase-shift models: 1) For the independent phase-shift model, a successive convex approximation-based algorithm is proposed. 2) For the coupled phase-shift model, a penalty dual decomposition-based algorithm is conceived, in which the phase shifts and amplitudes of the STAR-RIS elements are optimized using closed-form solutions. Simulation results show that: 1) The proposed STAR-RIS aided CR communication framework can significantly enhance the sum rate of the secondary system. 2) The coupled phase-shift model results in limited performance degradation compared to the independent phase-shift model.
Abstract:Domain adaptive object detection (DAOD) aims to generalize detectors trained on an annotated source domain to an unlabelled target domain. As the visual-language models (VLMs) can provide essential general knowledge on unseen images, freezing the visual encoder and inserting a domain-agnostic adapter can learn domain-invariant knowledge for DAOD. However, the domain-agnostic adapter is inevitably biased to the source domain. It discards some beneficial knowledge discriminative on the unlabelled domain, i.e., domain-specific knowledge of the target domain. To solve the issue, we propose a novel Domain-Aware Adapter (DA-Ada) tailored for the DAOD task. The key point is exploiting domain-specific knowledge between the essential general knowledge and domain-invariant knowledge. DA-Ada consists of the Domain-Invariant Adapter (DIA) for learning domain-invariant knowledge and the Domain-Specific Adapter (DSA) for injecting the domain-specific knowledge from the information discarded by the visual encoder. Comprehensive experiments over multiple DAOD tasks show that DA-Ada can efficiently infer a domain-aware visual encoder for boosting domain adaptive object detection. Our code is available at https://github.com/Therock90421/DA-Ada.
Abstract:Recent text-to-image models have achieved remarkable success in generating high-quality images. However, when tasked with multi-concept generation which creates images containing multiple characters or objects, existing methods often suffer from attribute confusion, resulting in severe text-image inconsistency. We found that attribute confusion occurs when a certain region of the latent features attend to multiple or incorrect prompt tokens. In this work, we propose novel Semantic Protection Diffusion (SPDiffusion) to protect the semantics of regions from the influence of irrelevant tokens, eliminating the confusion of non-corresponding attributes. In the SPDiffusion framework, we design a Semantic Protection Mask (SP-Mask) to represent the relevance of the regions and the tokens, and propose a Semantic Protection Cross-Attention (SP-Attn) to shield the influence of irrelevant tokens on specific regions in the generation process. To evaluate our method, we created a diverse multi-concept benchmark, and SPDiffusion achieves state-of-the-art results on this benchmark, proving its effectiveness. Our method can be combined with many other application methods or backbones, such as ControlNet, Story Diffusion, PhotoMaker and PixArt-alpha to enhance their multi-concept capabilities, demonstrating strong compatibility and scalability.
Abstract:Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, we introduce SEACrowd, a collaborative initiative that consolidates a comprehensive resource hub that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in SEA.