Abstract:Software defect prediction (SDP) aims to identify high-risk defect modules in software development, optimizing resource allocation. While previous studies show that dependency network metrics improve defect prediction, most methods focus on code-based dependency graphs, overlooking developer factors. Current metrics, based on handcrafted features like ego and global network metrics, fail to fully capture defect-related information. To address this, we propose DeMuVGN, a defect prediction model that learns multi-view software dependency via graph neural networks. We introduce a Multi-view Software Dependency Graph (MSDG) that integrates data, call, and developer dependencies. DeMuVGN also leverages the Synthetic Minority Oversampling Technique (SMOTE) to address class imbalance and enhance defect module identification. In a case study of eight open-source projects across 20 versions, DeMuVGN demonstrates significant improvements: i) models based on multi-view graphs improve F1 scores by 11.1% to 12.1% over single-view models; ii) DeMuVGN improves F1 scores by 17.4% to 45.8% in within-project contexts and by 17.9% to 41.0% in cross-project contexts. Additionally, DeMuVGN excels in software evolution, showing more improvement in later-stage software versions. Its strong performance across different projects highlights its generalizability. We recommend future research focus on multi-view dependency graphs for defect prediction in both mature and newly developed projects.
Abstract:Large language models (LLMs) can store a significant amount of factual knowledge in their parameters. However, their parametric knowledge may conflict with the information provided in the context. Such conflicts can lead to undesirable model behaviour, such as reliance on outdated or incorrect information. In this work, we investigate whether LLMs can identify knowledge conflicts and whether it is possible to know which source of knowledge the model will rely on by analysing the residual stream of the LLM. Through probing tasks, we find that LLMs can internally register the signal of knowledge conflict in the residual stream, which can be accurately detected by probing the intermediate model activations. This allows us to detect conflicts within the residual stream before generating the answers without modifying the input or model parameters. Moreover, we find that the residual stream shows significantly different patterns when the model relies on contextual knowledge versus parametric knowledge to resolve conflicts. This pattern can be employed to estimate the behaviour of LLMs when conflict happens and prevent unexpected answers before producing the answers. Our analysis offers insights into how LLMs internally manage knowledge conflicts and provides a foundation for developing methods to control the knowledge selection processes.
Abstract:Large language models (LLMs) can store a significant amount of factual knowledge in their parameters. However, their parametric knowledge may conflict with the information provided in the context -- this phenomenon, known as \emph{context-memory knowledge conflicts}, can lead to undesirable model behaviour, such as reliance on outdated or incorrect information. Analysing the internal activations of LLMs, we find that they can internally register the signals of knowledge conflict at mid-layers. Such signals allow us to detect whether a knowledge conflict occurs and use \emph{inference-time} intervention strategies to resolve it. In this work, we propose \textsc{SpARE}, a \emph{training-free} representation engineering method that uses pre-trained sparse auto-encoders (SAEs) to control the knowledge selection behaviour of LLMs. \textsc{SpARE} identifies the functional features that control the knowledge selection behaviours and applies them to edit the internal activations of LLMs at inference time. Our experimental results show that \textsc{SpARE} can effectively control the usage of either knowledge source to resolve knowledge conflict in open-domain question-answering tasks, surpassing existing representation engineering methods ($+10\%$) as well as contrastive decoding methods ($+15\%$).
Abstract:In the visual spatial understanding (VSU) area, spatial image-to-text (SI2T) and spatial text-to-image (ST2I) are two fundamental tasks that appear in dual form. Existing methods for standalone SI2T or ST2I perform imperfectly in spatial understanding, due to the difficulty of 3D-wise spatial feature modeling. In this work, we consider modeling the SI2T and ST2I together under a dual learning framework. During the dual framework, we then propose to represent the 3D spatial scene features with a novel 3D scene graph (3DSG) representation that can be shared and beneficial to both tasks. Further, inspired by the intuition that the easier 3D$\to$image and 3D$\to$text processes also exist symmetrically in the ST2I and SI2T, respectively, we propose the Spatial Dual Discrete Diffusion (SD$^3$) framework, which utilizes the intermediate features of the 3D$\to$X processes to guide the hard X$\to$3D processes, such that the overall ST2I and SI2T will benefit each other. On the visual spatial understanding dataset VSD, our system outperforms the mainstream T2I and I2T methods significantly. Further in-depth analysis reveals how our dual learning strategy advances.
Abstract:Wire-harnessing tasks pose great challenges to be automated by the robot due to the complex dynamics and unpredictable behavior of the deformable wire. Traditional methods, often reliant on dual-robot arms or tactile sensing, face limitations in adaptability, cost, and scalability. This paper introduces a novel single-robot wire-harnessing pipeline that leverages a robot's twisting motion to generate necessary wire tension for precise insertion into clamps, using only one robot arm with an integrated force/torque (F/T) sensor. Benefiting from this design, the single robot arm can efficiently apply tension for wire routing and insertion into clamps in a narrow space. Our approach is structured around four principal components: a Model Predictive Control (MPC) based on the Koopman operator for tension tracking and wire following, a motion planner for sequencing harnessing waypoints, a suite of insertion primitives for clamp engagement, and a fix-point switching mechanism for wire constraint updating. Evaluated on an industrial-level wire harnessing task, our method demonstrated superior performance and reliability over conventional approaches, efficiently handling both single and multiple wire configurations with high success rates.
Abstract:Large language models (LLMs) have demonstrated remarkable capabilities in generating high-quality texts across diverse domains. However, the potential misuse of LLMs has raised significant concerns, underscoring the urgent need for reliable detection of LLM-generated texts. Conventional training-based detectors often struggle with generalization, particularly in cross-domain and cross-model scenarios. In contrast, training-free methods, which focus on inherent discrepancies through carefully designed statistical features, offer improved generalization and interpretability. Despite this, existing training-free detection methods typically rely on global text sequence statistics, neglecting the modeling of local discriminative features, thereby limiting their detection efficacy. In this work, we introduce a novel training-free detector, termed \textbf{Lastde} that synergizes local and global statistics for enhanced detection. For the first time, we introduce time series analysis to LLM-generated text detection, capturing the temporal dynamics of token probability sequences. By integrating these local statistics with global ones, our detector reveals significant disparities between human and LLM-generated texts. We also propose an efficient alternative, \textbf{Lastde++} to enable real-time detection. Extensive experiments on six datasets involving cross-domain, cross-model, and cross-lingual detection scenarios, under both white-box and black-box settings, demonstrated that our method consistently achieves state-of-the-art performance. Furthermore, our approach exhibits greater robustness against paraphrasing attacks compared to existing baseline methods.
Abstract:We present the Melody Guided Music Generation (MMGen) model, the first novel approach using melody to guide the music generation that, despite a pretty simple method and extremely limited resources, achieves excellent performance. Specifically, we first align the melody with audio waveforms and their associated descriptions using the multimodal alignment module. Subsequently, we condition the diffusion module on the learned melody representations. This allows MMGen to generate music that matches the style of the provided audio while also producing music that reflects the content of the given text description. To address the scarcity of high-quality data, we construct a multi-modal dataset, MusicSet, which includes melody, text, and audio, and will be made publicly available. We conduct extensive experiments which demonstrate the superiority of the proposed model both in terms of experimental metrics and actual performance quality.
Abstract:Underwater images often suffer from quality degradation due to absorption and scattering effects. Most existing underwater image enhancement algorithms produce a single, fixed-color image, limiting user flexibility and application. To address this limitation, we propose a method called \textit{ColorCode}, which enhances underwater images while offering a range of controllable color outputs. Our approach involves recovering an underwater image to a reference enhanced image through supervised training and decomposing it into color and content codes via self-reconstruction and cross-reconstruction. The color code is explicitly constrained to follow a Gaussian distribution, allowing for efficient sampling and interpolation during inference. ColorCode offers three key features: 1) color enhancement, producing an enhanced image with a fixed color; 2) color adaptation, enabling controllable adjustments of long-wavelength color components using guidance images; and 3) color interpolation, allowing for the smooth generation of multiple colors through continuous sampling of the color code. Quantitative and visual evaluations on popular and challenging benchmark datasets demonstrate the superiority of ColorCode over existing methods in providing diverse, controllable, and color-realistic enhancement results. The source code is available at https://github.com/Xiaofeng-life/ColorCode.
Abstract:Imitation based robot learning has recently gained significant attention in the robotics field due to its theoretical potential for transferability and generalizability. However, it remains notoriously costly, both in terms of hardware and data collection, and deploying it in real-world environments demands meticulous setup of robots and precise experimental conditions. In this paper, we present a low-cost robot learning framework that is both easily reproducible and transferable to various robots and environments. We demonstrate that deployable imitation learning can be successfully applied even to industrial-grade robots, not just expensive collaborative robotic arms. Furthermore, our results show that multi-task robot learning is achievable with simple network architectures and fewer demonstrations than previously thought necessary. As the current evaluating method is almost subjective when it comes to real-world manipulation tasks, we propose Voting Positive Rate (VPR) - a novel evaluation strategy that provides a more objective assessment of performance. We conduct an extensive comparison of success rates across various self-designed tasks to validate our approach. To foster collaboration and support the robot learning community, we have open-sourced all relevant datasets and model checkpoints, available at huggingface.co/ZhiChengAI.
Abstract:Multimodal sentiment analysis aims to effectively integrate information from various sources to infer sentiment, where in many cases there are no annotations for unimodal labels. Therefore, most works rely on multimodal labels for training. However, there exists the noisy label problem for the learning of unimodal signals as multimodal annotations are not always the ideal substitutes for the unimodal ones, failing to achieve finer optimization for individual modalities. In this paper, we explore the learning of unimodal labels under the weak supervision from the annotated multimodal labels. Specifically, we propose a novel meta uni-label generation (MUG) framework to address the above problem, which leverages the available multimodal labels to learn the corresponding unimodal labels by the meta uni-label correction network (MUCN). We first design a contrastive-based projection module to bridge the gap between unimodal and multimodal representations, so as to use multimodal annotations to guide the learning of MUCN. Afterwards, we propose unimodal and multimodal denoising tasks to train MUCN with explicit supervision via a bi-level optimization strategy. We then jointly train unimodal and multimodal learning tasks to extract discriminative unimodal features for multimodal inference. Experimental results suggest that MUG outperforms competitive baselines and can learn accurate unimodal labels.