Abstract:This paper addresses the phase retrieval problem, which aims to recover a signal vector $x$ from $m$ measurements $y_i=|\langle a_i,x^{\natural}\rangle|^2$, $i=1,\ldots,m$. A standard approach is to solve a nonconvex least squares problem using gradient descent with random initialization, which is known to work efficiently given a sufficient number of measurements. However, whether $O(n)$ measurements suffice for gradient descent to recover the ground truth efficiently has remained an open question. Prior work has established that $O(n\,{\rm poly}(\log n))$ measurements are sufficient. In this paper, we resolve this open problem by proving that $m=O(n)$ Gaussian random measurements are sufficient to guarantee, with high probability, that the objective function has a benign global landscape. This sample complexity is optimal because at least $\Omega(n)$ measurements are required for exact recovery. The landscape result allows us to further show that gradient descent with a constant step size converges to the ground truth from almost any initial point.
Abstract:Large Language Models (LLMs) have recently demonstrated remarkable coding capabilities. However, assessing code generation based on well-formed properties and aligning it with developer preferences remains challenging. In this paper, we explore two key questions under the new challenge of code preference learning: (i) How do we train models to predict meaningful preferences for code? and (ii) How do human and LLM preferences align with verifiable code properties and developer code tastes? To this end, we propose CodeFavor, a framework for training pairwise code preference models from synthetic evolution data, including code commits and code critiques. To evaluate code preferences, we introduce CodePrefBench, a benchmark comprising 1364 rigorously curated code preference tasks to cover three verifiable properties-correctness, efficiency, and security-along with human preference. Our evaluation shows that CodeFavor holistically improves the accuracy of model-based code preferences by up to 28.8%. Meanwhile, CodeFavor models can match the performance of models with 6-9x more parameters while being 34x more cost-effective. We also rigorously validate the design choices in CodeFavor via a comprehensive set of controlled experiments. Furthermore, we discover the prohibitive costs and limitations of human-based code preference: despite spending 23.4 person-minutes on each task, 15.1-40.3% of tasks remain unsolved. Compared to model-based preference, human preference tends to be more accurate under the objective of code correctness, while being sub-optimal for non-functional objectives.
Abstract:Knowledge editing has emerged as an efficient approach for updating the knowledge of large language models (LLMs), attracting increasing attention in recent research. However, there is a notable lack of effective measures to prevent the malicious misuse of this technology, which could lead to harmful edits in LLMs. These malicious modifications have the potential to cause LLMs to generate toxic content, misleading users into inappropriate actions. To address this issue, we introduce a novel task, \textbf{K}nowledge \textbf{E}diting \textbf{T}ype \textbf{I}dentification (KETI), aimed at identifying malicious edits in LLMs. As part of this task, we present KETIBench, a benchmark that includes five types of malicious updates and one type of benign update. Furthermore, we develop four classical classification models and three BERT-based models as baseline identifiers for both open-source and closed-source LLMs. Our experimental results, spanning 42 trials involving two models and three knowledge editing methods, demonstrate that all seven baseline identifiers achieve decent identification performance, highlighting the feasibility of identifying malicious edits in LLMs. Additional analyses reveal that the performance of the identifiers is independent of the efficacy of the knowledge editing methods and exhibits cross-domain generalization, enabling the identification of edits from unknown sources. All data and code are available in https://github.com/xpq-tech/KETI. Warning: This paper contains examples of toxic text.
Abstract:The automated vehicle (AV) equipped with the Adaptive Cruise Control (ACC) system is expected to reduce the fuel consumption for the intelligent transportation system. This paper presents the Advanced ACC-Micro (AA-Micro) model, a new energy consumption model based on micro trajectory data, calibrated and verified by empirical data. Utilizing a commercial AV equipped with the ACC system as the test platform, experiments were conducted at the Columbus 151 Speedway, capturing data from multiple ACC and Human-Driven (HV) test runs. The calibrated AA-Micro model integrates features from traditional energy consumption models and demonstrates superior goodness of fit, achieving an impressive 90% accuracy in predicting ACC system energy consumption without overfitting. A comprehensive statistical evaluation of the AA-Micro model's applicability and adaptability in predicting energy consumption and vehicle trajectories indicated strong model consistency and reliability for ACC vehicles, evidenced by minimal variance in RMSE values and uniform RSS distributions. Conversely, significant discrepancies were observed when applying the model to HV data, underscoring the necessity for specialized models to accurately predict energy consumption for HV and ACC systems, potentially due to their distinct energy consumption characteristics.
Abstract:Advancements in autonomous driving have increasingly focused on end-to-end (E2E) systems that manage the full spectrum of driving tasks, from environmental perception to vehicle navigation and control. This paper introduces V2X-VLM, an innovative E2E vehicle-infrastructure cooperative autonomous driving (VICAD) framework with large vision-language models (VLMs). V2X-VLM is designed to enhance situational awareness, decision-making, and ultimate trajectory planning by integrating data from vehicle-mounted cameras, infrastructure sensors, and textual information. The strength of the comprehensive multimodel data fusion of the VLM enables precise and safe E2E trajectory planning in complex and dynamic driving scenarios. Validation on the DAIR-V2X dataset demonstrates that V2X-VLM outperforms existing state-of-the-art methods in cooperative autonomous driving.
Abstract:Motivated by the emergent reasoning capabilities of Vision Language Models (VLMs) and its potential to improve the comprehensibility of autonomous driving systems, this paper introduces a closed-loop autonomous driving controller called VLM-MPC, which combines a VLM for high-level decision-making and a Model Predictive Controller (MPC) for low-level vehicle control. The proposed VLM-MPC system is structurally divided into two asynchronous components: an upper-level VLM and a lower-level MPC. The upper layer VLM generates driving parameters for lower-level control based on front camera images, ego vehicle state, traffic environment conditions, and reference memory. The lower-level MPC controls the vehicle in real-time using these parameters, considering engine lag and providing state feedback to the entire system. Experiments based on the nuScenes dataset validated the effectiveness of the proposed VLM-MPC system across various scenarios (e.g., night, rain, intersections). Results showed that the VLM-MPC system consistently outperformed baseline models in terms of safety and driving comfort. By comparing behaviors under different weather conditions and scenarios, we demonstrated the VLM's ability to understand the environment and make reasonable inferences.
Abstract:In the domain of code generation, self-debugging is crucial. It allows LLMs to refine their generated code based on execution feedback. This is particularly important because generating correct solutions in one attempt proves challenging for complex tasks. Prior works on self-debugging mostly focus on prompting methods by providing LLMs with few-shot examples, which work poorly on small open-sourced LLMs. In this work, we propose a training framework that significantly improves self-debugging capability of LLMs. Intuitively, we observe that a chain of explanations on the wrong code followed by code refinement helps LLMs better analyze the wrong code and do refinement. We thus propose an automated pipeline to collect a high-quality dataset for code explanation and refinement by generating a number of explanations and refinement trajectories and filtering via execution verification. We perform supervised fine-tuning (SFT) and further reinforcement learning (RL) on both success and failure trajectories with a novel reward design considering code explanation and refinement quality. SFT improves the pass@1 by up to 15.92% and pass@10 by 9.30% over four benchmarks. RL training brings additional up to 3.54% improvement on pass@1 and 2.55% improvement on pass@10. The trained LLMs show iterative refinement ability, and can keep refining code continuously. Lastly, our human evaluation shows that the LLMs trained with our framework generate more useful code explanations and help developers better understand bugs in source code.
Abstract:Multimodal aspect-based sentiment analysis (MABSA) aims to understand opinions in a granular manner, advancing human-computer interaction and other fields. Traditionally, MABSA methods use a joint prediction approach to identify aspects and sentiments simultaneously. However, we argue that joint models are not always superior. Our analysis shows that joint models struggle to align relevant text tokens with image patches, leading to misalignment and ineffective image utilization. In contrast, a pipeline framework first identifies aspects through MATE (Multimodal Aspect Term Extraction) and then aligns these aspects with image patches for sentiment classification (MASC: Multimodal Aspect-Oriented Sentiment Classification). This method is better suited for multimodal scenarios where effective image use is crucial. We present three key observations: (a) MATE and MASC have different feature requirements, with MATE focusing on token-level features and MASC on sequence-level features; (b) the aspect identified by MATE is crucial for effective image utilization; and (c) images play a trivial role in previous MABSA methods due to high noise. Based on these observations, we propose a pipeline framework that first predicts the aspect and then uses translation-based alignment (TBA) to enhance multimodal semantic consistency for better image utilization. Our method achieves state-of-the-art (SOTA) performance on widely used MABSA datasets Twitter-15 and Twitter-17. This demonstrates the effectiveness of the pipeline approach and its potential to provide valuable insights for future MABSA research. For reproducibility, the code and checkpoint will be released.
Abstract:Worldwide geolocalization aims to locate the precise location at the coordinate level of photos taken anywhere on the Earth. It is very challenging due to 1) the difficulty of capturing subtle location-aware visual semantics, and 2) the heterogeneous geographical distribution of image data. As a result, existing studies have clear limitations when scaled to a worldwide context. They may easily confuse distant images with similar visual contents, or cannot adapt to various locations worldwide with different amounts of relevant data. To resolve these limitations, we propose G3, a novel framework based on Retrieval-Augmented Generation (RAG). In particular, G3 consists of three steps, i.e., Geo-alignment, Geo-diversification, and Geo-verification to optimize both retrieval and generation phases of worldwide geolocalization. During Geo-alignment, our solution jointly learns expressive multi-modal representations for images, GPS and textual descriptions, which allows us to capture location-aware semantics for retrieving nearby images for a given query. During Geo-diversification, we leverage a prompt ensembling method that is robust to inconsistent retrieval performance for different image queries. Finally, we combine both retrieved and generated GPS candidates in Geo-verification for location prediction. Experiments on two well-established datasets IM2GPS3k and YFCC4k verify the superiority of G3 compared to other state-of-the-art methods.
Abstract:This paper reviews the NTIRE 2024 RAW Image Super-Resolution Challenge, highlighting the proposed solutions and results. New methods for RAW Super-Resolution could be essential in modern Image Signal Processing (ISP) pipelines, however, this problem is not as explored as in the RGB domain. Th goal of this challenge is to upscale RAW Bayer images by 2x, considering unknown degradations such as noise and blur. In the challenge, a total of 230 participants registered, and 45 submitted results during thee challenge period. The performance of the top-5 submissions is reviewed and provided here as a gauge for the current state-of-the-art in RAW Image Super-Resolution.