Abstract:Self-critic has become an important mechanism for enhancing the reasoning performance of LLMs. However, current approaches mainly involve basic prompts without further training, which tend to be over-simplified, leading to limited accuracy.Moreover, there is a lack of in-depth investigation of the relationship between LLM's ability to criticism and its task-solving performance.To address these issues, we propose Critic-CoT, a novel framework that pushes LLMs toward System-2-like critic capability, via step-wise CoT reasoning format and distant-supervision data construction, without the need for human annotation. Experiments on GSM8K and MATH show that via filtering out invalid solutions or iterative refinement, our enhanced model boosts task-solving performance, which demonstrates the effectiveness of our method. Further, we find that training on critique and refinement alone improves the generation. We hope our work could shed light on future research on improving the reasoning and critic ability of LLMs.
Abstract:We present foundation language models developed to power Apple Intelligence features, including a ~3 billion parameter model designed to run efficiently on devices and a large server-based language model designed for Private Cloud Compute. These models are designed to perform a wide range of tasks efficiently, accurately, and responsibly. This report describes the model architecture, the data used to train the model, the training process, how the models are optimized for inference, and the evaluation results. We highlight our focus on Responsible AI and how the principles are applied throughout the model development.
Abstract:Large Language Models (LLMs) have emerged as powerful tools for tackling complex Operations Research (OR) problem by providing the capacity in automating optimization modeling. However, current methodologies heavily rely on prompt engineering (e.g., multi-agent cooperation) with proprietary LLMs, raising data privacy concerns that could be prohibitive in industry applications. To tackle this issue, we propose training open-source LLMs for optimization modeling. We identify four critical requirements for the training dataset of OR LLMs, design and implement OR-Instruct, a semi-automated process for creating synthetic data tailored to specific requirements. We also introduce the IndustryOR benchmark, the first industrial benchmark for testing LLMs on solving real-world OR problems. We apply the data from OR-Instruct to various open-source LLMs of 7b size (termed as ORLMs), resulting in a significantly improved capability for optimization modeling. Our best-performing ORLM achieves state-of-the-art performance on the NL4OPT, MAMO, and IndustryOR benchmarks. Our code and data are available at \url{https://github.com/Cardinal-Operations/ORLM}.
Abstract:Evaluating the performance of a well-trained GNN model on real-world graphs is a pivotal step for reliable GNN online deployment and serving. Due to a lack of test node labels and unknown potential training-test graph data distribution shifts, conventional model evaluation encounters limitations in calculating performance metrics (e.g., test error) and measuring graph data-level discrepancies, particularly when the training graph used for developing GNNs remains unobserved during test time. In this paper, we study a new research problem, online GNN evaluation, which aims to provide valuable insights into the well-trained GNNs's ability to effectively generalize to real-world unlabeled graphs under the test-time graph distribution shifts. Concretely, we develop an effective learning behavior discrepancy score, dubbed LeBeD, to estimate the test-time generalization errors of well-trained GNN models. Through a novel GNN re-training strategy with a parameter-free optimality criterion, the proposed LeBeD comprehensively integrates learning behavior discrepancies from both node prediction and structure reconstruction perspectives. This enables the effective evaluation of the well-trained GNNs' ability to capture test node semantics and structural representations, making it an expressive metric for estimating the generalization error in online GNN evaluation. Extensive experiments on real-world test graphs under diverse graph distribution shifts could verify the effectiveness of the proposed method, revealing its strong correlation with ground-truth test errors on various well-trained GNN models.
Abstract:Graph learning plays a pivotal role and has gained significant attention in various application scenarios, from social network analysis to recommendation systems, for its effectiveness in modeling complex data relations represented by graph structural data. In reality, the real-world graph data typically show dynamics over time, with changing node attributes and edge structure, leading to the severe graph data distribution shift issue. This issue is compounded by the diverse and complex nature of distribution shifts, which can significantly impact the performance of graph learning methods in degraded generalization and adaptation capabilities, posing a substantial challenge to their effectiveness. In this survey, we provide a comprehensive review and summary of the latest approaches, strategies, and insights that address distribution shifts within the context of graph learning. Concretely, according to the observability of distributions in the inference stage and the availability of sufficient supervision information in the training stage, we categorize existing graph learning methods into several essential scenarios, including graph domain adaptation learning, graph out-of-distribution learning, and graph continual learning. For each scenario, a detailed taxonomy is proposed, with specific descriptions and discussions of existing progress made in distribution-shifted graph learning. Additionally, we discuss the potential applications and future directions for graph learning under distribution shifts with a systematic analysis of the current state in this field. The survey is positioned to provide general guidance for the development of effective graph learning algorithms in handling graph distribution shifts, and to stimulate future research and advancements in this area.
Abstract:In recent years, multi-modal entity linking (MEL) has garnered increasing attention in the research community due to its significance in numerous multi-modal applications. Video, as a popular means of information transmission, has become prevalent in people's daily lives. However, most existing MEL methods primarily focus on linking textual and visual mentions or offline videos's mentions to entities in multi-modal knowledge bases, with limited efforts devoted to linking mentions within online video content. In this paper, we propose a task called Online Video Entity Linking OVEL, aiming to establish connections between mentions in online videos and a knowledge base with high accuracy and timeliness. To facilitate the research works of OVEL, we specifically concentrate on live delivery scenarios and construct a live delivery entity linking dataset called LIVE. Besides, we propose an evaluation metric that considers timelessness, robustness, and accuracy. Furthermore, to effectively handle OVEL task, we leverage a memory block managed by a Large Language Model and retrieve entity candidates from the knowledge base to augment LLM performance on memory management. The experimental results prove the effectiveness and efficiency of our method.
Abstract:Executing computer programs described in natural language has long been a pursuit of computer science. With the advent of enhanced natural language understanding capabilities exhibited by large language models (LLMs), the path toward this goal has been illuminated. In this paper, we seek to examine the capacity of present-day LLMs to comprehend and execute algorithms outlined in natural language. We established an algorithm test set sourced from Introduction to Algorithm, a well-known textbook that contains many representative widely-used algorithms. To systematically assess LLMs' code execution abilities, we selected 30 algorithms, generated 300 random-sampled instances in total, and evaluated whether popular LLMs can understand and execute these algorithms. Our findings reveal that LLMs, notably GPT-4, can effectively execute programs described in natural language, as long as no heavy numeric computation is involved. We believe our findings contribute to evaluating LLMs' code execution abilities and would encourage further investigation and application for the computation power of LLMs.
Abstract:Nowadays, sensor suits have been equipped with redundant LiDARs and IMUs to mitigate the risks associated with sensor failure. It is challenging for the previous discrete-time and IMU-driven kinematic systems to incorporate multiple asynchronized sensors, which are susceptible to abnormal IMU data. To address these limitations, we introduce a multi-LiDAR multi-IMU state estimator by taking advantage of Gaussian Process (GP) that predicts a non-parametric continuous-time trajectory to capture sensors' spatial-temporal movement with limited control states. Since the kinematic model driven by three types of linear time-invariant stochastic differential equations are independent of external sensor measurements, our proposed approach is capable of handling different sensor configurations and resilient to sensor failures. Moreover, we replace the conventional $\mathrm{SE}(3)$ state representation with the combination of $\mathrm{SO}(3)$ and vector space, which enables GP-based LiDAR-inertial system to fulfill the real-time requirement. Extensive experiments on the public datasets demonstrate the versatility and resilience of our proposed multi-LiDAR multi-IMU state estimator. To contribute to the community, we will make our source code publicly available.
Abstract:Despite having achieved real-time performance in mesh construction, most of the current LiDAR odometry and meshing methods may struggle to deal with complex scenes due to relying on explicit meshing schemes. They are usually sensitive to noise. To overcome these limitations, we propose a real-time mesh-based LiDAR odometry and mapping approach for large-scale scenes via implicit reconstruction and a parallel spatial-hashing scheme. To efficiently reconstruct triangular meshes, we suggest an incremental voxel meshing method that updates every scan by traversing each point once and compresses space via a scalable partition module. By taking advantage of rapid accessing triangular meshes at any time, we design point-to-mesh odometry with location and feature-based data association to estimate the poses between the incoming point clouds and the recovered triangular meshes. The experimental results on four datasets demonstrate the effectiveness of our proposed approach in generating accurate motion trajectories and environmental mesh maps.
Abstract:Video prediction aims to predict future frames from a video's previous content. Existing methods mainly process video data where the time dimension mingles with the space and channel dimensions from three distinct angles: as a sequence of individual frames, as a 3D volume in spatiotemporal coordinates, or as a stacked image where frames are treated as separate channels. Most of them generally focus on one of these perspectives and may fail to fully exploit the relationships across different dimensions. To address this issue, this paper introduces a convolutional mixer for video prediction, termed ViP-Mixer, to model the spatiotemporal evolution in the latent space of an autoencoder. The ViP-Mixers are stacked sequentially and interleave feature mixing at three levels: frames, channels, and locations. Extensive experiments demonstrate that our proposed method achieves new state-of-the-art prediction performance on three benchmark video datasets covering both synthetic and real-world scenarios.