Abstract:The ability to adapt beliefs or behaviors in response to unexpected outcomes, reflection, is fundamental to intelligent systems' interaction with the world. From a cognitive science perspective, this serves as a core principle of intelligence applicable to both human and AI systems. To address the debate on the intelligence of large language models (LLMs), we propose Reflection-Bench, a comprehensive benchmark comprising 7 tasks spanning core cognitive functions crucial for reflection, including perception, memory, belief updating, decision-making, prediction, counterfactual thinking, and meta-reflection. We evaluate the performances of 13 prominent LLMs such as OpenAI o1, GPT-4, Claude 3.5 Sonnet, etc. The results indicate that current LLMs still lack satisfactory reflection ability. We discuss the underlying causes of these results and suggest potential avenues for future research. In conclusion, Reflection-Bench offers both evaluation tools and inspiration for developing AI capable of reliably interacting with the environment. Our data and code are available at https://github.com/YabYum/ReflectionBench.
Abstract:Considering the problem of nonlinear and non-gaussian filtering of the graph signal, in this paper, a robust square root unscented Kalman filter based on graph signal processing is proposed. The algorithm uses a graph topology to generate measurements and an unscented transformation is used to obtain the priori state estimates. In addition, in order to enhance the numerical stability of the unscented Kalman filter, the algorithm combines the double square root decomposition method to update the covariance matrix in the graph frequency domain. Furthermore, to handle the non-Gaussian noise problem in the state estimation process, an error augmentation model is constructed in the graph frequency domain by unifying the measurement error and state error, which utilizes the Laplace matrix of the graph to effectively reduce the cumulative error at each vertex. Then the general robust cost function is adopted as the optimal criterion to deal with the error, which has more parameter options so that effectively suppresses the problems of random outliers and abnormal measurement values in the state estimation process. Finally, the convergence of the error of the proposed algorithm is firstly verified theoretically, and then the robustness of the proposed algorithm is verified by experimental simulation.
Abstract:Emotion Support Conversation (ESC) is a crucial application, which aims to reduce human stress, offer emotional guidance, and ultimately enhance human mental and physical well-being. With the advancement of Large Language Models (LLMs), many researchers have employed LLMs as the ESC models. However, the evaluation of these LLM-based ESCs remains uncertain. Inspired by the awesome development of role-playing agents, we propose an ESC Evaluation framework (ESC-Eval), which uses a role-playing agent to interact with ESC models, followed by a manual evaluation of the interactive dialogues. In detail, we first re-organize 2,801 role-playing cards from seven existing datasets to define the roles of the role-playing agent. Second, we train a specific role-playing model called ESC-Role which behaves more like a confused person than GPT-4. Third, through ESC-Role and organized role cards, we systematically conduct experiments using 14 LLMs as the ESC models, including general AI-assistant LLMs (ChatGPT) and ESC-oriented LLMs (ExTES-Llama). We conduct comprehensive human annotations on interactive multi-turn dialogues of different ESC models. The results show that ESC-oriented LLMs exhibit superior ESC abilities compared to general AI-assistant LLMs, but there is still a gap behind human performance. Moreover, to automate the scoring process for future ESC models, we developed ESC-RANK, which trained on the annotated data, achieving a scoring performance surpassing 35 points of GPT-4. Our data and code are available at https://github.com/haidequanbu/ESC-Eval.
Abstract:Powered by remarkable advancements in Large Language Models (LLMs), Multimodal Large Language Models (MLLMs) demonstrate impressive capabilities in manifold tasks. However, the practical application scenarios of MLLMs are intricate, exposing them to potential malicious instructions and thereby posing safety risks. While current benchmarks do incorporate certain safety considerations, they often lack comprehensive coverage and fail to exhibit the necessary rigor and robustness. For instance, the common practice of employing GPT-4V as both the evaluator and a model to be evaluated lacks credibility, as it tends to exhibit a bias toward its own responses. In this paper, we present MLLMGuard, a multidimensional safety evaluation suite for MLLMs, including a bilingual image-text evaluation dataset, inference utilities, and a lightweight evaluator. MLLMGuard's assessment comprehensively covers two languages (English and Chinese) and five important safety dimensions (Privacy, Bias, Toxicity, Truthfulness, and Legality), each with corresponding rich subtasks. Focusing on these dimensions, our evaluation dataset is primarily sourced from platforms such as social media, and it integrates text-based and image-based red teaming techniques with meticulous annotation by human experts. This can prevent inaccurate evaluation caused by data leakage when using open-source datasets and ensures the quality and challenging nature of our benchmark. Additionally, a fully automated lightweight evaluator termed GuardRank is developed, which achieves significantly higher evaluation accuracy than GPT-4. Our evaluation results across 13 advanced models indicate that MLLMs still have a substantial journey ahead before they can be considered safe and responsible.
Abstract:Although the known maximum total generalized correntropy (MTGC) and generalized maximum blakezisserman total correntropy (GMBZTC) algorithms can maintain good performance under the errors-in-variables (EIV) model disrupted by generalized Gaussian noise, their requirement for manual ad-justment of parameters is excessive, greatly increasing the practical difficulty of use. To solve this problem, the total arctangent based on logical distance metric (TACLDM) algo-rithm is proposed by utilizing the advantage of few parameters in logical distance metric (LDM) theory and the convergence behavior is improved by the arctangent function. Compared with other competing algorithms, the TACLDM algorithm not only has fewer parameters, but also has better robustness to generalized Gaussian noise and significantly reduces the steady-state error. Furthermore, the analysis of the algorithm in the generalized Gaussian noise environment is analyzed in detail in this paper. Finally, computer simulations demonstrate the outstanding performance of the TACLDM algorithm and the rigorous theoretical deduction in this paper.
Abstract:In recent years, multi-modal entity linking (MEL) has garnered increasing attention in the research community due to its significance in numerous multi-modal applications. Video, as a popular means of information transmission, has become prevalent in people's daily lives. However, most existing MEL methods primarily focus on linking textual and visual mentions or offline videos's mentions to entities in multi-modal knowledge bases, with limited efforts devoted to linking mentions within online video content. In this paper, we propose a task called Online Video Entity Linking OVEL, aiming to establish connections between mentions in online videos and a knowledge base with high accuracy and timeliness. To facilitate the research works of OVEL, we specifically concentrate on live delivery scenarios and construct a live delivery entity linking dataset called LIVE. Besides, we propose an evaluation metric that considers timelessness, robustness, and accuracy. Furthermore, to effectively handle OVEL task, we leverage a memory block managed by a Large Language Model and retrieve entity candidates from the knowledge base to augment LLM performance on memory management. The experimental results prove the effectiveness and efficiency of our method.
Abstract:The minimum error entropy (MEE) has been extensively used in unscented Kalman filter (UKF) to handle impulsive noises or abnormal measurement data in non-Gaussian systems. However, the MEE-UKF has poor numerical stability due to the inverse operation of singular matrix. In this paper, a novel UKF based on minimum error entropy with fiducial points (MEEF) is proposed \textcolor{black}{to improve the problem of non-positive definite key matrix. By adding the correntropy to the error entropy, the proposed algorithm further enhances the ability of suppressing impulse noise and outliers. At the same time, considering the uncertainty of noise distribution, the modified Sage-Husa estimator of noise statistics is introduced to adaptively update the noise covariance matrix. In addition, the convergence analysis of the proposed algorithm provides a guidance for the selection of kernel width. The robustness and estimation accuracy of the proposed algorithm are manifested by the state tracking examples under complex non-Gaussian noises.
Abstract:The conventional Minimum Error Entropy criterion (MEE) has its limitations, showing reduced sensitivity to error mean values and uncertainty regarding error probability density function locations. To overcome this, a MEE with fiducial points criterion (MEEF), was presented. However, the efficacy of the MEEF is not consistent due to its reliance on a fixed Gaussian kernel. In this paper, a generalized minimum error with fiducial points criterion (GMEEF) is presented by adopting the Generalized Gaussian Density (GGD) function as kernel. The GGD extends the Gaussian distribution by introducing a shape parameter that provides more control over the tail behavior and peakedness. In addition, due to the high computational complexity of GMEEF criterion, the quantized idea is introduced to notably lower the computational load of the GMEEF-type algorithm. Finally, the proposed criterions are introduced to the domains of adaptive filter, kernel recursive algorithm, and multilayer perceptron. Several numerical simulations, which contain system identification, acoustic echo cancellation, times series prediction, and supervised classification, indicate that the novel algorithms' performance performs excellently.
Abstract:When the input signal is correlated input signals, and the input and output signal is contaminated by Gaussian noise, the total least squares normalized subband adaptive filter (TLS-NSAF) algorithm shows good performance. However, when it is disturbed by impulse noise, the TLS-NSAF algorithm shows the rapidly deteriorating convergence performance. To solve this problem, this paper proposed the robust total minimum mean M-estimator normalized subband filter (TLMM-NSAF) algorithm. In addition, this paper also conducts a detailed theoretical performance analysis of the TLMM-NSAF algorithm and obtains the stable step size range and theoretical steady-state mean squared deviation (MSD) of the algorithm. To further improve the performance of the algorithm, we also propose a new variable step size (VSS) method of the algorithm. Finally, the robustness of our proposed algorithm and the consistency of theoretical and simulated values are verified by computer simulations of system identification and echo cancellation under different noise models.
Abstract:In recent years, the multitask diffusion least mean square (MD-LMS) algorithm has been extensively applied in the distributed parameter estimation and target tracking of multitask network. However, its performance is mainly limited by two aspects, i.e, the correlated input signal and impulsive noise interference. To overcome these two limitations simultaneously, this paper firstly introduces the subband adaptive filter (SAF) into the multitask network. Then, a robust multitask diffusion normalized M-estimate subband adaptive filtering (MD-NMSAF) algorithm is proposed by solving the modified Huber function based global network optimization problem in a distributed manner, which endows the multitask network strong decorrelation ability for correlated inputs and robustness to impulsive noise interference, and accelerates the convergence of the algorithm significantly. Compared with the robust multitask diffusion affine projection M-estimate (MD-APM) algorithm, the computational complexity of the proposed MD-NMSAF is greatly reduced. In addition, the stability condition, the analytical expressions of the theoretical transient and steady-state network mean square deviation (MSD) of the MD-NMSAF are also provided and verified through computer simulations. Simulation results under different input signals and impulsive noise environment fully demonstrate the performance advantages of the MD-NMSAF algorithm over some other competitors in terms of steady-state accuracy and tracking speed.