Abstract:With the increasing intelligence and autonomy of LLM agents, their potential applications in the legal domain are becoming increasingly apparent. However, existing general-domain benchmarks cannot fully capture the complexity and subtle nuances of real-world judicial cognition and decision-making. Therefore, we propose LegalAgentBench, a comprehensive benchmark specifically designed to evaluate LLM Agents in the Chinese legal domain. LegalAgentBench includes 17 corpora from real-world legal scenarios and provides 37 tools for interacting with external knowledge. We designed a scalable task construction framework and carefully annotated 300 tasks. These tasks span various types, including multi-hop reasoning and writing, and range across different difficulty levels, effectively reflecting the complexity of real-world legal scenarios. Moreover, beyond evaluating final success, LegalAgentBench incorporates keyword analysis during intermediate processes to calculate progress rates, enabling more fine-grained evaluation. We evaluated eight popular LLMs, highlighting the strengths, limitations, and potential areas for improvement of existing models and methods. LegalAgentBench sets a new benchmark for the practical application of LLMs in the legal domain, with its code and data available at \url{https://github.com/CSHaitao/LegalAgentBench}.
Abstract:Prosody contains rich information beyond the literal meaning of words, which is crucial for the intelligibility of speech. Current models still fall short in phrasing and intonation; they not only miss or misplace breaks when synthesizing long sentences with complex structures but also produce unnatural intonation. We propose ProsodyFM, a prosody-aware text-to-speech synthesis (TTS) model with a flow-matching (FM) backbone that aims to enhance the phrasing and intonation aspects of prosody. ProsodyFM introduces two key components: a Phrase Break Encoder to capture initial phrase break locations, followed by a Duration Predictor for the flexible adjustment of break durations; and a Terminal Intonation Encoder which integrates a set of intonation shape tokens combined with a novel Pitch Processor for more robust modeling of human-perceived intonation change. ProsodyFM is trained with no explicit prosodic labels and yet can uncover a broad spectrum of break durations and intonation patterns. Experimental results demonstrate that ProsodyFM can effectively improve the phrasing and intonation aspects of prosody, thereby enhancing the overall intelligibility compared to four state-of-the-art (SOTA) models. Out-of-distribution experiments show that this prosody improvement can further bring ProsodyFM superior generalizability for unseen complex sentences and speakers. Our case study intuitively illustrates the powerful and fine-grained controllability of ProsodyFM over phrasing and intonation.
Abstract:The rapid advancement of Large Language Models (LLMs) has driven their expanding application across various fields. One of the most promising applications is their role as evaluators based on natural language responses, referred to as ''LLMs-as-judges''. This framework has attracted growing attention from both academia and industry due to their excellent effectiveness, ability to generalize across tasks, and interpretability in the form of natural language. This paper presents a comprehensive survey of the LLMs-as-judges paradigm from five key perspectives: Functionality, Methodology, Applications, Meta-evaluation, and Limitations. We begin by providing a systematic definition of LLMs-as-Judges and introduce their functionality (Why use LLM judges?). Then we address methodology to construct an evaluation system with LLMs (How to use LLM judges?). Additionally, we investigate the potential domains for their application (Where to use LLM judges?) and discuss methods for evaluating them in various contexts (How to evaluate LLM judges?). Finally, we provide a detailed analysis of the limitations of LLM judges and discuss potential future directions. Through a structured and comprehensive analysis, we aim aims to provide insights on the development and application of LLMs-as-judges in both research and practice. We will continue to maintain the relevant resource list at https://github.com/CSHaitao/Awesome-LLMs-as-Judges.
Abstract:The use of large language models (LLMs) as automated evaluation tools to assess the quality of generated natural language, known as LLMs-as-Judges, has demonstrated promising capabilities and is rapidly gaining widespread attention. However, when applied to pairwise comparisons of candidate responses, LLM-based evaluators often exhibit selection bias. Specifically, their judgments may become inconsistent when the option positions or ID tokens are swapped, compromising the effectiveness and fairness of the evaluation result. To address this challenge, we introduce CalibraEval, a novel label-free method for mitigating selection bias during inference. Specifically, CalibraEval reformulates debiasing as an optimization task aimed at adjusting observed prediction distributions to align with unbiased prediction distributions. To solve this optimization problem, we propose a non-parametric order-preserving algorithm (NOA). This algorithm leverages the partial order relationships between model prediction distributions, thereby eliminating the need for explicit labels and precise mathematical function modeling.Empirical evaluations of LLMs in multiple representative benchmarks demonstrate that CalibraEval effectively mitigates selection bias and improves performance compared to existing debiasing methods. This work marks a step toward building more robust and unbiased automated evaluation frameworks, paving the way for improved reliability in AI-driven assessments
Abstract:With the rapid development of large language models (LLMs), how to efficiently evaluate them has become an important research question. Existing evaluation methods often suffer from high costs, limited test formats, the need of human references, and systematic evaluation biases. To address these limitations, our study introduces the Auto-PRE, an automatic LLM evaluation framework based on peer review. In contrast to previous studies that rely on human annotations, Auto-PRE selects evaluator LLMs automatically based on their inherent traits including consistency, self-confidence, and pertinence. We conduct extensive experiments on three tasks: summary generation, non-factoid question-answering, and dialogue generation. Experimental results indicate our Auto-PRE achieves state-of-the-art performance at a lower cost. Moreover, our study highlights the impact of prompt strategies and evaluation formats on evaluation performance, offering guidance for method optimization in the future.
Abstract:Machine unlearning (MU) is becoming a promising paradigm to achieve the "right to be forgotten", where the training trace of any chosen data points could be eliminated, while maintaining the model utility on general testing samples after unlearning. With the advancement of forgetting research, many fundamental open questions remain unanswered: do different samples exhibit varying levels of difficulty in being forgotten? Further, does the sequence in which samples are forgotten, determined by their respective difficulty levels, influence the performance of forgetting algorithms? In this paper, we identify key factor affecting unlearning difficulty and the performance of unlearning algorithms. We find that samples with higher privacy risks are more likely to be unlearning, indicating that the unlearning difficulty varies among different samples which motives a more precise unlearning mode. Built upon this insight, we propose a general unlearning framework, dubbed RSU, which consists of Ranking module and SeqUnlearn module.
Abstract:Unsupervised constituency parsers organize phrases within a sentence into a tree-shaped syntactic constituent structure that reflects the organization of sentence semantics. However, the traditional objective of maximizing sentence log-likelihood (LL) does not explicitly account for the close relationship between the constituent structure and the semantics, resulting in a weak correlation between LL values and parsing accuracy. In this paper, we introduce a novel objective for training unsupervised parsers: maximizing the information between constituent structures and sentence semantics (SemInfo). We introduce a bag-of-substrings model to represent the semantics and apply the probability-weighted information metric to estimate the SemInfo. Additionally, we develop a Tree Conditional Random Field (TreeCRF)-based model to apply the SemInfo maximization objective to Probabilistic Context-Free Grammar (PCFG) induction, the state-of-the-art method for unsupervised constituency parsing. Experiments demonstrate that SemInfo correlates more strongly with parsing accuracy than LL. Our algorithm significantly enhances parsing accuracy by an average of 7.85 points across five PCFG variants and in four languages, achieving new state-of-the-art results in three of the four languages.
Abstract:Solving the 3D refractive index (RI) from fluorescence images provides both fluorescence and phase information about biological samples. However, accurately retrieving the phase of partially coherent light to reconstruct the unknown RI of label-free phase objects over a large volume, at high resolution, and in reflection mode remains challenging. To tackle this challenge, we developed fluorescence diffraction tomography (FDT) with explicit neural fields that can reconstruct 3D RI from defocused fluorescence speckle images. The successful reconstruction of 3D RI using FDT relies on four key components: coarse-to-fine modeling, self-calibration, a differential multi-slice rendering model, and partial coherent masks. Specifically, the explicit representation efficiently integrates with the coarse-to-fine modeling to achieve high-speed, high-resolution reconstruction. Moreover, we advance the multi-slice equation to differential multi-slice rendering model, which enables the self-calibration method for the extrinsic and intrinsic parameters of the system. The self-calibration facilitates high accuracy forward image prediction and RI reconstruction. Partial coherent masks are digital masks to resolve the discrepancies between the coherent light model and the partial coherent light data accurately and efficiently. FDT successfully reconstructed the RI of 3D cultured label-free 3D MuSCs tube in a 530 $\times$ 530 $\times$ 300 $\mu m^3$ volume at 1024$\times$1024 pixels across 24 $z$-layers from fluorescence images, demonstrating high fidelity 3D RI reconstruction of bulky and heterogeneous biological samples in vitro.
Abstract:The continual evolution of autonomous driving technology requires car-following models that can adapt to diverse and dynamic traffic environments. Traditional learning-based models often suffer from performance degradation when encountering unseen traffic patterns due to a lack of continual learning capabilities. This paper proposes a novel car-following model based on continual learning that addresses this limitation. Our framework incorporates Elastic Weight Consolidation (EWC) and Memory Aware Synapses (MAS) techniques to mitigate catastrophic forgetting and enable the model to learn incrementally from new traffic data streams. We evaluate the performance of the proposed model on the Waymo and Lyft datasets which encompass various traffic scenarios. The results demonstrate that the continual learning techniques significantly outperform the baseline model, achieving 0\% collision rates across all traffic conditions. This research contributes to the advancement of autonomous driving technology by fostering the development of more robust and adaptable car-following models.
Abstract:Accurate modeling of car-following behaviors is essential for various applications in traffic management and autonomous driving systems. However, current approaches often suffer from limitations like high sensitivity to data quality and lack of interpretability. In this study, we propose GenFollower, a novel zero-shot prompting approach that leverages large language models (LLMs) to address these challenges. We reframe car-following behavior as a language modeling problem and integrate heterogeneous inputs into structured prompts for LLMs. This approach achieves improved prediction performance and interpretability compared to traditional baseline models. Experiments on the Waymo Open datasets demonstrate GenFollower's superior performance and ability to provide interpretable insights into factors influencing car-following behavior. This work contributes to advancing the understanding and prediction of car-following behaviors, paving the way for enhanced traffic management and autonomous driving systems.