Abstract:While Large Language Model-based agents have demonstrated substantial progress in task completion, existing evaluation benchmarks tend to overemphasize single-task performance, with insufficient attention given to the crucial aspects of multitask planning and execution efficiency required in real-world scenarios. To bridge this gap, we present Recipe2Plan, a novel benchmark framework based on real-world cooking scenarios. Unlike conventional benchmarks, Recipe2Plan challenges agents to optimize cooking time through parallel task execution while respecting temporal constraints i.e. specific actions need to be performed within a particular time intervals following the preceding steps. Overly aggressive local parallelization may disrupt this constraint, potentially compromising the entire cooking process. This strict time constraint between actions raises a unique challenge for agents to balance between maximizing concurrent operations and adhering to critical timing constraints. Extensive experiments with state-of-the-art models reveal challenges in maintaining this balance between efficiency and feasibility. The results highlight the need for improved temporal awareness and global multitasking capabilities in large language models. We open-source our benchmark and code at https://github.com/WilliamZR/Recipe2Plan.
Abstract:Large language models (LLMs) excel in high-resource languages but struggle with low-resource languages (LRLs), particularly those spoken by minority communities in China, such as Tibetan, Uyghur, Kazakh, and Mongolian. To systematically track the progress in these languages, we introduce MiLiC-Eval, a benchmark designed for minority languages in China, featuring 24K instances across 9 tasks. MiLiC-Eval focuses on underrepresented writing systems and provides a fine-grained assessment of linguistic and problem-solving skills. Our evaluation reveals that LLMs perform poorly on syntax-intensive tasks and multi-script languages. We further demonstrate how MiLiC-Eval can help advance LRL research in handling diverse writing systems and understanding the process of language adaptation.
Abstract:Longitudinal network data are essential for analyzing political, economic, and social systems and processes. In political science, these datasets are often generated through human annotation or supervised machine learning applied to evolving corpora. However, as semantic contexts shift over time, inferring dynamic interaction types on emerging issues among a diverse set of entities poses significant challenges, particularly in maintaining timely and consistent annotations. This paper presents the Expert-Augmented LLM Annotation (EALA) approach, which leverages Large Language Models (LLMs) in combination with historically annotated data and expert-constructed codebooks to extrapolate and extend datasets into future periods. We evaluate the performance and reliability of EALA using a dataset of climate negotiations. Our findings demonstrate that EALA effectively predicts nuanced interactions between negotiation parties and captures the evolution of topics over time. At the same time, we identify several limitations inherent to LLM-based annotation, highlighting areas for further improvement. Given the wide availability of codebooks and annotated datasets, EALA holds substantial promise for advancing research in political science and beyond.
Abstract:The Four-Element Theory is a fundamental framework in criminal law, defining the constitution of crime through four dimensions: Subject, Object, Subjective aspect, and Objective aspect. This theory is widely referenced in legal reasoning, and many Large Language Models (LLMs) attempt to incorporate it when handling legal tasks. However, current approaches rely on LLMs' internal knowledge to incorporate this theory, often lacking completeness and representativeness. To address this limitation, we introduce JUREX-4E, an expert-annotated knowledge base covering 155 criminal charges. It is structured through a progressive hierarchical annotation framework that prioritizes legal source validity and employs diverse legal interpretation methods to ensure comprehensiveness and authority. We evaluate JUREX-4E on the Similar Charge Distinction task and apply it to Legal Case Retrieval, demonstrating its effectiveness in improving LLM performance. Experimental results validate the high quality of JUREX-4E and its substantial impact on downstream legal tasks, underscoring its potential for advancing legal AI applications. Code: https://github.com/THUlawtech/JUREX
Abstract:Legal articles often include vague concepts to adapt to the ever-changing society. Providing detailed interpretations of these concepts is a critical task for legal practitioners, which requires meticulous and professional annotations by legal experts, admittedly time-consuming and expensive to collect at scale. In this paper, we introduce a novel retrieval-augmented generation framework, ATRI, for AuTomatically Retrieving relevant information from past judicial precedents and Interpreting vague legal concepts. We further propose a new benchmark, Legal Concept Entailment, to automate the evaluation of generated concept interpretations without expert involvement. Automatic evaluations indicate that our generated interpretations can effectively assist large language models (LLMs) in understanding vague legal concepts. Multi-faceted evaluations by legal experts indicate that the quality of our concept interpretations is comparable to those written by human experts. Our work has strong implications for leveraging LLMs to support legal practitioners in interpreting vague legal concepts and beyond.
Abstract:Event Temporal Relation Extraction (ETRE) aims to identify the temporal relationship between two events, which plays an important role in natural language understanding. Most previous works follow a single-label classification style, classifying an event pair into either a specific temporal relation (e.g., \textit{Before}, \textit{After}), or a special label \textit{Vague} when there may be multiple possible temporal relations between the pair. In our work, instead of directly making predictions on \textit{Vague}, we propose a multi-label classification solution for ETRE (METRE) to infer the possibility of each temporal relation independently, where we treat \textit{Vague} as the cases when there is more than one possible relation between two events. We design a speculation mechanism to explore the possible relations hidden behind \textit{Vague}, which enables the latent information to be used efficiently. Experiments on TB-Dense, MATRES and UDS-T show that our method can effectively utilize the \textit{Vague} instances to improve the recognition for specific temporal relations and outperforms most state-of-the-art methods.
Abstract:Despite remarkable performance in legal consultation exhibited by legal Large Language Models(LLMs) combined with legal article retrieval components, there are still cases when the advice given is incorrect or baseless. To alleviate these problems, we propose {\bf ELLA}, a tool for {\bf E}mpowering {\bf L}LMs for interpretable, accurate, and informative {\bf L}egal {\bf A}dvice. ELLA visually presents the correlation between legal articles and LLM's response by calculating their similarities, providing users with an intuitive legal basis for the responses. Besides, based on the users' queries, ELLA retrieves relevant legal articles and displays them to users. Users can interactively select legal articles for LLM to generate more accurate responses. ELLA also retrieves relevant legal cases for user reference. Our user study shows that presenting the legal basis for the response helps users understand better. The accuracy of LLM's responses also improves when users intervene in selecting legal articles for LLM. Providing relevant legal cases also aids individuals in obtaining comprehensive information.
Abstract:Conditional question answering (CQA) is an important task that aims to find probable answers and identify conditions that need to be satisfied to support the answer. Existing approaches struggle with CQA due to two main challenges: (1) precisely identifying conditions and their logical relationship, and (2) verifying and solving the conditions. To address these challenges, we propose Chain of Condition, a novel prompting approach by firstly identifying all conditions and constructing their logical relationships explicitly according to the document, then verifying whether these conditions are satisfied, finally solving the logical expression by tools to indicate any missing conditions and generating the answer based on the resolved conditions. The experiments on two benchmark conditional question answering datasets shows chain of condition outperforms existing prompting baselines, establishing a new state-of-the-art. Furthermore, with backbone models like GPT-3.5-Turbo or GPT-4, it surpasses all supervised baselines with only few-shot settings.
Abstract:Adapting large language models (LLMs) to new languages typically involves continual pre-training (CT) followed by supervised fine-tuning (SFT). However, this CT-then-SFT approach struggles with limited data in the context of low-resource languages, failing to balance language modeling and task-solving capabilities. We thus propose model merging as an alternative for low-resource languages, combining models with distinct capabilities into a single model without additional training. We use model merging to develop task-solving LLMs for low-resource languages without SFT data in the target languages. Our experiments based on Llama-2-7B demonstrate that model merging effectively endows LLMs for low-resource languages with task-solving abilities, outperforming CT-then-SFT in scenarios with extremely scarce data. Observing performance saturation in model merging with more training tokens, we further analyze the merging process and introduce a slack variable to the model merging algorithm to mitigate the loss of important parameters, thereby enhancing performance. We hope that model merging can benefit more human languages suffering from data scarcity with its higher data efficiency.
Abstract:We give a detailed overview of the CAIL 2023 Argument Mining Track, one of the Chinese AI and Law Challenge (CAIL) 2023 tracks. The main goal of the track is to identify and extract interacting argument pairs in trial dialogs. It mainly uses summarized judgment documents but can also refer to trial recordings. The track consists of two stages, and we introduce the tasks designed for each stage; we also extend the data from previous events into a new dataset -- CAIL2023-ArgMine -- with annotated new cases from various causes of action. We outline several submissions that achieve the best results, including their methods for different stages. While all submissions rely on language models, they have incorporated strategies that may benefit future work in this field.