Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Nils Holzenberger

Can AI expose tax loopholes? Towards a new generation of legal policy assistants

Mar 21, 2025

Peter Fratrič, Nils Holzenberger, David Restrepo Amariles

Abstract:The legislative process is the backbone of a state built on solid institutions. Yet, due to the complexity of laws -- particularly tax law -- policies may lead to inequality and social tensions. In this study, we introduce a novel prototype system designed to address the issues of tax loopholes and tax avoidance. Our hybrid solution integrates a natural language interface with a domain-specific language tailored for planning. We demonstrate on a case study how tax loopholes and avoidance schemes can be exposed. We conclude that our prototype can help enhance social welfare by systematically identifying and addressing tax gaps stemming from loopholes.

* 13 pages, 6 figures

Via

Access Paper or Ask Questions

The Factuality of Large Language Models in the Legal Domain

Sep 18, 2024

Rajaa El Hamdani, Thomas Bonald, Fragkiskos Malliaros, Nils Holzenberger, Fabian Suchanek

Abstract:This paper investigates the factuality of large language models (LLMs) as knowledge bases in the legal domain, in a realistic usage scenario: we allow for acceptable variations in the answer, and let the model abstain from answering when uncertain. First, we design a dataset of diverse factual questions about case law and legislation. We then use the dataset to evaluate several LLMs under different evaluation methods, including exact, alias, and fuzzy matching. Our results show that the performance improves significantly under the alias and fuzzy matching methods. Further, we explore the impact of abstaining and in-context examples, finding that both strategies enhance precision. Finally, we demonstrate that additional pre-training on legal documents, as seen with SaulLM, further improves factual precision from 63% to 81%.

* CIKM 2024, short paper

Via

Access Paper or Ask Questions

Gaps or Hallucinations? Gazing into Machine-Generated Legal Analysis for Fine-grained Text Evaluations

Sep 16, 2024

Abe Bohan Hou, William Jurayj, Nils Holzenberger, Andrew Blair-Stanek, Benjamin Van Durme

Abstract:Large Language Models (LLMs) show promise as a writing aid for professionals performing legal analyses. However, LLMs can often hallucinate in this setting, in ways difficult to recognize by non-professionals and existing text evaluation metrics. In this work, we pose the question: when can machine-generated legal analysis be evaluated as acceptable? We introduce the neutral notion of gaps, as opposed to hallucinations in a strict erroneous sense, to refer to the difference between human-written and machine-generated legal analysis. Gaps do not always equate to invalid generation. Working with legal experts, we consider the CLERC generation task proposed in Hou et al. (2024b), leading to a taxonomy, a fine-grained detector for predicting gap categories, and an annotated dataset for automatic evaluation. Our best detector achieves 67% F1 score and 80% precision on the test set. Employing this detector as an automated metric on legal analysis generated by SOTA LLMs, we find around 80% contain hallucinations of different kinds.

Via

Access Paper or Ask Questions

CLERC: A Dataset for Legal Case Retrieval and Retrieval-Augmented Analysis Generation

Jun 24, 2024

Abe Bohan Hou, Orion Weller, Guanghui Qin, Eugene Yang, Dawn Lawrie, Nils Holzenberger, Andrew Blair-Stanek, Benjamin Van Durme

Figure 1 for CLERC: A Dataset for Legal Case Retrieval and Retrieval-Augmented Analysis Generation

Figure 2 for CLERC: A Dataset for Legal Case Retrieval and Retrieval-Augmented Analysis Generation

Figure 3 for CLERC: A Dataset for Legal Case Retrieval and Retrieval-Augmented Analysis Generation

Figure 4 for CLERC: A Dataset for Legal Case Retrieval and Retrieval-Augmented Analysis Generation

Abstract:Legal professionals need to write analyses that rely on citations to relevant precedents, i.e., previous case decisions. Intelligent systems assisting legal professionals in writing such documents provide great benefits but are challenging to design. Such systems need to help locate, summarize, and reason over salient precedents in order to be useful. To enable systems for such tasks, we work with legal professionals to transform a large open-source legal corpus into a dataset supporting two important backbone tasks: information retrieval (IR) and retrieval-augmented generation (RAG). This dataset CLERC (Case Law Evaluation Retrieval Corpus), is constructed for training and evaluating models on their ability to (1) find corresponding citations for a given piece of legal analysis and to (2) compile the text of these citations (as well as previous context) into a cogent analysis that supports a reasoning goal. We benchmark state-of-the-art models on CLERC, showing that current approaches still struggle: GPT-4o generates analyses with the highest ROUGE F-scores but hallucinates the most, while zero-shot IR models only achieve 48.3% recall@1000.

Via

Access Paper or Ask Questions

Reframing Tax Law Entailment as Analogical Reasoning

Jan 12, 2024

Xinrui Zou, Ming Zhang, Nathaniel Weir, Benjamin Van Durme, Nils Holzenberger

Abstract:Statutory reasoning refers to the application of legislative provisions to a series of case facts described in natural language. We re-frame statutory reasoning as an analogy task, where each instance of the analogy task involves a combination of two instances of statutory reasoning. This increases the dataset size by two orders of magnitude, and introduces an element of interpretability. We show that this task is roughly as difficult to Natural Language Processing models as the original task. Finally, we come back to statutory reasoning, solving it with a combination of a retrieval mechanism and analogy models, and showing some progress on prior comparable work.

Via

Access Paper or Ask Questions

BLT: Can Large Language Models Handle Basic Legal Text?

Nov 16, 2023

Andrew Blair-Stanek, Nils Holzenberger, Benjamin Van Durme

Abstract:We find that the best publicly available LLMs like GPT-4 and PaLM 2 currently perform poorly at basic text handling required of lawyers or paralegals, such as looking up the text at a line of a witness deposition or at a subsection of a contract. We introduce a benchmark to quantify this poor performance, which casts into doubt LLMs' current reliability as-is for legal practice. Finetuning for these tasks brings an older LLM to near-perfect performance on our test set and also raises performance on a related legal task. This stark result highlights the need for more domain expertise in LLM training.

Via

Access Paper or Ask Questions

OpenAI Cribbed Our Tax Example, But Can GPT-4 Really Do Tax?

Sep 15, 2023

Andrew Blair-Stanek, Nils Holzenberger, Benjamin Van Durme

Abstract:The authors explain where OpenAI got the tax law example in its livestream demonstration of GPT-4, why GPT-4 got the wrong answer, and how it fails to reliably calculate taxes.

* 180 TAX NOTES FEDERAL 1101 (AUG. 14, 2023)
* 5 pages

Via

Access Paper or Ask Questions

LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models

Aug 20, 2023

Neel Guha, Julian Nyarko, Daniel E. Ho, Christopher Ré, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N. Rockmore(+30 more)

Figure 1 for LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models

Figure 2 for LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models

Figure 3 for LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models

Figure 4 for LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models

Abstract:The advent of large language models (LLMs) and their adoption by the legal community has given rise to the question: what types of legal reasoning can LLMs perform? To enable greater study of this question, we present LegalBench: a collaboratively constructed legal reasoning benchmark consisting of 162 tasks covering six different types of legal reasoning. LegalBench was built through an interdisciplinary process, in which we collected tasks designed and hand-crafted by legal professionals. Because these subject matter experts took a leading role in construction, tasks either measure legal reasoning capabilities that are practically useful, or measure reasoning skills that lawyers find interesting. To enable cross-disciplinary conversations about LLMs in the law, we additionally show how popular legal frameworks for describing legal reasoning -- which distinguish between its many forms -- correspond to LegalBench tasks, thus giving lawyers and LLM developers a common vocabulary. This paper describes LegalBench, presents an empirical evaluation of 20 open-source and commercial LLMs, and illustrates the types of research explorations LegalBench enables.

* 143 pages, 79 tables, 4 figures

Via

Access Paper or Ask Questions

Can GPT-3 Perform Statutory Reasoning?

Feb 13, 2023

Andrew Blair-Stanek, Nils Holzenberger, Benjamin Van Durme

Abstract:Statutory reasoning is the task of reasoning with facts and statutes, which are rules written in natural language by a legislature. It is a basic legal skill. In this paper we explore the capabilities of the most capable GPT-3 model, text-davinci-003, on an established statutory-reasoning dataset called SARA. We consider a variety of approaches, including dynamic few-shot prompting, chain-of-thought prompting, and zero-shot prompting. While we achieve results with GPT-3 that are better than the previous best published results, we also identify several types of clear errors it makes. In investigating why these happen, we discover that GPT-3 has imperfect prior knowledge of the actual U.S. statutes on which SARA is based. More importantly, GPT-3 performs poorly at answering straightforward questions about simple synthetic statutes. By also posing the same questions when the synthetic statutes are written in sentence form, we find that some of GPT-3's poor performance results from difficulty in parsing the typical structure of statutes, containing subsections and paragraphs.

* 9 pages

Via

Access Paper or Ask Questions

Asking the Right Questions in Low Resource Template Extraction

May 25, 2022

Nils Holzenberger, Yunmo Chen, Benjamin Van Durme

Figure 1 for Asking the Right Questions in Low Resource Template Extraction

Figure 2 for Asking the Right Questions in Low Resource Template Extraction

Figure 3 for Asking the Right Questions in Low Resource Template Extraction

Figure 4 for Asking the Right Questions in Low Resource Template Extraction

Abstract:Information Extraction (IE) researchers are mapping tasks to Question Answering (QA) in order to leverage existing large QA resources, and thereby improve data efficiency. Especially in template extraction (TE), mapping an ontology to a set of questions can be more time-efficient than collecting labeled examples. We ask whether end users of TE systems can design these questions, and whether it is beneficial to involve an NLP practitioner in the process. We compare questions to other ways of phrasing natural language prompts for TE. We propose a novel model to perform TE with prompts, and find it benefits from questions over other styles of prompts, and that they do not require an NLP background to author.

Via

Access Paper or Ask Questions