Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Rajiv Jain

LegalCore: A Dataset for Legal Documents Event Coreference Resolution

Feb 18, 2025

Kangda Wei, Xi Shi, Jonathan Tong, Sai Ramana Reddy, Anandhavelu Natarajan, Rajiv Jain, Aparna Garimella, Ruihong Huang

Abstract:Recognizing events and their coreferential mentions in a document is essential for understanding semantic meanings of text. The existing research on event coreference resolution is mostly limited to news articles. In this paper, we present the first dataset for the legal domain, LegalCore, which has been annotated with comprehensive event and event coreference information. The legal contract documents we annotated in this dataset are several times longer than news articles, with an average length of around 25k tokens per document. The annotations show that legal documents have dense event mentions and feature both short-distance and super long-distance coreference links between event mentions. We further benchmark mainstream Large Language Models (LLMs) on this dataset for both event detection and event coreference resolution tasks, and find that this dataset poses significant challenges for state-of-the-art open-source and proprietary LLMs, which perform significantly worse than a supervised baseline. We will publish the dataset as well as the code.

Via

Access Paper or Ask Questions

DocEdit-v2: Document Structure Editing Via Multimodal LLM Grounding

Oct 21, 2024

Manan Suri, Puneet Mathur, Franck Dernoncourt, Rajiv Jain, Vlad I Morariu, Ramit Sawhney, Preslav Nakov, Dinesh Manocha

Figure 1 for DocEdit-v2: Document Structure Editing Via Multimodal LLM Grounding

Figure 2 for DocEdit-v2: Document Structure Editing Via Multimodal LLM Grounding

Figure 3 for DocEdit-v2: Document Structure Editing Via Multimodal LLM Grounding

Figure 4 for DocEdit-v2: Document Structure Editing Via Multimodal LLM Grounding

Abstract:Document structure editing involves manipulating localized textual, visual, and layout components in document images based on the user's requests. Past works have shown that multimodal grounding of user requests in the document image and identifying the accurate structural components and their associated attributes remain key challenges for this task. To address these, we introduce the DocEdit-v2, a novel framework that performs end-to-end document editing by leveraging Large Multimodal Models (LMMs). It consists of three novel components: (1) Doc2Command, which simultaneously localizes edit regions of interest (RoI) and disambiguates user edit requests into edit commands; (2) LLM-based Command Reformulation prompting to tailor edit commands originally intended for specialized software into edit instructions suitable for generalist LMMs. (3) Moreover, DocEdit-v2 processes these outputs via Large Multimodal Models like GPT-4V and Gemini, to parse the document layout, execute edits on grounded Region of Interest (RoI), and generate the edited document image. Extensive experiments on the DocEdit dataset show that DocEdit-v2 significantly outperforms strong baselines on edit command generation (2-33%), RoI bounding box detection (12-31%), and overall document editing (1-12\%) tasks.

* EMNLP 2024 (Main)

Via

Access Paper or Ask Questions

DocSynthv2: A Practical Autoregressive Modeling for Document Generation

Jun 12, 2024

Sanket Biswas, Rajiv Jain, Vlad I. Morariu, Jiuxiang Gu, Puneet Mathur, Curtis Wigington, Tong Sun, Josep Lladós

Figure 1 for DocSynthv2: A Practical Autoregressive Modeling for Document Generation

Figure 2 for DocSynthv2: A Practical Autoregressive Modeling for Document Generation

Figure 3 for DocSynthv2: A Practical Autoregressive Modeling for Document Generation

Figure 4 for DocSynthv2: A Practical Autoregressive Modeling for Document Generation

Abstract:While the generation of document layouts has been extensively explored, comprehensive document generation encompassing both layout and content presents a more complex challenge. This paper delves into this advanced domain, proposing a novel approach called DocSynthv2 through the development of a simple yet effective autoregressive structured model. Our model, distinct in its integration of both layout and textual cues, marks a step beyond existing layout-generation approaches. By focusing on the relationship between the structural elements and the textual content within documents, we aim to generate cohesive and contextually relevant documents without any reliance on visual components. Through experimental studies on our curated benchmark for the new task, we demonstrate the ability of our model combining layout and textual information in enhancing the generation quality and relevance of documents, opening new pathways for research in document creation and automated design. Our findings emphasize the effectiveness of autoregressive models in handling complex document generation tasks.

* Spotlight (Oral) Acceptance to CVPR 2024 Workshop for Graphic Design Understanding and Generation (GDUG)

Via

Access Paper or Ask Questions

Chain of Logic: Rule-Based Reasoning with Large Language Models

Feb 23, 2024

Sergio Servantez, Joe Barrow, Kristian Hammond, Rajiv Jain

Figure 1 for Chain of Logic: Rule-Based Reasoning with Large Language Models

Figure 2 for Chain of Logic: Rule-Based Reasoning with Large Language Models

Figure 3 for Chain of Logic: Rule-Based Reasoning with Large Language Models

Figure 4 for Chain of Logic: Rule-Based Reasoning with Large Language Models

Abstract:Rule-based reasoning, a fundamental type of legal reasoning, enables us to draw conclusions by accurately applying a rule to a set of facts. We explore causal language models as rule-based reasoners, specifically with respect to compositional rules - rules consisting of multiple elements which form a complex logical expression. Reasoning about compositional rules is challenging because it requires multiple reasoning steps, and attending to the logical relationships between elements. We introduce a new prompting method, Chain of Logic, which elicits rule-based reasoning through decomposition (solving elements as independent threads of logic), and recomposition (recombining these sub-answers to resolve the underlying logical expression). This method was inspired by the IRAC (Issue, Rule, Application, Conclusion) framework, a sequential reasoning approach used by lawyers. We evaluate chain of logic across eight rule-based reasoning tasks involving three distinct compositional rules from the LegalBench benchmark and demonstrate it consistently outperforms other prompting methods, including chain of thought and self-ask, using open-source and commercial language models.

Via

Access Paper or Ask Questions

Improving a Named Entity Recognizer Trained on Noisy Data with a Few Clean Instances

Oct 25, 2023

Zhendong Chu, Ruiyi Zhang, Tong Yu, Rajiv Jain, Vlad I Morariu, Jiuxiang Gu, Ani Nenkova

Figure 1 for Improving a Named Entity Recognizer Trained on Noisy Data with a Few Clean Instances

Figure 2 for Improving a Named Entity Recognizer Trained on Noisy Data with a Few Clean Instances

Figure 3 for Improving a Named Entity Recognizer Trained on Noisy Data with a Few Clean Instances

Figure 4 for Improving a Named Entity Recognizer Trained on Noisy Data with a Few Clean Instances

Abstract:To achieve state-of-the-art performance, one still needs to train NER models on large-scale, high-quality annotated data, an asset that is both costly and time-intensive to accumulate. In contrast, real-world applications often resort to massive low-quality labeled data through non-expert annotators via crowdsourcing and external knowledge bases via distant supervision as a cost-effective alternative. However, these annotation methods result in noisy labels, which in turn lead to a notable decline in performance. Hence, we propose to denoise the noisy NER data with guidance from a small set of clean instances. Along with the main NER model we train a discriminator model and use its outputs to recalibrate the sample weights. The discriminator is capable of detecting both span and category errors with different discriminative prompts. Results on public crowdsourcing and distant supervision datasets show that the proposed method can consistently improve performance with a small guidance set.

* 14 pages

Via

Access Paper or Ask Questions

User-Entity Differential Privacy in Learning Natural Language Models

Nov 09, 2022

Phung Lai, NhatHai Phan, Tong Sun, Rajiv Jain, Franck Dernoncourt, Jiuxiang Gu, Nikolaos Barmpalios

Figure 1 for User-Entity Differential Privacy in Learning Natural Language Models

Figure 2 for User-Entity Differential Privacy in Learning Natural Language Models

Figure 3 for User-Entity Differential Privacy in Learning Natural Language Models

Figure 4 for User-Entity Differential Privacy in Learning Natural Language Models

Abstract:In this paper, we introduce a novel concept of user-entity differential privacy (UeDP) to provide formal privacy protection simultaneously to both sensitive entities in textual data and data owners in learning natural language models (NLMs). To preserve UeDP, we developed a novel algorithm, called UeDP-Alg, optimizing the trade-off between privacy loss and model utility with a tight sensitivity bound derived from seamlessly combining user and sensitive entity sampling processes. An extensive theoretical analysis and evaluation show that our UeDP-Alg outperforms baseline approaches in model utility under the same privacy budget consumption on several NLM tasks, using benchmark datasets.

* Accepted at IEEE BigData 2022

Via

Access Paper or Ask Questions

Certified Neural Network Watermarks with Randomized Smoothing

Jul 16, 2022

Arpit Bansal, Ping-yeh Chiang, Michael Curry, Rajiv Jain, Curtis Wigington, Varun Manjunatha, John P Dickerson, Tom Goldstein

Figure 1 for Certified Neural Network Watermarks with Randomized Smoothing

Figure 2 for Certified Neural Network Watermarks with Randomized Smoothing

Figure 3 for Certified Neural Network Watermarks with Randomized Smoothing

Figure 4 for Certified Neural Network Watermarks with Randomized Smoothing

Abstract:Watermarking is a commonly used strategy to protect creators' rights to digital images, videos and audio. Recently, watermarking methods have been extended to deep learning models -- in principle, the watermark should be preserved when an adversary tries to copy the model. However, in practice, watermarks can often be removed by an intelligent adversary. Several papers have proposed watermarking methods that claim to be empirically resistant to different types of removal attacks, but these new techniques often fail in the face of new or better-tuned adversaries. In this paper, we propose a certifiable watermarking method. Using the randomized smoothing technique proposed in Chiang et al., we show that our watermark is guaranteed to be unremovable unless the model parameters are changed by more than a certain l2 threshold. In addition to being certifiable, our watermark is also empirically more robust compared to previous watermarking methods. Our experiments can be reproduced with code at https://github.com/arpitbansal297/Certified_Watermarks

* ICML 2022
* ICML 2022

Via

Access Paper or Ask Questions

Unified Pretraining Framework for Document Understanding

Apr 28, 2022

Jiuxiang Gu, Jason Kuen, Vlad I. Morariu, Handong Zhao, Nikolaos Barmpalios, Rajiv Jain, Ani Nenkova, Tong Sun

Figure 1 for Unified Pretraining Framework for Document Understanding

Figure 2 for Unified Pretraining Framework for Document Understanding

Figure 3 for Unified Pretraining Framework for Document Understanding

Figure 4 for Unified Pretraining Framework for Document Understanding

Abstract:Document intelligence automates the extraction of information from documents and supports many business applications. Recent self-supervised learning methods on large-scale unlabeled document datasets have opened up promising directions towards reducing annotation efforts by training models with self-supervised objectives. However, most of the existing document pretraining methods are still language-dominated. We present UDoc, a new unified pretraining framework for document understanding. UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input. Each input element is composed of words and visual features from a semantic region of the input document image. An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses, encouraging the representation to model sentences, learn similarities, and align modalities. Extensive empirical analysis demonstrates that the pretraining procedure learns better joint representations and leads to improvements in downstream tasks.

* 12 pages, 4 figures, NeurIPS 2021 (Updated Camera Ready)

Via

Access Paper or Ask Questions

MACRONYM: A Large-Scale Dataset for Multilingual and Multi-Domain Acronym Extraction

Feb 19, 2022

Amir Pouran Ben Veyseh, Nicole Meister, Seunghyun Yoon, Rajiv Jain, Franck Dernoncourt, Thien Huu Nguyen

Figure 1 for MACRONYM: A Large-Scale Dataset for Multilingual and Multi-Domain Acronym Extraction

Figure 2 for MACRONYM: A Large-Scale Dataset for Multilingual and Multi-Domain Acronym Extraction

Abstract:Acronym extraction is the task of identifying acronyms and their expanded forms in texts that is necessary for various NLP applications. Despite major progress for this task in recent years, one limitation of existing AE research is that they are limited to the English language and certain domains (i.e., scientific and biomedical). As such, challenges of AE in other languages and domains is mainly unexplored. Lacking annotated datasets in multiple languages and domains has been a major issue to hinder research in this area. To address this limitation, we propose a new dataset for multilingual multi-domain AE. Specifically, 27,200 sentences in 6 typologically different languages and 2 domains, i.e., Legal and Scientific, is manually annotated for AE. Our extensive experiments on the proposed dataset show that AE in different languages and different learning settings has unique challenges, emphasizing the necessity of further research on multilingual and multi-domain AE.

Via

Access Paper or Ask Questions

CLAUSEREC: A Clause Recommendation Framework for AI-aided Contract Authoring

Oct 26, 2021

Vinay Aggarwal, Aparna Garimella, Balaji Vasan Srinivasan, Anandhavelu N, Rajiv Jain

Figure 1 for CLAUSEREC: A Clause Recommendation Framework for AI-aided Contract Authoring

Figure 2 for CLAUSEREC: A Clause Recommendation Framework for AI-aided Contract Authoring

Figure 3 for CLAUSEREC: A Clause Recommendation Framework for AI-aided Contract Authoring

Figure 4 for CLAUSEREC: A Clause Recommendation Framework for AI-aided Contract Authoring

Abstract:Contracts are a common type of legal document that frequent in several day-to-day business workflows. However, there has been very limited NLP research in processing such documents, and even lesser in generating them. These contracts are made up of clauses, and the unique nature of these clauses calls for specific methods to understand and generate such documents. In this paper, we introduce the task of clause recommendation, asa first step to aid and accelerate the author-ing of contract documents. We propose a two-staged pipeline to first predict if a specific clause type is relevant to be added in a contract, and then recommend the top clauses for the given type based on the contract context. We pretrain BERT on an existing library of clauses with two additional tasks and use it for our prediction and recommendation. We experiment with classification methods and similarity-based heuristics for clause relevance prediction, and generation-based methods for clause recommendation, and evaluate the results from various methods on several clause types. We provide analyses on the results, and further outline the advantages and limitations of the various methods for this line of research.

Via

Access Paper or Ask Questions