Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jan-Micha Bodensohn

Document Structure in Long Document Transformers

Jan 31, 2024

Jan Buchmann, Max Eichler, Jan-Micha Bodensohn, Ilia Kuznetsov, Iryna Gurevych

Figure 1 for Document Structure in Long Document Transformers

Figure 2 for Document Structure in Long Document Transformers

Figure 3 for Document Structure in Long Document Transformers

Figure 4 for Document Structure in Long Document Transformers

Abstract:Long documents often exhibit structure with hierarchically organized elements of different functions, such as section headers and paragraphs. Despite the omnipresence of document structure, its role in natural language processing (NLP) remains opaque. Do long-document Transformer models acquire an internal representation of document structure during pre-training? How can structural information be communicated to a model after pre-training, and how does it influence downstream performance? To answer these questions, we develop a novel suite of probing tasks to assess structure-awareness of long-document Transformers, propose general-purpose structure infusion methods, and evaluate the effects of structure infusion on QASPER and Evidence Inference, two challenging long-document NLP tasks. Results on LED and LongT5 suggest that they acquire implicit understanding of document structure during pre-training, which can be further enhanced by structure infusion, leading to improved end-task performance. To foster research on the role of document structure in NLP modeling, we make our data and code publicly available.

* Accepted at EACL 2024. Code and data: http://github.com/UKPLab/eacl2024-doc-structure

Via

Access Paper or Ask Questions

ASET: Ad-hoc Structured Exploration of Text Collections

Mar 09, 2022

Benjamin Hättasch, Jan-Micha Bodensohn, Carsten Binnig

Figure 1 for ASET: Ad-hoc Structured Exploration of Text Collections

Figure 2 for ASET: Ad-hoc Structured Exploration of Text Collections

Figure 3 for ASET: Ad-hoc Structured Exploration of Text Collections

Figure 4 for ASET: Ad-hoc Structured Exploration of Text Collections

Abstract:In this paper, we propose a new system called ASET that allows users to perform structured explorations of text collections in an ad-hoc manner. The main idea of ASET is to use a new two-phase approach that first extracts a superset of information nuggets from the texts using existing extractors such as named entity recognizers and then matches the extractions to a structured table definition as requested by the user based on embeddings. In our evaluation, we show that ASET is thus able to extract structured data from real-world text collections in high quality without the need to design extraction pipelines upfront.

* Accepted at the 3rd International Workshop on Applied AI for Database Systems and Applications (AIDB'21), August 20, 2021, Copenhagen, Denmark

Via

Access Paper or Ask Questions