Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Alexandre Moucher

Detecting automatically the layout of clinical documents to enhance the performances of downstream natural language processing

May 23, 2023

Christel Gérardin, Perceval Wajsbürt, Basile Dura, Alice Calliger, Alexandre Moucher, Xavier Tannier, Romain Bey

Figure 1 for Detecting automatically the layout of clinical documents to enhance the performances of downstream natural language processing

Figure 2 for Detecting automatically the layout of clinical documents to enhance the performances of downstream natural language processing

Figure 3 for Detecting automatically the layout of clinical documents to enhance the performances of downstream natural language processing

Figure 4 for Detecting automatically the layout of clinical documents to enhance the performances of downstream natural language processing

Abstract:Objective:Develop and validate an algorithm for analyzing the layout of PDF clinical documents to improve the performance of downstream natural language processing tasks. Materials and Methods: We designed an algorithm to process clinical PDF documents and extract only clinically relevant text. The algorithm consists of several steps: initial text extraction using a PDF parser, followed by classification into categories such as body text, left notes, and footers using a Transformer deep neural network architecture, and finally an aggregation step to compile the lines of a given label in the text. We evaluated the technical performance of the body text extraction algorithm by applying it to a random sample of documents that were annotated. Medical performance was evaluated by examining the extraction of medical concepts of interest from the text in their respective sections. Finally, we tested an end-to-end system on a medical use case of automatic detection of acute infection described in the hospital report. Results:Our algorithm achieved per-line precision, recall, and F1 score of 98.4, 97.0, and 97.7, respectively, for body line extraction. The precision, recall, and F1 score per document for the acute infection detection algorithm were 82.54 (95CI 72.86-91.60), 85.24 (95CI 76.61-93.70), 83.87 (95CI 76, 92-90.08) with exploitation of the results of the advanced body extraction algorithm, respectively. Conclusion:We have developed and validated a system for extracting body text from clinical documents in PDF format by identifying their layout. We were able to demonstrate that this preprocessing allowed us to obtain better performances for a common downstream task, i.e., the extraction of medical concepts in their respective sections, thus proving the interest of this method on a clinical use case.

* 22 pages, 5 figures

Via

Access Paper or Ask Questions