Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hugo JWL. Aerts

LongHealth: A Question Answering Benchmark with Long Clinical Documents

Jan 25, 2024

Lisa Adams, Felix Busch, Tianyu Han, Jean-Baptiste Excoffier, Matthieu Ortala, Alexander Löser, Hugo JWL. Aerts, Jakob Nikolas Kather, Daniel Truhn, Keno Bressem

Figure 1 for LongHealth: A Question Answering Benchmark with Long Clinical Documents

Figure 2 for LongHealth: A Question Answering Benchmark with Long Clinical Documents

Figure 3 for LongHealth: A Question Answering Benchmark with Long Clinical Documents

Figure 4 for LongHealth: A Question Answering Benchmark with Long Clinical Documents

Abstract:Background: Recent advancements in large language models (LLMs) offer potential benefits in healthcare, particularly in processing extensive patient records. However, existing benchmarks do not fully assess LLMs' capability in handling real-world, lengthy clinical data. Methods: We present the LongHealth benchmark, comprising 20 detailed fictional patient cases across various diseases, with each case containing 5,090 to 6,754 words. The benchmark challenges LLMs with 400 multiple-choice questions in three categories: information extraction, negation, and sorting, challenging LLMs to extract and interpret information from large clinical documents. Results: We evaluated nine open-source LLMs with a minimum of 16,000 tokens and also included OpenAI's proprietary and cost-efficient GPT-3.5 Turbo for comparison. The highest accuracy was observed for Mixtral-8x7B-Instruct-v0.1, particularly in tasks focused on information retrieval from single and multiple patient documents. However, all models struggled significantly in tasks requiring the identification of missing information, highlighting a critical area for improvement in clinical data interpretation. Conclusion: While LLMs show considerable potential for processing long clinical documents, their current accuracy levels are insufficient for reliable clinical use, especially in scenarios requiring the identification of missing information. The LongHealth benchmark provides a more realistic assessment of LLMs in a healthcare setting and highlights the need for further model refinement for safe and effective clinical application. We make the benchmark and evaluation code publicly available.

* 11 pages, 3 figures, 5 tables

Via

Access Paper or Ask Questions

MEDBERT.de: A Comprehensive German BERT Model for the Medical Domain

Mar 24, 2023

Keno K. Bressem, Jens-Michalis Papaioannou, Paul Grundmann, Florian Borchert, Lisa C. Adams, Leonhard Liu, Felix Busch, Lina Xu, Jan P. Loyen, Stefan M. Niehues(+5 more)

Figure 1 for MEDBERT.de: A Comprehensive German BERT Model for the Medical Domain

Figure 2 for MEDBERT.de: A Comprehensive German BERT Model for the Medical Domain

Figure 3 for MEDBERT.de: A Comprehensive German BERT Model for the Medical Domain

Figure 4 for MEDBERT.de: A Comprehensive German BERT Model for the Medical Domain

Abstract:This paper presents medBERTde, a pre-trained German BERT model specifically designed for the German medical domain. The model has been trained on a large corpus of 4.7 Million German medical documents and has been shown to achieve new state-of-the-art performance on eight different medical benchmarks covering a wide range of disciplines and medical document types. In addition to evaluating the overall performance of the model, this paper also conducts a more in-depth analysis of its capabilities. We investigate the impact of data deduplication on the model's performance, as well as the potential benefits of using more efficient tokenization methods. Our results indicate that domain-specific models such as medBERTde are particularly useful for longer texts, and that deduplication of training data does not necessarily lead to improved performance. Furthermore, we found that efficient tokenization plays only a minor role in improving model performance, and attribute most of the improved performance to the large amount of training data. To encourage further research, the pre-trained model weights and new benchmarks based on radiological data are made publicly available for use by the scientific community.

* Keno K. Bressem and Jens-Michalis Papaioannou and Paul Grundmann contributed equally

Via

Access Paper or Ask Questions

What Does DALL-E 2 Know About Radiology?

Sep 27, 2022

Lisa C. Adams, Felix Busch, Daniel Truhn, Marcus R. Makowski, Hugo JWL. Aerts, Keno K. Bressem

Figure 1 for What Does DALL-E 2 Know About Radiology?

Figure 2 for What Does DALL-E 2 Know About Radiology?

Figure 3 for What Does DALL-E 2 Know About Radiology?

Figure 4 for What Does DALL-E 2 Know About Radiology?

Abstract:Generative models such as DALL-E 2 could represent a promising future tool for image generation, augmentation, and manipulation for artificial intelligence research in radiology provided that these models have sufficient medical domain knowledge. Here we show that DALL-E 2 has learned relevant representations of X-ray images with promising capabilities in terms of zero-shot text-to-image generation of new images, continuation of an image beyond its original boundaries, or removal of elements, while pathology generation or CT, MRI, and ultrasound images are still limited. The use of generative models for augmenting and generating radiological data thus seems feasible, even if further fine-tuning and adaptation of these models to the respective domain is required beforehand.

* 4 Figures

Via

Access Paper or Ask Questions