Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Barry Devereux

SynFinTabs: A Dataset of Synthetic Financial Tables for Information and Table Extraction

Dec 05, 2024

Ethan Bradley, Muhammad Roman, Karen Rafferty, Barry Devereux

Abstract:Table extraction from document images is a challenging AI problem, and labelled data for many content domains is difficult to come by. Existing table extraction datasets often focus on scientific tables due to the vast amount of academic articles that are readily available, along with their source code. However, there are significant layout and typographical differences between tables found across scientific, financial, and other domains. Current datasets often lack the words, and their positions, contained within the tables, instead relying on unreliable OCR to extract these features for training modern machine learning models on natural language processing tasks. Therefore, there is a need for a more general method of obtaining labelled data. We present SynFinTabs, a large-scale, labelled dataset of synthetic financial tables. Our hope is that our method of generating these synthetic tables is transferable to other domains. To demonstrate the effectiveness of our dataset in training models to extract information from table images, we create FinTabQA, a layout large language model trained on an extractive question-answering task. We test our model using real-world financial tables and compare it to a state-of-the-art generative model and discuss the results. We make the dataset, model, and dataset generation code publicly available.

* 12 pages, 8 figures

Via

Access Paper or Ask Questions

QUB-Cirdan at "Discharge Me!": Zero shot discharge letter generation by open-source LLM

May 27, 2024

Rui Guo, Greg Farnan, Niall McLaughlin, Barry Devereux

Figure 1 for QUB-Cirdan at "Discharge Me!": Zero shot discharge letter generation by open-source LLM

Figure 2 for QUB-Cirdan at "Discharge Me!": Zero shot discharge letter generation by open-source LLM

Figure 3 for QUB-Cirdan at "Discharge Me!": Zero shot discharge letter generation by open-source LLM

Figure 4 for QUB-Cirdan at "Discharge Me!": Zero shot discharge letter generation by open-source LLM

Abstract:The BioNLP ACL'24 Shared Task on Streamlining Discharge Documentation aims to reduce the administrative burden on clinicians by automating the creation of critical sections of patient discharge letters. This paper presents our approach using the Llama3 8B quantized model to generate the "Brief Hospital Course" and "Discharge Instructions" sections. We employ a zero-shot method combined with Retrieval-Augmented Generation (RAG) to produce concise, contextually accurate summaries. Our contributions include the development of a curated template-based approach to ensure reliability and consistency, as well as the integration of RAG for word count prediction. We also describe several unsuccessful experiments to provide insights into our pathway for the competition. Our results demonstrate the effectiveness and efficiency of our approach, achieving high scores across multiple evaluation metrics.

Via

Access Paper or Ask Questions

Feature2Vec: Distributional semantic modelling of human property knowledge

Aug 29, 2019

Steven Derby, Paul Miller, Barry Devereux

Figure 1 for Feature2Vec: Distributional semantic modelling of human property knowledge

Figure 2 for Feature2Vec: Distributional semantic modelling of human property knowledge

Figure 3 for Feature2Vec: Distributional semantic modelling of human property knowledge

Figure 4 for Feature2Vec: Distributional semantic modelling of human property knowledge

Abstract:Feature norm datasets of human conceptual knowledge, collected in surveys of human volunteers, yield highly interpretable models of word meaning and play an important role in neurolinguistic research on semantic cognition. However, these datasets are limited in size due to practical obstacles associated with exhaustively listing properties for a large number of words. In contrast, the development of distributional modelling techniques and the availability of vast text corpora have allowed researchers to construct effective vector space models of word meaning over large lexicons. However, this comes at the cost of interpretable, human-like information about word meaning. We propose a method for mapping human property knowledge onto a distributional semantic space, which adapts the word2vec architecture to the task of modelling concept features. Our approach gives a measure of concept and feature affinity in a single semantic space, which makes for easy and efficient ranking of candidate human-derived semantic properties for arbitrary words. We compare our model with a previous approach, and show that it performs better on several evaluation tasks. Finally, we discuss how our method could be used to develop efficient sampling techniques to extend existing feature norm datasets in a reliable way.

* Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP 2019)
* 7 pages, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP 2019)

Via

Access Paper or Ask Questions

Using Sparse Semantic Embeddings Learned from Multimodal Text and Image Data to Model Human Conceptual Knowledge

Sep 18, 2018

Steven Derby, Paul Miller, Brian Murphy, Barry Devereux

Figure 1 for Using Sparse Semantic Embeddings Learned from Multimodal Text and Image Data to Model Human Conceptual Knowledge

Figure 2 for Using Sparse Semantic Embeddings Learned from Multimodal Text and Image Data to Model Human Conceptual Knowledge

Figure 3 for Using Sparse Semantic Embeddings Learned from Multimodal Text and Image Data to Model Human Conceptual Knowledge

Figure 4 for Using Sparse Semantic Embeddings Learned from Multimodal Text and Image Data to Model Human Conceptual Knowledge

Abstract:Distributional models provide a convenient way to model semantics using dense embedding spaces derived from unsupervised learning algorithms. However, the dimensions of dense embedding spaces are not designed to resemble human semantic knowledge. Moreover, embeddings are often built from a single source of information (typically text data), even though neurocognitive research suggests that semantics is deeply linked to both language and perception. In this paper, we combine multimodal information from both text and image-based representations derived from state-of-the-art distributional models to produce sparse, interpretable vectors using Joint Non-Negative Sparse Embedding. Through in-depth analyses comparing these sparse models to human-derived behavioural and neuroimaging data, we demonstrate their ability to predict interpretable linguistic descriptions of human ground-truth semantic knowledge.

* To appear in Proceedings of the 22nd Conference on Computational Natural Language Learning (CoNLL 2018), Brussels, Belguim, October 31 - November 1, 2018

Via

Access Paper or Ask Questions