Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Derek Greene

PreP-OCR: A Complete Pipeline for Document Image Restoration and Enhanced OCR Accuracy

May 28, 2025

Shuhao Guan, Moule Lin, Cheng Xu, Xinyi Liu, Jinman Zhao, Jiexin Fan, Qi Xu, Derek Greene

Abstract:This paper introduces PreP-OCR, a two-stage pipeline that combines document image restoration with semantic-aware post-OCR correction to enhance both visual clarity and textual consistency, thereby improving text extraction from degraded historical documents. First, we synthesize document-image pairs from plaintext, rendering them with diverse fonts and layouts and then applying a randomly ordered set of degradation operations. An image restoration model is trained on this synthetic data, using multi-directional patch extraction and fusion to process large images. Second, a ByT5 post-OCR model, fine-tuned on synthetic historical text pairs, addresses remaining OCR errors. Detailed experiments on 13,831 pages of real historical documents in English, French, and Spanish show that the PreP-OCR pipeline reduces character error rates by 63.9-70.3% compared to OCR on raw images. Our pipeline demonstrates the potential of integrating image restoration with linguistic error correction for digitizing historical archives.

* ACL 2025 main

Via

Access Paper or Ask Questions

Combining Query Performance Predictors: A Reproducibility Study

Mar 31, 2025

Sourav Saha, Suchana Datta, Dwaipayan Roy, Mandar Mitra, Derek Greene

Abstract:A large number of approaches to Query Performance Prediction (QPP) have been proposed over the last two decades. As early as 2009, Hauff et al. [28] explored whether different QPP methods may be combined to improve prediction quality. Since then, significant research has been done both on QPP approaches, as well as their evaluation. This study revisits Hauff et al.s work to assess the reproducibility of their findings in the light of new prediction methods, evaluation metrics, and datasets. We expand the scope of the earlier investigation by: (i) considering post-retrieval methods, including supervised neural techniques (only pre-retrieval techniques were studied in [28]); (ii) using sMARE for evaluation, in addition to the traditional correlation coefficients and RMSE; and (iii) experimenting with additional datasets (Clueweb09B and TREC DL). Our results largely support previous claims, but we also present several interesting findings. We interpret these findings by taking a more nuanced look at the correlation between QPP methods, examining whether they capture diverse information or rely on overlapping factors.

Via

Access Paper or Ask Questions

Unveiling Temporal Trends in 19th Century Literature: An Information Retrieval Approach

Jan 12, 2025

Suchana Datta, Dwaipayan Roy, Derek Greene, Gerardine Meaney

Abstract:In English literature, the 19th century witnessed a significant transition in styles, themes, and genres. Consequently, the novels from this period display remarkable diversity. This paper explores these variations by examining the evolution of term usage in 19th century English novels through the lens of information retrieval. By applying a query expansion-based approach to a decade-segmented collection of fiction from the British Library, we examine how related terms vary over time. Our analysis employs multiple standard metrics including Kendall's tau, Jaccard similarity, and Jensen-Shannon divergence to assess overlaps and shifts in expanded query term sets. Our results indicate a significant degree of divergence in the related terms across decades as selected by the query expansion technique, suggesting substantial linguistic and conceptual changes throughout the 19th century novels.

* Accepted at JCDL 2024

Via

Access Paper or Ask Questions

Transformers4NewsRec: A Transformer-based News Recommendation Framework

Oct 17, 2024

Dairui Liu, Honghui Du, Boming Yang, Neil Hurley, Aonghus Lawlor, Irene Li, Derek Greene, Ruihai Dong

Figure 1 for Transformers4NewsRec: A Transformer-based News Recommendation Framework

Figure 2 for Transformers4NewsRec: A Transformer-based News Recommendation Framework

Figure 3 for Transformers4NewsRec: A Transformer-based News Recommendation Framework

Figure 4 for Transformers4NewsRec: A Transformer-based News Recommendation Framework

Abstract:Pre-trained transformer models have shown great promise in various natural language processing tasks, including personalized news recommendations. To harness the power of these models, we introduce Transformers4NewsRec, a new Python framework built on the \textbf{Transformers} library. This framework is designed to unify and compare the performance of various news recommendation models, including deep neural networks and graph-based models. Transformers4NewsRec offers flexibility in terms of model selection, data preprocessing, and evaluation, allowing both quantitative and qualitative analysis.

Via

Access Paper or Ask Questions

Advancing Post-OCR Correction: A Comparative Study of Synthetic Data

Aug 05, 2024

Shuhao Guan, Derek Greene

Abstract:This paper explores the application of synthetic data in the post-OCR domain on multiple fronts by conducting experiments to assess the impact of data volume, augmentation, and synthetic data generation methods on model performance. Furthermore, we introduce a novel algorithm that leverages computer vision feature detection algorithms to calculate glyph similarity for constructing post-OCR synthetic data. Through experiments conducted across a variety of languages, including several low-resource ones, we demonstrate that models like ByT5 can significantly reduce Character Error Rates (CER) without the need for manually annotated data, and our proposed synthetic data generation method shows advantages over traditional methods, particularly in low-resource languages.

* ACL 2024 findings

Via

Access Paper or Ask Questions

Benchmark Data Contamination of Large Language Models: A Survey

Jun 06, 2024

Cheng Xu, Shuhao Guan, Derek Greene, M-Tahar Kechadi

Abstract:The rapid development of Large Language Models (LLMs) like GPT-4, Claude-3, and Gemini has transformed the field of natural language processing. However, it has also resulted in a significant issue known as Benchmark Data Contamination (BDC). This occurs when language models inadvertently incorporate evaluation benchmark information from their training data, leading to inaccurate or unreliable performance during the evaluation phase of the process. This paper reviews the complex challenge of BDC in LLM evaluation and explores alternative assessment methods to mitigate the risks associated with traditional benchmarks. The paper also examines challenges and future directions in mitigating BDC risks, highlighting the complexity of the issue and the need for innovative solutions to ensure the reliability of LLM evaluation in real-world applications.

* 31 pages, 7 figures, 3 tables

Via

Access Paper or Ask Questions

A Deep Learning Approach for Selective Relevance Feedback

Jan 20, 2024

Suchana Datta, Debasis Ganguly, Sean MacAvaney, Derek Greene

Figure 1 for A Deep Learning Approach for Selective Relevance Feedback

Figure 2 for A Deep Learning Approach for Selective Relevance Feedback

Figure 3 for A Deep Learning Approach for Selective Relevance Feedback

Figure 4 for A Deep Learning Approach for Selective Relevance Feedback

Abstract:Pseudo-relevance feedback (PRF) can enhance average retrieval effectiveness over a sufficiently large number of queries. However, PRF often introduces a drift into the original information need, thus hurting the retrieval effectiveness of several queries. While a selective application of PRF can potentially alleviate this issue, previous approaches have largely relied on unsupervised or feature-based learning to determine whether a query should be expanded. In contrast, we revisit the problem of selective PRF from a deep learning perspective, presenting a model that is entirely data-driven and trained in an end-to-end manner. The proposed model leverages a transformer-based bi-encoder architecture. Additionally, to further improve retrieval effectiveness with this selective PRF approach, we make use of the model's confidence estimates to combine the information from the original and expanded queries. In our experiments, we apply this selective feedback on a number of different combinations of ranking and feedback models, and show that our proposed approach consistently improves retrieval effectiveness for both sparse and dense ranking models, with the feedback models being either sparse, dense or generative.

Via

Access Paper or Ask Questions

RecPrompt: A Prompt Tuning Framework for News Recommendation Using Large Language Models

Dec 16, 2023

Dairui Liu, Boming Yang, Honghui Du, Derek Greene, Aonghus Lawlor, Ruihai Dong, Irene Li

Figure 1 for RecPrompt: A Prompt Tuning Framework for News Recommendation Using Large Language Models

Figure 2 for RecPrompt: A Prompt Tuning Framework for News Recommendation Using Large Language Models

Figure 3 for RecPrompt: A Prompt Tuning Framework for News Recommendation Using Large Language Models

Figure 4 for RecPrompt: A Prompt Tuning Framework for News Recommendation Using Large Language Models

Abstract:In the evolving field of personalized news recommendation, understanding the semantics of the underlying data is crucial. Large Language Models (LLMs) like GPT-4 have shown promising performance in understanding natural language. However, the extent of their applicability in news recommendation systems remains to be validated. This paper introduces RecPrompt, the first framework for news recommendation that leverages the capabilities of LLMs through prompt engineering. This system incorporates a prompt optimizer that applies an iterative bootstrapping process, enhancing the LLM-based recommender's ability to align news content with user preferences and interests more effectively. Moreover, this study offers insights into the effective use of LLMs in news recommendation, emphasizing both the advantages and the challenges of incorporating LLMs into recommendation systems.

* 8 pages, 3 figures, and 8 tables

Via

Access Paper or Ask Questions

The Role of Document Embedding in Research Paper Recommender Systems: To Breakdown or to Bolster Disciplinary Borders?

Sep 26, 2023

Eoghan Cunningham, Derek Greene, Barry Smyth

Figure 1 for The Role of Document Embedding in Research Paper Recommender Systems: To Breakdown or to Bolster Disciplinary Borders?

Figure 2 for The Role of Document Embedding in Research Paper Recommender Systems: To Breakdown or to Bolster Disciplinary Borders?

Figure 3 for The Role of Document Embedding in Research Paper Recommender Systems: To Breakdown or to Bolster Disciplinary Borders?

Figure 4 for The Role of Document Embedding in Research Paper Recommender Systems: To Breakdown or to Bolster Disciplinary Borders?

Abstract:In the extensive recommender systems literature, novelty and diversity have been identified as key properties of useful recommendations. However, these properties have received limited attention in the specific sub-field of research paper recommender systems. In this work, we argue for the importance of offering novel and diverse research paper recommendations to scientists. This approach aims to reduce siloed reading, break down filter bubbles, and promote interdisciplinary research. We propose a novel framework for evaluating the novelty and diversity of research paper recommendations that leverages methods from network analysis and natural language processing. Using this framework, we show that the choice of representational method within a larger research paper recommendation system can have a measurable impact on the nature of downstream recommendations, specifically on their novelty and diversity. We introduce a novel paper embedding method, which we demonstrate offers more innovative and diverse recommendations without sacrificing precision, compared to other state-of-the-art baselines.

* Under Review at Scientometrics

Via

Access Paper or Ask Questions

Curatr: A Platform for Semantic Analysis and Curation of Historical Literary Texts

Jun 13, 2023

Susan Leavy, Gerardine Meaney, Karen Wade, Derek Greene

Abstract:The increasing availability of digital collections of historical and contemporary literature presents a wealth of possibilities for new research in the humanities. The scale and diversity of such collections however, presents particular challenges in identifying and extracting relevant content. This paper presents Curatr, an online platform for the exploration and curation of literature with machine learning-supported semantic search, designed within the context of digital humanities scholarship. The platform provides a text mining workflow that combines neural word embeddings with expert domain knowledge to enable the generation of thematic lexicons, allowing researches to curate relevant sub-corpora from a large corpus of 18th and 19th century digitised texts.

* Metadata and Semantic Research (MTSR 2019), Communications in Computer and Information Science, vol 1057. Springer, Cham
* 12 pages

Via

Access Paper or Ask Questions