Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jamshid Mozafari

It's High Time: A Survey of Temporal Information Retrieval and Question Answering

May 26, 2025

Bhawna Piryani, Abdelrahman Abdullah, Jamshid Mozafari, Avishek Anand, Adam Jatowt

Abstract:Time plays a critical role in how information is generated, retrieved, and interpreted. In this survey, we provide a comprehensive overview of Temporal Information Retrieval and Temporal Question Answering, two research areas aimed at handling and understanding time-sensitive information. As the amount of time-stamped content from sources like news articles, web archives, and knowledge bases increases, systems must address challenges such as detecting temporal intent, normalizing time expressions, ordering events, and reasoning over evolving or ambiguous facts. These challenges are critical across many dynamic and time-sensitive domains, from news and encyclopedias to science, history, and social media. We review both traditional approaches and modern neural methods, including those that use transformer models and Large Language Models (LLMs). We also review recent advances in temporal language modeling, multi-hop reasoning, and retrieval-augmented generation (RAG), alongside benchmark datasets and evaluation strategies that test temporal robustness, recency awareness, and generalization.

Via

Access Paper or Ask Questions

From Retrieval to Generation: Comparing Different Approaches

Feb 27, 2025

Abdelrahman Abdallah, Jamshid Mozafari, Bhawna Piryani, Mohammed Ali, Adam Jatowt

Abstract:Knowledge-intensive tasks, particularly open-domain question answering (ODQA), document reranking, and retrieval-augmented language modeling, require a balance between retrieval accuracy and generative flexibility. Traditional retrieval models such as BM25 and Dense Passage Retrieval (DPR), efficiently retrieve from large corpora but often lack semantic depth. Generative models like GPT-4-o provide richer contextual understanding but face challenges in maintaining factual consistency. In this work, we conduct a systematic evaluation of retrieval-based, generation-based, and hybrid models, with a primary focus on their performance in ODQA and related retrieval-augmented tasks. Our results show that dense retrievers, particularly DPR, achieve strong performance in ODQA with a top-1 accuracy of 50.17\% on NQ, while hybrid models improve nDCG@10 scores on BEIR from 43.42 (BM25) to 52.59, demonstrating their strength in document reranking. Additionally, we analyze language modeling tasks using WikiText-103, showing that retrieval-based approaches like BM25 achieve lower perplexity compared to generative and hybrid methods, highlighting their utility in retrieval-augmented generation. By providing detailed comparisons and practical insights into the conditions where each approach excels, we aim to facilitate future optimizations in retrieval, reranking, and generative models for ODQA and related knowledge-intensive applications.

* work on progress

Via

Access Paper or Ask Questions

MultiOCR-QA: Dataset for Evaluating Robustness of LLMs in Question Answering on Multilingual OCR Texts

Feb 24, 2025

Bhawna Piryani, Jamshid Mozafari, Abdelrahman Abdallah, Antoine Doucet, Adam Jatowt

Abstract:Optical Character Recognition (OCR) plays a crucial role in digitizing historical and multilingual documents, yet OCR errors -- imperfect extraction of the text, including character insertion, deletion and permutation -- can significantly impact downstream tasks like question-answering (QA). In this work, we introduce a multilingual QA dataset MultiOCR-QA, designed to analyze the effects of OCR noise on QA systems' performance. The MultiOCR-QA dataset comprises 60K question-answer pairs covering three languages, English, French, and German. The dataset is curated from OCR-ed old documents, allowing for the evaluation of OCR-induced challenges on question answering. We evaluate MultiOCR-QA on various levels and types of OCR errors to access the robustness of LLMs in handling real-world digitization errors. Our findings show that QA systems are highly prone to OCR induced errors and exhibit performance degradation on noisy OCR text.

Via

Access Paper or Ask Questions

Wrong Answers Can Also Be Useful: PlausibleQA -- A Large-Scale QA Dataset with Answer Plausibility Scores

Feb 22, 2025

Jamshid Mozafari, Abdelrahman Abdallah, Bhawna Piryani, Adam Jatowt

Abstract:Large Language Models (LLMs) are revolutionizing information retrieval, with chatbots becoming an important source for answering user queries. As by their design, LLMs prioritize generating correct answers, the value of highly plausible yet incorrect answers (candidate answers) tends to be overlooked. However, such answers can still prove useful, for example, they can play a crucial role in tasks like Multiple-Choice Question Answering (MCQA) and QA Robustness Assessment (QARA). Existing QA datasets primarily focus on correct answers without explicit consideration of the plausibility of other candidate answers, limiting opportunity for more nuanced evaluations of models. To address this gap, we introduce PlausibleQA, a large-scale dataset comprising 10,000 questions and 100,000 candidate answers, each annotated with plausibility scores and justifications for their selection. Additionally, the dataset includes 900,000 justifications for pairwise comparisons between candidate answers, further refining plausibility assessments. We evaluate PlausibleQA through human assessments and empirical experiments, demonstrating its utility in MCQA and QARA analysis. Our findings show that plausibility-aware approaches are effective for MCQA distractor generation and QARA. We release PlausibleQA as a resource for advancing QA research and enhancing LLM performance in distinguishing plausible distractors from correct answers.

* Submitted to SIGIR 2025

Via

Access Paper or Ask Questions

Rankify: A Comprehensive Python Toolkit for Retrieval, Re-Ranking, and Retrieval-Augmented Generation

Feb 04, 2025

Abdelrahman Abdallah, Jamshid Mozafari, Bhawna Piryani, Mohammed Ali, Adam Jatowt

Abstract:Retrieval, re-ranking, and retrieval-augmented generation (RAG) are critical components of modern natural language processing (NLP) applications in information retrieval, question answering, and knowledge-based text generation. However, existing solutions are often fragmented, lacking a unified framework that easily integrates these essential processes. The absence of a standardized implementation, coupled with the complexity of retrieval and re-ranking workflows, makes it challenging for researchers to compare and evaluate different approaches in a consistent environment. While existing toolkits such as Rerankers and RankLLM provide general-purpose reranking pipelines, they often lack the flexibility required for fine-grained experimentation and benchmarking. In response to these challenges, we introduce \textbf{Rankify}, a powerful and modular open-source toolkit designed to unify retrieval, re-ranking, and RAG within a cohesive framework. Rankify supports a wide range of retrieval techniques, including dense and sparse retrievers, while incorporating state-of-the-art re-ranking models to enhance retrieval quality. Additionally, Rankify includes a collection of pre-retrieved datasets to facilitate benchmarking, available at Huggingface (https://huggingface.co/datasets/abdoelsayed/reranking-datasets). To encourage adoption and ease of integration, we provide comprehensive documentation (http://rankify.readthedocs.io/), an open-source implementation on GitHub(https://github.com/DataScienceUIBK/rankify), and a PyPI package for effortless installation(https://pypi.org/project/rankify/). By providing a unified and lightweight framework, Rankify allows researchers and practitioners to advance retrieval and re-ranking methodologies while ensuring consistency, scalability, and ease of use.

* Work in Progress

Via

Access Paper or Ask Questions

HintEval: A Comprehensive Framework for Hint Generation and Evaluation for Questions

Feb 02, 2025

Jamshid Mozafari, Bhawna Piryani, Abdelrahman Abdallah, Adam Jatowt

Abstract:Large Language Models (LLMs) are transforming how people find information, and many users turn nowadays to chatbots to obtain answers to their questions. Despite the instant access to abundant information that LLMs offer, it is still important to promote critical thinking and problem-solving skills. Automatic hint generation is a new task that aims to support humans in answering questions by themselves by creating hints that guide users toward answers without directly revealing them. In this context, hint evaluation focuses on measuring the quality of hints, helping to improve the hint generation approaches. However, resources for hint research are currently spanning different formats and datasets, while the evaluation tools are missing or incompatible, making it hard for researchers to compare and test their models. To overcome these challenges, we introduce HintEval, a Python library that makes it easy to access diverse datasets and provides multiple approaches to generate and evaluate hints. HintEval aggregates the scattered resources into a single toolkit that supports a range of research goals and enables a clear, multi-faceted, and reliable evaluation. The proposed library also includes detailed online documentation, helping users quickly explore its features and get started. By reducing barriers to entry and encouraging consistent evaluation practices, HintEval offers a major step forward for facilitating hint generation and analysis research within the NLP/IR community.

* Submitted to SIGIR 2025

Via

Access Paper or Ask Questions

ASRank: Zero-Shot Re-Ranking with Answer Scent for Document Retrieval

Jan 25, 2025

Abdelrahman Abdallah, Jamshid Mozafari, Bhawna Piryani, Adam Jatowt

Figure 1 for ASRank: Zero-Shot Re-Ranking with Answer Scent for Document Retrieval

Figure 2 for ASRank: Zero-Shot Re-Ranking with Answer Scent for Document Retrieval

Figure 3 for ASRank: Zero-Shot Re-Ranking with Answer Scent for Document Retrieval

Figure 4 for ASRank: Zero-Shot Re-Ranking with Answer Scent for Document Retrieval

Abstract:Retrieval-Augmented Generation (RAG) models have drawn considerable attention in modern open-domain question answering. The effectiveness of RAG depends on the quality of the top retrieved documents. However, conventional retrieval methods sometimes fail to rank the most relevant documents at the top. In this paper, we introduce ASRank, a new re-ranking method based on scoring retrieved documents using zero-shot answer scent which relies on a pre-trained large language model to compute the likelihood of the document-derived answers aligning with the answer scent. Our approach demonstrates marked improvements across several datasets, including NQ, TriviaQA, WebQA, ArchivalQA, HotpotQA, and Entity Questions. Notably, ASRank increases Top-1 retrieval accuracy on NQ from $19.2\%$ to $46.5\%$ for MSS and $22.1\%$ to $47.3\%$ for BM25. It also shows strong retrieval performance on several datasets compared to state-of-the-art methods (47.3 Top-1 by ASRank vs 35.4 by UPR by BM25).

* Accepted At NAACL 2025

Via

Access Paper or Ask Questions

Using Large Language Models in Automatic Hint Ranking and Generation Tasks

Dec 02, 2024

Jamshid Mozafari, Florian Gerhold, Adam Jatowt

Abstract:The use of Large Language Models (LLMs) has increased significantly recently, with individuals frequently interacting with chatbots to receive answers to a wide range of questions. In an era where information is readily accessible, it is crucial to stimulate and preserve human cognitive abilities and maintain strong reasoning skills. This paper addresses such challenges by promoting the use of hints as an alternative or a supplement to direct answers. We first introduce a manually constructed hint dataset, WIKIHINT, which includes 5,000 hints created for 1,000 questions. We then finetune open-source LLMs such as LLaMA-3.1 for hint generation in answer-aware and answer-agnostic contexts. We assess the effectiveness of the hints with human participants who try to answer questions with and without the aid of hints. Additionally, we introduce a lightweight evaluation method, HINTRANK, to evaluate and rank hints in both answer-aware and answer-agnostic settings. Our findings show that (a) the dataset helps generate more effective hints, (b) including answer information along with questions generally improves hint quality, and (c) encoder-based models perform better than decoder-based models in hint ranking.

Via

Access Paper or Ask Questions

DynRank: Improving Passage Retrieval with Dynamic Zero-Shot Prompting Based on Question Classification

Nov 30, 2024

Abdelrahman Abdallah, Jamshid Mozafari, Bhawna Piryani, Mohammed M. Abdelgwad, Adam Jatowt

Figure 1 for DynRank: Improving Passage Retrieval with Dynamic Zero-Shot Prompting Based on Question Classification

Figure 2 for DynRank: Improving Passage Retrieval with Dynamic Zero-Shot Prompting Based on Question Classification

Figure 3 for DynRank: Improving Passage Retrieval with Dynamic Zero-Shot Prompting Based on Question Classification

Figure 4 for DynRank: Improving Passage Retrieval with Dynamic Zero-Shot Prompting Based on Question Classification

Abstract:This paper presents DynRank, a novel framework for enhancing passage retrieval in open-domain question-answering systems through dynamic zero-shot question classification. Traditional approaches rely on static prompts and pre-defined templates, which may limit model adaptability across different questions and contexts. In contrast, DynRank introduces a dynamic prompting mechanism, leveraging a pre-trained question classification model that categorizes questions into fine-grained types. Based on these classifications, contextually relevant prompts are generated, enabling more effective passage retrieval. We integrate DynRank into existing retrieval frameworks and conduct extensive experiments on multiple QA benchmark datasets.

* Accepted at Coling2025

Via

Access Paper or Ask Questions

Detecting Temporal Ambiguity in Questions

Sep 25, 2024

Bhawna Piryani, Abdelrahman Abdallah, Jamshid Mozafari, Adam Jatowt

Figure 1 for Detecting Temporal Ambiguity in Questions

Figure 2 for Detecting Temporal Ambiguity in Questions

Figure 3 for Detecting Temporal Ambiguity in Questions

Figure 4 for Detecting Temporal Ambiguity in Questions

Abstract:Detecting and answering ambiguous questions has been a challenging task in open-domain question answering. Ambiguous questions have different answers depending on their interpretation and can take diverse forms. Temporally ambiguous questions are one of the most common types of such questions. In this paper, we introduce TEMPAMBIQA, a manually annotated temporally ambiguous QA dataset consisting of 8,162 open-domain questions derived from existing datasets. Our annotations focus on capturing temporal ambiguity to study the task of detecting temporally ambiguous questions. We propose a novel approach by using diverse search strategies based on disambiguated versions of the questions. We also introduce and test non-search, competitive baselines for detecting temporal ambiguity using zero-shot and few-shot approaches.

* Accepted at EMNLP 2024 Findings

Via

Access Paper or Ask Questions