Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mian Zhong

The Effect of Document Selection on Query-focused Text Analysis

Apr 13, 2026

Sandesh S Rangreji, Mian Zhong, Anjalie Field

Abstract:Analyses of document collections often require selecting what data to analyze, as not all documents are relevant to a particular research question and computational constraints preclude analyzing all documents, yet little work has examined effects of selection strategy choices. We systematically evaluate seven selection methods (from random selection to hybrid retrieval) on outputs from four text analyses methods (LDA, BERTopic, TopicGPT, HiCode) over two datasets with 26 open-ended queries. Our evaluation reveals practice guidance: semantic or hybrid retrieval offer strong go-to approaches that avoid the pitfalls of weaker selection strategies and the unnecessary compute overhead of more complicated ones. Overall, our evaluation framework establishes data selection as a methodological decision, rather than a practical necessity, inviting the development of new strategies.

Via

Access Paper or Ask Questions

Extracting Victim Counts from Text

Feb 23, 2023

Mian Zhong, Shehzaad Dhuliawala, Niklas Stoehr

Figure 1 for Extracting Victim Counts from Text

Figure 2 for Extracting Victim Counts from Text

Figure 3 for Extracting Victim Counts from Text

Figure 4 for Extracting Victim Counts from Text

Abstract:Decision-makers in the humanitarian sector rely on timely and exact information during crisis events. Knowing how many civilians were injured during an earthquake is vital to allocate aids properly. Information about such victim counts is often only available within full-text event descriptions from newspapers and other reports. Extracting numbers from text is challenging: numbers have different formats and may require numeric reasoning. This renders purely string matching-based approaches insufficient. As a consequence, fine-grained counts of injured, displaced, or abused victims beyond fatalities are often not extracted and remain unseen. We cast victim count extraction as a question answering (QA) task with a regression or classification objective. We compare regex, dependency parsing, semantic role labeling-based approaches, and advanced text-to-text models. Beyond model accuracy, we analyze extraction reliability and robustness which are key for this sensitive task. In particular, we discuss model calibration and investigate few-shot and out-of-distribution performance. Ultimately, we make a comprehensive recommendation on which model to select for different desiderata and data domains. Our work is among the first to apply numeracy-focused large language models in a real-world use case with a positive impact.

* Long paper accepted at EACL 2023 main conference

Via

Access Paper or Ask Questions

Towards Automatic Bias Detection in Knowledge Graphs

Sep 19, 2021

Daphna Keidar, Mian Zhong, Ce Zhang, Yash Raj Shrestha, Bibek Paudel

Figure 1 for Towards Automatic Bias Detection in Knowledge Graphs

Figure 2 for Towards Automatic Bias Detection in Knowledge Graphs

Figure 3 for Towards Automatic Bias Detection in Knowledge Graphs

Figure 4 for Towards Automatic Bias Detection in Knowledge Graphs

Abstract:With the recent surge in social applications relying on knowledge graphs, the need for techniques to ensure fairness in KG based methods is becoming increasingly evident. Previous works have demonstrated that KGs are prone to various social biases, and have proposed multiple methods for debiasing them. However, in such studies, the focus has been on debiasing techniques, while the relations to be debiased are specified manually by the user. As manual specification is itself susceptible to human cognitive bias, there is a need for a system capable of quantifying and exposing biases, that can support more informed decisions on what to debias. To address this gap in the literature, we describe a framework for identifying biases present in knowledge graph embeddings, based on numerical bias metrics. We illustrate the framework with three different bias measures on the task of profession prediction, and it can be flexibly extended to further bias definitions and applications. The relations flagged as biased can then be handed to decision makers for judgement upon subsequent debiasing.

* Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: Findings (EMNLP 2021). Nov 7--11, 2021

Via

Access Paper or Ask Questions