Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sara Lafia

DataChat: Prototyping a Conversational Agent for Dataset Search and Visualization

May 26, 2023

Lizhou Fan, Sara Lafia, Lingyao Li, Fangyuan Yang, Libby Hemphill

Abstract:Data users need relevant context and research expertise to effectively search for and identify relevant datasets. Leading data providers, such as the Inter-university Consortium for Political and Social Research (ICPSR), offer standardized metadata and search tools to support data search. Metadata standards emphasize the machine-readability of data and its documentation. There are opportunities to enhance dataset search by improving users' ability to learn about, and make sense of, information about data. Prior research has shown that context and expertise are two main barriers users face in effectively searching for, evaluating, and deciding whether to reuse data. In this paper, we propose a novel chatbot-based search system, DataChat, that leverages a graph database and a large language model to provide novel ways for users to interact with and search for research data. DataChat complements data archives' and institutional repositories' ongoing efforts to curate, preserve, and share research data for reuse by making it easier for users to explore and learn about available research data.

* 6 pages, 2 figures, and 1 table. Accepted to the 86th Annual Meeting of the Association for Information Science & Technology

Via

Access Paper or Ask Questions

A Natural Language Processing Pipeline for Detecting Informal Data References in Academic Literature

May 23, 2022

Sara Lafia, Lizhou Fan, Libby Hemphill

Figure 1 for A Natural Language Processing Pipeline for Detecting Informal Data References in Academic Literature

Figure 2 for A Natural Language Processing Pipeline for Detecting Informal Data References in Academic Literature

Figure 3 for A Natural Language Processing Pipeline for Detecting Informal Data References in Academic Literature

Figure 4 for A Natural Language Processing Pipeline for Detecting Informal Data References in Academic Literature

Abstract:Discovering authoritative links between publications and the datasets that they use can be a labor-intensive process. We introduce a natural language processing pipeline that retrieves and reviews publications for informal references to research datasets, which complements the work of data librarians. We first describe the components of the pipeline and then apply it to expand an authoritative bibliography linking thousands of social science studies to the data-related publications in which they are used. The pipeline increases recall for literature to review for inclusion in data-related collections of publications and makes it possible to detect informal data references at scale. We contribute (1) a novel Named Entity Recognition (NER) model that reliably detects informal data references and (2) a dataset connecting items from social science literature with datasets they reference. Together, these contributions enable future work on data reference, data citation networks, and data reuse.

* 13 pages, 7 figures, 3 tables

Via

Access Paper or Ask Questions

Librarian-in-the-Loop: A Natural Language Processing Paradigm for Detecting Informal Mentions of Research Data in Academic Literature

Mar 10, 2022

Lizhou Fan, Sara Lafia, David Bleckley, Elizabeth Moss, Andrea Thomer, Libby Hemphill

Figure 1 for Librarian-in-the-Loop: A Natural Language Processing Paradigm for Detecting Informal Mentions of Research Data in Academic Literature

Abstract:Data citations provide a foundation for studying research data impact. Collecting and managing data citations is a new frontier in archival science and scholarly communication. However, the discovery and curation of research data citations is labor intensive. Data citations that reference unique identifiers (i.e. DOIs) are readily findable; however, informal mentions made to research data are more challenging to infer. We propose a natural language processing (NLP) paradigm to support the human task of identifying informal mentions made to research datasets. The work of discovering informal data mentions is currently performed by librarians and their staff in the Inter-university Consortium for Political and Social Research (ICPSR), a large social science data archive that maintains a large bibliography of data-related literature. The NLP model is bootstrapped from data citations actively collected by librarians at ICPSR. The model combines pattern matching with multiple iterations of human annotations to learn additional rules for detecting informal data mentions. These examples are then used to train an NLP pipeline. The librarian-in-the-loop paradigm is centered in the data work performed by ICPSR librarians, supporting broader efforts to build a more comprehensive bibliography of data-related literature that reflects the scholarly communities of research data users.

Via

Access Paper or Ask Questions

Leveraging Machine Learning to Detect Data Curation Activities

Apr 30, 2021

Sara Lafia, Andrea Thomer, David Bleckley, Dharma Akmon, Libby Hemphill

Figure 1 for Leveraging Machine Learning to Detect Data Curation Activities

Figure 2 for Leveraging Machine Learning to Detect Data Curation Activities

Figure 3 for Leveraging Machine Learning to Detect Data Curation Activities

Figure 4 for Leveraging Machine Learning to Detect Data Curation Activities

Abstract:This paper describes a machine learning approach for annotating and analyzing data curation work logs at ICPSR, a large social sciences data archive. The systems we studied track curation work and coordinate team decision-making at ICPSR. Repository staff use these systems to organize, prioritize, and document curation work done on datasets, making them promising resources for studying curation work and its impact on data reuse, especially in combination with data usage analytics. A key challenge, however, is classifying similar activities so that they can be measured and associated with impact metrics. This paper contributes: 1) a schema of data curation activities; 2) a computational model for identifying curation actions in work log descriptions; and 3) an analysis of frequent data curation activities at ICPSR over time. We first propose a schema of data curation actions to help us analyze the impact of curation work. We then use this schema to annotate a set of data curation logs, which contain records of data transformations and project management decisions completed by repository staff. Finally, we train a text classifier to detect the frequency of curation actions in a large set of work logs. Our approach supports the analysis of curation work documented in work log systems as an important step toward studying the relationship between research data curation and data reuse.

* 10 pages, 4 figures. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Via

Access Paper or Ask Questions