Abstract:Searching for information on the internet and digital platforms to satisfy an information need requires effective retrieval solutions. However, such solutions are not yet available for Tetun, making it challenging to find relevant documents for text-based search queries in this language. To address these challenges, this study investigates Tetun text retrieval with a focus on the ad-hoc retrieval task. It begins by developing essential language resources -- including a list of stopwords, a stemmer, and a test collection -- which serve as foundational components for solutions tailored to Tetun text retrieval. Various strategies are then explored using both document titles and content to evaluate retrieval effectiveness. The results show that retrieving document titles, after removing hyphens and apostrophes without applying stemming, significantly improves retrieval performance compared to the baseline. Efficiency increases by 31.37%, while effectiveness achieves an average gain of 9.40% in MAP@10 and 30.35% in nDCG@10 with DFR BM25. Beyond the top-10 cutoff point, Hiemstra LM demonstrates strong performance across various retrieval strategies and evaluation metrics. Contributions of this work include the development of Labadain-Stopwords (a list of 160 Tetun stopwords), Labadain-Stemmer (a Tetun stemmer with three variants), and Labadain-Avaliad\'or (a Tetun test collection containing 59 topics, 33,550 documents, and 5,900 qrels).
Abstract:The Cranfield paradigm has served as a foundational approach for developing test collections, with relevance judgments typically conducted by human assessors. However, the emergence of large language models (LLMs) has introduced new possibilities for automating these tasks. This paper explores the feasibility of using LLMs to automate relevance assessments, particularly within the context of low-resource languages. In our study, LLMs are employed to automate relevance judgment tasks, by providing a series of query-document pairs in Tetun as the input text. The models are tasked with assigning relevance scores to each pair, where these scores are then compared to those from human annotators to evaluate the inter-annotator agreement levels. Our investigation reveals results that align closely with those reported in studies of high-resource languages.
Abstract:The recent advances in natural language processing (NLP) are linked to training processes that require vast amounts of corpora. Access to this data is commonly not a trivial process due to resource dispersion and the need to maintain these infrastructures online and up-to-date. New developments in NLP are often compromised due to the scarcity of data or lack of a shared repository that works as an entry point to the community. This is especially true in low and mid-resource languages, such as Portuguese, which lack data and proper resource management infrastructures. In this work, we propose PT-Pump-Up, a set of tools that aim to reduce resource dispersion and improve the accessibility to Portuguese NLP resources. Our proposal is divided into four software components: a) a web platform to list the available resources; b) a client-side Python package to simplify the loading of Portuguese NLP resources; c) an administrative Python package to manage the platform and d) a public GitHub repository to foster future collaboration and contributions. All four components are accessible using: https://linktr.ee/pt_pump_up
Abstract:The hypergraph-of-entity was conceptually proposed as a general model for entity-oriented search. However, only the performance for ad hoc document retrieval had been assessed. We continue this line of research by also evaluating ad hoc entity retrieval, and entity list completion. We also attempt to scale the model, so that it can support the complete INEX 2009 Wikipedia collection. We do this by indexing the top keywords for each document, reducing complexity by partially lowering the number of nodes and, indirectly, the number of hyperedges linking terms to entities. This enables us to compare the effectiveness of the hypergraph-of-entity with the results obtained by the participants of the INEX tracks for the considered tasks. We find this to be a viable model that is, to our knowledge, the first attempt at a generalization in information retrieval, in particular by supporting a universal ranking function for multiple entity-oriented search tasks.
Abstract:Connections among entities are everywhere. From social media interactions to web page hyperlinks, networks are frequently used to represent such complex systems. Node ranking is a fundamental task that provides the strategy to identify central entities according to multiple criteria. Popular node ranking metrics include degree, closeness or betweenness centralities, as well as HITS authority or PageRank. In this work, we propose a novel node ranking metric, where we combine PageRank and the idea of node fatigue, in order to model a random explorer who wants to optimize coverage - it gets fatigued and avoids previously visited nodes. We formalize and exemplify the computation of Fatigued PageRank, evaluating it as a node ranking metric, as well as query-independent evidence in ad hoc document retrieval. Based on the Simple English Wikipedia link graph with clickstream transitions from the English Wikipedia, we find that Fatigued PageRank is able to surpass both indegree and HITS authority, but only for the top ranking nodes. On the other hand, based on the TREC Washington Post Corpus, we were unable to outperform the BM25 baseline, obtaining similar performance for all graph-based metrics, except for indegree, which lowered GMAP and MAP, but increased NDCG@10 and P@10.
Abstract:Hypergraphs are data structures capable of capturing supra-dyadic relations. We can use them to model binary relations, but also to model groups of entities, as well as the intersections between these groups or the contained subgroups. In previous work, we explored the usage of hypergraphs as an indexing data structure, in particular one that was capable of seamlessly integrating text, entities and their relations to support entity-oriented search tasks. As more information is added to the hypergraph, however, it not only increases in size, but it also becomes denser, making the task of efficiently ranking nodes or hyperedges more complex. Random walks can effectively capture network structure, without compromising performance, or at least providing a tunable balance between efficiency and effectiveness, within a nondeterministic universe. For a higher effectiveness, a higher number of random walks is usually required, which often results in lower efficiency. Inspired by von Neumann and the neuron in the brain, we propose and study the usage of node and hyperedge fatigue as a way to temporarily constrict random walks during keyword-based ad hoc retrieval. We found that we were able to improve search time by a factor of 32, but also worsen MAP by a factor of 8. Moreover, by distinguishing between fatigue in nodes and hyperedges, we are able to find that, for hyperedge ranking tasks, we consistently obtained lower MAP scores when increasing fatigue for nodes. On the other hand, the overall impact of hyperedge fatigue was slightly positive, although it also slightly worsened efficiency.