Abstract:In recent years, we have witnessed the proliferation of large amounts of online content generated directly by users with virtually no form of external control, leading to the possible spread of misinformation. The search for effective solutions to this problem is still ongoing, and covers different areas of application, from opinion spam to fake news detection. A more recently investigated scenario, despite the serious risks that incurring disinformation could entail, is that of the online dissemination of health information. Early approaches in this area focused primarily on user-based studies applied to Web page content. More recently, automated approaches have been developed for both Web pages and social media content, particularly with the advent of the COVID-19 pandemic. These approaches are primarily based on handcrafted features extracted from online content in association with Machine Learning. In this scenario, we focus on Web page content, where there is still room for research to study structural-, content- and context-based features to assess the credibility of Web pages. Therefore, this work aims to study the effectiveness of such features in association with a deep learning model, starting from an embedded representation of Web pages that has been recently proposed in the context of phishing Web page detection, i.e., Web2Vec.
Abstract:In this paper, we propose a novel approach to consider multiple dimensions of relevance beyond topicality in cross-encoder re-ranking. On the one hand, current multidimensional retrieval models often use na\"ive solutions at the re-ranking stage to aggregate multiple relevance scores into an overall one. On the other hand, cross-encoder re-rankers are effective in considering topicality but are not designed to straightforwardly account for other relevance dimensions. To overcome these issues, we envisage enhancing the candidate documents -- which are retrieved by a first-stage lexical retrieval model -- with "relevance statements" related to additional dimensions of relevance and then performing a re-ranking on them with cross-encoders. In particular, here we consider an additional relevance dimension beyond topicality, which is credibility. We test the effectiveness of our solution in the context of the Consumer Health Search task, considering publicly available datasets. Our results show that the proposed approach statistically outperforms both aggregation-based and cross-encoder re-rankers.