Abstract:In Conversational Recommendation Systems (CRS), a user can provide feedback on recommended items at each interaction turn, leading the CRS towards more desirable recommendations. Currently, different types of CRS offer various possibilities for feedback, i.e., natural language feedback, or answering clarifying questions. In most cases, a user simulator is employed for training as well as evaluating the CRS. Such user simulators typically critique the current retrieved items based on knowledge of a single target item. Still, evaluating systems in offline settings with simulators suffers from problems, such as focusing entirely on a single target item (not addressing the exploratory nature of a recommender system), and exhibiting extreme patience (consistent feedback over a large number of turns). To overcome these limitations, we obtain extra judgements for a selection of alternative items in common CRS datasets, namely Shoes and Fashion IQ Dresses. Going further, we propose improved user simulators that allow simulated users not only to express their preferences about alternative items to their original target, but also to change their mind and level of patience. In our experiments using the relative image captioning CRS setting and different CRS models, we find that using the knowledge of alternatives by the simulator can have a considerable impact on the evaluation of existing CRS models, specifically that the existing single-target evaluation underestimates their effectiveness, and when simulated users are allowed to instead consider alternatives, the system can rapidly respond to more quickly satisfy the user.
Abstract:Query Performance Prediction (QPP) estimates the effectiveness of a search engine's results in response to a query without relevance judgments. Traditionally, post-retrieval predictors have focused upon either the distribution of the retrieval scores, or the coherence of the top-ranked documents using traditional bag-of-words index representations. More recently, BERT-based models using dense embedded document representations have been used to create new predictors, but mostly applied to predict the performance of rankings created by BM25. Instead, we aim to predict the effectiveness of rankings created by single-representation dense retrieval models (ANCE & TCT-ColBERT). Therefore, we propose a number of variants of existing unsupervised coherence-based predictors that employ neural embedding representations. In our experiments on the TREC Deep Learning Track datasets, we demonstrate improved accuracy upon dense retrieval (up to 92% compared to sparse variants for TCT-ColBERT and 188% for ANCE). Going deeper, we select the most representative and best performing predictors to study the importance of differences among predictors and query types on query performance. Using existing distribution-based evaluation QPP measures and a particular type of linear mixed models, we find that query types further significantly influence query performance (and are up to 35% responsible for the unstable performance of QPP predictors), and that this sensitivity is unique to dense retrieval models. Our approach introduces a new setting for obtaining richer information on query differences in dense QPP that can explain potential unstable performance of existing predictors and outlines the unique characteristics of different query types on dense retrieval models.