Abstract:Personalization generally improves the performance of queries but in a few cases it may also harms it. If we are able to predict and therefore to disable personalization for those situations, the overall performance will be higher and users will be more satisfied with personalized systems. We use some state-of-the-art pre-retrieval query performance predictors and propose some others including the user profile information for the previous purpose. We study the correlations among these predictors and the difference between the personalized and the original queries. We also use classification and regression techniques to improve the results and finally reach a bit more than one third of the maximum ideal performance. We think this is a good starting point within this research line, which certainly needs more effort and improvements.
Abstract:In the context of content-based recommender systems, the aim of this paper is to determine how better profiles can be built and how these affect the recommendation process based on the incorporation of temporality, i.e. the inclusion of time in the recommendation process, and topicality, i.e. the representation of texts associated with users and items using topics and their combination. The main contribution of the paper is to present two different ways of hybridising these two dimensions and to evaluate and compare them with other alternatives.
Abstract:A common task in many political institutions (i.e. Parliament) is to find politicians who are experts in a particular field. In order to tackle this problem, the first step is to obtain politician profiles which include their interests, and these can be automatically learned from their speeches. As a politician may have various areas of expertise, one alternative is to use a set of subprofiles, each of which covers a different subject. In this study, we propose a novel approach for this task by using latent Dirichlet allocation (LDA) to determine the main underlying topics of each political speech, and to distribute the related terms among the different topic-based subprofiles. With this objective, we propose the use of fifteen distance and similarity measures to automatically determine the optimal number of topics discussed in a document, and to demonstrate that every measure converges into five strategies: Euclidean, Dice, Sorensen, Cosine and Overlap. Our experimental results showed that the scores of the different accuracy metrics of the proposed strategies tended to be higher than those of the baselines for expert recommendation tasks, and that the use of an appropriate number of topics has proved relevant.
Abstract:In this paper, we examine the problem of building a user profile from a set of documents. This profile will consist of a subset of the most representative terms in the documents that best represent user preferences or interests. Inspired by the discrete concentration theory we have conducted an axiomatic study of seven properties that a selection function should fulfill: the minimum and maximum uncertainty principle, invariant to adding zeros, invariant to scale transformations, principle of nominal increase, transfer principle and the richest get richer inequality. We also present a novel selection function based on the use of similarity metrics, and more specifically the cosine measure which is commonly used in information retrieval, and demonstrate that this verifies six of the properties in addition to a weaker variant of the transfer principle, thereby representing a good selection approach. The theoretical study was complemented with an empirical study to compare the performance of different selection criteria (weight- and unweight-based) using real data in a parliamentary setting. In this study, we analyze the performance of the different functions focusing on the two main factors affecting the selection process: profile size (number of terms) and weight distribution. These profiles are then used in a document filtering task to show that our similarity-based approach performs well in terms not only of recommendation accuracy but also efficiency (we obtain smaller profiles and consequently faster recommendations).
Abstract:Our goal is to learn about the political interests and preferences of the Members of Parliament by mining their parliamentary activity, in order to develop a recommendation/filtering system that, given a stream of documents to be distributed among them, is able to decide which documents should receive each Member of Parliament. We propose to use positive unlabeled learning to tackle this problem, because we only have information about relevant documents (the own interventions of each Member of Parliament in the debates) but not about irrelevant documents, so that we cannot use standard binary classifiers trained with positive and negative examples. We have also developed a new algorithm of this type, which compares favourably with: a) the baseline approach assuming that all the interventions of other Members of Parliament are irrelevant, b) another well-known positive unlabeled learning method and c) an approach based on information retrieval methods that matches documents and legislators' representations. The experiments have been carried out with data from the regional Andalusian Parliament at Spain.
Abstract:In the information age we are living in today, not only are we interested in accessing multimedia objects such as documents, videos, etc. but also in searching for professional experts, people or celebrities, possibly for professional needs or just for fun. Information access systems need to be able to extract and exploit various sources of information (usually in text format) about such individuals, and to represent them in a suitable way usually in the form of a profile. In this article, we tackle the problems of profile-based expert recommendation and document filtering from a machine learning perspective by clustering expert textual sources to build profiles and capture the different hidden topics in which the experts are interested. The experts will then be represented by means of multi-faceted profiles. Our experiments show that this is a valid technique to improve the performance of expert finding and document filtering.
Abstract:In this paper we study the venue recommendation problem in order to help researchers to identify a journal or conference to submit a given paper. A common approach to tackle this problem is to build profiles defining the scope of each venue. Then, these profiles are compared against the target paper. In our approach we will study how clustering techniques can be used to construct topic-based profiles and use an Information Retrieval based approach to obtain the final recommendations. Additionally, we will explore how the use of authorship, representing a complementary piece of information, helps to improve the recommendations.
Abstract:Decomposable dependency models and their graphical counterparts, i.e., chordal graphs, possess a number of interesting and useful properties. On the basis of two characterizations of decomposable models in terms of independence relationships, we develop an exact algorithm for recovering the chordal graphical representation of any given decomposable model. We also propose an algorithm for learning chordal approximations of dependency models isomorphic to general undirected graphs.
Abstract:Information Retrieval (IR) is concerned with the identification of documents in a collection that are relevant to a given information need, usually represented as a query containing terms or keywords, which are supposed to be a good description of what the user is looking for. IR systems may improve their effectiveness (i.e., increasing the number of relevant documents retrieved) by using a process of query expansion, which automatically adds new terms to the original query posed by an user. In this paper we develop a method of query expansion based on Bayesian networks. Using a learning algorithm, we construct a Bayesian network that represents some of the relationships among the terms appearing in a given document collection; this network is then used as a thesaurus (specific for that collection). We also report the results obtained by our method on three standard test collections.