Abstract:It has been established in the literature that the number of ratings and the scores restaurants obtain on online rating systems (ORS) significantly impact their revenue. However, when a restaurant has a limited number of ratings, it may be challenging to predict its future performance. It may well be that ratings reveal more about the user who did the rating than about the quality of the restaurant. This motivates us to segment users into "inflating raters", who tend to give unusually high ratings, and "deflating raters", who tend to give unusually low ratings, and compare the rankings generated by these two populations. Using a public dataset provided by Yelp, we find that deflating raters are better at predicting restaurants that will achieve a top rating (4.5 and above) in the future. As such, these deflating raters may have an important role in restaurant discovery.
Abstract:Recommendation systems rely on user-provided data to learn about item quality and provide personalized recommendations. An implicit assumption when aggregating ratings into item quality is that ratings are strong indicators of item quality. In this work, we test this assumption using data collected from a music discovery application. Our study focuses on two factors that cause rating inflation: heterogeneous user rating behavior and the dynamics of personalized recommendations. We show that user rating behavior substantially varies by user, leading to item quality estimates that reflect the users who rated an item more than the item quality itself. Additionally, items that are more likely to be shown via personalized recommendations can experience a substantial increase in their exposure and potential bias toward them. To mitigate these effects, we analyze the results of a randomized controlled trial in which the rating interface was modified. The test resulted in a substantial improvement in user rating behavior and a reduction in item quality inflation. These findings highlight the importance of carefully considering the assumptions underlying recommendation systems and designing interfaces that encourage accurate rating behavior.
Abstract:High quality user feedback data is essential to training and evaluating a successful music recommendation system, particularly one that has to balance the needs of multiple stakeholders. Most existing music datasets suffer from noisy feedback and self-selection biases inherent in the data collected by music platforms. Using the Piki Music dataset of 500k ratings collected over a two-year time period, we evaluate the performance of classic recommendation algorithms on three important stakeholders: consumers, well-known artists and lesser-known artists. We show that a matrix factorization algorithm trained on both likes and dislikes performs significantly better compared to one trained only on likes for all three stakeholders.