Abstract:Recommendation systems rely on user-provided data to learn about item quality and provide personalized recommendations. An implicit assumption when aggregating ratings into item quality is that ratings are strong indicators of item quality. In this work, we test this assumption using data collected from a music discovery application. Our study focuses on two factors that cause rating inflation: heterogeneous user rating behavior and the dynamics of personalized recommendations. We show that user rating behavior substantially varies by user, leading to item quality estimates that reflect the users who rated an item more than the item quality itself. Additionally, items that are more likely to be shown via personalized recommendations can experience a substantial increase in their exposure and potential bias toward them. To mitigate these effects, we analyze the results of a randomized controlled trial in which the rating interface was modified. The test resulted in a substantial improvement in user rating behavior and a reduction in item quality inflation. These findings highlight the importance of carefully considering the assumptions underlying recommendation systems and designing interfaces that encourage accurate rating behavior.