Abstract:The development of tools and techniques to analyze and extract organizations data habits from privacy policies are critical for scalable regulatory compliance audits. Unfortunately, these tools are becoming increasingly limited in their ability to identify compliance issues and fixes. After all, most were developed using regulation-agnostic datasets of annotated privacy policies obtained from a time before the introduction of landmark privacy regulations such as EUs GDPR and Californias CCPA. In this paper, we describe the first open regulation-aware dataset of expert-annotated privacy policies, C3PA (CCPA Privacy Policy Provision Annotations), aimed to address this challenge. C3PA contains over 48K expert-labeled privacy policy text segments associated with responses to CCPA-specific disclosure mandates from 411 unique organizations. We demonstrate that the C3PA dataset is uniquely suited for aiding automated audits of compliance with CCPA-related disclosure mandates.
Abstract:The evolution of information-seeking processes, driven by search engines like Google, has transformed the access to information people have. This paper investigates how individuals' preexisting attitudes influence the modern information-seeking process, specifically the results presented by Google Search. Through a comprehensive study involving surveys and information-seeking tasks focusing on the topic of abortion, the paper provides four crucial insights: 1) Individuals with opposing attitudes on abortion receive different search results. 2) Individuals express their beliefs in their choice of vocabulary used in formulating the search queries, shaping the outcome of the search. 3) Additionally, the user's search history contributes to divergent results among those with opposing attitudes. 4) Google Search engine reinforces preexisting beliefs in search results. Overall, this study provides insights into the interplay between human biases and algorithmic processes, highlighting the potential for information polarization in modern information-seeking processes.
Abstract:Data generated by audits of social media websites have formed the basis of our understanding of the biases presented in algorithmic content recommendation systems. As legislators around the world are beginning to consider regulating the algorithmic systems that drive online platforms, it is critical to ensure the correctness of these inferred biases. However, as we will show in this paper, doing so is a challenging task for a variety of reasons related to the complexity of configuration parameters associated with the audits that gather data from a specific platform. Focusing specifically on YouTube, we show that conducting audits to make inferences about YouTube's recommendation systems is more methodologically challenging than one might expect. There are many methodological decisions that need to be considered in order to obtain scientifically valid results, and each of these decisions incur costs. For example, should an auditor use (expensive to obtain) logged-in YouTube accounts while gathering recommendations from the algorithm to obtain more accurate inferences? We explore the impact of this and many other decisions and make some startling discoveries about the methodological choices that impact YouTube's recommendations. Taken all together, our research suggests auditing configuration compromises that YouTube auditors and researchers can use to reduce audit overhead, both economically and computationally, without sacrificing accuracy of their inferences. Similarly, we also identify several configuration parameters that have a significant impact on the accuracy of measured inferences and should be carefully considered.
Abstract:The Internet and online forums such as Reddit have become an increasingly popular medium for citizens to engage in political conversations. However, the online disinhibition effect resulting from the ability to use pseudonymous identities may manifest in the form of offensive speech, consequently making political discussions more aggressive and polarizing than they already are. Such environments may result in harassment and self-censorship from its targets. In this paper, we present preliminary results from a large-scale temporal measurement aimed at quantifying offensiveness in online political discussions. To enable our measurements, we develop and evaluate an offensive speech classifier. We then use this classifier to quantify and compare offensiveness in the political and general contexts. We perform our study using a database of over 168M Reddit comments made by over 7M pseudonyms between January 2015 and January 2017 -- a period covering several divisive political events including the 2016 US presidential elections.