Abstract:The geolocation of online information is an essential component in any geospatial application. While most of the previous work on geolocation has focused on Twitter, in this paper we quantify and compare the performance of text-based geolocation methods on social media data drawn from both Blogger and Twitter. We introduce a novel set of location specific features that are both highly informative and easily interpretable, and show that we can achieve error rate reductions of up to 12.5% with respect to the best previously proposed geolocation features. We also show that despite posting longer text, Blogger users are significantly harder to geolocate than Twitter users. Additionally, we investigate the effect of training and testing on different media (cross-media predictions), or combining multiple social media sources (multi-media predictions). Finally, we explore the geolocability of social media in relation to three user dimensions: state, gender, and industry.
Abstract:Automatic profiling of social media users is an important task for supporting a multitude of downstream applications. While a number of studies have used social media content to extract and study collective social attributes, there is a lack of substantial research that addresses the detection of a user's industry. We frame this task as classification using both feature engineering and ensemble learning. Our industry-detection system uses both posted content and profile information to detect a user's industry with 64.3% accuracy, significantly outperforming the majority baseline in a taxonomy of fourteen industry classes. Our qualitative analysis suggests that a person's industry not only affects the words used and their perceived meanings, but also the number and type of emotions being expressed.
Abstract:People's personality and motivations are manifest in their everyday language usage. With the emergence of social media, ample examples of such usage are procurable. In this paper, we aim to analyze the vocabulary used by close to 200,000 Blogger users in the U.S. with the purpose of geographically portraying various demographic, linguistic, and psychological dimensions at the state level. We give a description of a web-based tool for viewing maps that depict various characteristics of the social media users as derived from this large blog dataset of over two billion words.