Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Marian-Andrei Rizoiu

Estimating Online Influence Needs Causal Modeling! Counterfactual Analysis of Social Media Engagement

May 25, 2025

Lin Tian, Marian-Andrei Rizoiu

Abstract:Understanding true influence in social media requires distinguishing correlation from causation--particularly when analyzing misinformation spread. While existing approaches focus on exposure metrics and network structures, they often fail to capture the causal mechanisms by which external temporal signals trigger engagement. We introduce a novel joint treatment-outcome framework that leverages existing sequential models to simultaneously adapt to both policy timing and engagement effects. Our approach adapts causal inference techniques from healthcare to estimate Average Treatment Effects (ATE) within the sequential nature of social media interactions, tackling challenges from external confounding signals. Through our experiments on real-world misinformation and disinformation datasets, we show that our models outperform existing benchmarks by 15--22% in predicting engagement across diverse counterfactual scenarios, including exposure adjustment, timing shifts, and varied intervention durations. Case studies on 492 social media users show our causal effect measure aligns strongly with the gold standard in influence estimation, the expert-based empirical influence.

Via

Access Paper or Ask Questions

Before It's Too Late: A State Space Model for the Early Prediction of Misinformation and Disinformation Engagement

Feb 07, 2025

Lin Tian, Emily Booth, Francesco Bailo, Julian Droogan, Marian-Andrei Rizoiu

Abstract:In today's digital age, conspiracies and information campaigns can emerge rapidly and erode social and democratic cohesion. While recent deep learning approaches have made progress in modeling engagement through language and propagation models, they struggle with irregularly sampled data and early trajectory assessment. We present IC-Mamba, a novel state space model that forecasts social media engagement by modeling interval-censored data with integrated temporal embeddings. Our model excels at predicting engagement patterns within the crucial first 15-30 minutes of posting (RMSE 0.118-0.143), enabling rapid assessment of content reach. By incorporating interval-censored modeling into the state space framework, IC-Mamba captures fine-grained temporal dynamics of engagement growth, achieving a 4.72% improvement over state-of-the-art across multiple engagement metrics (likes, shares, comments, and emojis). Our experiments demonstrate IC-Mamba's effectiveness in forecasting both post-level dynamics and broader narrative patterns (F1 0.508-0.751 for narrative-level predictions). The model maintains strong predictive performance across extended time horizons, successfully forecasting opinion-level engagement up to 28 days ahead using observation windows of 3-10 days. These capabilities enable earlier identification of potentially problematic content, providing crucial lead time for designing and implementing countermeasures. Code is available at: https://github.com/ltian678/ic-mamba. An interactive dashboard demonstrating our results is available at: https://ic-mamba.behavioral-ds.science.

* 11 pages, 5 figures, 10 tables, Accepted by the Web Conference 2025 (WWW2025)

Via

Access Paper or Ask Questions

Behavioral Homophily in Social Media via Inverse Reinforcement Learning: A Reddit Case Study

Feb 05, 2025

Lanqin Yuan, Philipp J. Schneider, Marian-Andrei Rizoiu

Abstract:Online communities play a critical role in shaping societal discourse and influencing collective behavior in the real world. The tendency for people to connect with others who share similar characteristics and views, known as homophily, plays a key role in the formation of echo chambers which further amplify polarization and division. Existing works examining homophily in online communities traditionally infer it using content- or adjacency-based approaches, such as constructing explicit interaction networks or performing topic analysis. These methods fall short for platforms where interaction networks cannot be easily constructed and fail to capture the complex nature of user interactions across the platform. This work introduces a novel approach for quantifying user homophily. We first use an Inverse Reinforcement Learning (IRL) framework to infer users' policies, then use these policies as a measure of behavioral homophily. We apply our method to Reddit, conducting a case study across 5.9 million interactions over six years, demonstrating how this approach uncovers distinct behavioral patterns and user roles that vary across different communities. We further validate our behavioral homophily measure against traditional content-based homophily, offering a powerful method for analyzing social media dynamics and their broader societal implications. We find, among others, that users can behave very similarly (high behavioral homophily) when discussing entirely different topics like soccer vs e-sports (low topical homophily), and that there is an entire class of users on Reddit whose purpose seems to be to disagree with others.

Via

Access Paper or Ask Questions

What Drives Online Popularity: Author, Content or Sharers? Estimating Spread Dynamics with Bayesian Mixture Hawkes

Jun 12, 2024

Pio Calderon, Marian-Andrei Rizoiu

Abstract:The spread of content on social media is shaped by intertwining factors on three levels: the source, the content itself, and the pathways of content spread. At the lowest level, the popularity of the sharing user determines its eventual reach. However, higher-level factors such as the nature of the online item and the credibility of its source also play crucial roles in determining how widely and rapidly the online item spreads. In this work, we propose the Bayesian Mixture Hawkes (BMH) model to jointly learn the influence of source, content and spread. We formulate the BMH model as a hierarchical mixture model of separable Hawkes processes, accommodating different classes of Hawkes dynamics and the influence of feature sets on these classes. We test the BMH model on two learning tasks, cold-start popularity prediction and temporal profile generalization performance, applying to two real-world retweet cascade datasets referencing articles from controversial and traditional media publishers. The BMH model outperforms the state-of-the-art models and predictive baselines on both datasets and utilizes cascade- and item-level information better than the alternatives. Lastly, we perform a counter-factual analysis where we apply the trained publisher-level BMH models to a set of article headlines and show that effectiveness of headline writing style (neutral, clickbait, inflammatory) varies across publishers. The BMH model unveils differences in style effectiveness between controversial and reputable publishers, where we find clickbait to be notably more effective for reputable publishers as opposed to controversial ones, which links to the latter's overuse of clickbait.

* accepted as a full paper in the Research Track at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD) 2024

Via

Access Paper or Ask Questions

Author, Content or Sharers? Estimating Spread Dynamics with Bayesian Mixture Hawkes

Jun 05, 2024

Pio Calderon, Marian-Andrei Rizoiu

* accepted in the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD) 2024

Via

Access Paper or Ask Questions

Misinformation is not about Bad Facts: An Analysis of the Production and Consumption of Fringe Content

Mar 13, 2024

JooYoung Lee, Emily Booth, Hany Farid, Marian-Andrei Rizoiu

Abstract:What if misinformation is not an information problem at all? Our findings suggest that online fringe ideologies spread through the use of content that is consensus-based and "factually correct". We found that Australian news publishers with both moderate and far-right political leanings contain comparable levels of information completeness and quality; and furthermore, that far-right Twitter users often share from moderate sources. However, a stark difference emerges when we consider two additional factors: 1) the narrow topic selection of articles by far-right users, suggesting that they cherrypick only news articles that engage with specific topics of their concern, and 2) the difference between moderate and far-right publishers when we examine the writing style of their articles. Furthermore, we can even identify users prone to sharing misinformation based on their communication style. These findings have important implications for countering online misinformation, as they highlight the powerful role that users' personal bias towards specific topics, and publishers' writing styles, have in amplifying fringe ideologies online.

* 11 pages, 2 figures

Via

Access Paper or Ask Questions

The Innovation-to-Occupations Ontology: Linking Business Transformation Initiatives to Occupations and Skills

Oct 27, 2023

Daniela Elia, Fang Chen, Didar Zowghi, Marian-Andrei Rizoiu

Figure 1 for The Innovation-to-Occupations Ontology: Linking Business Transformation Initiatives to Occupations and Skills

Figure 2 for The Innovation-to-Occupations Ontology: Linking Business Transformation Initiatives to Occupations and Skills

Abstract:The fast adoption of new technologies forces companies to continuously adapt their operations making it harder to predict workforce requirements. Several recent studies have attempted to predict the emergence of new roles and skills in the labour market from online job ads. This paper aims to present a novel ontology linking business transformation initiatives to occupations and an approach to automatically populating it by leveraging embeddings extracted from job ads and Wikipedia pages on business transformation and emerging technologies topics. To our knowledge, no previous research explicitly links business transformation initiatives, like the adoption of new technologies or the entry into new markets, to the roles needed. Our approach successfully matches occupations to transformation initiatives under ten different scenarios, five linked to technology adoption and five related to business. This framework presents an innovative approach to guide enterprises and educational institutions on the workforce requirements for specific business transformation initiatives.

* 14 pages, 3 figures, Camera-ready version in ACIS 2023

Via

Access Paper or Ask Questions

Enhancing Biomedical Text Summarization and Question-Answering: On the Utility of Domain-Specific Pre-Training

Jul 10, 2023

Dima Galat, Marian-Andrei Rizoiu

Abstract:Biomedical summarization requires large datasets to train for text generation. We show that while transfer learning offers a viable option for addressing this challenge, an in-domain pre-training does not always offer advantages in a BioASQ summarization task. We identify a suitable model architecture and use it to show a benefit of a general-domain pre-training followed by a task-specific fine-tuning in the context of a BioASQ summarization task, leading to a novel three-step fine-tuning approach that works with only a thousand in-domain examples. Our results indicate that a Large Language Model without domain-specific pre-training can have a significant edge in some domain-specific biomedical text generation tasks.

Via

Access Paper or Ask Questions

Being Automated or Not? Risk Identification of Occupations with Graph Neural Networks

Sep 06, 2022

Dawei Xu, Haoran Yang, Marian-Andrei Rizoiu, Guandong Xu

Figure 1 for Being Automated or Not? Risk Identification of Occupations with Graph Neural Networks

Figure 2 for Being Automated or Not? Risk Identification of Occupations with Graph Neural Networks

Figure 3 for Being Automated or Not? Risk Identification of Occupations with Graph Neural Networks

Figure 4 for Being Automated or Not? Risk Identification of Occupations with Graph Neural Networks

Abstract:The rapid advances in automation technologies, such as artificial intelligence (AI) and robotics, pose an increasing risk of automation for occupations, with a likely significant impact on the labour market. Recent social-economic studies suggest that nearly 50\% of occupations are at high risk of being automated in the next decade. However, the lack of granular data and empirically informed models have limited the accuracy of these studies and made it challenging to predict which jobs will be automated. In this paper, we study the automation risk of occupations by performing a classification task between automated and non-automated occupations. The available information is 910 occupations' task statements, skills and interactions categorised by Standard Occupational Classification (SOC). To fully utilize this information, we propose a graph-based semi-supervised classification method named \textbf{A}utomated \textbf{O}ccupation \textbf{C}lassification based on \textbf{G}raph \textbf{C}onvolutional \textbf{N}etworks (\textbf{AOC-GCN}) to identify the automated risk for occupations. This model integrates a heterogeneous graph to capture occupations' local and global contexts. The results show that our proposed method outperforms the baseline models by considering the information of both internal features of occupations and their external interactions. This study could help policymakers identify potential automated occupations and support individuals' decision-making before entering the job market.

* 15 pages, 6 figures

Via

Access Paper or Ask Questions

Detect Hate Speech in Unseen Domains using Multi-Task Learning: A Case Study of Political Public Figures

Aug 22, 2022

Lanqin Yuan, Marian-Andrei Rizoiu

Figure 1 for Detect Hate Speech in Unseen Domains using Multi-Task Learning: A Case Study of Political Public Figures

Figure 2 for Detect Hate Speech in Unseen Domains using Multi-Task Learning: A Case Study of Political Public Figures

Figure 3 for Detect Hate Speech in Unseen Domains using Multi-Task Learning: A Case Study of Political Public Figures

Figure 4 for Detect Hate Speech in Unseen Domains using Multi-Task Learning: A Case Study of Political Public Figures

Abstract:Automatic identification of hateful and abusive content is vital in combating the spread of harmful online content and its damaging effects. Most existing works evaluate models by examining the generalization error on train-test splits on hate speech datasets. These datasets often differ in their definitions and labeling criteria, leading to poor model performance when predicting across new domains and datasets. In this work, we propose a new Multi-task Learning (MTL) pipeline that utilizes MTL to train simultaneously across multiple hate speech datasets to construct a more encompassing classification model. We simulate evaluation on new previously unseen datasets by adopting a leave-one-out scheme in which we omit a target dataset from training and jointly train on the other datasets. Our results consistently outperform a large sample of existing work. We show strong results when examining generalization error in train-test splits and substantial improvements when predicting on previously unseen datasets. Furthermore, we assemble a novel dataset, dubbed PubFigs, focusing on the problematic speech of American Public Political Figures. We automatically detect problematic speech in the $305,235$ tweets in PubFigs, and we uncover insights into the posting behaviors of public figures.

Via

Access Paper or Ask Questions