Abstract:The rapid proliferation of fake news on social media threatens social stability, creating an urgent demand for more effective detection methods. While many promising approaches have emerged, most rely on content analysis with limited semantic depth, leading to suboptimal comprehension of news content.To address this limitation, capturing broader-range semantics is essential yet challenging, as it introduces two primary types of noise: fully connecting sentences in news graphs often adds unnecessary structural noise, while highly similar but authenticity-irrelevant sentences introduce feature noise, complicating the detection process. To tackle these issues, we propose BREAK, a broad-range semantics model for fake news detection that leverages a fully connected graph to capture comprehensive semantics while employing dual denoising modules to minimize both structural and feature noise. The semantic structure denoising module balances the graph's connectivity by iteratively refining it between two bounds: a sequence-based structure as a lower bound and a fully connected graph as the upper bound. This refinement uncovers label-relevant semantic interrelations structures. Meanwhile, the semantic feature denoising module reduces noise from similar semantics by diversifying representations, aligning distinct outputs from the denoised graph and sequence encoders using KL-divergence to achieve feature diversification in high-dimensional space. The two modules are jointly optimized in a bi-level framework, enhancing the integration of denoised semantics into a comprehensive representation for detection. Extensive experiments across four datasets demonstrate that BREAK significantly outperforms existing methods in identifying fake news. Code is available at https://anonymous.4open.science/r/BREAK.
Abstract:Large Language Models (LLMs) have become pervasive due to their knowledge absorption and text-generation capabilities. Concurrently, the copyright issue for pretraining datasets has been a pressing concern, particularly when generation includes specific styles. Previous methods either focus on the defense of identical copyrighted outputs or find interpretability by individual tokens with computational burdens. However, the gap between them exists, where direct assessments of how dataset contributions impact LLM outputs are missing. Once the model providers ensure copyright protection for data holders, a more mature LLM community can be established. To address these limitations, we introduce CopyLens, a new framework to analyze how copyrighted datasets may influence LLM responses. Specifically, a two-stage approach is employed: First, based on the uniqueness of pretraining data in the embedding space, token representations are initially fused for potential copyrighted texts, followed by a lightweight LSTM-based network to analyze dataset contributions. With such a prior, a contrastive-learning-based non-copyright OOD detector is designed. Our framework can dynamically face different situations and bridge the gap between current copyright detection methods. Experiments show that CopyLens improves efficiency and accuracy by 15.2% over our proposed baseline, 58.7% over prompt engineering methods, and 0.21 AUC over OOD detection baselines.
Abstract:The ID-free recommendation paradigm has been proposed to address the limitation that traditional recommender systems struggle to model cold-start users or items with new IDs. Despite its effectiveness, this study uncovers that ID-free recommender systems are vulnerable to the proposed Text Simulation attack (TextSimu) which aims to promote specific target items. As a novel type of text poisoning attack, TextSimu exploits large language models (LLM) to alter the textual information of target items by simulating the characteristics of popular items. It operates effectively in both black-box and white-box settings, utilizing two key components: a unified popularity extraction module, which captures the essential characteristics of popular items, and an N-persona consistency simulation strategy, which creates multiple personas to collaboratively synthesize refined promotional textual descriptions for target items by simulating the popular items. To withstand TextSimu-like attacks, we further explore the detection approach for identifying LLM-generated promotional text. Extensive experiments conducted on three datasets demonstrate that TextSimu poses a more significant threat than existing poisoning attacks, while our defense method can detect malicious text of target items generated by TextSimu. By identifying the vulnerability, we aim to advance the development of more robust ID-free recommender systems.
Abstract:Modern recommender systems (RS) have profoundly enhanced user experience across digital platforms, yet they face significant threats from poisoning attacks. These attacks, aimed at manipulating recommendation outputs for unethical gains, exploit vulnerabilities in RS through injecting malicious data or intervening model training. This survey presents a unique perspective by examining these threats through the lens of an attacker, offering fresh insights into their mechanics and impacts. Concretely, we detail a systematic pipeline that encompasses four stages of a poisoning attack: setting attack goals, assessing attacker capabilities, analyzing victim architecture, and implementing poisoning strategies. The pipeline not only aligns with various attack tactics but also serves as a comprehensive taxonomy to pinpoint focuses of distinct poisoning attacks. Correspondingly, we further classify defensive strategies into two main categories: poisoning data filtering and robust training from the defender's perspective. Finally, we highlight existing limitations and suggest innovative directions for further exploration in this field.
Abstract:Modern recommender systems (RS) have seen substantial success, yet they remain vulnerable to malicious activities, notably poisoning attacks. These attacks involve injecting malicious data into the training datasets of RS, thereby compromising their integrity and manipulating recommendation outcomes for gaining illicit profits. This survey paper provides a systematic and up-to-date review of the research landscape on Poisoning Attacks against Recommendation (PAR). A novel and comprehensive taxonomy is proposed, categorizing existing PAR methodologies into three distinct categories: Component-Specific, Goal-Driven, and Capability Probing. For each category, we discuss its mechanism in detail, along with associated methods. Furthermore, this paper highlights potential future research avenues in this domain. Additionally, to facilitate and benchmark the empirical comparison of PAR, we introduce an open-source library, ARLib, which encompasses a comprehensive collection of PAR models and common datasets. The library is released at https://github.com/CoderWZW/ARLib.
Abstract:Contrastive learning (CL) has recently gained significant popularity in the field of recommendation. Its ability to learn without heavy reliance on labeled data is a natural antidote to the data sparsity issue. Previous research has found that CL can not only enhance recommendation accuracy but also inadvertently exhibit remarkable robustness against noise. However, this paper identifies a vulnerability of CL-based recommender systems: Compared with their non-CL counterparts, they are even more susceptible to poisoning attacks that aim to promote target items. Our analysis points to the uniform dispersion of representations led by the CL loss as the very factor that accounts for this vulnerability. We further theoretically and empirically demonstrate that the optimization of CL loss can lead to smooth spectral values of representations. Based on these insights, we attempt to reveal the potential poisoning attacks against CL-based recommender systems. The proposed attack encompasses a dual-objective framework: One that induces a smoother spectral value distribution to amplify the CL loss's inherent dispersion effect, named dispersion promotion; and the other that directly elevates the visibility of target items, named rank promotion. We validate the destructiveness of our attack model through extensive experimentation on four datasets. By shedding light on these vulnerabilities, we aim to facilitate the development of more robust CL-based recommender systems.
Abstract:Implicit feedback plays a huge role in recommender systems, but its high noise characteristic seriously reduces its effect. To denoise implicit feedback, some efforts have been devoted to graph data augmentation (GDA) methods. Although the bi-level optimization thought of GDA guarantees better recommendation performance theoretically, it also leads to expensive time costs and severe space explosion problems. Specifically, bi-level optimization involves repeated traversal of all positive and negative instances after each optimization of the recommendation model. In this paper, we propose a new denoising paradigm, i.e., Quick Graph Conversion (QGrace), to effectively transform the original interaction graph into a purified (for positive instances) and densified (for negative instances) interest graph during the recommendation model training process. In QGrace, we leverage the gradient matching scheme based on elaborated generative models to fulfill the conversion and generation of an interest graph, elegantly overcoming the high time and space cost problems. To enable recommendation models to run on interest graphs that lack implicit feedback data, we provide a fine-grained objective function from the perspective of alignment and uniformity. The experimental results on three benchmark datasets demonstrate that the QGrace outperforms the state-of-the-art GDA methods and recommendation models in effectiveness and robustness.
Abstract:Self-supervised learning (SSL) recently has achieved outstanding success on recommendation. By setting up an auxiliary task (either predictive or contrastive), SSL can discover supervisory signals from the raw data without human annotation, which greatly mitigates the problem of sparse user-item interactions. However, most SSL-based recommendation models rely on general-purpose auxiliary tasks, e.g., maximizing correspondence between node representations learned from the original and perturbed interaction graphs, which are explicitly irrelevant to the recommendation task. Accordingly, the rich semantics reflected by social relationships and item categories, which lie in the recommendation data-based heterogeneous graphs, are not fully exploited. To explore recommendation-specific auxiliary tasks, we first quantitatively analyze the heterogeneous interaction data and find a strong positive correlation between the interactions and the number of user-item paths induced by meta-paths. Based on the finding, we design two auxiliary tasks that are tightly coupled with the target task (one is predictive and the other one is contrastive) towards connecting recommendation with the self-supervision signals hiding in the positive correlation. Finally, a model-agnostic DUal-Auxiliary Learning (DUAL) framework which unifies the SSL and recommendation tasks is developed. The extensive experiments conducted on three real-world datasets demonstrate that DUAL can significantly improve recommendation, reaching the state-of-the-art performance.
Abstract:With the increasingly fierce market competition, offering a free trial has become a potent stimuli strategy to promote products and attract users. By providing users with opportunities to experience goods without charge, a free trial makes adopters know more about products and thus encourages their willingness to buy. However, as the critical point in the promotion process, finding the proper adopters is rarely explored. Empirically winnowing users by their static demographic attributes is feasible but less effective, neglecting their personalized preferences. To dynamically match the products with the best adopters, in this work, we propose a novel free trial user selection model named SMILE, which is based on reinforcement learning (RL) where an agent actively selects specific adopters aiming to maximize the profit after free trials. Specifically, we design a tree structure to reformulate the action space, which allows us to select adopters from massive user space efficiently. The experimental analysis on three datasets demonstrates the proposed model's superiority and elucidates why reinforcement learning and tree structure can improve performance. Our study demonstrates technical feasibility for constructing a more robust and intelligent user selection model and guides for investigating more marketing promotion strategies.
Abstract:To explore the robustness of recommender systems, researchers have proposed various shilling attack models and analyzed their adverse effects. Primitive attacks are highly feasible but less effective due to simplistic handcrafted rules, while upgraded attacks are more powerful but costly and difficult to deploy because they require more knowledge from recommendations. In this paper, we explore a novel shilling attack called Graph cOnvolution-based generative shilling ATtack (GOAT) to balance the attacks' feasibility and effectiveness. GOAT adopts the primitive attacks' paradigm that assigns items for fake users by sampling and the upgraded attacks' paradigm that generates fake ratings by a deep learning-based model. It deploys a generative adversarial network (GAN) that learns the real rating distribution to generate fake ratings. Additionally, the generator combines a tailored graph convolution structure that leverages the correlations between co-rated items to smoothen the fake ratings and enhance their authenticity. The extensive experiments on two public datasets evaluate GOAT's performance from multiple perspectives. Our study of the GOAT demonstrates technical feasibility for building a more powerful and intelligent attack model with a much-reduced cost, enables analysis the threat of such an attack and guides for investigating necessary prevention measures.