Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yunpeng Zhao

Does Training with Synthetic Data Truly Protect Privacy?

Feb 18, 2025

Yunpeng Zhao, Jie Zhang

Abstract:As synthetic data becomes increasingly popular in machine learning tasks, numerous methods--without formal differential privacy guarantees--use synthetic data for training. These methods often claim, either explicitly or implicitly, to protect the privacy of the original training data. In this work, we explore four different training paradigms: coreset selection, dataset distillation, data-free knowledge distillation, and synthetic data generated from diffusion models. While all these methods utilize synthetic data for training, they lead to vastly different conclusions regarding privacy preservation. We caution that empirical approaches to preserving data privacy require careful and rigorous evaluation; otherwise, they risk providing a false sense of privacy.

* Accepted to ICLR 2025

Via

Access Paper or Ask Questions

Community Detection with Heterogeneous Block Covariance Model

Dec 04, 2024

Xiang Li, Yunpeng Zhao, Qing Pan, Ning Hao

Abstract:Community detection is the task of clustering objects based on their pairwise relationships. Most of the model-based community detection methods, such as the stochastic block model and its variants, are designed for networks with binary (yes/no) edges. In many practical scenarios, edges often possess continuous weights, spanning positive and negative values, which reflect varying levels of connectivity. To address this challenge, we introduce the heterogeneous block covariance model (HBCM) that defines a community structure within the covariance matrix, where edges have signed and continuous weights. Furthermore, it takes into account the heterogeneity of objects when forming connections with other objects within a community. A novel variational expectation-maximization algorithm is proposed to estimate the group membership. The HBCM provides provable consistent estimates of memberships, and its promising performance is observed in numerical simulations with different setups. The model is applied to a single-cell RNA-seq dataset of a mouse embryo and a stock price dataset. Supplementary materials for this article are available online.

Via

Access Paper or Ask Questions

Dealing with All-stage Missing Modality: Towards A Universal Model with Robust Reconstruction and Personalization

Jun 04, 2024

Yunpeng Zhao, Cheng Chen, Qing You Pang, Quanzheng Li, Carol Tang, Beng-Ti Ang, Yueming Jin

Figure 1 for Dealing with All-stage Missing Modality: Towards A Universal Model with Robust Reconstruction and Personalization

Figure 2 for Dealing with All-stage Missing Modality: Towards A Universal Model with Robust Reconstruction and Personalization

Figure 3 for Dealing with All-stage Missing Modality: Towards A Universal Model with Robust Reconstruction and Personalization

Figure 4 for Dealing with All-stage Missing Modality: Towards A Universal Model with Robust Reconstruction and Personalization

Abstract:Addressing missing modalities presents a critical challenge in multimodal learning. Current approaches focus on developing models that can handle modality-incomplete inputs during inference, assuming that the full set of modalities are available for all the data during training. This reliance on full-modality data for training limits the use of abundant modality-incomplete samples that are often encountered in practical settings. In this paper, we propose a robust universal model with modality reconstruction and model personalization, which can effectively tackle the missing modality at both training and testing stages. Our method leverages a multimodal masked autoencoder to reconstruct the missing modality and masked patches simultaneously, incorporating an innovative distribution approximation mechanism to fully utilize both modality-complete and modality-incomplete data. The reconstructed modalities then contributes to our designed data-model co-distillation scheme to guide the model learning in the presence of missing modalities. Moreover, we propose a CLIP-driven hyper-network to personalize partial model parameters, enabling the model to adapt to each distinct missing modality scenario. Our method has been extensively validated on two brain tumor segmentation benchmarks. Experimental results demonstrate the promising performance of our method, which consistently exceeds previous state-of-the-art approaches under the all-stage missing modality settings with different missing ratios. Code will be available.

Via

Access Paper or Ask Questions

Spatial Temporal Graph Convolution with Graph Structure Self-learning for Early MCI Detection

Nov 11, 2022

Yunpeng Zhao, Fugen Zhou, Bin Guo, Bo Liu

Abstract:Graph neural networks (GNNs) have been successfully applied to early mild cognitive impairment (EMCI) detection, with the usage of elaborately designed features constructed from blood oxygen level-dependent (BOLD) time series. However, few works explored the feasibility of using BOLD signals directly as features. Meanwhile, existing GNN-based methods primarily rely on hand-crafted explicit brain topology as the adjacency matrix, which is not optimal and ignores the implicit topological organization of the brain. In this paper, we propose a spatial temporal graph convolutional network with a novel graph structure self-learning mechanism for EMCI detection. The proposed spatial temporal graph convolution block directly exploits BOLD time series as input features, which provides an interesting view for rsfMRI-based preclinical AD diagnosis. Moreover, our model can adaptively learn the optimal topological structure and refine edge weights with the graph structure self-learning mechanism. Results on the Alzheimer's Disease Neuroimaging Initiative (ADNI) database show that our method outperforms state-of-the-art approaches. Biomarkers consistent with previous studies can be extracted from the model, proving the reliable interpretability of our method.

* 5 pages, 3 figures

Via

Access Paper or Ask Questions

Variational Estimators of the Degree-corrected Latent Block Model for Bipartite Networks

Jun 16, 2022

Yunpeng Zhao, Ning Hao, Ji Zhu

Figure 1 for Variational Estimators of the Degree-corrected Latent Block Model for Bipartite Networks

Figure 2 for Variational Estimators of the Degree-corrected Latent Block Model for Bipartite Networks

Figure 3 for Variational Estimators of the Degree-corrected Latent Block Model for Bipartite Networks

Figure 4 for Variational Estimators of the Degree-corrected Latent Block Model for Bipartite Networks

Abstract:Biclustering on bipartite graphs is an unsupervised learning task that simultaneously clusters the two types of objects in the graph, for example, users and movies in a movie review dataset. The latent block model (LBM) has been proposed as a model-based tool for biclustering. Biclustering results by the LBM are, however, usually dominated by the row and column sums of the data matrix, i.e., degrees. We propose a degree-corrected latent block model (DC-LBM) to accommodate degree heterogeneity in row and column clusters, which greatly outperforms the classical LBM in the MovieLens dataset and simulated data. We develop an efficient variational expectation-maximization algorithm by observing that the row and column degrees maximize the objective function in the M step given any probability assignment on the cluster labels. We prove the label consistency of the variational estimator under the DC-LBM, which allows the expected graph density goes to zero as long as the average expected degrees of rows and columns go to infinity.

Via

Access Paper or Ask Questions

Deep Learning Models in Detection of Dietary Supplement Adverse Event Signals from Twitter

Jun 21, 2021

Yefeng Wang, Yunpeng Zhao, Jiang Bian, Rui Zhang

Figure 1 for Deep Learning Models in Detection of Dietary Supplement Adverse Event Signals from Twitter

Figure 2 for Deep Learning Models in Detection of Dietary Supplement Adverse Event Signals from Twitter

Figure 3 for Deep Learning Models in Detection of Dietary Supplement Adverse Event Signals from Twitter

Figure 4 for Deep Learning Models in Detection of Dietary Supplement Adverse Event Signals from Twitter

Abstract:Objective: The objective of this study is to develop a deep learning pipeline to detect signals on dietary supplement-related adverse events (DS AEs) from Twitter. Material and Methods: We obtained 247,807 tweets ranging from 2012 to 2018 that mentioned both DS and AE. We annotated biomedical entities and relations on 2,000 randomly selected tweets. For the concept extraction task, we compared the performance of traditional word embeddings with SVM, CRF and LSTM-CRF classifiers to BERT models. For the relation extraction task, we compared GloVe vectors with CNN classifiers to BERT models. We chose the best performing models in each task to assemble an end-to-end deep learning pipeline to detect DS AE signals and compared the results to the known DS AEs from a DS knowledge base (i.e., iDISK). Results: In both tasks, the BERT-based models outperformed traditional word embeddings. The best performing concept extraction model is the BioBERT model that can identify supplement, symptom, and body organ entities with F1-scores of 0.8646, 0.8497, and 0.7104, respectively. The best performing relation extraction model is the BERT model that can identify purpose and AE relations with F1-scores of 0.8335 and 0.7538, respectively. The end-to-end pipeline was able to extract DS indication and DS AEs with an F1-score of 0.7459 and 0,7414, respectively. Comparing to the iDISK, we could find both known and novel DS-AEs. Conclusion: We have demonstrated the feasibility of detecting DS AE signals from Twitter with a BioBERT-based deep learning pipeline.

* 1 Figure, 6 Tables

Via

Access Paper or Ask Questions

Identifiability and consistency of network inference using the hub model and variants: a restricted class of Bernoulli mixture models

Apr 22, 2020

Yunpeng Zhao, Peter Bickel, Charles Weko

Figure 1 for Identifiability and consistency of network inference using the hub model and variants: a restricted class of Bernoulli mixture models

Figure 2 for Identifiability and consistency of network inference using the hub model and variants: a restricted class of Bernoulli mixture models

Abstract:Statistical network analysis primarily focuses on inferring the parameters of an observed network. In many applications, especially in the social sciences, the observed data is the groups formed by individual subjects. In these applications, the network is itself a parameter of a statistical model. Zhao and Weko (2019) propose a model-based approach, called the hub model, to infer implicit networks from grouping behavior. The hub model assumes that each member of the group is brought together by a member of the group called the hub. The hub model belongs to the family of Bernoulli mixture models. Identifiability of parameters is a notoriously difficult problem for Bernoulli mixture models. This paper proves identifiability of the hub model parameters and estimation consistency under mild conditions. Furthermore, this paper generalizes the hub model by introducing a model component that allows hubless groups in which individual nodes spontaneously appear independent of any other individual. We refer to this additional component as the null component. The new model bridges the gap between the hub model and the degenerate case of the mixture model -- the Bernoulli product. Identifiability and consistency are also proved for the new model. Numerical studies are provided to demonstrate the theoretical results.

Via

Access Paper or Ask Questions

Integrating Crowdsourcing and Active Learning for Classification of Work-Life Events from Tweets

Apr 02, 2020

Yunpeng Zhao, Mattia Prosperi, Tianchen Lyu, Yi Guo, Jiang Bian

Figure 1 for Integrating Crowdsourcing and Active Learning for Classification of Work-Life Events from Tweets

Figure 2 for Integrating Crowdsourcing and Active Learning for Classification of Work-Life Events from Tweets

Figure 3 for Integrating Crowdsourcing and Active Learning for Classification of Work-Life Events from Tweets

Figure 4 for Integrating Crowdsourcing and Active Learning for Classification of Work-Life Events from Tweets

Abstract:Social media, especially Twitter, is being increasingly used for research with predictive analytics. In social media studies, natural language processing (NLP) techniques are used in conjunction with expert-based, manual and qualitative analyses. However, social media data are unstructured and must undergo complex manipulation for research use. The manual annotation is the most resource and time-consuming process that multiple expert raters have to reach consensus on every item, but is essential to create gold-standard datasets for training NLP-based machine learning classifiers. To reduce the burden of the manual annotation, yet maintaining its reliability, we devised a crowdsourcing pipeline combined with active learning strategies. We demonstrated its effectiveness through a case study that identifies job loss events from individual tweets. We used Amazon Mechanical Turk platform to recruit annotators from the Internet and designed a number of quality control measures to assure annotation accuracy. We evaluated 4 different active learning strategies (i.e., least confident, entropy, vote entropy, and Kullback-Leibler divergence). The active learning strategies aim at reducing the number of tweets needed to reach a desired performance of automated classification. Results show that crowdsourcing is useful to create high-quality annotations and active learning helps in reducing the number of required tweets, although there was no substantial difference among the strategies tested.

Via

Access Paper or Ask Questions

Understanding Perceptions and Attitudes in Breast Cancer Discussions on Twitter

May 22, 2019

Francois Modave, Yunpeng Zhao, Janice Krieger, Zhe He, Yi Guo, Jinhai Huo, Mattia Prosperi, Jiang Bian

Figure 1 for Understanding Perceptions and Attitudes in Breast Cancer Discussions on Twitter

Figure 2 for Understanding Perceptions and Attitudes in Breast Cancer Discussions on Twitter

Figure 3 for Understanding Perceptions and Attitudes in Breast Cancer Discussions on Twitter

Figure 4 for Understanding Perceptions and Attitudes in Breast Cancer Discussions on Twitter

Abstract:Among American women, the rate of breast cancer is only second to lung cancer. An estimated 12.4% women will develop breast cancer over the course of their lifetime. The widespread use of social media across the socio-economic spectrum offers unparalleled ways to facilitate information sharing, in particular as it pertains to health. Social media is also used by many healthcare stakeholders, ranging from government agencies to healthcare industry, to disseminate health information and to engage patients. The purpose of this study is to investigate people's perceptions and attitudes relate to breast cancer, especially those that are related to physical activities, on Twitter. To achieve this, we first identified and collected tweets related to breast cancer; and then used topic modeling and sentiment analysis techniques to understanding discussion themes and quantify Twitter users' perceptions and emotions w.r.t breast cancer to answer 5 research questions.

* 5 pages, 10 figures, The 17th World Congress of Medical and Health Informatics

Via

Access Paper or Ask Questions

Logistic Regression Augmented Community Detection for Network Data with Application in Identifying Autism-Related Gene Pathways

Sep 07, 2018

Yunpeng Zhao, Qing Pan, Chengan Du

Figure 1 for Logistic Regression Augmented Community Detection for Network Data with Application in Identifying Autism-Related Gene Pathways

Figure 2 for Logistic Regression Augmented Community Detection for Network Data with Application in Identifying Autism-Related Gene Pathways

Figure 3 for Logistic Regression Augmented Community Detection for Network Data with Application in Identifying Autism-Related Gene Pathways

Figure 4 for Logistic Regression Augmented Community Detection for Network Data with Application in Identifying Autism-Related Gene Pathways

Abstract:When searching for gene pathways leading to specific disease outcomes, additional information on gene characteristics is often available that may facilitate to differentiate genes related to the disease from irrelevant background when connections involving both types of genes are observed and their relationships to the disease are unknown. We propose method to single out irrelevant background genes with the help of auxiliary information through a logistic regression, and cluster relevant genes into cohesive groups using the adjacency matrix. Expectation-maximization algorithm is modified to maximize a joint pseudo-likelihood assuming latent indicators for relevance to the disease and latent group memberships as well as Poisson or multinomial distributed link numbers within and between groups. A robust version allowing arbitrary linkage patterns within the background is further derived. Asymptotic consistency of label assignments under the stochastic blockmodel is proven. Superior performance and robustness in finite samples are observed in simulation studies. The proposed robust method identifies previously missed gene sets underlying autism related neurological diseases using diverse data sources including de novo mutations, gene expressions and protein-protein interactions.

* Biometrics (2019)

Via

Access Paper or Ask Questions