Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Haoyue Dai

Type Information-Assisted Self-Supervised Knowledge Graph Denoising

Mar 13, 2025

Jiaqi Sun, Yujia Zheng, Xinshuai Dong, Haoyue Dai, Kun Zhang

Abstract:Knowledge graphs serve as critical resources supporting intelligent systems, but they can be noisy due to imperfect automatic generation processes. Existing approaches to noise detection often rely on external facts, logical rule constraints, or structural embeddings. These methods are often challenged by imperfect entity alignment, flexible knowledge graph construction, and overfitting on structures. In this paper, we propose to exploit the consistency between entity and relation type information for noise detection, resulting a novel self-supervised knowledge graph denoising method that avoids those problems. We formalize type inconsistency noise as triples that deviate from the majority with respect to type-dependent reasoning along the topological structure. Specifically, we first extract a compact representation of a given knowledge graph via an encoder that models the type dependencies of triples. Then, the decoder reconstructs the original input knowledge graph based on the compact representation. It is worth noting that, our proposal has the potential to address the problems of knowledge graph compression and completion, although this is not our focus. For the specific task of noise detection, the discrepancy between the reconstruction results and the input knowledge graph provides an opportunity for denoising, which is facilitated by the type consistency embedded in our method. Experimental validation demonstrates the effectiveness of our approach in detecting potential noise in real-world data.

* Accepted by AISTATS 2025

Via

Access Paper or Ask Questions

When Selection Meets Intervention: Additional Complexities in Causal Discovery

Mar 10, 2025

Haoyue Dai, Ignavier Ng, Jianle Sun, Zeyu Tang, Gongxu Luo, Xinshuai Dong, Peter Spirtes, Kun Zhang

Abstract:We address the common yet often-overlooked selection bias in interventional studies, where subjects are selectively enrolled into experiments. For instance, participants in a drug trial are usually patients of the relevant disease; A/B tests on mobile applications target existing users only, and gene perturbation studies typically focus on specific cell types, such as cancer cells. Ignoring this bias leads to incorrect causal discovery results. Even when recognized, the existing paradigm for interventional causal discovery still fails to address it. This is because subtle differences in when and where interventions happen can lead to significantly different statistical patterns. We capture this dynamic by introducing a graphical model that explicitly accounts for both the observed world (where interventions are applied) and the counterfactual world (where selection occurs while interventions have not been applied). We characterize the Markov property of the model, and propose a provably sound algorithm to identify causal relations as well as selection mechanisms up to the equivalence class, from data with soft interventions and unknown targets. Through synthetic and real-world experiments, we demonstrate that our algorithm effectively identifies true causal relations despite the presence of selection bias.

* Proceedings of the International Conference on Learning Representations (ICLR), 2025
* Appears at ICLR 2025 (oral)

Via

Access Paper or Ask Questions

Permutation-Based Rank Test in the Presence of Discretization and Application in Causal Discovery with Mixed Data

Jan 31, 2025

Xinshuai Dong, Ignavier Ng, Boyang Sun, Haoyue Dai, Guang-Yuan Hao, Shunxing Fan, Peter Spirtes, Yumou Qiu, Kun Zhang

Figure 1 for Permutation-Based Rank Test in the Presence of Discretization and Application in Causal Discovery with Mixed Data

Figure 2 for Permutation-Based Rank Test in the Presence of Discretization and Application in Causal Discovery with Mixed Data

Figure 3 for Permutation-Based Rank Test in the Presence of Discretization and Application in Causal Discovery with Mixed Data

Figure 4 for Permutation-Based Rank Test in the Presence of Discretization and Application in Causal Discovery with Mixed Data

Abstract:Recent advances have shown that statistical tests for the rank of cross-covariance matrices play an important role in causal discovery. These rank tests include partial correlation tests as special cases and provide further graphical information about latent variables. Existing rank tests typically assume that all the continuous variables can be perfectly measured, and yet, in practice many variables can only be measured after discretization. For example, in psychometric studies, the continuous level of certain personality dimensions of a person can only be measured after being discretized into order-preserving options such as disagree, neutral, and agree. Motivated by this, we propose Mixed data Permutation-based Rank Test (MPRT), which properly controls the statistical errors even when some or all variables are discretized. Theoretically, we establish the exchangeability and estimate the asymptotic null distribution by permutations; as a consequence, MPRT can effectively control the Type I error in the presence of discretization while previous methods cannot. Empirically, our method is validated by extensive experiments on synthetic data and real-world data to demonstrate its effectiveness as well as applicability in causal discovery.

Via

Access Paper or Ask Questions

Local Causal Discovery with Linear non-Gaussian Cyclic Models

Mar 21, 2024

Haoyue Dai, Ignavier Ng, Yujia Zheng, Zhengqing Gao, Kun Zhang

Figure 1 for Local Causal Discovery with Linear non-Gaussian Cyclic Models

Figure 2 for Local Causal Discovery with Linear non-Gaussian Cyclic Models

Figure 3 for Local Causal Discovery with Linear non-Gaussian Cyclic Models

Figure 4 for Local Causal Discovery with Linear non-Gaussian Cyclic Models

Abstract:Local causal discovery is of great practical significance, as there are often situations where the discovery of the global causal structure is unnecessary, and the interest lies solely on a single target variable. Most existing local methods utilize conditional independence relations, providing only a partially directed graph, and assume acyclicity for the ground-truth structure, even though real-world scenarios often involve cycles like feedback mechanisms. In this work, we present a general, unified local causal discovery method with linear non-Gaussian models, whether they are cyclic or acyclic. We extend the application of independent component analysis from the global context to independent subspace analysis, enabling the exact identification of the equivalent local directed structures and causal strengths from the Markov blanket of the target variable. We also propose an alternative regression-based method in the particular acyclic scenarios. Our identifiability results are empirically validated using both synthetic and real-world datasets.

* Appears at AISTATS 2024

Via

Access Paper or Ask Questions

Gene Regulatory Network Inference in the Presence of Dropouts: a Causal View

Mar 21, 2024

Haoyue Dai, Ignavier Ng, Gongxu Luo, Peter Spirtes, Petar Stojanov, Kun Zhang

Abstract:Gene regulatory network inference (GRNI) is a challenging problem, particularly owing to the presence of zeros in single-cell RNA sequencing data: some are biological zeros representing no gene expression, while some others are technical zeros arising from the sequencing procedure (aka dropouts), which may bias GRNI by distorting the joint distribution of the measured gene expressions. Existing approaches typically handle dropout error via imputation, which may introduce spurious relations as the true joint distribution is generally unidentifiable. To tackle this issue, we introduce a causal graphical model to characterize the dropout mechanism, namely, Causal Dropout Model. We provide a simple yet effective theoretical result: interestingly, the conditional independence (CI) relations in the data with dropouts, after deleting the samples with zero values (regardless if technical or not) for the conditioned variables, are asymptotically identical to the CI relations in the original data without dropouts. This particular test-wise deletion procedure, in which we perform CI tests on the samples without zeros for the conditioned variables, can be seamlessly integrated with existing structure learning approaches including constraint-based and greedy score-based methods, thus giving rise to a principled framework for GRNI in the presence of dropouts. We further show that the causal dropout model can be validated from data, and many existing statistical models to handle dropouts fit into our model as specific parametric instances. Empirical evaluation on synthetic, curated, and real-world experimental transcriptomic data comprehensively demonstrate the efficacy of our method.

* Appears at ICLR 2024 (oral)

Via

Access Paper or Ask Questions

On the Three Demons in Causality in Finance: Time Resolution, Nonstationarity, and Latent Factors

Jan 12, 2024

Xinshuai Dong, Haoyue Dai, Yewen Fan, Songyao Jin, Sathyamoorthy Rajendran, Kun Zhang

Abstract:Financial data is generally time series in essence and thus suffers from three fundamental issues: the mismatch in time resolution, the time-varying property of the distribution - nonstationarity, and causal factors that are important but unknown/unobserved. In this paper, we follow a causal perspective to systematically look into these three demons in finance. Specifically, we reexamine these issues in the context of causality, which gives rise to a novel and inspiring understanding of how the issues can be addressed. Following this perspective, we provide systematic solutions to these problems, which hopefully would serve as a foundation for future research in the area.

Via

Access Paper or Ask Questions

ML4C: Seeing Causality Through Latent Vicinity

Oct 01, 2021

Haoyue Dai, Rui Ding, Yuanyuan Jiang, Shi Han, Dongmei Zhang

Figure 1 for ML4C: Seeing Causality Through Latent Vicinity

Figure 2 for ML4C: Seeing Causality Through Latent Vicinity

Figure 3 for ML4C: Seeing Causality Through Latent Vicinity

Figure 4 for ML4C: Seeing Causality Through Latent Vicinity

Abstract:Supervised Causal Learning (SCL) aims to learn causal relations from observational data by accessing previously seen datasets associated with ground truth causal relations. This paper presents a first attempt at addressing a fundamental question: What are the benefits from supervision and how does it benefit? Starting from seeing that SCL is not better than random guessing if the learning target is non-identifiable a priori, we propose a two-phase paradigm for SCL by explicitly considering structure identifiability. Following this paradigm, we tackle the problem of SCL on discrete data and propose ML4C. The core of ML4C is a binary classifier with a novel learning target: it classifies whether an Unshielded Triple (UT) is a v-structure or not. Starting from an input dataset with the corresponding skeleton provided, ML4C orients each UT once it is classified as a v-structure. These v-structures are together used to construct the final output. To address the fundamental question of SCL, we propose a principled method for ML4C featurization: we exploit the vicinity of a given UT (i.e., the neighbors of UT in skeleton), and derive features by considering the conditional dependencies and structural entanglement within the vicinity. We further prove that ML4C is asymptotically perfect. Last but foremost, thorough experiments conducted on benchmark datasets demonstrate that ML4C remarkably outperforms other state-of-the-art algorithms in terms of accuracy, robustness, tolerance and transferability. In summary, ML4C shows promising results on validating the effectiveness of supervision for causal learning.

* causal discovery, supervised causal learning, vicinity, conditional dependency, entanglement, learnability

Via

Access Paper or Ask Questions

What do CNN neurons learn: Visualization & Clustering

Oct 18, 2020

Haoyue Dai

Figure 1 for What do CNN neurons learn: Visualization & Clustering

Figure 2 for What do CNN neurons learn: Visualization & Clustering

Figure 3 for What do CNN neurons learn: Visualization & Clustering

Figure 4 for What do CNN neurons learn: Visualization & Clustering

Abstract:In recent years convolutional neural networks (CNN) have shown striking progress in various tasks. However, despite the high performance, the training and prediction process remains to be a black box, leaving it a mystery to extract what neurons learn in CNN. In this paper, we address the problem of interpreting a CNN from the aspects of the input image's focus and preference, and the neurons' domination, activation and contribution to a concrete final prediction. Specifically, we use two techniques - visualization and clustering - to tackle the problems above. Visualization means the method of gradient descent on image pixel, and in clustering section two algorithms are proposed to cluster respectively over image categories and network neurons. Experiments and quantitative analyses have demonstrated the effectiveness of the two methods in explaining the question: what do neurons learn.

* 9 pages, 10 figures, tech report

Via

Access Paper or Ask Questions