Shanghai Jiao Tong University
Abstract:Causality inference is prone to spurious causal interactions, due to the substantial confounders in a complex system. While many existing methods based on the statistical methods or dynamical methods attempt to address misidentification challenges, there remains a notable lack of effective methods to infer causality, in particular in the presence of invisible/unobservable confounders. As a result, accurately inferring causation with invisible confounders remains a largely unexplored and outstanding issue in data science and AI fields. In this work, we propose a method to overcome such challenges to infer dynamical causality under invisible confounders (CIC method) and further reconstruct the invisible confounders from time-series data by developing an orthogonal decomposition theorem in a delay embedding space. The core of our CIC method lies in its ability to decompose the observed variables not in their original space but in their delay embedding space into the common and private subspaces respectively, thereby quantifying causality between those variables both theoretically and computationally. This theoretical foundation ensures the causal detection for any high-dimensional system even with only two observed variables under many invisible confounders, which is actually a long-standing problem in the field. In addition to the invisible confounder problem, such a decomposition actually makes the intertwined variables separable in the embedding space, thus also solving the non-separability problem of causal inference. Extensive validation of the CIC method is carried out using various real datasets, and the experimental results demonstrates its effectiveness to reconstruct real biological networks even with unobserved confounders.
Abstract:We study the stochastic multi-armed bandit problem in the $P$-pass streaming model. In this problem, the $n$ arms are present in a stream and at most $m<n$ arms and their statistics can be stored in the memory. We give a complete characterization of the optimal regret in terms of $m, n$ and $P$. Specifically, we design an algorithm with $\tilde O\left((n-m)^{1+\frac{2^{P}-2}{2^{P+1}-1}} n^{\frac{2-2^{P+1}}{2^{P+1}-1}} T^{\frac{2^P}{2^{P+1}-1}}\right)$ regret and complement it with an $\tilde \Omega\left((n-m)^{1+\frac{2^{P}-2}{2^{P+1}-1}} n^{\frac{2-2^{P+1}}{2^{P+1}-1}} T^{\frac{2^P}{2^{P+1}-1}}\right)$ lower bound when the number of rounds $T$ is sufficiently large. Our results are tight up to a logarithmic factor in $n$ and $P$.
Abstract:Learning with expert advice and multi-armed bandit are two classic online decision problems which differ on how the information is observed in each round of the game. We study a family of problems interpolating the two. For a vector $\mathbf{m}=(m_1,\dots,m_K)\in \mathbb{N}^K$, an instance of $\mathbf{m}$-MAB indicates that the arms are partitioned into $K$ groups and the $i$-th group contains $m_i$ arms. Once an arm is pulled, the losses of all arms in the same group are observed. We prove tight minimax regret bounds for $\mathbf{m}$-MAB and design an optimal PAC algorithm for its pure exploration version, $\mathbf{m}$-BAI, where the goal is to identify the arm with minimum loss with as few rounds as possible. We show that the minimax regret of $\mathbf{m}$-MAB is $\Theta\left(\sqrt{T\sum_{k=1}^K\log (m_k+1)}\right)$ and the minimum number of pulls for an $(\epsilon,0.05)$-PAC algorithm of $\mathbf{m}$-BAI is $\Theta\left(\frac{1}{\epsilon^2}\cdot \sum_{k=1}^K\log (m_k+1)\right)$. Both our upper bounds and lower bounds for $\mathbf{m}$-MAB can be extended to a more general setting, namely the bandit with graph feedback, in terms of the clique cover and related graph parameters. As consequences, we obtained tight minimax regret bounds for several families of feedback graphs.
Abstract:The problem of bandit with graph feedback generalizes both the multi-armed bandit (MAB) problem and the learning with expert advice problem by encoding in a directed graph how the loss vector can be observed in each round of the game. The mini-max regret is closely related to the structure of the feedback graph and their connection is far from being fully understood. We propose a new algorithmic framework for the problem based on a partition of the feedback graph. Our analysis reveals the interplay between various parts of the graph by decomposing the regret to the sum of the regret caused by small parts and the regret caused by their interaction. As a result, our algorithm can be viewed as an interpolation and generalization of the optimal algorithms for MAB and learning with expert advice. Our framework unifies previous algorithms for both strongly observable graphs and weakly observable graphs, resulting in improved and optimal regret bounds on a wide range of graph families including graphs of bounded degree and strongly observable graphs with a few corrupted arms.
Abstract:Outcome labeling ambiguity and subjectivity are ubiquitous in real-world datasets. While practitioners commonly combine ambiguous outcome labels in an ad hoc way to improve the accuracy of multi-class classification, there lacks a principled approach to guide label combination by any optimality criterion. To address this problem, we propose the information-theoretic classification accuracy (ITCA), a criterion of outcome "information" conditional on outcome prediction, to guide practitioners on how to combine ambiguous outcome labels. ITCA indicates a balance in the trade-off between prediction accuracy (how well do predicted labels agree with actual labels) and prediction resolution (how many labels are predictable). To find the optimal label combination indicated by ITCA, we develop two search strategies: greedy search and breadth-first search. Notably, ITCA and the two search strategies are adaptive to all machine-learning classification algorithms. Coupled with a classification algorithm and a search strategy, ITCA has two uses: to improve prediction accuracy and to identify ambiguous labels. We first verify that ITCA achieves high accuracy with both search strategies in finding the correct label combinations on synthetic and real data. Then we demonstrate the effectiveness of ITCA in diverse applications including medical prognosis, cancer survival prediction, user demographics prediction, and cell type classification.
Abstract:The bandit problem with graph feedback, proposed in [Mannor and Shamir, NeurIPS 2011], is modeled by a directed graph $G=(V,E)$ where $V$ is the collection of bandit arms, and once an arm is triggered, all its incident arms are observed. A fundamental question is how the structure of the graph affects the min-max regret. We propose the notions of the fractional weak domination number $\delta^*$ and the $k$-packing independence number capturing upper bound and lower bound for the regret respectively. We show that the two notions are inherently connected via aligning them with the linear program of the weakly dominating set and its dual -- the fractional vertex packing set respectively. Based on this connection, we utilize the strong duality theorem to prove a general regret upper bound $O\left(\left( \delta^*\log |V|\right)^{\frac{1}{3}}T^{\frac{2}{3}}\right)$ and a lower bound $\Omega\left(\left(\delta^*/\alpha\right)^{\frac{1}{3}}T^{\frac{2}{3}}\right)$ where $\alpha$ is the integrality gap of the dual linear program. Therefore, our bounds are tight up to a $\left(\log |V|\right)^{\frac{1}{3}}$ factor on graphs with bounded integrality gap for the vertex packing problem including trees and graphs with bounded degree. Moreover, we show that for several special families of graphs, we can get rid of the $\left(\log |V|\right)^{\frac{1}{3}}$ factor and establish optimal regret.
Abstract:Matrix decomposition is one of the fundamental tools to discover knowledge from big data generated by modern applications. However, it is still inefficient or infeasible to process very big data using such a method in a single machine. Moreover, big data are often distributedly collected and stored on different machines. Thus, such data generally bear strong heterogeneous noise. It is essential and useful to develop distributed matrix decomposition for big data analytics. Such a method should scale up well, model the heterogeneous noise, and address the communication issue in a distributed system. To this end, we propose a distributed Bayesian matrix decomposition model (DBMD) for big data mining and clustering. Specifically, we adopt three strategies to implement the distributed computing including 1) the accelerated gradient descent, 2) the alternating direction method of multipliers (ADMM), and 3) the statistical inference. We investigate the theoretical convergence behaviors of these algorithms. To address the heterogeneity of the noise, we propose an optimal plug-in weighted average that reduces the variance of the estimation. Synthetic experiments validate our theoretical results, and real-world experiments show that our algorithms scale up well to big data and achieves superior or competing performance compared to other distributed methods.
Abstract:Principal component analysis (PCA) is one of the most widely used dimension reduction and multivariate statistical techniques. From a probabilistic perspective, PCA seeks a low-dimensional representation of data in the presence of independent identical Gaussian noise. Probabilistic PCA (PPCA) and its variants have been extensively studied for decades. Most of them assume the underlying noise follows a certain independent identical distribution. However, the noise in the real world is usually complicated and structured. To address this challenge, some non-linear variants of PPCA have been proposed. But those methods are generally difficult to interpret. To this end, we propose a powerful and intuitive PCA method (MN-PCA) through modeling the graphical noise by the matrix normal distribution, which enables us to explore the structure of noise in both the feature space and the sample space. MN-PCA obtains a low-rank representation of data and the structure of noise simultaneously. And it can be explained as approximating data over the generalized Mahalanobis distance. We develop two algorithms to solve this model: one maximizes the regularized likelihood, the other exploits the Wasserstein distance, which is more robust. Extensive experiments on various data demonstrate their effectiveness.
Abstract:Matrix decomposition is a popular and fundamental approach in machine learning and data mining. It has been successfully applied into various fields. Most matrix decomposition methods focus on decomposing a data matrix from one single source. However, it is common that data are from different sources with heterogeneous noise. A few of matrix decomposition methods have been extended for such multi-view data integration and pattern discovery. While only few methods were designed to consider the heterogeneity of noise in such multi-view data for data integration explicitly. To this end, we propose a joint matrix decomposition framework (BJMD), which models the heterogeneity of noise by Gaussian distribution in a Bayesian framework. We develop two algorithms to solve this model: one is a variational Bayesian inference algorithm, which makes full use of the posterior distribution; and another is a maximum a posterior algorithm, which is more scalable and can be easily paralleled. Extensive experiments on synthetic and real-world datasets demonstrate that BJMD considering the heterogeneity of noise is superior or competitive to the state-of-the-art methods.