Abstract:An important problem in the field of bioinformatics is to identify interactive effects among profiled variables for outcome prediction. In this paper, a logistic regression model with pairwise interactions among a set of binary covariates is considered. Modeling the structure of the interactions by a graph, our goal is to recover the interaction graph from independently identically distributed (i.i.d.) samples of the covariates and the outcome. When viewed as a feature selection problem, a simple quantity called influence is proposed as a measure of the marginal effects of the interaction terms on the outcome. For the case when the underlying interaction graph is known to be acyclic, it is shown that a simple algorithm that is based on a maximum-weight spanning tree with respect to the plug-in estimates of the influences not only has strong theoretical performance guarantees, but can also outperform generic feature selection algorithms for recovering the interaction graph from i.i.d. samples of the covariates and the outcome. Our results can also be extended to the model that includes both individual effects and pairwise interactions via the help of an auxiliary covariate.
Abstract:Submodular maximization problems belong to the family of combinatorial optimization problems and enjoy wide applications. In this paper, we focus on the problem of maximizing a monotone submodular function subject to a $d$-knapsack constraint, for which we propose a streaming algorithm that achieves a $\left(\frac{1}{1+2d}-\epsilon\right)$-approximation of the optimal value, while it only needs one single pass through the dataset without storing all the data in the memory. In our experiments, we extensively evaluate the effectiveness of our proposed algorithm via two applications: news recommendation and scientific literature recommendation. It is observed that the proposed streaming algorithm achieves both execution speedup and memory saving by several orders of magnitude, compared with existing approaches.