Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Subhadeep Karan

Fast Counting in Machine Learning Applications

Jun 26, 2018

Subhadeep Karan, Matthew Eichhorn, Blake Hurlburt, Grant Iraci, Jaroslaw Zola

Figure 1 for Fast Counting in Machine Learning Applications

Figure 2 for Fast Counting in Machine Learning Applications

Figure 3 for Fast Counting in Machine Learning Applications

Figure 4 for Fast Counting in Machine Learning Applications

Abstract:We propose scalable methods to execute counting queries in machine learning applications. To achieve memory and computational efficiency, we abstract counting queries and their context such that the counts can be aggregated as a stream. We demonstrate performance and scalability of the resulting approach on random queries, and through extensive experimentation using Bayesian networks learning and association rule mining. Our methods significantly outperform commonly used ADtrees and hash tables, and are practical alternatives for processing large-scale data.

Via

Access Paper or Ask Questions

Scalable Exact Parent Sets Identification in Bayesian Networks Learning with Apache Spark

Oct 24, 2017

Subhadeep Karan, Jaroslaw Zola

Figure 1 for Scalable Exact Parent Sets Identification in Bayesian Networks Learning with Apache Spark

Figure 2 for Scalable Exact Parent Sets Identification in Bayesian Networks Learning with Apache Spark

Figure 3 for Scalable Exact Parent Sets Identification in Bayesian Networks Learning with Apache Spark

Figure 4 for Scalable Exact Parent Sets Identification in Bayesian Networks Learning with Apache Spark

Abstract:In Machine Learning, the parent set identification problem is to find a set of random variables that best explain selected variable given the data and some predefined scoring function. This problem is a critical component to structure learning of Bayesian networks and Markov blankets discovery, and thus has many practical applications, ranging from fraud detection to clinical decision support. In this paper, we introduce a new distributed memory approach to the exact parent sets assignment problem. To achieve scalability, we derive theoretical bounds to constraint the search space when MDL scoring function is used, and we reorganize the underlying dynamic programming such that the computational density is increased and fine-grain synchronization is eliminated. We then design efficient realization of our approach in the Apache Spark platform. Through experimental results, we demonstrate that the method maintains strong scalability on a 500-core standalone Spark cluster, and it can be used to efficiently process data sets with 70 variables, far beyond the reach of the currently available solutions.

Via

Access Paper or Ask Questions

Exact Structure Learning of Bayesian Networks by Optimal Path Extension

Mar 21, 2017

Subhadeep Karan, Jaroslaw Zola

Figure 1 for Exact Structure Learning of Bayesian Networks by Optimal Path Extension

Figure 2 for Exact Structure Learning of Bayesian Networks by Optimal Path Extension

Figure 3 for Exact Structure Learning of Bayesian Networks by Optimal Path Extension

Figure 4 for Exact Structure Learning of Bayesian Networks by Optimal Path Extension

Abstract:Bayesian networks are probabilistic graphical models often used in big data analytics. The problem of exact structure learning is to find a network structure that is optimal under certain scoring criteria. The problem is known to be NP-hard and the existing methods are both computationally and memory intensive. In this paper, we introduce a new approach for exact structure learning. Our strategy is to leverage relationship between a partial network structure and the remaining variables to constraint the number of ways in which the partial network can be optimally extended. Via experimental results, we show that the method provides up to three times improvement in runtime, and orders of magnitude reduction in memory consumption over the current best algorithms.

* Published in the IEEE BigData 2016, this version contains a correction to Figure 1c

Via

Access Paper or Ask Questions