Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xiao-Qin Cao

Fast Online EM for Big Topic Modeling

Dec 07, 2015

Jia Zeng, Zhi-Qiang Liu, Xiao-Qin Cao

Figure 1 for Fast Online EM for Big Topic Modeling

Figure 2 for Fast Online EM for Big Topic Modeling

Figure 3 for Fast Online EM for Big Topic Modeling

Figure 4 for Fast Online EM for Big Topic Modeling

Abstract:The expectation-maximization (EM) algorithm can compute the maximum-likelihood (ML) or maximum a posterior (MAP) point estimate of the mixture models or latent variable models such as latent Dirichlet allocation (LDA), which has been one of the most popular probabilistic topic modeling methods in the past decade. However, batch EM has high time and space complexities to learn big LDA models from big data streams. In this paper, we present a fast online EM (FOEM) algorithm that infers the topic distribution from the previously unseen documents incrementally with constant memory requirements. Within the stochastic approximation framework, we show that FOEM can converge to the local stationary point of the LDA's likelihood function. By dynamic scheduling for the fast speed and parameter streaming for the low memory usage, FOEM is more efficient for some lifelong topic modeling tasks than the state-of-the-art online LDA algorithms to handle both big data and big models (aka, big topic modeling) on just a PC.

* 14 pages, 12 figures in IEEE Transactions on Knowledge and Data Engineering, 2016

Via

Access Paper or Ask Questions

A New Approach to Speeding Up Topic Modeling

Apr 08, 2014

Jia Zeng, Zhi-Qiang Liu, Xiao-Qin Cao

Figure 1 for A New Approach to Speeding Up Topic Modeling

Figure 2 for A New Approach to Speeding Up Topic Modeling

Figure 3 for A New Approach to Speeding Up Topic Modeling

Figure 4 for A New Approach to Speeding Up Topic Modeling

Abstract:Latent Dirichlet allocation (LDA) is a widely-used probabilistic topic modeling paradigm, and recently finds many applications in computer vision and computational biology. In this paper, we propose a fast and accurate batch algorithm, active belief propagation (ABP), for training LDA. Usually batch LDA algorithms require repeated scanning of the entire corpus and searching the complete topic space. To process massive corpora having a large number of topics, the training iteration of batch LDA algorithms is often inefficient and time-consuming. To accelerate the training speed, ABP actively scans the subset of corpus and searches the subset of topic space for topic modeling, therefore saves enormous training time in each iteration. To ensure accuracy, ABP selects only those documents and topics that contribute to the largest residuals within the residual belief propagation (RBP) framework. On four real-world corpora, ABP performs around $10$ to $100$ times faster than state-of-the-art batch LDA algorithms with a comparable topic modeling accuracy.

* 14 pages, 12 figures

Via

Access Paper or Ask Questions

Memory-Efficient Topic Modeling

Jun 08, 2012

Jia Zeng, Zhi-Qiang Liu, Xiao-Qin Cao

Figure 1 for Memory-Efficient Topic Modeling

Figure 2 for Memory-Efficient Topic Modeling

Figure 3 for Memory-Efficient Topic Modeling

Figure 4 for Memory-Efficient Topic Modeling

Abstract:As one of the simplest probabilistic topic modeling techniques, latent Dirichlet allocation (LDA) has found many important applications in text mining, computer vision and computational biology. Recent training algorithms for LDA can be interpreted within a unified message passing framework. However, message passing requires storing previous messages with a large amount of memory space, increasing linearly with the number of documents or the number of topics. Therefore, the high memory usage is often a major problem for topic modeling of massive corpora containing a large number of topics. To reduce the space complexity, we propose a novel algorithm without storing previous messages for training LDA: tiny belief propagation (TBP). The basic idea of TBP relates the message passing algorithms with the non-negative matrix factorization (NMF) algorithms, which absorb the message updating into the message passing process, and thus avoid storing previous messages. Experimental results on four large data sets confirm that TBP performs comparably well or even better than current state-of-the-art training algorithms for LDA but with a much less memory consumption. TBP can do topic modeling when massive corpora cannot fit in the computer memory, for example, extracting thematic topics from 7 GB PUBMED corpora on a common desktop computer with 2GB memory.

* 20 pages, 7 figures

Via

Access Paper or Ask Questions

Residual Belief Propagation for Topic Modeling

Apr 30, 2012

Jia Zeng, Xiao-Qin Cao, Zhi-Qiang Liu

Figure 1 for Residual Belief Propagation for Topic Modeling

Figure 2 for Residual Belief Propagation for Topic Modeling

Figure 3 for Residual Belief Propagation for Topic Modeling

Figure 4 for Residual Belief Propagation for Topic Modeling

Abstract:Fast convergence speed is a desired property for training latent Dirichlet allocation (LDA), especially in online and parallel topic modeling for massive data sets. This paper presents a novel residual belief propagation (RBP) algorithm to accelerate the convergence speed for training LDA. The proposed RBP uses an informed scheduling scheme for asynchronous message passing, which passes fast-convergent messages with a higher priority to influence those slow-convergent messages at each learning iteration. Extensive empirical studies confirm that RBP significantly reduces the training time until convergence while achieves a much lower predictive perplexity than other state-of-the-art training algorithms for LDA, including variational Bayes (VB), collapsed Gibbs sampling (GS), loopy belief propagation (BP), and residual VB (RVB).

* Advanced Data Mining and Applications Lecture Notes in Computer Science Volume 7713, 739-752, 2012
* 6 pages, 8 figures

Via

Access Paper or Ask Questions