Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Sequential Embedding Induced Text Clustering, a Non-parametric Bayesian Approach

Nov 29, 2018

Tiehang Duan, Qi Lou, Sargur N. Srihari, Xiaohui Xie

Figure 1 for Sequential Embedding Induced Text Clustering, a Non-parametric Bayesian Approach

Figure 2 for Sequential Embedding Induced Text Clustering, a Non-parametric Bayesian Approach

Figure 3 for Sequential Embedding Induced Text Clustering, a Non-parametric Bayesian Approach

Figure 4 for Sequential Embedding Induced Text Clustering, a Non-parametric Bayesian Approach

Share this with someone who'll enjoy it:

Abstract:Current state-of-the-art nonparametric Bayesian text clustering methods model documents through multinomial distribution on bags of words. Although these methods can effectively utilize the word burstiness representation of documents and achieve decent performance, they do not explore the sequential information of text and relationships among synonyms. In this paper, the documents are modeled as the joint of bags of words, sequential features and word embeddings. We proposed Sequential Embedding induced Dirichlet Process Mixture Model (SiDPMM) to effectively exploit this joint document representation in text clustering. The sequential features are extracted by the encoder-decoder component. Word embeddings produced by the continuous-bag-of-words (CBOW) model are introduced to handle synonyms. Experimental results demonstrate the benefits of our model in two major aspects: 1) improved performance across multiple diverse text datasets in terms of the normalized mutual information (NMI); 2) more accurate inference of ground truth cluster numbers with regularization effect on tiny outlier clusters.

View paper on

Share this with someone who'll enjoy it:

Title:Sequential Embedding Induced Text Clustering, a Non-parametric Bayesian Approach

Paper and Code