Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Konstantin Vorontsov

Iterative Improvement of an Additively Regularized Topic Model

Aug 14, 2024

Alex Gorbulev, Vasiliy Alekseev, Konstantin Vorontsov

Figure 1 for Iterative Improvement of an Additively Regularized Topic Model

Figure 2 for Iterative Improvement of an Additively Regularized Topic Model

Figure 3 for Iterative Improvement of an Additively Regularized Topic Model

Figure 4 for Iterative Improvement of an Additively Regularized Topic Model

Abstract:Topic modelling is fundamentally a soft clustering problem (of known objects -- documents, over unknown clusters -- topics). That is, the task is incorrectly posed. In particular, the topic models are unstable and incomplete. All this leads to the fact that the process of finding a good topic model (repeated hyperparameter selection, model training, and topic quality assessment) can be particularly long and labor-intensive. We aim to simplify the process, to make it more deterministic and provable. To this end, we present a method for iterative training of a topic model. The essence of the method is that a series of related topic models are trained so that each subsequent model is at least as good as the previous one, i.e., that it retains all the good topics found earlier. The connection between the models is achieved by additive regularization. The result of this iterative training is the last topic model in the series, which we call the iteratively updated additively regularized topic model (ITAR). Experiments conducted on several collections of natural language texts show that the proposed ITAR model performs better than other popular topic models (LDA, ARTM, BERTopic), its topics are diverse, and its perplexity (ability to "explain" the underlying data) is moderate.

* Fix HTML view. That is, fix the heap (strikethrough) order of .tex files using the auxiliary Arxiv Readme XXX

Via

Access Paper or Ask Questions

Determination of the Number of Topics Intrinsically: Is It Possible?

Jun 14, 2024

Victor Bulatov, Vasiliy Alekseev, Konstantin Vorontsov

Abstract:The number of topics might be the most important parameter of a topic model. The topic modelling community has developed a set of various procedures to estimate the number of topics in a dataset, but there has not yet been a sufficiently complete comparison of existing practices. This study attempts to partially fill this gap by investigating the performance of various methods applied to several topic models on a number of publicly available corpora. Further analysis demonstrates that intrinsic methods are far from being reliable and accurate tools. The number of topics is shown to be a method- and a model-dependent quantity, as opposed to being an absolute property of a particular corpus. We conclude that other methods for dealing with this problem should be developed and suggest some promising directions for further research.

* This is the first full draft version of the article. The camera-ready version was accepted at the 11th International Conference on Analysis of Images, Social Networks and Texts (AIST 2023). Presented on September 30, 2023. Expected to be published in the conference proceedings, as part of the Communications in Computer and Information Science series (CCIS, Vol. 1905)

Via

Access Paper or Ask Questions

Interpretable probabilistic embeddings: bridging the gap between topic models and neural networks

Nov 11, 2017

Anna Potapenko, Artem Popov, Konstantin Vorontsov

Figure 1 for Interpretable probabilistic embeddings: bridging the gap between topic models and neural networks

Figure 2 for Interpretable probabilistic embeddings: bridging the gap between topic models and neural networks

Figure 3 for Interpretable probabilistic embeddings: bridging the gap between topic models and neural networks

Figure 4 for Interpretable probabilistic embeddings: bridging the gap between topic models and neural networks

Abstract:We consider probabilistic topic models and more recent word embedding techniques from a perspective of learning hidden semantic representations. Inspired by a striking similarity of the two approaches, we merge them and learn probabilistic embeddings with online EM-algorithm on word co-occurrence data. The resulting embeddings perform on par with Skip-Gram Negative Sampling (SGNS) on word similarity tasks and benefit in the interpretability of the components. Next, we learn probabilistic document embeddings that outperform paragraph2vec on a document similarity task and require less memory and time for training. Finally, we employ multimodal Additive Regularization of Topic Models (ARTM) to obtain a high sparsity and learn embeddings for other modalities, such as timestamps and categories. We observe further improvement of word similarity performance and meaningful inter-modality similarities.

* Appeared in AINL-2017

Via

Access Paper or Ask Questions