Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jose-Luis Redondo García

Efficient Clustering from Distributions over Topics

Dec 15, 2020

Carlos Badenes-Olmedo, Jose-Luis Redondo García, Oscar Corcho

Figure 1 for Efficient Clustering from Distributions over Topics

Figure 2 for Efficient Clustering from Distributions over Topics

Figure 3 for Efficient Clustering from Distributions over Topics

Figure 4 for Efficient Clustering from Distributions over Topics

Abstract:There are many scenarios where we may want to find pairs of textually similar documents in a large corpus (e.g. a researcher doing literature review, or an R&D project manager analyzing project proposals). To programmatically discover those connections can help experts to achieve those goals, but brute-force pairwise comparisons are not computationally adequate when the size of the document corpus is too large. Some algorithms in the literature divide the search space into regions containing potentially similar documents, which are later processed separately from the rest in order to reduce the number of pairs compared. However, this kind of unsupervised methods still incur in high temporal costs. In this paper, we present an approach that relies on the results of a topic modeling algorithm over the documents in a collection, as a means to identify smaller subsets of documents where the similarity function can then be computed. This approach has proved to obtain promising results when identifying similar documents in the domain of scientific publications. We have compared our approach against state of the art clustering techniques and with different configurations for the topic modeling algorithm. Results suggest that our approach outperforms (> 0.5) the other analyzed techniques in terms of efficiency.

* ACM Proceedings of the Knowledge Capture Conference, article 17, K-CAP 2017
* Accepted at the 9th International Conference on Knowledge Capture (K-CAP 2017)

Via

Access Paper or Ask Questions

Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies

Dec 15, 2020

Carlos Badenes-Olmedo, Jose-Luis Redondo García, Oscar Corcho

Figure 1 for Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies

Figure 2 for Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies

Figure 3 for Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies

Figure 4 for Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies

Abstract:With the ongoing growth in number of digital articles in a wider set of languages and the expanding use of different languages, we need annotation methods that enable browsing multi-lingual corpora. Multilingual probabilistic topic models have recently emerged as a group of semi-supervised machine learning models that can be used to perform thematic explorations on collections of texts in multiple languages. However, these approaches require theme-aligned training data to create a language-independent space. This constraint limits the amount of scenarios that this technique can offer solutions to train and makes it difficult to scale up to situations where a huge collection of multi-lingual documents are required during the training phase. This paper presents an unsupervised document similarity algorithm that does not require parallel or comparable corpora, or any other type of translation resource. The algorithm annotates topics automatically created from documents in a single language with cross-lingual labels and describes documents by hierarchies of multi-lingual concepts from independently-trained models. Experiments performed on the English, Spanish and French editions of JCR-Acquis corpora reveal promising results on classifying and sorting documents by similar content.

* AACM Proceedings of the 10th International Conference on Knowledge Capture, pages = 147-153, K-CAP 19 (2020)
* Accepted at the 10th International Conference on Knowledge Capture (K-CAP 2019)

Via

Access Paper or Ask Questions