Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Compass-aligned Distributional Embeddings for Studying Semantic Differences across Corpora

Apr 13, 2020

Federico Bianchi, Valerio Di Carlo, Paolo Nicoli, Matteo Palmonari

Figure 1 for Compass-aligned Distributional Embeddings for Studying Semantic Differences across Corpora

Figure 2 for Compass-aligned Distributional Embeddings for Studying Semantic Differences across Corpora

Figure 3 for Compass-aligned Distributional Embeddings for Studying Semantic Differences across Corpora

Figure 4 for Compass-aligned Distributional Embeddings for Studying Semantic Differences across Corpora

Share this with someone who'll enjoy it:

Abstract:Word2vec is one of the most used algorithms to generate word embeddings because of a good mix of efficiency, quality of the generated representations and cognitive grounding. However, word meaning is not static and depends on the context in which words are used. Differences in word meaning that depends on time, location, topic, and other factors, can be studied by analyzing embeddings generated from different corpora in collections that are representative of these factors. For example, language evolution can be studied using a collection of news articles published in different time periods. In this paper, we present a general framework to support cross-corpora language studies with word embeddings, where embeddings generated from different corpora can be compared to find correspondences and differences in meaning across the corpora. CADE is the core component of our framework and solves the key problem of aligning the embeddings generated from different corpora. In particular, we focus on providing solid evidence about the effectiveness, generality, and robustness of CADE. To this end, we conduct quantitative and qualitative experiments in different domains, from temporal word embeddings to language localization and topical analysis. The results of our experiments suggest that CADE achieves state-of-the-art or superior performance on tasks where several competing approaches are available, yet providing a general method that can be used in a variety of domains. Finally, our experiments shed light on the conditions under which the alignment is reliable, which substantially depends on the degree of cross-corpora vocabulary overlap.

* arXiv admin note: text overlap with arXiv:1906.02376

View paper on

Share this with someone who'll enjoy it:

Title:Compass-aligned Distributional Embeddings for Studying Semantic Differences across Corpora

Paper and Code