Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Claudia Henry

Soft Uncoupling of Markov Chains for Permeable Language Distinction: A New Algorithm

Oct 07, 2008

Richard Nock, Pascal Vaillant, Frank Nielsen, Claudia Henry

Figure 1 for Soft Uncoupling of Markov Chains for Permeable Language Distinction: A New Algorithm

Figure 2 for Soft Uncoupling of Markov Chains for Permeable Language Distinction: A New Algorithm

Figure 3 for Soft Uncoupling of Markov Chains for Permeable Language Distinction: A New Algorithm

Figure 4 for Soft Uncoupling of Markov Chains for Permeable Language Distinction: A New Algorithm

Abstract:Without prior knowledge, distinguishing different languages may be a hard task, especially when their borders are permeable. We develop an extension of spectral clustering -- a powerful unsupervised classification toolbox -- that is shown to resolve accurately the task of soft language distinction. At the heart of our approach, we replace the usual hard membership assignment of spectral clustering by a soft, probabilistic assignment, which also presents the advantage to bypass a well-known complexity bottleneck of the method. Furthermore, our approach relies on a novel, convenient construction of a Markov chain out of a corpus. Extensive experiments with a readily available system clearly display the potential of the method, which brings a visually appealing soft distinction of languages that may define altogether a whole corpus.

* ECAI 2006: 17th European Conference on Artificial Intelligence. Riva del Garda, Italy, 29 August - 1st September 2006
* 6 pages, 7 embedded figures, LaTeX 2e using the ecai2006.cls document class and the algorithm2e.sty style file (+ standard packages like epsfig, amsmath, amssymb, amsfonts...). Extends the short version contained in the ECAI 2006 proceedings

Via

Access Paper or Ask Questions

Analyse spectrale des textes: détection automatique des frontières de langue et de discours

Oct 07, 2008

Pascal Vaillant, Richard Nock, Claudia Henry

Figure 1 for Analyse spectrale des textes: détection automatique des frontières de langue et de discours

Figure 2 for Analyse spectrale des textes: détection automatique des frontières de langue et de discours

Figure 3 for Analyse spectrale des textes: détection automatique des frontières de langue et de discours

Abstract:We propose a theoretical framework within which information on the vocabulary of a given corpus can be inferred on the basis of statistical information gathered on that corpus. Inferences can be made on the categories of the words in the vocabulary, and on their syntactical properties within particular languages. Based on the same statistical data, it is possible to build matrices of syntagmatic similarity (bigram transition matrices) or paradigmatic similarity (probability for any pair of words to share common contexts). When clustered with respect to their syntagmatic similarity, words tend to group into sublanguage vocabularies, and when clustered with respect to their paradigmatic similarity, into syntactic or semantic classes. Experiments have explored the first of these two possibilities. Their results are interpreted in the frame of a Markov chain modelling of the corpus' generative processe(s): we show that the results of a spectral analysis of the transition matrix can be interpreted as probability distributions of words within clusters. This method yields a soft clustering of the vocabulary into sublanguages which contribute to the generation of heterogeneous corpora. As an application, we show how multilingual texts can be visually segmented into linguistically homogeneous segments. Our method is specifically useful in the case of related languages which happened to be mixed in corpora.

* Verbum ex machina: Actes de la 13eme conference annuelle sur le Traitement Automatique des Langues Naturelles (TALN 2006), p. 619-629. Louvain (Leuven), Belgique, 10-13 avril 2006
* In French. 10 pages, 5 figures, LaTeX 2e using EPSF and custom package taln2006.sty (designed by Pierre Zweigenbaum, ATALA). Proceedings of the 13th annual French-speaking conference on Natural Language Processing: `Traitement Automatique des Langues Naturelles' (TALN 2006), Louvain (Leuven), Belgium, 10-13 April 2003

Via

Access Paper or Ask Questions