Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kumiko Tanaka-Ishii

Information-Theoretic Generative Clustering of Documents

Dec 18, 2024

Xin Du, Kumiko Tanaka-Ishii

Abstract:We present {\em generative clustering} (GC) for clustering a set of documents, $\mathrm{X}$, by using texts $\mathrm{Y}$ generated by large language models (LLMs) instead of by clustering the original documents $\mathrm{X}$. Because LLMs provide probability distributions, the similarity between two documents can be rigorously defined in an information-theoretic manner by the KL divergence. We also propose a natural, novel clustering algorithm by using importance sampling. We show that GC achieves the state-of-the-art performance, outperforming any previous clustering method often by a large margin. Furthermore, we show an application to generative document retrieval in which documents are indexed via hierarchical clustering and our method improves the retrieval accuracy.

* Accepted to AAAI 2025

Via

Access Paper or Ask Questions

Correlation Dimension of Natural Language in a Statistical Manifold

May 10, 2024

Xin Du, Kumiko Tanaka-Ishii

Abstract:The correlation dimension of natural language is measured by applying the Grassberger-Procaccia algorithm to high-dimensional sequences produced by a large-scale language model. This method, previously studied only in a Euclidean space, is reformulated in a statistical manifold via the Fisher-Rao distance. Language exhibits a multifractal, with global self-similarity and a universal dimension around 6.5, which is smaller than those of simple discrete random sequences and larger than that of a Barab\'asi-Albert process. Long memory is the key to producing self-similarity. Our method is applicable to any probabilistic model of real-world discrete sequences, and we show an application to music data.

* Du, X., & Tanaka-Ishii, K. (2024). Correlation dimension of natural language in a statistical manifold. Physical Review Research, 6(2), L022028
* Published at Physical Review Research

Via

Access Paper or Ask Questions

Statistical Mechanics of Strahler Number via Random and Natural Language Sentences

Jul 06, 2023

Kumiko Tanaka-Ishii, Akira Tanaka

Abstract:The Strahler number was originally proposed to characterize the complexity of river bifurcation and has found various applications. This article proposes computation of the Strahler number's upper and lower limits for natural language sentence tree structures, which are available in a large dataset allowing for statistical mechanics analysis. Through empirical measurements across grammatically annotated data, the Strahler number of natural language sentences is shown to be almost always 3 or 4, similar to the case of river bifurcation as reported by Strahler (1957) and Horton (1945). From the theory behind the number, we show that it is the lower limit of the amount of memory required to process sentences under a particular model. A mathematical analysis of random trees provides a further conjecture on the nature of the Strahler number, revealing that it is not a constant but grows logarithmically. This finding uncovers the statistical basics behind the Strahler number as a characteristic of a general tree structure target.

Via

Access Paper or Ask Questions

A Comparison of Two Fluctuation Analyses for Natural Language Clustering Phenomena: Taylor and Ebeling & Neiman Methods

Sep 14, 2020

Kumiko Tanaka-Ishii, Shuntaro Takahashi

Figure 1 for A Comparison of Two Fluctuation Analyses for Natural Language Clustering Phenomena: Taylor and Ebeling & Neiman Methods

Figure 2 for A Comparison of Two Fluctuation Analyses for Natural Language Clustering Phenomena: Taylor and Ebeling & Neiman Methods

Figure 3 for A Comparison of Two Fluctuation Analyses for Natural Language Clustering Phenomena: Taylor and Ebeling & Neiman Methods

Figure 4 for A Comparison of Two Fluctuation Analyses for Natural Language Clustering Phenomena: Taylor and Ebeling & Neiman Methods

Abstract:This article considers the fluctuation analysis methods of Taylor and Ebeling & Neiman. While both have been applied to various phenomena in the statistical mechanics domain, their similarities and differences have not been clarified. After considering their analytical aspects, this article presents a large-scale application of these methods to text. It is found that both methods can distinguish real text from independently and identically distributed (i.i.d.) sequences. Furthermore, it is found that the Taylor exponents acquired from words can roughly distinguish text categories; this is also the case for Ebeling and Neiman exponents, but to a lesser extent. Additionally, both methods show some possibility of capturing script kinds.

* Fractals, in 2021, No.2. https://www.worldscientific.com/toc/fractals/0/ja

Via

Access Paper or Ask Questions

Extraction of Templates from Phrases Using Sequence Binary Decision Diagrams

Jan 28, 2020

Daiki Hirano, Kumiko Tanaka-Ishii, Andrew Finch

Figure 1 for Extraction of Templates from Phrases Using Sequence Binary Decision Diagrams

Figure 2 for Extraction of Templates from Phrases Using Sequence Binary Decision Diagrams

Figure 3 for Extraction of Templates from Phrases Using Sequence Binary Decision Diagrams

Figure 4 for Extraction of Templates from Phrases Using Sequence Binary Decision Diagrams

Abstract:The extraction of templates such as ``regard X as Y'' from a set of related phrases requires the identification of their internal structures. This paper presents an unsupervised approach for extracting templates on-the-fly from only tagged text by using a novel relaxed variant of the Sequence Binary Decision Diagram (SeqBDD). A SeqBDD can compress a set of sequences into a graphical structure equivalent to a minimal DFA, but more compact and better suited to the task of template extraction. The main contribution of this paper is a relaxed form of the SeqBDD construction algorithm that enables it to form general representations from a small amount of data. The process of compression of shared structures in the text during Relaxed SeqBDD construction, naturally induces the templates we wish to extract. Experiments show that the method is capable of high-quality extraction on tasks based on verb+preposition templates from corpora and phrasal templates from short messages from social media.

* Natural Language Engineering, 2018

Via

Access Paper or Ask Questions

Evaluating Computational Language Models with Scaling Properties of Natural Language

Jun 22, 2019

Shuntaro Takahashi, Kumiko Tanaka-Ishii

Abstract:In this article, we evaluate computational models of natural language with respect to the universal statistical behaviors of natural language. Statistical mechanical analyses have revealed that natural language text is characterized by scaling properties, which quantify the global structure in the vocabulary population and the long memory of a text. We study whether five scaling properties (given by Zipf's law, Heaps' law, Ebeling's method, Taylor's law, and long-range correlation analysis) can serve for evaluation of computational models. Specifically, we test $n$-gram language models, a probabilistic context-free grammar (PCFG), language models based on Simon/Pitman-Yor processes, neural language models, and generative adversarial networks (GANs) for text generation. Our analysis reveals that language models based on recurrent neural networks (RNNs) with a gating mechanism (i.e., long short-term memory, LSTM; a gated recurrent unit, GRU; and quasi-recurrent neural networks, QRNNs) are the only computational models that can reproduce the long memory behavior of natural language. Furthermore, through comparison with recently proposed model-based evaluation methods, we find that the exponent of Taylor's law is a good indicator of model quality.

* 32 pages, accepted by Computational Linguistics

Via

Access Paper or Ask Questions

Word Familiarity and Frequency

Jun 09, 2018

Kumiko Tanaka-Ishii, Hiroshi Terada

Figure 1 for Word Familiarity and Frequency

Figure 2 for Word Familiarity and Frequency

Figure 3 for Word Familiarity and Frequency

Figure 4 for Word Familiarity and Frequency

Abstract:Word frequency is assumed to correlate with word familiarity, but the strength of this correlation has not been thoroughly investigated. In this paper, we report on our analysis of the correlation between a word familiarity rating list obtained through a psycholinguistic experiment and the log-frequency obtained from various corpora of different kinds and sizes (up to the terabyte scale) for English and Japanese. Major findings are threefold: First, for a given corpus, familiarity is necessary for a word to achieve high frequency, but familiar words are not necessarily frequent. Second, correlation increases with the corpus data size. Third, a corpus of spoken language correlates better than one of written language. These findings suggest that cognitive familiarity ratings are correlated to frequency, but more highly to that of spoken rather than written language.

* 17 pages, 8 figures, Published in Studia Linguistica in 2011. Available also from Wiley Online Library

Via

Access Paper or Ask Questions

Taylor's law for Human Linguistic Sequences

Jun 07, 2018

Tatsuru Kobayashi, Kumiko Tanaka-Ishii

Figure 1 for Taylor's law for Human Linguistic Sequences

Figure 2 for Taylor's law for Human Linguistic Sequences

Figure 3 for Taylor's law for Human Linguistic Sequences

Figure 4 for Taylor's law for Human Linguistic Sequences

Abstract:Taylor's law describes the fluctuation characteristics underlying a system in which the variance of an event within a time span grows by a power law with respect to the mean. Although Taylor's law has been applied in many natural and social systems, its application for language has been scarce. This article describes a new quantification of Taylor's law in natural language and reports an analysis of over 1100 texts across 14 languages. The Taylor exponents of written natural language texts were found to exhibit almost the same value. The exponent was also compared for other language-related data, such as the child-directed speech, music, and programming language code. The results show how the Taylor exponent serves to quantify the fundamental structural complexity underlying linguistic time series. The article also shows the applicability of these findings in evaluating language models.

* 11 pages, 16 figures, Accepted as ACL 2018 long paper

Via

Access Paper or Ask Questions

Assessing Language Models with Scaling Properties

Apr 24, 2018

Shuntaro Takahashi, Kumiko Tanaka-Ishii

Figure 1 for Assessing Language Models with Scaling Properties

Figure 2 for Assessing Language Models with Scaling Properties

Figure 3 for Assessing Language Models with Scaling Properties

Figure 4 for Assessing Language Models with Scaling Properties

Abstract:Language models have primarily been evaluated with perplexity. While perplexity quantifies the most comprehensible prediction performance, it does not provide qualitative information on the success or failure of models. Another approach for evaluating language models is thus proposed, using the scaling properties of natural language. Five such tests are considered, with the first two accounting for the vocabulary population and the other three for the long memory of natural language. The following models were evaluated with these tests: n-grams, probabilistic context-free grammar (PCFG), Simon and Pitman-Yor (PY) processes, hierarchical PY, and neural language models. Only the neural language models exhibit the long memory properties of natural language, but to a limited degree. The effectiveness of every test of these models is also discussed.

* 14 pages, 16 figures

Via

Access Paper or Ask Questions

Long-Range Correlation Underlying Childhood Language and Generative Models

Dec 11, 2017

Kumiko Tanaka-Ishii

Figure 1 for Long-Range Correlation Underlying Childhood Language and Generative Models

Figure 2 for Long-Range Correlation Underlying Childhood Language and Generative Models

Figure 3 for Long-Range Correlation Underlying Childhood Language and Generative Models

Figure 4 for Long-Range Correlation Underlying Childhood Language and Generative Models

Abstract:Long-range correlation, a property of time series exhibiting long-term memory, is mainly studied in the statistical physics domain and has been reported to exist in natural language. Using a state-of-the-art method for such analysis, long-range correlation is first shown to occur in long CHILDES data sets. To understand why, Bayesian generative models of language, originally proposed in the cognitive scientific domain, are investigated. Among representative models, the Simon model was found to exhibit surprisingly good long-range correlation, but not the Pitman-Yor model. Since the Simon model is known not to correctly reflect the vocabulary growth of natural language, a simple new model is devised as a conjunct of the Simon and Pitman-Yor models, such that long-range correlation holds with a correct vocabulary growth rate. The investigation overall suggests that uniform sampling is one cause of long-range correlation and could thus have a relation with actual linguistic processes.

Via

Access Paper or Ask Questions