Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tomasz Stanisz

Average shortest-path length in word-adjacency networks: Chinese versus English

Jan 10, 2026

Jakub Dec, Michał Dolina, Stanisław Drożdż, Jarosław Kwapień, Jin Liu, Tomasz Stanisz

Abstract:Complex networks provide powerful tools for analyzing and understanding the intricate structures present in various systems, including natural language. Here, we analyze topology of growing word-adjacency networks constructed from Chinese and English literary works written in different periods. Unconventionally, instead of considering dictionary words only, we also include punctuation marks as if they were ordinary words. Our approach is based on two arguments: (1) punctuation carries genuine information related to emotional state, allows for logical grouping of content, provides a pause in reading, and facilitates understanding by avoiding ambiguity, and (2) our previous works have shown that punctuation marks behave like words in a Zipfian analysis and, if considered together with regular words, can improve authorship attribution in stylometric studies. We focus on a functional dependence of the average shortest path length $L(N)$ on a network size $N$ for different epochs and individual novels in their original language as well as for translations of selected novels into the other language. We approximate the empirical results with a growing network model and obtain satisfactory agreement between the two. We also observe that $L(N)$ behaves asymptotically similar for both languages if punctuation marks are included but becomes sizably larger for Chinese if punctuation marks are neglected.

* Physical Review E 112, 064318 (2025)

Via

Access Paper or Ask Questions

Quantifying patterns of punctuation in modern Chinese prose

Mar 06, 2025

Michał Dolina, Jakub Dec, Stanisław Drożdż, Jarosław Kwapień, Jin Liu, Tomasz Stanisz

Abstract:Recent research shows that punctuation patterns in texts exhibit universal features across languages. Analysis of Western classical literature reveals that the distribution of spaces between punctuation marks aligns with a discrete Weibull distribution, typically used in survival analysis. By extending this analysis to Chinese literature represented here by three notable contemporary works, it is shown that Zipf's law applies to Chinese texts similarly to Western texts, where punctuation patterns also improve adherence to the law. Additionally, the distance distribution between punctuation marks in Chinese texts follows the Weibull model, though larger spacing is less frequent than in English translations. Sentence-ending punctuation, representing sentence length, diverges more from this pattern, reflecting greater flexibility in sentence length. This variability supports the formation of complex, multifractal sentence structures, particularly evident in Gao Xingjian's "Soul Mountain". These findings demonstrate that both Chinese and Western texts share universal punctuation and word distribution patterns, underscoring their broad applicability across languages.

* Chaos 35, 023155 (2025)

Via

Access Paper or Ask Questions

Statistics of punctuation in experimental literature -- the remarkable case of "Finnegans Wake" by James Joyce

Aug 31, 2024

Tomasz Stanisz, Stanisław Drożdż, Jarosław Kwapień

Figure 1 for Statistics of punctuation in experimental literature -- the remarkable case of "Finnegans Wake" by James Joyce

Figure 2 for Statistics of punctuation in experimental literature -- the remarkable case of "Finnegans Wake" by James Joyce

Figure 3 for Statistics of punctuation in experimental literature -- the remarkable case of "Finnegans Wake" by James Joyce

Figure 4 for Statistics of punctuation in experimental literature -- the remarkable case of "Finnegans Wake" by James Joyce

Abstract:As the recent studies indicate, the structure imposed onto written texts by the presence of punctuation develops patterns which reveal certain characteristics of universality. In particular, based on a large collection of classic literary works, it has been evidenced that the distances between consecutive punctuation marks, measured in terms of the number of words, obey the discrete Weibull distribution - a discrete variant of a distribution often used in survival analysis. The present work extends the analysis of punctuation usage patterns to more experimental pieces of world literature. It turns out that the compliance of the the distances between punctuation marks with the discrete Weibull distribution typically applies here as well. However, some of the works by James Joyce are distinct in this regard - in the sense that the tails of the relevant distributions are significantly thicker and, consequently, the corresponding hazard functions are decreasing functions not observed in typical literary texts in prose. "Finnegans Wake" - the same one to which science owes the word "quarks" for the most fundamental constituents of matter - is particularly striking in this context. At the same time, in all the studied texts, the sentence lengths - representing the distances between sentence-ending punctuation marks - reveal more freedom and are not constrained by the discrete Weibull distribution. This freedom in some cases translates into long-range nonlinear correlations, which manifest themselves in multifractality. Again, a text particularly spectacular in terms of multifractality is "Finnegans Wake".

* Chaos 34, 083124 (2024)

Via

Access Paper or Ask Questions

Complex systems approach to natural language

Jan 05, 2024

Tomasz Stanisz, Stanisław Drożdż, Jarosław Kwapień

Abstract:The review summarizes the main methodological concepts used in studying natural language from the perspective of complexity science and documents their applicability in identifying both universal and system-specific features of language in its written representation. Three main complexity-related research trends in quantitative linguistics are covered. The first part addresses the issue of word frequencies in texts and demonstrates that taking punctuation into consideration restores scaling whose violation in the Zipf's law is often observed for the most frequent words. The second part introduces methods inspired by time series analysis, used in studying various kinds of correlations in written texts. The related time series are generated on the basis of text partition into sentences or into phrases between consecutive punctuation marks. It turns out that these series develop features often found in signals generated by complex systems, like long-range correlations or (multi)fractal structures. Moreover, it appears that the distances between punctuation marks comply with the discrete variant of the Weibull distribution. In the third part, the application of the network formalism to natural language is reviewed, particularly in the context of the so-called word-adjacency networks. Parameters characterizing topology of such networks can be used for classification of texts, for example, from a stylometric perspective. Network approach can also be applied to represent the organization of word associations. Structure of word-association networks turns out to be significantly different from that observed in random networks, revealing genuine properties of language. Finally, punctuation seems to have a significant impact not only on the language's information-carrying ability but also on its key statistical properties, hence it is recommended to consider punctuation marks on a par with words.

* Physics Reports 1053 (2024) 1-84
* 113 pages, 49 figures

Via

Access Paper or Ask Questions

Universal versus system-specific features of punctuation usage patterns in~major Western~languages

Dec 21, 2022

Tomasz Stanisz, Stanislaw Drozdz, Jaroslaw Kwapien

Figure 1 for Universal versus system-specific features of punctuation usage patterns in~major Western~languages

Figure 2 for Universal versus system-specific features of punctuation usage patterns in~major Western~languages

Figure 3 for Universal versus system-specific features of punctuation usage patterns in~major Western~languages

Figure 4 for Universal versus system-specific features of punctuation usage patterns in~major Western~languages

Abstract:The celebrated proverb that "speech is silver, silence is golden" has a long multinational history and multiple specific meanings. In written texts punctuation can in fact be considered one of its manifestations. Indeed, the virtue of effectively speaking and writing involves - often decisively - the capacity to apply the properly placed breaks. In the present study, based on a large corpus of world-famous and representative literary texts in seven major Western languages, it is shown that the distribution of intervals between consecutive punctuation marks in almost all texts can universally be characterised by only two parameters of the discrete Weibull distribution which can be given an intuitive interpretation in terms of the so-called hazard function. The values of these two parameters tend to be language-specific, however, and even appear to navigate translations. The properties of the computed hazard functions indicate that among the studied languages, English turns out to be the least constrained by the necessity to place a consecutive punctuation mark to partition a sequence of words. This may suggest that when compared to other studied languages, English is more flexible, in the sense of allowing longer uninterrupted sequences of words. Spanish reveals similar tendency to only a bit lesser extent.

Via

Access Paper or Ask Questions

Linguistic data mining with complex networks: a stylometric-oriented approach

Aug 16, 2018

Tomasz Stanisz, Jaroslaw Kwapien, Stanislaw Drozdz

Figure 1 for Linguistic data mining with complex networks: a stylometric-oriented approach

Figure 2 for Linguistic data mining with complex networks: a stylometric-oriented approach

Figure 3 for Linguistic data mining with complex networks: a stylometric-oriented approach

Figure 4 for Linguistic data mining with complex networks: a stylometric-oriented approach

Abstract:By representing a text by a set of words and their co-occurrences, one obtains a word-adjacency network - a network being in a way a reduced representation of the given language sample. In this paper, the possibility of using network representation in order to extract information about individual language styles of literary texts is studied. By determining selected quantitative characteristics of the networks and applying machine learning algorithms, it is made possible to distinguish between texts of different authors. It turns out that within the studied set of texts in English and Polish, the properly rescaled weighted clustering coefficients and weighted degrees of only a few nodes in the word-adjacency networks are sufficient to obtain the accuracy of authorship attribution over 90\%. A correspondence between the authorship of texts and the structure of word-adjacency networks can therefore clearly be found; it may be stated that the network representation allows to distinguish individual language styles by comparing the way the authors use particular words and punctuation marks. The presented approach can be viewed as a generalization of the authorship attribution methods based on simplest lexical features. Apart from the characteristics given above, other network parameters are studied, both local and global ones, for both the unweighted and weighted networks. Their potential to capture the diversity of writing styles is discussed; some differences between languages are also observed.

Via

Access Paper or Ask Questions

In narrative texts punctuation marks obey the same statistics as words

Sep 27, 2016

Andrzej Kulig, Jaroslaw Kwapien, Tomasz Stanisz, Stanislaw Drozdz

Figure 1 for In narrative texts punctuation marks obey the same statistics as words

Figure 2 for In narrative texts punctuation marks obey the same statistics as words

Figure 3 for In narrative texts punctuation marks obey the same statistics as words

Figure 4 for In narrative texts punctuation marks obey the same statistics as words

Abstract:From a grammar point of view, the role of punctuation marks in a sentence is formally defined and well understood. In semantic analysis punctuation plays also a crucial role as a method of avoiding ambiguity of the meaning. A different situation can be observed in the statistical analyses of language samples, where the decision on whether the punctuation marks should be considered or should be neglected is seen rather as arbitrary and at present it belongs to a researcher's preference. An objective of this work is to shed some light onto this problem by providing us with an answer to the question whether the punctuation marks may be treated as ordinary words and whether they should be included in any analysis of the word co-occurences. We already know from our previous study (S.~Dro\.zd\.z {\it et al.}, Inf. Sci. 331 (2016) 32-44) that full stops that determine the length of sentences are the main carrier of long-range correlations. Now we extend that study and analyze statistical properties of the most common punctuation marks in a few Indo-European languages, investigate their frequencies, and locate them accordingly in the Zipf rank-frequency plots as well as study their role in the word-adjacency networks. We show that, from a statistical viewpoint, the punctuation marks reveal properties that are qualitatively similar to the properties of the most frequent words like articles, conjunctions, pronouns, and prepositions. This refers to both the Zipfian analysis and the network analysis. By adding the punctuation marks to the Zipf plots, we also show that these plots that are normally described by the Zipf-Mandelbrot distribution largely restore the power-law Zipfian behaviour for the most frequent items.

* Information Sciences 375 (2017) 98-113
* Information Sciences (inprint)

Via

Access Paper or Ask Questions