Abstract:We propose a general mechanism for evolution to explain the diversity of gene and language. To quantify their common features and reveal the hidden structures, several statistical properties and patterns are examined based on a new method called the rank-rank analysis. We find that the classical correspondence, "domain plays the role of word in gene language", is not rigorous, and propose to replace domain by protein. In addition, we devise a new evolution unit, syllgram, to include the characteristics of spoken and written language. Based on the correspondence between (protein, domain) and (word, syllgram), we discover that both gene and language shared a common scaling structure and scale-free network. Like the Rosetta stone, this work may help decipher the secret behind non-coding DNA and unknown languages.
Abstract:One of the ultimate goals for linguists is to find universal properties in human languages. Although words are generally considered as representing arbitrary mapping between linguistic forms and meanings, we propose a new universal law that highlights the equally important role of syllables, which is complementary to Zipf's. By plotting rank-rank frequency distribution of word and syllable for English and Chinese corpora, visible lines appear and can be fit to a master curve. We discover the multi-layer network for words and syllables based on this analysis exhibits the feature of self-organization which relies heavily on the inclusion of syllables and their connections. Analytic form for the scaling structure is derived and used to quantify how Internet slang becomes fashionable, which demonstrates its usefulness as a new tool to evolutionary linguistics.