Researchers use a word co-occurrence network (WCN) to find features of language that cannot be analyzed using existing WCNs
Every day the world generates a vast amount of data in a variety of languages. Semantic networks, such as word co-occurrence networks (WCNs) can help overcome language barriers and analyze these data. Studies have shown that WCNs can accurately capture syntactical features of language by analyzing consecutive words in sentences, but thus far, no one has explored the relationships between distant words. Recently, researchers used an enhanced WCN to investigate just that.
There are nearly 7,000 different languages in the world and several quintillion bytes of data is generated in nearly all of them every day. This poses a serious problem for data analysis. Scholars have proposed complex network theory as a solution to this issue. One of the main types of semantic networks is the word co-occurrence network (WCN). In a WCN, words form the vertices of the network (morphemes) and the edges between these vertices connect words on the basis of a string of words called an ‘n-gram.’ Here, n refers to the number of consecutive words in a sentence that are analysed at a time. Previous research has been limited to WCNs with a maximum n of two and have found that these WCNs can capture the characteristic features of multiple languages fairly well. But what is the relationship between distant words in sentences? Or, phrased differently, what happens when you increase the number of n beyond two?
To answer this question, a research team led by Prof. Tohru Ikeguchi from Tokyo University of Science, investigated the syntactic dependency relations in languages by using WCNs with increasing n. “We transformed well-known documents in eight languages into WCNs with n greater than or equal to two and found important features of each language in the WCNs,” says Professor Ikeguchi. The team also consisted of Mr. Kihei Magishi and Prof. Tomoko Matsumoto of Tokyo University of Science and Prof. Yutaka Shimada of Saitama University. This study has been published in Nonlinear Theory and Its Applications, IEICE on April 1, 2022.
For their study, the research team transformed well-known works in eight different languages into WCNs. These works included a wide range of text data—the New Testament of the Christian Bible, the United Nations proceedings, the Paris agreement, and novels by different authors. These documents were chosen because they have been accurately translated into multiple languages, thereby allowing their faithful analysis. They then analysed the WCNs for a variety of n, up to n = 16.
“We found that the important features of each language appear in the networks with more than three co-occurrences, i.e., with n greater than or equal to three. We also saw that some of the network indices used to evaluate the structural features of the networks depend on the text data,” explains Prof. Ikeguchi. The network indices that are dependent on the text data include the number of words and vertices, the density of the network, the triangle clustering coefficient and the square clustering coefficient. However, the research team also observed that some indices remained independent of the text data, such as the triangle clustering coefficient and the average shortest-path length, thereby enabling the description of the similarities and differences between languages.
Speaking of the long-term applications of the study, Prof. Ikeguchi says, “We are working towards the foundation of a new field of linguistics, mathematical linguistics. By deriving meta-grammar rules from mathematical commonalities and universality that appear in the grammatical functions of various languages, we will be able to establish a foundation for this field.”
The clarification of meta-grammar rules that do not depend on language will help realise the quantitative classification of language and help establish the factors that cause languages to diverge. The findings of this study constitute a major first step and make significant contributions to the understanding of the similarities and differences between languages.