Statistical Analysis of Word Frequency Distribution in Lithuanian Texts of Different Genres
Articles
Neringa Bružaitė
Vilnius Gediminas Technical University, Lithuania
Tomas Rekašius
Vilnius Gediminas Technical University, Lithuania
Published 2016-12-20
https://doi.org/10.15388/LJS.2016.13868
PDF

Keywords

word frequencies
structural distribution
Zipf’s law
hierarchical clustering
Jaccard distance
Ward method

How to Cite

Bružaitė N. and Rekašius T. (2016) “Statistical Analysis of Word Frequency Distribution in Lithuanian Texts of Different Genres”, Lithuanian Journal of Statistics, 55(1), pp. 61-69. doi: 10.15388/LJS.2016.13868.

Abstract

The paper examines Lithuanian texts of different authors and genres. The main points ofinterest – the number of words, the number of different words and word frequencies. Structural type distributionand Zipf’s law are applied for describing the frequency distribution of words in the text. It is obvious that thelexical diversity of any text can be defined by different words that are used in the text, also called vocabulary.It is shown that the information contained in a reduced vocabulary is enough for dividing the texts analyzedin this article into groups by genre and author using a hierarchical clustering method. In this case, distancesbetween clusters are measured using the Jaccard distance measure, and clusters are aggregated using the Wardmethod.

PDF
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Please read the Copyright Notice in Journal Policy