Statistical Analysis of Word Frequency Distribution in Lithuanian Texts of Different Genres

Neringa Bružaitė; Tomas Rekašius

doi:10.15388/LJS.2016.13868

Articles

Neringa Bružaitė

Vilnius Gediminas Technical University, Lithuania

Tomas Rekašius

Vilnius Gediminas Technical University, Lithuania

Published 2016-12-20

https://doi.org/10.15388/LJS.2016.13868

PDF

Keywords

word frequencies
structural distribution
Zipf’s law
hierarchical clustering
Jaccard distance
Ward method

How to Cite

Bružaitė, N. and Rekašius, T. (2016) “Statistical Analysis of Word Frequency Distribution in Lithuanian Texts of Different Genres”, Lithuanian Journal of Statistics, 55(1), pp. 61–69. doi:10.15388/LJS.2016.13868.

Download Citation

Abstract

The paper examines Lithuanian texts of different authors and genres. The main points ofinterest – the number of words, the number of different words and word frequencies. Structural type distributionand Zipf’s law are applied for describing the frequency distribution of words in the text. It is obvious that thelexical diversity of any text can be defined by different words that are used in the text, also called vocabulary.It is shown that the information contained in a reduced vocabulary is enough for dividing the texts analyzedin this article into groups by genre and author using a hierarchical clustering method. In this case, distancesbetween clusters are measured using the Jaccard distance measure, and clusters are aggregated using the Wardmethod.

PDF

References

Downloads

Download data is not yet available.

Most read articles by the same author(s)

Ignė Dapkutė, Tomas Rekašius, Kazys Simanauskas, Statistical Analysis of Cardiovascular Risk Factors and Links with Mental Health of Adolescents , Lithuanian Journal of Statistics: Vol. 55 No. 1 (2016): Lithuanian Journal of Statistics