Senųjų raštų rašybos keitimas paieškos sistemai

Mindaugas Šinkūnas

doi:10.15388/Proceedings.2018.16

Articles

Mindaugas Šinkūnas

Lietuvių kalbos institutas

Published 2018-12-20

https://doi.org/10.15388/Proceedings.2018.16

PDF

How to Cite

Šinkūnas, M. (2018) “The Normalization of Old Lithuanian Orthography for Usage in a Search Engine”, Vilnius University Open Series, (1), pp. 389–407. doi:10.15388/Proceedings.2018.16.

Download Citation

Abstract

[full article and abstract in Lithuanian; abstract in English]

The Lithuanian historical corpus consists of machine-readable texts, transcribed according to the principles of documentary edition; the original spelling and the language features it encodes are preserved. Several orthographic systems were used during the various stages of the history of Lithuanian language, and some of them differ from the modern one to a relatively great extent. The historical orthography does not allow the use of language analysis tools, which were developed on the basis of the modern spelling. A link is therefore needed that would connect the historical orthography to the modern orthography used today.

In normalizing spelling, various challenges must be dealt with: the same grapheme must be differently realized without changing the orthography and by rewriting the form in the modern Lithuanian alphabet. At the same time, the normalization of phonetics has to be carried out, which includes the elimination of dialectal phonetic features and the representation of phonemes in the assimilated position. These principles can be used in constructing a universal search engine, in which queries can be processed across different orthographic systems (http://sr.lki.lt).

The size of the corpus and the available limited resources stimulate the search for an automated way of normalizing orthography. A set of rules was developed based on the empirical research on the history of orthography; these rules were then arranged hierarchically in accordance with the length of the sequence of processed characters, their implementation being limited to using the metadata according to the spelling features of a particular source. A 82–97% accuracy level of correct normalization was achieved.

The advantage of a rules-based transliteration is the consistency of changes; the disadvantage can be seen in generating not a single but several equivalents of the word, and the ambiguous rules in certain cases generate many tokens that do not exist in the natural language. The number of generated forms being fed to the search engine was reduced based on non-existent letter sequences and by narrowing the query alphabet. A further selection of the correct forms could be done using dictionaries or tools for analyzing the morphology and syntax of modern Lithuanian.

PDF

Downloads

Download data is not yet available.

Most read articles by the same author(s)

Ramachandran Sugavaneswaran, Ingrida Šarkiūnaitė, The Importance of Internal and External Factors for the Selection Decision of Expatriates , Vilnius University Open Series: 2021: 15th Prof. Vladas Gronskas International Scientific Conference
Lina Plaušinaitytė, On the Explanations of the Meanings of Lithuanian Proverbs in Constant von Wurzbach ’s collection Sprichwörter der Polen historisch erläutert, mit Hinblick auf die eigenthümlichsten der Lithauer ... (1852) , Vilnius University Open Series: 2018: Baltų kalbų tekstų ir žodžių reikšmės
Paulina Ambrasaitė, Agnė Smagurauskaitė, Epic Games v. Apple: Fortnite battle that can change the industry , Vilnius University Open Series: 2021: Teisės mokslo pavasaris
Ugnė Grigaitytė, Miglė Mackevičiūtė, Nusikaltimai virtualioje erdvėje – šiuolaikiniai Iššūkiai ir prevencijos galimybės , Vilnius University Open Series: 2020: Teisės mokslo pavasaris
Simas Garbenis, Positive Psychology: Overview of the Links between Trait Emotional Intelligence and Positive Psychology , Vilnius University Open Series: Vol. 3 (2020): Scientific Research in Education
Erika Jasionytė-Mikučionienė, The Complementizers kad and jog ‘that’ in old and Contemporary Lithuanian , Vilnius University Open Series: 2018: Baltų kalbų tekstų ir žodžių reikšmės
Justė Juškaitė, Milda Aušrinė Janušauskaitė, Third party cookies: what kind of world is without them? , Vilnius University Open Series: 2021: Teisės mokslo pavasaris
Monika Kontautaitė, Aida Norvilienė, Education for Students with Special Needs: Child X Case , Vilnius University Open Series: Vol. 3 (2020): Scientific Research in Education
Aytakin Nazim Ibrahimova, The defintions of information and security; history of information security development , Vilnius University Open Series: 2020: The Future Decade of the EU Law
Erika Turauskienė, Rasa Braslauskienė, Possibilities of Cooperation between Teachers and Parents Cultivating Preschool Children’s Communication Competence at Kindergarten: Teachers’ Opinion , Vilnius University Open Series: Vol. 3 (2020): Scientific Research in Education

1 2 3 4 5 > >>