First steps towards the Lithuanian word association database

This paper introduces the first version of the Lithuanian database of free association norms. This is an attempt to provide an open-access resource, which would be helpful for psycholinguists, linguists, computational linguists, and students. This version of the database includes 277 cue word forms. The responses were collected from 304 participants. In total 15,612 association pairs were recorded. The paper presents the procedure of collecting free associations and additional data available for researchers. It also provides a list of all cue words with their five most frequent associates and some summary statistics.


Introduction
Eliciting lexical associations by giving participants a free association test is a commonly used experimental technique. The respondents are asked to read (or listen to) a cue word and to produce the first response that comes to their minds. Depending on the purpose of the study and the expected outcome, the participants are asked to record either only their first response or more than one word. These associations can then be summarized in databases or dictionaries.
Responses to the cue words have been analysed extensively, and various potential classifications have been provided. Traditionally, the responses were classified into paradigmatic (when a cue and a response are related by their meaning, e.g. black -> white), syntagmatic (when a cue and a responses form a phrase, e.g., black -> coffee), and clang associations (when the words are related by their phonetic form, e.g., black -> lack) (e.g. Meara, 2009). This classification, though, has been criticised and updated (see Fitzpatrick, 2007), but the general idea of words being associated for different reasons (semantics, co-occurrence in language, or phonetic form) remains. Vilkaitė-Lozdienė, L. 2019. First steps towards the Lithuanian word association database. Taikomoji kalbotyra,www.taikomojikalbotyra.lt. open science initiatives taking place at the moment, sharing such data could become a common practice.

Association norms in Lithuanian
For the Lithuanian language, there is one valuable resource: a dictionary of associations (Steponavičienė, 1986). Steponavičienė used 140 cue words, most of them taken and translated from an association dictionary of English, and administered a written form of association test to 1,000 respondents (all students in Lithuanian universities). One limitation of this dictionary, though, is that there is no digital version of the data available, which makes the use of it rather daunting.
Also, it has only 140 cue words, so it could definitely be expanded. A larger dataset of association norms for the Lithuanian language could have more potential applications for various experimental research projects or practical applications.
Apart from this dictionary, there are no published databases of a larger number of associative norms. Researchers that have looked at associations, analysed them at smaller scale, mostly qualitatively, e.g. focusing either on a word and its associations, such as the word medis 'tree' (Papaurėlytė-Klovienė, 2011), or associations of animal names (Akelaitienė, 2007), or looked at associations produced by students of different age groups (Daukšytė, 2005). This paper presents the first version of a Lithuanian database of association norms. The full database with all the additional information described in this paper is an open-access resource available to download at the platform Zenodo (doi: 10.5281/zenodo.3451880). The data available contains the database in the SQL format, as well as a summary spreadsheet for an easier use. The details of its collection and design are presented in the following sections.

Stages of data collection
The data for the database were collected in 2018 in two stages. In both stages, the participants received a written list of cues and had to write down the first word that came to their minds after reading the cue.
The first data administration stage was intended to test whether the morphological form of the cue word would affect the responses in Lithuanian (see next section for more detail on the selection of Vilkaitė-Lozdienė, L. 2019. First steps towards the Lithuanian word association database. Taikomoji kalbotyra,www.taikomojikalbotyra.lt. the cue words). During the first administration, the participants (n = 64) completed a paper and pencil association test. They were also asked to give information about their age, gender, and their first language. The whole procedure took about 10-15 minutes. Three different populations were targeted: students of English Philology, students of Physics, and students of Life Sciences.
During the second administration with a second set of stimuli, the data collection took place online, in order to maximize the number of responses and to make the data collection more efficient. An online platform designed for running psycholinguistic experiments PsyTookit (Stoet, 2017) was used for creating this task. The questionnaire was sent to a group of students and also posted on a faculty's social network page encouraging people to participate and to invite their friends.

Cue words in the database
Different sets of stimuli were used for the two administrations. The first administration was a paper and pencil test. During it, 24 frequent verbs and 24 frequent common nouns were selected for the study from the dictionary of word frequencies (Utka, 2009).
As for the verbs, half of them were transitive (e.g. daryti 'to do', pradėti 'to start') and half were intransitive (e.g. gyventi 'to live ', augti 'to grow'). They were also presented in three different forms: infinitive, third person present singular and third person past singular (e.g. daryti. INF,daro.PRS.3SG,darė.PST.3SG). Those 144 word forms were divided across three experimental lists so that each participant would see only one form of each target word. All experimental lists had three versions, each arranged in random order, to minimize any effect of presentation order on the responses.
The second stage of data collection was based on stimuli, later to be used in a language processing experiment. The stimuli included in total 152 frequent word forms: 70 verbs, 70 nouns and 12 Vilkaitė-Lozdienė, L. 2019. First steps towards the Lithuanian word association database. Taikomoji kalbotyra,www.taikomojikalbotyra.lt. adjectives. These stimuli were divided into three lists and presented in an individually randomized order for each participant.
All the cue words were limited to frequent words in Lithuanian. Previous research shows that word frequency affects responses to that word, with more frequent words leading to more paradigmatic responses (e.g. Cronin, 2002). However, it seemed to be a reasonable first step, since responses to frequent cue words seem to be more homogeneous as well, at least for native speakers (Fitzpatrick, 2007).

Data cleaning
The paper and pencil data did not require much data cleaning: the responses were simply recorded digitally, correcting a few spelling mistakes. The data collected online were much messier and required more changes. To start with, while incomplete surveys with just a few responses missing were retained, the surveys with only one or two responses provided were discarded, assuming that these participants might have not taken the task seriously. Also, spelling mistakes were corrected.
English words written in English (such as a response flashdrive to the cue atminties 'memory') were kept as they were. Afterwards, the responses that were provided without Lithuanian letters were corrected. This was sometimes not a straightforward procedure, as in some cases there was no way to decide which form of the word the participant had in mind (e.g. problema.NOM.SG or problemą.ACC.SG). Only minimal changes were introduced in order for the word to be an existing grammatical form in Lithuanian. That is, *sprendima would be changed to sprendimą.ACC.SG, as the provided form sprendima can only be interpreted as the accusative singular form written without the diacritical mark. Conversely, problema would never be changed to problemą.ACC.SG no matter that a lot of other participants would have provided an accusative form to that specific cue, just because both forms problema.NOM.SG and problemą.ACC.SG are possible in Lithuanian, and the researcher cannot predict which one the participant had in mind. This procedure allowed making straightforward decisions without any interpretations of the researcher. It has to be noted, though, that only very few participants provided answers without Lithuanian diacritics, and even for them, most of the answers were not ambiguous, so these problems were rather exceptional and did not affect the final database much. Data cleaning also included deleting longer comments provided by the participants (e.g. Nežinau, kodėl tai pirmas žodis, apie kurį pagalvojau 'I don't know why this was the first word I thought about') though these were very rare. However, if more than one word Vilkaitė-Lozdienė, L. 2019. First steps towards the Lithuanian word association database. Taikomoji kalbotyra, 12: 226-258, www.taikomojikalbotyra.lt. was provided as an answer (usually a phrase, e.g. namų darbai 'homework'), all the items were kept in the dataset.
The answers provided by the participants were not lemmatized or grouped in any way: forms like darbą.ACC. SG,darbus.ACC.PL,or lavinti.INF,lavino.3.PST.SG are presented as separate words. This is both potentially problematic and more informative. While to the best of my knowledge, so far there has been no published research on the effect of cue's morphological form on the response, some initial analysis seems to show that different morphological forms of the cue elicit different responses (Vilkaitė-Lozdienė, 2019). It seems that especially for nouns, the morphological form of the cue influences the response with accusative and genitive cases eliciting more syntagmatic responses than nominative case. Considering that the morphology of the cue matters, the morphology of the answers provided can also be worth further research.
The final dataset presented in the database contains 277 cue word forms and the responses from 304 participants. The minimum number of participants who have a response per cue word was 18, while the maximum was 208. In total, this makes up 15,612 associations.

Participants
The characteristics of the respondents who took part in the different stages of the study are presented in Table 1. As Table 1 clearly shows, the populations of the participants in the two administrations of the experiment were a bit different. It is not unexpected: once the data are collected online, the researcher has little control on who answers the questionnaire. However, given that the data about the participants is available, everyone using the database can make their own decisions. Researchers can filter and select the data that are of their interest, for example, associations provided by students of a particular course or only one gender, associations provided by bilingual or monolingual speakers, or any other associations.

The database
The whole database is created in the SQL format using the HeidiSQL interface and can be freely These tables can be linked and queried together in order to access the data of interest to any researcher.
However, the SQL format is not very user friendly, especially for novice researchers, so the downloadable material also includes a summary spreadsheet. It summarizes all the responses given to cue words only by the native speakers of Lithuanian. As such, it can be easily accessible for potential users, including students who want to use association norms for their experiments.

Summary data
While the full dataset is available online in the SQL format, this paper presents a brief summary of the association norms in the Appendix. This summary only includes the responses of the participants who indicated Lithuanian as their first language (as these data can be used as a reference for native speaker norms). This means it includes 278 respondents and, in total, 14,336 association pairs. The Appendix presents all the cue words of the database together with some characteristics of their associative behaviour (following Barrón-Martínez and Arias-Trejo, 2014; Comesaña et al., 2014) and their 5 most frequent associations provided for each of the cue words.  First five most frequent associates First 5 The first five most frequent associates together with their frequencies. If one or more words given had the frequency of 1, they were taken from all the other associations with the frequency of one in alphabetical order.

Discussion
While this database for now is rather limited, it is still the largest set of association norms freely available online that we have for the Lithuanian language. It has a number of potential applications for research. First of all, in psycholinguistics, the priming effect was established for years, but in order to explore it, lexical association lists are needed. In Lithuanian, the priming effect has not been studied at all so far. Running simple reaction time experiments to test for priming effects could be an easy and attractive task to work with for MA or even BA projects. It could help students learn the basic techniques of psycholinguistics and extend our knowledge about priming effects.
Further studies could also look at word associations more generally by examining, for instance, the differences between word associations in L1 and L2 (those were researched extensively in English,  Fitzpatrick, 2006Fitzpatrick, , 2007Meara, 1983; in Lithuanian there was only a BA thesis by Ţuperkaitė (2018)), the effect of morphological form of the cue (Vilkaitė-Lozdienė, 2019), and gender or age of the participant on the word associations provided. Word association is a great way to get an insight into one's mental lexicon, and this leads to numerous research questions.
While using association norms that are a couple of decades old is a common practice, as at least some of the associations tend to remain rather stable, arguably, associations do change over time.
For example, the already mentioned response flashdrive to the stimulus atminties 'memory' would not have been given 20 years ago. Also, the associations of the cue word prezidentas 'president' are obviously affected by the current political situation. Because of the potential changes in associations, it could be interesting to look at associations diachronically as well. Thus, in the future, it could be interesting to collect new associations for the same cue words used by Steponavičienė (1986) and add them to the database. Research on English data, seems to suggest that word associations do change over time, but the most frequent words are the ones that are the most resistant to change (Jenkins and Palermo, 1965).
However, the main aim this database is created for is not the study of word associations in themselves, but rather the possibility to control for priming effects in any other psycholinguistic experiment one might want to run. Admittedly, for this purpose, the larger the database, the more useful it is, and for now its use can only be limited. While the number of responses to each cue word is adequate, the number of cue words could definitely be enlarged. The intention is to add cue words to the present version of the database over time to reach a comprehensive set of association norms, comparable to the ones existing for English. Another aim is to enlarge this database keeping track of various stages of data collection and basic information of the participants contributing the associations so that these data could be used for various research projects. To this aim, researchers that have association data available and want to make them public by contributing to this database are welcome to contact the author. Word associations for this database were collected using a simple free association experiment technique: the participants were given cue words and asked to write down the first word that came to their minds. The data were collected in two stages: 64 participants took a paper and pencil test, while the others (n = 240) completed a survey online. The responses were cleaned for mistakes, longer comments, and non-Lithuanian spelling, but they were not lemmatized and are presented in their original form in the database. The data available to download includes tables of Participants, Cues, and Responses needed to use the database in the SQL format, as well as a summary spreadsheet for easier use for researchers not familiar with databases. The Appendix of the paper provides a summary of the main association pairs in the database together with some descriptive statistics.

Appendix: List of Abbreviations
This is a first step towards a more comprehensive association norms database for the Lithuanian language. In the future, this database will be enlarged with more responses and more cue words.
Any collaborations from researchers who are willing to share their data are very welcome.