The article aims to introduce the corpus of broadcast media (radio and television), which was compiled in the framework of the project Lithuanian Language: Ideals, Ideologies and Identity Shifts to the academic community. The corpus includes about 63 hours of transcribed recordings from 1960 to 2010. The article discusses theoretical principles of corpus sampling, which were based on the criteria of time periods and genres; the methodological issues encountered when constructing the sampling scheme; shares the practical experience of selecting and gathering recordings to be included into the corpus; and presents the actual structure of the corpus.
One of the main requirements for any corpus is its representativeness and balance. One possible way to achieve them is to distinguish objectively defined text types and to build the corpus along these lines. Thus the composition of the corpus of the broadcast media was designed on the basis of two criteria: periods of broadcast media development and genre. There are three periods distinguished: Soviet 1960–1987, transitional 1988–1992 and contemporary 1993–now. In respect of genre, three groups of programs are included: talk programs (further subdivided into the types of interview, debate and talk-show); documentaries, features and journal programs; information programs. The article discusses problems that were encountered when trying to implement the corpus along these lines: problems of availability of materials due to technological peculiarities of different periods and organisational factors of archive institutions; the issue of balance between the periods; the problems of genre comparability and different extent of diversity of genres in different periods, and continuity of genres.
Finally, the composition and the size of the corpus are presented (63 hours of recordings, about 350 thousand words). The paper concludes that despite the limited availability of materials and other problems discussed above which is why the corpus cannot be regarded as perfectly representative and balanced, it is sufficient for research into public language change. This was confirmed by tentative research studies done on its basis. The corpus meets the usual technical requirements: the transcriptions have been made in CLAN software developed within the CHILDES project, the recordings have been transcribed, coded and morphologically annotated following the conventions of the CHILDES project, the speakers have been assigned individual codes, and the transcriptions have been linked to the sound/image files.
This work is licensed under a Creative Commons Attribution 4.0 International License.