A comparative analysis of mathematical methods for homogeneity estimation of the Lithuanian population

2 Department of Information Systems, Faculty of Fundamentals Sciences, Vilnius Gediminas Technical University, Vilnius, Lithuania Background. Population genetic structure is one of the most im­ portant population genetic parameters revealing its demographic features. The aim of this study was to evaluate the homogeneity of the Lithuanian population on the basis of the genome­wide gen­ otyping data. The comparative analysis of three methods – multi­ dimensional scaling, principal components, and principal coor­ dinates analysis – to visualize multidimensional genetics data was performed. The results of visualization (mapping images) are also presented. Materials and methods. The data set consisted of 425 sam­ ples from six ethnolinguistic groups of the Lithuanian population. Genomic DNA was extracted from whole venous blood using either the phenol­chloroform extraction method or the automated DNA extraction platform TECAN Freedom EVO. Genotyping was per­ formed at the Department of Human and Medical Genetics, Insti­ tute of Biomedical Sciences, Faculty of Medicine, Vilnius Universi­ ty, Lithuania, with the Illumina HumanOmniExpress­12 v1.1 and the Infinium OmniExpress­24. For the estimation of homogeneity of the Lithuanian population, PLINK data file was obtained using PLINK v1.07 program. The Past3 software was used to visualize the genotype data with multidimensional scaling and principal co­ ordinates methods. The SmartPCA from EIGENSOFT 7.2.1 pro­ gram was used in the principal component analysis to determine the population structure. Conclusions. Methods of multidimensional scaling, princi­ pal coordinate, and principal component for the genetic structure of the Lithuanian population were investigated and compared. The principal coordinate and principal component methods can be used for genotyping data visualization, since any essential differenc­ es in the results obtained were not observed and compared to mul­ tidimensional scaling. The Lithuanian population is homogenous whereas the points are strongly close when we use the principal co­ ordinates or principal component methods.


INTRODUCTION
Nowadays population genetic structure is one of the most important parameters in analysing pop ulation research. Different genetic models based on genetic markers are used to evaluate the pop ulation structure. Appropriate mathematical me thods have been developed for different genetic models to obtain information from genetic mark er data to explore the population structure.
There is a large class of methods that have been developed for multidimensional data visualiza tion (1,2). The visual presentation of the data en ables seeing the data structure, clusters, outliers, and other properties of multidimensional data. Direct data visualization is a graphical presenta tion of a data set providing a quality understand ing of the information contents in a natural and direct way.
There exist numerous methods that can be used for reducing the dimensionality and particularly for visualizing the ndimensional data: principal component analysis (PCA) (3), multidimension al scaling (MDS) (4), locally linear embedding (LLE) (5), etc. These methods can be used to visu alize the data set provided that a sufficiently small output dimensionality (d = 2, d = 3) is chosen.
In this study, methods of multidimensional scaling, principal coordinates, and principal com ponents were used for detecting population gene tic structure and potential outliers of the Lithua nian population. Our core task was to determine the accuracy of each method for genomewide genotyping data visualization.

Samples and genotyping
The data set consisted of 425 samples from unre lated Lithuanian individuals. The samples were collected randomly from six ethnolinguistic groups of Lithuania: three groups of Aukštaiči ai (from the region in the northeastern part of the country) and three groups of Žemaičiai (from the ethnic region in the northwestern part of Lithuania) ( Table 1).
Genomic DNA was extracted from whole ve nous blood using either the phenolchloroform extraction method or the automated DNA (ex traction platform TECAN Freedom EVO (TE CAN Group Ltd., Männedorf, Switzerland), based on paramagnetic particle method. DNA concent ration and quality were measured by NanoDropR ND1000 spectrophotometer (NanoDrop Tech nologies Inc., US).
Genotyping was performed at the Department of Human and Medical Genetics, Institute of Bio medical Sciences, Faculty of Medicine, Vilnius University, Lithuania, with Illumina HumanOm niExpress12v1.1 (296 samples) and the Infinium OmniExpress24 (129 samples) arrays (Illumina, San Diego, CA, USA), with overlap of 707,138 SNPs genomewide distributed. Quality control of the genotyping data was performed according to the manufacturer's standard recommendations. Individuals with call rate <98% and standard devi ation (SD) of Log R ratio >0.3 were excluded from further analysis. GenomeStudio v2011.1 program (Illumina, USA) was used to distinguish the gen otypes from the sample and to export the data in PED/MAP format.
For the estimation of homogeneity of the Lith uanian population, PLINK data file (binary for mat) was obtained using PLINK v1.07 program (3). Individuals or SNPs with >10% missing data, minor allele frequency (MAF) <0.01, and Har dyWeinberg equilibrium (HWE) test Pvalue of less than 10 -4 were excluded. SNPs in linkage dise quilibrium were removed with the indeppairwise option of PLINK v1.07 using a window size of 50 SNPs, a step size of 5, and an r 2 threshold of 0.5.

VISUALIZATION METHODS
In this paper, we performed an analytic investiga tion of multidimensional scaling, principal compo nents, and principal coordinates methods, which are used for multidimensional data visualization. If we have the dataset If a suffi ciently small output dimensionality d = 2 or d = 3 is chosen, two or three dimensional vectors obtained may be presented in a scatter plot.
The Past3 program was used to visualize the genotype data with multidimensional scaling and principal coordinates methods. The Smart PCA from EIGENSOFT 7.2.1 is one of the basic programs used in the principal component analysis to determine homogeneous or heterogeneous pop ulation structure. The method was developed for the samples not related to the population structure.

Multidimensional scaling
Multidimensional scaling (MDS) refers to a group of methods that are widely used for dimension ality reduction and visualization of multidimen sional data (4). The starting point of MDS is a ma trix consisting of pairwise proximities of the data. The proximities are similarity or dissimilarity. The main goal of multidimensional scaling is to find lowerdimensional data Y i , i = 1, …, m, such that the distances between the data in the low erdimensional space were be as close to the orig inal distances (or other proximities) as possible (4). The stress function E MDS must be minimized.
The multidimensional scaling method is based on a distance matrix computed with distance meas ures. The results of MDS depend on the initial val ues of twodimensional vectors if the MDS stress is minimized in an iterative way.

Principal coordinates analysis
Principal coordinates analysis (PCO) is another method also known as metric multidimensional scaling. The algorithm is taken from Davis (1986). The main idea of this method is finding the ei genvalues and eigenvectors of a matrix contain ing the distances or similarities between all data points. Giving a measure of the variance account ed for by the corresponding eigenvectors (coordi nates), the eigenvalues are given for the first four most important coordinates (or fewer if there are fewer than four data points). The percentages of variance accounted for by these components are also given (6).
Before eigenanalysis, the values of similarity and distance index values can be raised to the power c. C is the "transformation exponent", which can be 1, 2, 4 and 6 (7). We needed principal coordinates analysis with the standard value c = 2.

Principal components
Principal components analysis (PCA) is one of the powerful and popular statistical linear pro jection methods. Linear transformation is wide ly used for a dimensionality reduction, feature extraction, and visualization of multidimension al data. The main goal of this method is finding the trend with the largest variance. The input data is a matrix of multivariate data, with items in rows and variates in columns. The eigenvectors and their eigenvalues were calculated by the singular value decomposition algorithm (SVD).
Principal component analysis is a tool wide ly used in genomics and statistical genetics, em ployed to infer cryptic population structure from genomewide data such as single nucleotide poly morphisms (SNPs) (8), and/or to identify outlier individuals which may need to be removed prior to further analyses, such as genomewide associa tion studies (GWAS) (9).
In this paper, we analysed only Euclidean and Gower similarity and distance indices in greater detail since with these similarities we achieved the best possible results. The basic Euclidean dis tance means the distance between the two points in a plane. Gower is a distance measure that averages the difference over all variables, each term normalized for the range of X k = (x k1 , x k2 , …, x kn ) and X l = (x l1 , x l2 , …, x ln ) variables calculated as fol lows: The Gower measure is similar to the Manhattan distance but with the range normalization (6).

RESULTS
The aim of this study was to explore the most suit able method for inferring the Lithuanian popula tion structure using genotype data: multidimen sional scaling, principal coordinate, and principal component.
The visualization results of the genotype data of six ethnolinguistic groups of Lithuania visualized by multidimensional scaling methods are presented in Fig. 1, by principal coordinates analysis in Fig. 2, and visualized by principal components in Fig. 3.
In order to estimate the quality of mapping, the stress function was calculated.
The research results show that the stress func tion values are smaller and approximately equal when we used the similarity of the Euclidean dis tance (Stress = 1.371) and Gower (Stress = 1.372) than other similarities of multidimensional scal ing method (Fig. 1). The stress function values are larger when we used other similarities (except Eu clidean and Gower).
The results of visualization obtained by the prin cipal coordinates are presented in Fig. 2. The Coordi nate 1 explained 0.57% and Coordinate 2 0.55% of the genetic variation among the studied samples (424) of the Lithuanian population when the sim ilarity index was the Euclidean distance and Co ordinate 1 explained 0.77%, Coordinate 2 0.76% of genetic variation, provided that the similarity index was Gower.
The results of the data analysis show that Co ordinate 1 explained 11.25% and Coordinate 2 11.11%, when the principal components method was used with the Euclidean distance similarity index (Fig. 3).
The investigation results show that the princi pal components method is more suitable to ana lyze the population genetic structure than the me thods of multidimensional scaling or principal coordinates. On the other hand, the principal co ordinates method is more suitable as compared to the multidimensional scaling ( Fig. 1 and Fig. 2).
It can be seen that since the similarity index is the Euclidean distance, the points obtained by principal coordinate and principal component, are clustered very strongly, but the points obtained by the multidimensional scaling method are dispersed, and it is difficult to evaluate the population struc ture. It is evident that the outliers are more visible when we use the principal coordinates and principal components methods provided that the similarity index remains the same, i.e., the Euclidean distance.  Figure 3 show that all six eth nolinguistic groups form one general cluster, and therefore we can conclude that the Lithuanian pop ulation is homogeneous.

CONCLUSIONS
In this paper, the multidimensional scaling, princi pal coordinate, and principal component methods for the Lithuanian population genetic structure have been investigated and compared. We conclude that the principal coordinate and principal component methods can be used for genotyping data visuali zation, since any essential differences in the results obtained have not been observed and compared to multidimensional scaling. The results show that the Lithuanian population is homogeneous as the points are clustered very strongly when we use the principal coordinates or principal component methods.