KERNEL DENSITY ESTIMATORS FOR GAUSSIAN MIXTURE MODELS

The problem of nonparametric estimation of probability density function is considered. The performance of kernel estimators based on various common kernels and a new kernel K (see (14)) with both fixed and adaptive smoothing bandwidth is compared in terms of the symmetric mean absolute percentage error using the Monte Carlo method. The kernel K is everywhere positive but has lighter tails than the Gaussian density. Gaussian mixture models from a collection introduced by Marron and Wand (1992) are taken for Monte Carlo simulations. The adaptive kernel method outperforms the smoothing with a fixed bandwidth in the majority of models. The kernel K shows better performance for Gaussian mixtures with considerably overlapping components and multiple peaks (double claw distribution).


Introduction
Nowadays there are a lot of methods for estimating the probability density function; however, one might face problems when choosing an effective estimation procedure and applying it in practice if the probability density of sampled data is multimodal and the sample size is small or moderate.Kernel smoothing is the most frequently used nonparametric estimation method (see [18], [20], [24]).Thus far, there is no generally accepted kernel estimation method which outperforms the other in all cases.Although many adaptive selection procedures have been proposed [2], [17], [28], their efficiency has not been well established yet, especially for samples of a moderate size.
The applications of the Gaussian mixture model (GMM) are often encountered in various fields.For example, in medical research, to classify the mobility of limbs using myoelectric signals, Yonghong, Englehart, Hudgins, Chan [27] compared GMM with other three classifiers: linear discriminant analysis, linear perceptrons network, and multilayer perceptrons neural network.In the said paper, the investigation made by using GMM showed an extreme accuracy of the classification.Such researchers as Henthorn, Benitez, Avram, Martinez, Llerena, Cobaleda, Krejcie and Gibbons in their investigation [14] analysed the ability to hydroxylate (the degree of urinary metabolic) the debrisoquine and the dextromethorphan of O-demethylate with the GMM assistance.Zimmerman, Povinelli, Johnson and Ropella [29] applied the Gaussian mixture model for ECG analysis by using ST segments and T waves to determine the ischemic disease.In the geological researches of Stachniss, Plagemann, Lilienthal [25], the Gaussian mixture model was used for the prediction of gas concentration based on drillings.Moreover, the approximation of GMM was used in ecological investigations to study the contamination of soil.This idea was implemented by Gruszczyński [10].In an applied sociology paper, Chernova and Veloso [4] described the interactive policy learning results by using Gaussian mixture models as well.
In this paper, the Monte-Carlo method was used to compare the performance of various kernel estimators both with fixed and adaptive smoothing parameters.The following four kernel functions were applied: 1) Gaussian distribution density, 2) Epanechnikov kernel (12), 3) triweight kernel (centred beta (4,4) distribution density, see ( 13)), and 4) a new kernel K, obtained as a simple transformation of Gaussian distribution density (14).The kernel K is everywhere positive but has lighter tails than Gaussian (see (14)).Gaussian mixture models from a collection introduced by Marron and Wand (1992) are taken for Monte Carlo simulations.The accuracy of estimators is measured by the symmetric mean absolute percentage error.
The paper consists of 5 sections.Section 2 reviews the kernel density estimators and kernel functions; Section 3 describes the simulation experiment; Section 4 contains the simulation results.The detailed results of the study can be found in the appendix and supplementary material.Concluding remarks are presented in Section 5.

Kernel estimator
A random variable X∈R satisfies a mixture model if its distribution density function f(x) is given by the equality: ( The parameter q is the number of components in the mixture.The component weights p k are called a priori probabilities and satisfy conditions: Function f k (x) is the distribution density function of the k th component and θ is the vector of parameters of mixture model (1).Suppose a simple sample X = (X(1), …, X(n)) of size n from X is given.The estimation of the distribution density of an observed random variable is one of the main statistical tasks.
Histogram is one of the simplest and the oldest density estimators.As far as it is known, for the first time the data were shown as a bar chart (without graphic visualization) in the context of the mortality histogram constructed by John Graunt (1662) from the Bills of Mortality during the plague years [26].The circumstantial evidence that Graunt actually invented the histogram while looking at these mortality data seems quite strong, although there is a reason to infer that Galileo had used histogram-like diagrams earlier.Hald [11] recounts a portion of Galileo's Dialogo, published in 1632, in which Galileo summarized his observations on the star that appeared in 1572.This graphical representation was first introduced by Karl Pearson [23] in 1891.For the approximation of density f(x), the number of observations X(t) falling within the range of Ω is calculated and divided by n and the volume of area Ω.
Shaping the histogram, every X(t) should be imagined as a separate bar with a height equal to 1/n.Then, logically, the midpoint of the bar can be changed to the X(t) and obtain a function: where C h is a bin with the midpoint X(t) and the width h.This statistic is often named as a simple estimator.In fact, we can use a smooth "bump", known as a kernel function, instead of the indicator function.Thus, the fixed-width kernel density estimator (FKDE) ) ( ˆx f with a kernel function K and a fixed (global) bandwidth h is derived: It is one of the most widespread non-parametric density estimators in data analysis (see e.g.[5], [16], [22]).The kernel function K(u) should satisfy the condition: Usually, but not always, K(u) will be a symmetric probability density function [24].
At first, the data is usually prescaled in order to avoid large differences in data spread.A natural approach [8] is first to standardize the data by a linear transformation yielding data with zero mean and unit variance.As a result, equation ( 4) is applied to the standardized data.Let Z denote standardized values of random X: where X is the empirical mean, and S is the empirical variance.Applying the FKDE method to the standardized data Z = (Z(1), …, Z(n)) yields the following estimator of density function f(x): The optimal FKDE bandwidth h * is determined through the minimization of the mean integrated squared error (MISE) [24].For example, when both the kernel and the distribution of observations is Gaussian with the unit variance, h * can be found by Silverman's [24] equation: , where More advanced methods for determining the kernel bandwidth, such as the least-square cross-validation method, are also available with respectively increasing complexity and computation resources [3], [7], [13].In practice, the bandwidth is often selected by testing.If the value of h is too small, then the density estimator is irregular and has a lot of models.In the ultimate case, the modes are located at each of the observations.The greater values of h mean greater smoothing of an estimator.
Although FKDE is a widely used method of nonparametric density estimation, it has some practical flaws [24]; for example, the inability to deal satisfactorily with the tails of distributions without oversmoothing the main part of the density.A significant improvement of FKDE is an adaptive kernel density estimator (AKDE) [24].The construction of AKDE looks like FKDE; however, now the kernel bandwidth h varies together with the argument x.This method includes two stages: computing an adaptive bandwidth and the density estimation by kernel method.The algorithm is as follows: Step 1.The elements of a sample X = (X(1), …, X(n)) are standardized producing data Z = (Z(1), …, Z(n)) with the sample mean 0 Step 2. Calculate the pilot FKDE (7) estimator Step 3. Find the values of a local width here g is the geometric mean of , and γ is the sensitivity parameter, Step 4. Obtain the adaptive kernel estimator where h is the same global smoothing parameter as used in equation ( 4).The larger the γ, the more sensitive performance to the selection of kernel.The typical value is , [24].This recommended value will be used in the further analysis.
The comparative analysis of estimation accuracy was made for four different types of kernels.The first three kernels are classical, while the last one is new.
The Gaussian kernel consistent with the distribution of normal φ(x) [9], [19] selection: The Epanechnikov kernel is the second order polynomial, corrected to satisfy the properties of the density function [6]: The triweight kernel proposed by Tapia and Thompson [26] has better smoothness properties and finite support.Its detailed investigation was done by Hall [12]: The kernel K is everywhere positive, but has lighter tails than Gaussian distribution density and was introduced by authors of this article: This kernel function depends on parameters c and q.In simulations, the chosen values of the parameters c and q were, respectively, 0.01, 0.1, 1 and 1/3, 1/2, 1.The first two and the last two values in the first and the second collection, respectively, produce worse accuracy results.Therefore, only the results for c = 1 and q = 1/3 are reported here.Detailed results are presented in the supplementary material [21].

Monte Carlo experiment
A comprehensive simulation study was conducted with the aim to compare kernel methods of statistical density estimation and the kernel functions described before.In the comparative analysis, Gaussian mixture models from a collection suggested by J. S. Marron and M. P. Wand [20] were used as the target densities:  These fifteen densities have been carefully chosen because they thoroughly represent many different types of challenges to curve estimators [20].The first five represent different types of problems that can arise for unimodal densities.The rest of the densities are multimodal.Densities from 6 to 9 are mildly multimodal and one might hope to be able to estimate them fairly well with a data set of a moderate size.The remaining densities are strongly multimodal and, for moderate sizes, it is difficult to recover even their shape; yet, they are well worth studying because the issue of just how much of them can be recovered is an important one.
In the simulation study, low-size and moderate-size samples (16,32,64,128,256,512,1024,2048) were used.500 replications were generated in each case.The estimation accuracy is measured by the symmetric mean absolute percentage error: An additional bias is introduced when the kernel estimator is calculated at the sample values, i.e. for x = X(t).Therefore, to eliminate this bias estimator (4) in the SMAPE, the formula is changed to: (15)

Results of Monte Carlo simulations
The errors in diagrams that are obtained by fixed width kernel density estimator are represented by solid line, and the errors obtained by adaptive width kernel density estimator are represented by dashed line.Symbols G, E, T, and K mark, accordingly, the Gaussian, the Epanechnikov, the triweight and the proposed kernel functions.Full comparative analysis results are given in the tables of supplementary material [21].The estimator of fixed kernel density in the tables is denoted FK.The estimator of adaptive kernel density is denoted AK.The results (mean and standard deviation) of the SMAPE for 500 Monte Carlo replications are presented in supplementary material.All the kernel functions applied to the fixed width and adaptive kernel density estimators used are described in Section 2. In addition to the kernel functions mentioned, other functions were investigated as well: Cauchy and Laplace distributions, cosine kernel.The application of these kernel functions results in worse accuracy in this computer experiment; therefore, the respective results are not reported.The outcomes of this research confirmed the advantage of AK over FK, which had been previously shown by other researches [15].The Epanechnikov kernel function suits well for symmetrical distributions with clearly expressed one or several peaks.The triweight kernel function is most suitable for several or high number of needles.However, for the complex mixture models investigated in this study, these peaks show up only for the samples of size n ≥ 128.Therefore, the use of this kernel is not recommended for estimation in case of a small sample size.The proposed kernel function K is statistically significantly better for distributions where several components have large variances and the components are strongly overlapping.From the fifteen distributions offered by J. S. Marrton and M. P. Wand [20] (see Section 3), better results using the K kernel were obtained for two of them for samples with a sample size n > 16 and for four distributions when n < 256.As an example, the results for double claw distribution are presented in Table 1.The decreasing parameter c usually increases the SMAPE.However, its increase on average improves the SMAPE for the skewed distributions and the small sample size.The increase in the value of the parameter q has a positive effect when estimating densities with sharp peaks, i.e. when the triweigth kernel function is most suitable.The Gaussian kernel was the best solution only for very smooth skewed mixtures when the size of the sample was less than 32 observations.

Conclusion and future work
The comparative simulation analysis has shown that the new kernel K is preferable to the well-known classical kernels, when the components of estimated mixture are considerably overlapping.The research was conducted using the Gaussian mixture models commonly used for data simulations by other researchers as they typify many different types of challenges of density estimators.The future work can address the problem how the local behaviour of the kernel in the neighbourhood of zero and its tails influence the accuracy of kernel estimators for data distributions of various shapes in moderate sample sizes.

Fig. 1 .
Fig. 1.Form of kernel K function dependence on parameters

Table 1 .
Performance of density estimators for double claw distributions