Efﬁcient algorithm for testing goodness-of-ﬁt for classiﬁcation of high dimensional data

. Let us have a sample satisfying d -dimensional Gaussian mixture model ( d is supposed to be large). The problem of classiﬁcation of the sample is considered. Because of large dimension it is natu-ral to project the sample to k -dimensional ( k = 1 , 2 ,... ) linear subspaces using projection pursuit method which gives the best selection of these subspaces.Having an estimate of the discriminantsubspace we can perform classiﬁcation using projected sample thus avoiding ’curse of dimensionality’. An essential step in this method is testing goodness-of-ﬁt of the estimated d -dimensional model assuming that distribution on the complement space is standard Gaussian. We presenta simple, data-driven and computationallyefﬁ-cient procedure for testing goodness-of-ﬁt.The procedureis based on well-known interpretation of testing goodness-of-ﬁt as the classiﬁcation problem, a special sequential data partition procedure, randomization and resampling,elementsofsequentialtesting.Monte-Carlosimulationsare used to assessthe performance of the procedure


Introduction
Let X = X N be a sample of size N satisfying d-dimensional Gaussian mixture model (we assume that d is large) with distribution function (d.f.) F .
Because of the high dimension of the considered space it is natural to project the sample X to linear subspaces of dimension k (k = 1, 2, . . .) using projection pursuit method. If the distribution of the standardized projected sample on the complementary space is standard Gaussian this linear subspace H is called discriminant subspace. E.g., if we have q Gaussian mixture components with equal covariance matrices then the dimension of the discriminant subspace is equal to q − 1.
Having the estimate of the discriminant subspace it is easier to perform the classification using the projected sample.
The step-by-step procedure applied to the standardized sample is the following (here k = 1, 2, . . . , d, until hypothesis of standard Gaussian distribution on the complementary space holds for some k): 1. Finding the best linear subspace of dimension k using the projection pursuit method (see, e.g., [4]). 2. Estimation of the parameters of Gaussian mixture (see, e.g., [3]) from the sample projected to the linear subspace of dimension k.
3. Test goodness-of-fit of the estimated model in the d-dimensional space assuming that distribution on the complementary space is standard Gaussian. If the test fails we increase k and go to Step 1. The problems related with the Steps 1 and 2 are considered in abovementioned papers and in their references. If we use common methods in the Step 3 the problem is the comparison of some non-parametric density estimate with some parametric density estimate in high dimensional space. Problems related with high dimensional data are often referred to as 'curse of dimensionality' (see, e.g., [1]). As an alternate approach we use Monte-Carlo method and special sequential data partition procedure. More precisely, we resample the given sample assuming that the distribution on the complementary space is standard Gaussian. For the test statistics we use the joined sample and calculate number of data points corresponding to the initial and resampled samples in each partition element. Test statistics is selected in such a way that if the hypothesis holds the distribution of the test statistics weakly depends on the dimension d and of the distribution in the linear subspace. Test criterion is obtained by simulating sufficiently large number (e.g., 100 or 1000) of independent resampled samples for which the hypothesis holds and comparing test criterion value with predefined level.
The efficiency of the algorithm is based on the weak dependence of the test criterion on the dimension d and the distribution in the linear subspace. Computational efficiency is based on the very efficient dyadic data partition procedure and very simple computation of the test statistics.
We will present some computer simulation results. This approach can be used in other situations, e.g., for testing independence of high-dimensional random vectors (see [2]).

Test criterion
.
Here f and f H denote distribution densities of F and F H , respectively. Let X H be a sample of size M of i.i.d. vectors from H independent of X. The joint sample is denoted by Y , and Z j , j = 1, 2, . . . , N + M, is the corresponding sequence of indicators of the population . Let P = {P k , k = 0, 1, . . . , K}, P 0 = R d be a sequence of partitions of R d , possibly dependent on Y and let A k , k = 0, 1, . . . , K, be the corresponding sequence of σ -algebras generated by these partitions. A computationally efficient choice of P is the sequential dyadic coordinate-wise partition minimizing at each step mean square error in partition sets. The natural choice of the test statistics would by χ 2 -type statistics whereÊ stands for the expectation with the respect to the empirical distributionF of Y and Z k =Ê(Z|A k ), k ∈ {1, 2, . . . , K}.

Computer simulation results
For the computer simulation we selected M = N , and the test statistics in explicit form is given by the following formula: n j,k − m j,k 2 , k = 1, 2, . . . , K, n j,k and, respectively, m j,k , are number of elements of sample X (respectively, sample X H in j th partition element in partition P k . We assumed that discriminant space is known exactly (no errors in finding the best linear subspace). We performed simulations with 100 independent realizations. We obtained maximum and minimum values of the test statistics of corresponding joint realizations. Also we obtained minimum and maximum values of the test statistics excluding 5 per cent highest and 5 per cent lowest values. Dimensions up to 100, typically 10, were considered. Dimension of the discriminant subspace was chosen in range 1-4 (i.e., this dimension depends on the number of mixture components and its parameters), and corresponding range of dimensions of linear subspaces were considered.
The results showed very weak dependence on the selected mixture model and the dimension. Maximum of test statistics excluding 5 per cent highest values appeared to be the suitable criterion to accept or reject the considered hypothesis.