Goodness-of-fit tests for sparse nominal data based on grouping

Currently the amount of accessible information is very extensive, therefore problems related to a high dimensionality of data arise rather frequently. For quantitative (continuous) variables, (generalized) linear models are usually applied. They describe relationships between the means of these variables or their covariance structures and hence the number of model parameters grows at most as O(k) with respect to the dimensionality k of the data. The problem of high dimensionality is especially topical for qualitative (categorical) variables. In this case, the number of model parameters generally increases exponentially with k. Consequently, even for a moderate number of categorical variables, a corresponding contingency table can be sparse, i.e. many cells in the table are empty or have small counts. In fact, for categorical data, the number of cells in the corresponding contingency table is even more important characteristic of sparsity than the dimensionality k itself. Sometimes the number of cells (the number of unknown parameters) is even greater than the sample size (very sparse categorical data).


Introduction
Currently the amount of accessible information is very extensive, therefore problems related to a high dimensionality of data arise rather frequently.For quantitative (continuous) variables, (generalized) linear models are usually applied.They describe relationships between the means of these variables or their covariance structures and hence the number of model parameters grows at most as O(k 2 ) with respect to the dimensionality k of the data.The problem of high dimensionality is especially topical for qualitative (categorical) variables.In this case, the number of model parameters generally increases exponentially with k.Consequently, even for a moderate number of categorical variables, a corresponding contingency table can be sparse, i.e. many cells in the table are empty or have small counts.In fact, for categorical data, the number of cells in the corresponding contingency table is even more important characteristic of sparsity than the dimensionality k itself.Sometimes the number of cells (the number of unknown parameters) is even greater than the sample size (very sparse categorical data).
c Vilnius University, 2012 Example.(Cf.[1,p. 16,Case 3].) Suppose a questionnaire consists of k = 10 questions, each with 2 possible answers.Then the total number of cells in a contingency table of the answers is 2 k = 2 10 > 10 3 .Thus, for a sample with 10 3 respondents, the average of expected frequencies in the contingency table is less than 1.
According to the rule of thumb expected (under the null hypothesis) frequencies in a contingency table are required to exceed 5 in the majority of their cells.If this condition is violated, the χ 2 approximations of goodness-of-fit statistics may be inaccurate and the table is said to be sparse [2].
Actually, there are three main problems caused by sparsity in statistical analysis of contingency tables: 1.The standard χ 2 approximation for distributions of classical tests is not sufficiently accurate (see, e.g., [2,4]).Several techniques have been proposed to tackle this problem: exact tests [2], alternative approximations [5,6] parametric and nonparametric bootstrap [7], Bayes approach [8,9] and other methods.2. The classical tests are not longer (asymptotically) distribution free [1].The latter property for test implies that the test performance is independent of a null hypothesis to be tested and thus leads to universal decision rules.The lack of this property means that a critical value of every testing problem is a specific problem to be solved.3.For (very) sparse data, the classical tests become noninformative: they do not anymore measure the goodness-of-fit of a null hypothesis to data.For instance, the classical tests are inconsistent even in cases where a simple consistent test does exist ( [10,11], see also [1,12]. The paper is devoted to the third problem.It reveals that possibly there is no sense to solve the former two problems.The goal of the paper is to propose alternative nonparametric criteria to the classical ones which are consistent for sparse categorical (nominal) data as well.
In the next section, we present a brief overview of different approaches to sparsity.We propose the extended empirical Bayes model of sparse asymptotics.This model contains the latent distribution and the structural distribution models as special cases.In Section 3, testing problem is formulated without any assumptions about convergence of distributions.The consistency of tests based on φ-divergences and grouping is proved.Finitesample performance of these tests is studied using Monte Carlo simulations in Section 4. The proposed tests are compared with the classical criteria.

Definitions of sparsity
Let y := (y 1 , . . ., y n ) be a contingency table, i.e. a vector of observed frequencies.Set µ = Ey.Assume that components of y are independent Poisson random variables, y ∼ Poisson(µ).
An alternative assumption might be Consider a simple hypothesis testing problem where is a given vector of positive values.We are interested in case where contingency tables are sparse.Informally it means that the number of cells n is large and expected frequencies of a significant part of cells are small.
There are different ways to define sparsity formally, as well as represent sparsity scale by introducing the corresponding parameters.The definition of sparsity is based on the sparse asymptotics (cf.[13,14]).Denote µ + := Ey + , y + := n j=1 y j .Let M → ∞ be some asymptotic parameter.The sparse asymptotics assumes that n = n(M ) → ∞ and In what follows we usually hide the dependence on the asymptotic parameter M though indicate it when introducing new objects and in cases we need to stress this dependence.

Latent distribution model
One of the simplest way to deal with the sparsity is to suppose that the expected frequencies µ = (µ 1 , . . ., µ n ) of an ordered variable are determined by a latent distribution function F on [0, 1] via representation where t 0 = 0, t i := i/n, i = 1, . . ., n (cf.[13,15]).In this setting, it is usually assumed that there exists rather smooth latent distribution density f , f (u) = dF (u)/du.This assumption implies Thus, in this case the sparsity is expressed by the average expected frequency ρ = ρ(M ) := µ + /n.For multinomial sampling scheme (1) we have µ + = N where N is the sample size of the contingency table y.Hence ρ(M ) = N/n.A typical assumption for the sparse asymptotics is ρ = O(1).In this case, the number of unknown parameters n−1 is proportional to N and hence the consistent estimator of the parameters, in general, does not exist (see, e.g., [16]).The consistent estimator can be constructed under the additional requirements on smoothness of the latent distribution density f .Then standard (kernel) smoothing technique can be applied (see, e.g., [13,15]).
The latent distribution model ( 3) with uniform with respect to M restrictions on the smoothness of the latent density f is inappropriate for nominal data.In this case, the expected frequencies µ and their sparsity can be described by the structural distribution function introduced by Khmaladze [1] to characterize data with a large number of rare events (LNRE for short; see also [12,17]).Thus, LNRE is Khmaladze's definition of sparse categorical data.

Structural distribution
When dealing with testing problem (2), one can suppose that the cell numbering order is irrelevant.It means that the statement µ = µ • is replaced by the statement {µ 1 , . . ., µ n } = {µ • 1 , . . ., µ • n }.Actually, it is the same as to require the tests to be invariant with respect to permutations of the cell numbers.Then only permutation invariant hypotheses can be tested.This leads to the testing problem where F (M ) is the empirical distribution function of {µ 1 , . . ., µ n }, and |A| denotes the number of elements (cardinality) of the set A.
Here we explicitly indicate the dependence of the statements on M , the key parameter in the sparse asymptotics.
In fact, testing problem (4) as well as ( 2) is a sequence of statements and it remains some uncertainty how they should be combined.While it is quite natural to take "H 0 : µ (M ) = (µ • ) (M ) ∀ (sufficiently large) M ", a reasonable definition of H 1 is not so clear.Using ideas of the contiguous alternative approach, the testing problem is expressed through asymptotic characteristics (parameters) of sample distributions.Definition 1. (Cf.[17].)Suppose that F ρ (t) := F (M ) (ρt) with some scaling factor ρ = ρ(M ) converges weakly to some distribution function F as M → ∞.Then F is called a structural distribution of the expected cell frequencies µ (or simply of the table y) with the scaling factor ρ.
In terms of the structural distribution the testing problem states where F • is a given distribution function with supp(F • ) ⊂ R + .Again, the sparsity scale is determined by ρ.
Khmaladze [1] pointed out that the structural distribution can be treated as a latent mixing distribution in the empirical Bayes approach.Below we extend this approach to include the null hypothesis in the Bayes model as well.
Fix M or set M = ∞.Now the testing problem for structural distribution (5) takes the following form: H 0 : P • = P γ versus H 1 : P • = P γ .Thus in this case only the marginal distributions of P are involved.
Let P γ • denote the conditional distribution of γ given γ • : Then problem (2) can be extended in terms of P as follows: Here δ a is the Dirac measure with the support {a}, a ∈ R + , Ω and A are some measurable sets satisfying, respectively, P • (Ω) = 1 and P • (A) > 0.
Note that this extension of (2) can not be tested using the latent distribution model, nor the structural distribution approach.They both suggest some convergence of distributions as M → ∞, i.e. some regularity in the sparse asymptotics of frequency tables.In the next section the testing problem is formulated without any assumptions about convergence of distributions thus providing more flexibility in applications.

Hypotheses testing under the sparsity
Here we use the extended empirical Bayes framework described in 2.3.
Let P = P (M ) be a class of probability distributions Suppose that a discrepancy measure d(P, Q) = d (M ) (P, Q) between probability distributions P ∈ P and Q ∈ P satisfies conditions: Given Q (M ) ∈ P (M ) and δ = δ(M ) > 0, consider the following testing problem: Our proofs of the consistency of testing criteria are based on a general result given below.

Main lemma
Given P (M ) ∈ P (M ) for all M , let P P = P (M ) P denote the probability distribution of an observed data D (M ) generated by making use of P (M ) .Let Q(M ) ∈ P (M ) be a hypothetical distribution generating D (M ) .
In order to apply Lemma 1 we need to specify the discrepancy measure d, the class P (M ) of distributions, the estimator d(Q; P ), and the critical value (1 − τ (M ))δ(M ) for sparse asymptotics M → ∞.

Discrepancy measures
The φ-divergence between two vectors u, v ∈ R n + is defined by (cf.[18]) The function φ : R + → R is convex, strictly convex at 1, φ(1) = 0.The most of φ-divergences widely used to measure distribution discrepancy belong to power-divergence family (cf.[4]) with φ = φ α : For φ = φ α , denote d α (v; u) := d φ (v; u).Taking α = 1 and α = 2 produce the classical logarithmic likelihood ratio and Pearson χ 2 statistics, respectively, However, classical test statistics usually are not appropriate for testing goodness-offit in case of sparse contingency tables or LNRE data [1,10,11].A special grouping procedure is applied to increase power of the classical criteria for such data.
Grouping.The observed data is {(µ • i , y i ), i = 1, . . ., n}, where the conditional distribution of y i given the random pair Without loss of generality one can assume that the sequence (µ • i , i = 1, . . ., n) is nondecreasing.Define cumulative empirical sequences, the sequence for initial data, and the sequences determined by the partition ∆, Suppose that Q (M ) and P (M ) are the empirical distributions based on the data and respectively.The discrepancy between Q (M ) and P (M ) is measured by φ-divergence for the grouped data: Nonlinear Anal.Model.Control, 2012, Vol.17, No. 4, 489-501 The straighforward plug-in estimator of d(Q (M ) ; P (M ) ) is given by Let η u ∼ Poisson(u) and suppose that Denote Lemma 2. Suppose ( 16) is fulfilled.Then Proof is presented in Appendix.

Consistency
From Lemma 2 it easy to derive the following result.
Proof is presented in Appendix.
Remark 1.If the partition ∆ = ∆ (M ) with K = K(M ) → ∞ is such that then the statistic d(Q; P ) defined in ( 15) is asymptotically normal as M → ∞.This fact can be established by arguments of Györfi and Vajda [19] used in the case of multinomial sampling scheme.In the case of sparse asymptotics, however, the power of the test based on the statistic d(Q; P ) heavily depends on grouping.Thus, even weaker requirement min ≥ κ 0 with a pre-specified constant κ 0 > 0 may be rather restrictive.
In Section 4 we present (provide) some computer simulation results to illustrate performance of the proposed criterion.

Computer experiment
In this section the finite-sample (n = 200, µ + ≈ 200) behavior of goodness-of-fit tests based on two different methods of grouping (K = 10) is compared with classical criteria.The results of Monte Carlo study with R = 1000 replications for two extended Bayes models are presented.
In the first model, named "Bottom split", µ differs from µ • in the region of low values of µ • ("Bottom"), while in the second, named "Top split", µ differs from µ • in the region of high values of µ • ("Top").The average values of µ in the both regions are kept close to that of µ • .The Poisson distribution parameters, i.e. the expected frequencies µ and the true expected frequencies µ • , are generated as independent Gamma random variables: Here Gamma(a, v) denotes the Gamma distribution with the mean a and the variance v, in the "Bottom split" model and in the "Top split" model, respectively (see (a) in Fig. 1 and Fig. 2).Two simple methods of grouping are applied.By the first method, the groups have equal sizes, i.e. number of elements.In the second method, the groups have equal expected frequencies µ • k+ , k = 1, . . ., 10.In the "Bottom" model, the first grouping method is much better than the second one (Fig. 1(c) and (d)), however, it is slightly worse in the "Top" model (Fig. 2(c) and (d)).Note that expected frequencies in the first grouping in the "Bottom" region are equal 8 and thus normal approximation for these frequencies fails.Consequently, the performance of the test heavily depends on grouping and hence an adaptive grouping rule can significantly increase the power of the tests.Obviously, the grouping does not help if average values of µ • and µ in each group are close.
The classical criteria based on the same φ-divergencies (but without grouping) have very low power, see (b) in Fig. 1 and Fig.