Stratification of populations with skewed distribution *

The problem of efficient stratification in the case of skewed population is considered. Four stratification methods are examined. A new adjusted ge ometric stratification method is introduced. This method is compared by simulation with the Dalenius-Hodges cumulative root frequency method, the geometric method proposed by Gunning and Horgan [2], a nd the power method offered by Plikusas in [6]. The simulation results show that in most cases considered the power method is the most efficient one.


Introduction
Survey statisticians are always concerned in selecting the best sample design which gives more accurate estimates of the population parameters of interest. One of the classical and still efficient sample designs is a stratified sample design: the survey population is divided into several non-overlapping parts (strata), the sample is drawn from each part independently, then according to the selection method, the population parameters are estimated on the basis of the sample drawn. In survey practice the most popular stratification method is the cumulative root frequency stratification method considered by Dalenius and Hodges [1,4]. It is useful even nowadays. A review of different stratification methods for the skewed populations is considered in [3]. In the following section, we formulate a stratification problem in general. The new stratification method is presented in Section 3.

Stratification problem
Consider a finite population U = {u 1 , u 2 , . . . , u N } of N elements. Let y be a study variable defined on the population U and taking values {y 1 , y 2 , . . . , y N }. Let us consider a stratified simple random sample obtained by partitioning the population into non-overlapping groups, called strata, and then selecting a simple random sample from each stratum. Suppose that the number of strata H is fixed and known. Denote by U h the stratum h, by s, s ⊂ U, a stratified random sample set, drawn from the population U, and by s h a simple random sample selected from the stratum h.
Using the proper stratification strategy, we can get estimators of the population parameters of interest which provide more precise estimates at a lower survey cost. The aim of a survey statistician is to decide how to select the best stratification algorithm * The research is supported by the Grant of Lithuanian science foundation, T-07149. in order to maximize the precision of considered estimators, i.e, to minimize variance, MSE or the coefficient of variation (cv) of estimators.
The classical stratification problem is formulated by choosing the population mean as a parameter of interest and minimizing the variance of its estimator: Hereȳ h is the sample mean in stratum h, N h is the number of elements in stratum h, and the product N hȳh is a well known Horvitz-Thompson estimator of the stratum h total. Stratification procedure deals with several issues. How to choose the stratification variable? How can the strata boundaries be determined? How many strata should there be? How large sample should be selected? How to allocate the sample to the strata defined?
We suppose the number of strata H and the sample size n to be chosen, and consider the second issue assuming that the sample is distributed according to the Neyman optimal allocation [5].
Let the variable y be known and its values be arranged in an ascending order. Denote by k 0 and k H the smallest and largest values of y respectively. The problem is to find intermediate stratum boundaries k 1 , k 2 , . . . , k H −1 such that var(μ) be minimal. An assumption that the variable y is known is unrealistic, therefore we will use auxiliary variable x for stratification. This auxiliary variable x should be well correlated with the study variable y. The principle remains the same: the values of variable x are arranged in an ascending order and we are looking for the stratum boundaries which minimize variance of the mean estimator var(μ x ) for the variable x.
Tore Dalenius has showed that stratum boundaries with the above-mentioned property exist and satisfy the following equations: where S h , µ h are the standard deviation and mean of the stratum h. There are H − 1 equation, moreover, both S h and µ h depend on k h . Thus, we have complicated iterative equations. Some additional problems arise: a) how to select the first approximation of the solution k h , h = 1, . . . , H − 1; b) whether the iteration procedure converge.

Some stratification methods
I. The cumulative root frequency method. Denote by f (x) a continuous density of the auxiliary variable x. Assuming that the distribution of x in each stratum is approximately uniform, Dalenius and Hodges [1,4] have showed that the minimum variance of the population mean estimator is approximately achieved when the strata boundaries k (f ) h are chosen so that If the distribution of the variable x is discrete, then f (x) is the frequency function of x. So, the rule is to choose stratum boundaries k (f ) h so that the following totals be approximately the same.
II. Geometric method. An interesting method is presented by Gunning and Horgan [2]. They have proposed a new algorithm for construction of stratum boundaries, based on an observation that, with near optimum boundaries, the coefficient of variation of the stratification variable x is the same in all strata: Assuming that the distribution of the variable x within each stratum is uniform, the following expression for the approximately optimum stratum boundaries has been obtained: So, the stratum boundaries are terms of a geometric progression. This method is called Geometric method and it is proposed for skewed populations.
III. Power method. A simple and efficient method is proposed in Plikusas [6]. The boundaries k (p) h are chosen so that the totals show that the parameter α should be in the range from 0.5 to 0.7. There is a hypothesis that the parameter α depends on the exponential distribution parameter λ.
IV. Adjusted geometric method. Using the same idea of Gunning and Horgan [2] to equalize the coefficients of variation of each stratum and assuming that the distribution within each stratum is exponential, we get iterative equations for defining the strata boundaries: where Let us compare the described stratification methods by simulation.

Simulation study
We compare all the mentioned stratification methods considering four real populations of size 300 having a skewed distribution which is close to exponential. The sample size n = 50 is distributed into five strata, using Neyman's optimal allocation. The known variable x is used for the stratification and the results are presented for the study variable y which is highly correlated (ρ ≈ 0.9) with the variable x. m = 1000 samples s j are drawn. The strata boundaries and the coefficient of variation of the estimate of µ y are calculated for each method. The simulation results for some skewed populations are presented in Table 1.
For the most skewed populations the power method is the best one. The geometric method is simple, but precision is lowest in the most cases considered. It can be observed, for example, in the case of the first population.
The coefficient of skewness for the second and third populations is higher, but the efficiency of all methods remains almost the same. Moreover, there appear more significant differences between the power method and the others.
It should be mentioned, that for very skewed populations the power method is not best. This situation illustrates the fourth population with the highest coefficient of skewness. The adjusted geometric method is preferable in this case.
The simulation was also performed for populations with a normal distribution. Then the cumulative root frequency method is most suitable, however differences in efficiency of stratification methods are minimal.