APPLICATION OF BALANCED SAMPLING , NON-RESPONSE AND CALIBRATED ESTIMATOR

The aim of this paper is to study the interplay between balanced sampling, non-response and calibrated estimator by simulation. The results of seven strategies, embracing a combination of balanced sampling via the cube method, simple random cluster sampling, adjustment for non-response, Horvitz–Thompson estimator of the total and calibration of design weights, are compared. Auxiliary information is used for all strategies at least at one of the stages (sampling or estimation). This auxiliary information consists of indicator variables for sex, age groups and urban/rural living area, and their totals. Real Labour Force Survey data of Statistics Lithuania are used for simulation. Bias, variance and relative mean squared error are measures of accuracy for the comparison of results.


Introduction
The idea of balanced sampling is very old and goes back to the beginning of survey sampling.In a way, it has already been used in the works of Kiaer [7].Despite the fact that this concept evolved and was touched by many survey statisticians, it became known for a wide community of survey statisticians after the book [6] by Y. Tillé was published and its author has given a lot of talks at the conferences introducing sampling design, which he called "balanced sampling design".
The values of auxiliary variables are used in this method at the stage of sample selection.Another method, which uses the values of auxiliary variables at the estimation stage, is calibration of design weights.It was introduced by J.-C.Deville and C.-E. Särndal in 1992 [1].This method became very popular at the statistical offices of many countries and is often used especially for estimation in social surveys.
The aim of the current paper is to study by simulation the use of balanced sampling and calibration together and the effect introduced by non-response into the process of sample selection and estimation.

Probability sampling
Let us study a finite population U = {1, 2, ..., N}, with the study variable y defined for elements of the population y 1 , y 2 , ..., y N .The parameter of interest is a total t y = ∑ k∈U y k .
Any subset s = {i 1 , i 2 , ..., i n } ⊂ U is called a sample selected from a finite population.A random sample S from the finite population is called probabilistic sample if a) the elements of the set of all possible samples S = {s 1 , s 2 , ..., s V } (realizations of S) can be enumerated, and to any possible sample a probability of its selection p(s v ) = P(S = s v ), v = 1, 2, ...,V , is attached, so that Distribution p(•) is called a sampling design.The population U may be a set of clusters, primary sampling units, consisting of population elements -secondary sampling units.Probability π k = P(s ∈ S : k ∈ s), π k > 0, is called element inclusion probability to the sample, d k = 1/π k is called a design weight.
The population total may be estimated from the probability sample by a Horvitz-Thompson estimator ([4]): which is unbiased, with the variance The estimator of variance A fixed-size sampling design, assigning selection probability p(s) = 1/C n N to any n size collection s of different elements and p(s) = 0 to any other collection of elements, is called simple random sampling.It is sampling design without replacement with inclusion probabilities

Balanced sampling
Let us take a vector of auxiliary variables x = (x (1) , x (2) , ..., x (J) ) with the values k , ..., x (J) k ) , k ∈ U, known for the whole population before the sample selection.If this vector characterizes a study variable in the population, it is natural to seek for such a version of random sample S for which the Horvitz-Thompson estimator of the population total of auxiliary variables remains equal to the true total: Let us suppose inclusion probabilities π 1 , π 2 , ..., π N are given.According to [6], the sampling design p(•) is said to be balanced with respect to auxiliary vector x = (x (1) , x (2) , ..., x (J) ) if it satisfies (2).
In the case of a social survey such kind of design can imply, for example, that Horvitz-Thompson estimates of the population size for some groups are equal to the true values if the components of the auxiliary vectors are chosen as indicators of these groups.
A question arises: is it always possible to select balanced samples, satisfying (2) for given variables x (1) , x (2) , ..., x (J) ?An exhaustive answer to this question is given in [6], [7] and many other papers by Deville and Tillé.
Tillé suggested an algorithm for the selection of a balanced sample, which is called a cube method because of the geometric representation of balanced sampling design by a random walk on an Ndimensional cube.First of all, let us note that for a fixed sample size n and only one auxiliary variable x with the values x k = π k , k ∈ U, any probability sample will be balanced with respect to this variable, because the balancing equation is satisfied for any sample: Now, let us express a sample S as a vector of indicators s = (I 1 , I 2 , ..., I N ) with Such a sample can be represented as a vertex of an N-dimensional unit cube.Then balancing equations (2) may be rewritten as These balancing equations should be satisfied.
In order to construct a balanced sample, we should come from the opposite direction and look for the vector a = (a 1 , a 2 , ..., a N ) , which satisfies the system of equations This is a system of J linear equations with N unknowns a 1 , a 2 , ..., a N , and it follows from linear algebra so that, without restriction of the generality, the equation system can be rewritten as and for any choice of N − J components a J+1 , ..., a N , a unique solution a 1 , ..., a J of (4) exists, if det(A) = 0 with Generally, a solution a of an equation system (4) may consist of any numbers.If we succeed to find a solution a consisting of components equal to 0 and 1, this solution will give us a balanced sample: s = a.Otherwise, a balanced sample is not selected, and a vector I = (I 1 , I 2 , ..., I N ) , I k = 0 or I k = 1, k = 1, 2, ..., N, I 1 + ... + I N = n, close to the solution a = (a 1 , ..., a N ) in a way, should be found.The vector I will indicate a sample s = I which is approximately balanced and its finding is named a rounding problem.Unfortunately, a problem to find the vector I is often faced in practice.The cube method is one of the solutions to this problem.The method consists of two phases: flight phase and landing phase.The flight phase means a solution of the linear equation system (4) by the random walk starting at the point π π π = (π 1 , π 2 , ..., π N ) and stoping at the point a = (a 1 , a 2 , ..., a N ) , which satisfies the equation system (4) and is on the ridge of the N dimensional cube.The landing phase -rounding the solution a obtained in the flight phase to the closest vertex of the cube I = (I 1 , I 2 , ..., I N ) if the flight phase did not give it.The balanced sample cannot be reached exactly if one of the constraints in ( 4) is a fixed sample size and sum of the inclusion probabilities is not an integer: ∑ k∈U π k = n.A solution of the rounding problem by the cube method is presented in [6], [7] theoretically, and also implemented into a software R package sampling [8] practically.It has to be mentioned that the elements of balanced sample have predefined inclusion probabilities; therefore a Horvitz-Thompson estimator of the total can be used.The estimator of variance for this estimator without using joint inclusion probabilities is also given in [7].

Calibrated estimator
Let us suppose a probability sample s is selected and data from its elements are collected.Let us suppose we have a vector of auxiliary variables x = (x (1) , x (2) , ..., x (J) ) with the values of the sampled elements and known population totals t x = (t x1 , ...,t xJ ) .Let us fix a sample s ∈ S. A calibrated estimator of the total t y is such an estimator tyw = ∑ k∈s w k y k , whose weights w k , k ∈ s, satisfy the requirements: a) w k differ as little as possible from the design weights d k = 1/π k in the sense of the distance function Application of balanced sampling, non-response and calibrated estimator q k are freely chosen constants; b) calibration equations are valid We see that the calibration equation ( 5) is similar to the balancing equation ( 2).The difference is in the weights; also all values of x are needed for balancing before sample selection; the values of x for the selected elements only and totals t x are needed for the calibration of the design weights at the estimation stage.The connection between these two methods is widely discussed in [7].

Dealing with non-response
Non-response is unavoidable in any real survey.Probabilities for the population elements to respond to the survey questionnaire may be equal or non-equal.When non-response occurs, the bias of the estimator for a population parameter is almost unavoidable.Then the aim of the statistician is to select an estimator of the parameter with the bias which is not too large and variance which is not too high.Many estimators are known for the estimation of parameters in the case of non-response.
Here some of them are applied.
Reweighting estimator.Let s (r) be a subsample of respondents s (r) ⊂ s.The probability to get data from the population element can be expressed as where κ k is the response probability of the element k, k ∈ U [4].The response probability will be considered as known in our study; therefore, the Horvitz-Thompson estimator will be used to estimate the total from the sample of respondents s (r) .
Imputation by logistic regression.The probability of a population element to obtain a value 1 for a binary study variable y will be simulated by the logistic regression model [2]: where x = (1, x (1) , ..., x (J) ) is a matrix of auxiliary variables and β β β = (β 0 , β 1 , ..., β J ) is a vector of coefficients.Model coefficients are estimated from the observations available using the maximum likelihood method, and estimates P(y k = 1) are obtained.After that, the values y k are simulated as values of Bernoulli random variables with probabilities of success P(y k = 1) and are denoted by ŷk .
Multiple imputation.The data of the sampled elements which would be available in the case of full response are called real data.If some elements are not responding, then their real data are not known.
If the statistician makes certain assumptions about the non-response distribution and imputes the values of the variable for missed observations, all values of the sampled elements become available for that variable, but some of them are not real.The imputation of missing values means the input of additional uncertainty into the data set, in comparison with the real data of the sampled elements.The variance of the estimator for a population parameter based on the data with some imputed values cannot be considered as variance of the estimator for a population parameter obtained for real data because the variability of the imputed values should also be taken into account in the variance of the estimator.
Using the method of logistic regression, all values of the study variable for sampled elements are obtained, and a parameter of study θ = t y = ∑ N k=1 y k is estimated.Let us denote the estimator by θ = ty .The method used for imputation is random.It is repeated a C number of times, C complete data sets are obtained, and C estimators θ1 , ..., θC become available for θ.The estimator θc is obtained by multiple imputation, and its variance is estimated by with the component of variance within the complete samples and the component of variance between the estimates for complete data sets The term BC estimates the increase in variance Var( θ) due to imputation [3].

Problem formulation
Let us suppose a balanced sample is available for a social survey.Auxiliary information is used at the stage of sample selection.Unfortunately, non-response occurs, and the set of respondents becomes unbalanced.If the data of the respondents are only used to estimate the parameter of the finite population, considering the set of the respondents to selected for the sample, then a bias of the estimator for the population parameter may arise, and the variance of this estimator is increasing due to the lower size of the set of the respondents, in comparison with the selected sample size.The sample should be adjusted for non-response.If imputation for missing values of a study variable is used, then the sample balance is still preserved.If the reweighting of the respondent set is used, then sample balance is destroyed, and the respondent set is no longer balanced.
The method using auxiliary information at the estimation stage is calibration of the design weights.As it is mentioned in [7], the combination of balancing and calibration is a good strategy.Our aim is to study this strategy introducing non-response by simulation.
Sample balancing and calibration with the same auxiliary variables means the usage of auxiliary information twice.Our aim is to answer if it is worth doing.

Study population
Labour Force Survey data of Statistics Lithuania [5] are used for a simulation study.A fictitious population consists of M = 21 318 individuals aged 16-69.19 586 of them are employed and 1 732 are unemployed (inactive individuals are not included in the population).The parameter of interest is the number of the unemployed in the population, and it will be estimated in the study.The study variable y is binary with the value 1, if a person is unemployed, and 0 otherwise.
The population consists of N = 11 236 households, with the average size of 1.9 persons.These households are considered as clusters in our study.The cluster size equals to the number of its members.
The same auxiliary variables will be used at the sampling design and estimation stage.The variables having influence on the unemployment of a person are selected as auxiliary.From the data analysis of the previous surveys, it is  Simulation is carried out with the sample size n = 100, 1000, 5000 clusters in order to perceive dependency of the accuracy of the results on the sample size.Each strategy is repeated K = 10 times and simulation results are averaged.The number of repetitions K is small and diminishes the validity of the conclusions; however, computer resources available do not allow using more repetitions.

Simulation strategies
The strategy is a pair consisting of sampling design and estimator.The following seven strategies are studied: 1. Balanced cluster sampling and Horvitz-Thompson estimator.A cube method and auxiliary information is used at the sampling stage.Inclusion probabilities are considered to be proportional to the household size: π k = nm k /M, k = 1, 2, ..., N, m k -household size, m 1 + ... + m N = M, ncluster sample size.We denote this strategy by BC+HT.
2. Simple random cluster sampling and calibrated estimator.The same auxiliary information as for the first strategy is used here at the estimation stage only (SRCS+CAL).
3. Balanced cluster sampling, non-response, and calibrated estimator of the total.Reweighting is used for non-response adjustment (BC+NR+Rew+CAL).

Balanced cluster sampling, non-response, logistic regression model for imputation of missing values for a study variable, and Horvitz-Thompson estimator (BC+NR+Imp+HT).
As auxiliary information, the same indicator vectors for sex, age and urban/rural living area are used for balanced sampling, calibration and logistic regression model.Non-response has to be simulated.It is assumed that all household members are responding to the survey questionnaire or not, and response probabilities of the household (and its members) are assumed to be equal: κ k = 0.9.The inclusion probability for a responding individual [4] is Here by s (r) is denoted a subsample of respondents.The inclusion probability π (r) k is used in the third, fifth and sixth strategy.Because of equal and known response probabilities κ k , the Horvitz-Thompson estimator becomes a reweighting estimator.
For the seventh strategy, the probability of a household member to be unemployed is simulated using the logistic regression model.Firstly, the logarithm of the odds ratio is estimated by the maximum likelihood method with the use of the function glm of the software R package stats: The values to be imputed instead of the missing values of the binary study variable y are simulated according to the Bernoulli distribution with the probability of success P(y k = 1), and simulation results are considered as the estimates y k .Consequently, the total of the study variable y is estimated as follows: here d k = 1/π k are design weights, s \ s (r) is a subsample of non-respondents, ŷk are the values of the study variable y for individuals, simulated by Bernoulli distribution using the logistic regression model.The R package sampling [8] is used to estimate parameters and their variances.In the case of a balanced sampling design, the estimates of variance for the estimator are computed by the function varest using the Deville's method for which only first-order inclusion probabilities are needed.In the case of simple random cluster sampling for variance estimation of the estimator of the total, the function calibev is used for an unbiased estimator and for a calibrated estimator.To use it, secondorder inclusion probabilities for individuals π kl = P(k ∈ s, l ∈ s) have to be indicated.For simple random cluster sampling, let us suppose two clusters and their elements are available: u k = {e ki , i = 1, ..., m k } and u l = {e li , i = 1, ..., m l }.For two elements, we have π ki,l j = P(e ki ∈ s, e l j ∈ s) = P(e ki ∈ s|e l j ∈ s)P(e l j ∈ s), From here, joint inclusion probabilities for two elements are with i and j being the elements of the clusters, i = 1, ..., m k , j = 1, 2, ..., m l .

Main results
For any strategy a sample was drawn K = 10 times, and the parameter θ = t y of a study variable y was estimated by θk , k = 1, 2, ..., K.The accuracy measures for estimates are used as follows: empirical mean or average of the estimates relative empirical bias RBias( θ) = Bias( θ)/θ, average of the variance estimates relative mean squared error The relative biases of the estimates of the total for balancing variables are presented in Table 1.It shows a possibility to achieve complete balance of auxiliary variables for a small, medium and large sample size.It is seen in Table 1 that for a small sample size (n = 100), the estimates of totals for balancing variables are far from the real values due to the rounding problem arising essentially.Large samples (N = 5000) do not encounter such a problem.It means that for a small sample size, it is difficult to achieve balance of auxiliary variables.Therefore, if the study variable is correlated with auxiliary variables, the estimates of its total for small, not well-balanced sample sizes should not be very precise.Table 1 also show, that balance for indicator variables characterizing smaller groups, for example, age groups, is worse than balance for indicators characterizing large groups: sex and living area.In other words, it can be said that the relative empirical bias of the estimates for the total of auxiliary variable is higher for an indicator characterizing a small group than for an indicator characterizing a large group (value of t x ji ) in the balanced sample.Means for ten estimates of the totals for auxiliary variables and their empirical biases are presented in Table 1.  2 demonstrate an increase in the estimate for variance of the estimator of the total due to imputation using logistic regression and Bernoulli distribution.They show that variance due to imputation increases by about 6-15%.Results of estimation of the population the total t y for seven strategies are presented in tables 3, 4 and 5.When comparing the results of tables 3-5, one should have in mind that balanced sampling is applied with probabilities proportional to the cluster size, but the cluster size is not taken into account in simple random cluster sampling.These differences may slightly influence the accuracy of the estimates.

Discussion
Strategy 1 -balanced sampling and the Horvitz-Thompson estimator of the total -shows that the estimator has empirical bias, which decreases with an increasing sample size.Beside the common regularity property, the balance of samples for small sample sizes is not adequate, and it influences empirical biases of the estimates.Strategy 6 is obtained, appending non-response to Strategy 1. Empirical bias is observed.It decreases with an increasing sample size, but still remains significant.The variance of the estimator due to non-response also increased, and it influences RMSE for large samples.
Strategy 4 means balanced sampling, as for Strategy 1, but the Horvitz-Thompson estimator is replaced by a calibrated estimator.Empirical bias is approximately the same as for Strategy 1, but variance increases and relative measures of accuracy are also higher than for Strategy 1.
Strategy 3 consists of the conditions for Strategy 4 appended with non-response.The estimates became closer to the estimates for Strategy 1. Relative measures of accuracy decreased for small sample sizes, but remain unchanged for large sample sizes.Strategy 7. Balanced sampling and non-response.A logistic regression model is used for the imputation of the study variable values for non-responding elements, and the Horvitz-Thompson estimator of the total is used.It is reasonable to compare this estimator with Strategy 6 because of the same sampling design, the same estimator, but different adjustment for non-response.Biases are smaller for Strategy 7 than for Strategy 6, and they are decreasing with an increasing sample size.
In comparison with Strategy 3, the variance estimates for Strategy 7 are higher.With increasing sample sizes, the estimates of Strategy 7 approached the estimates obtained for the Strategy 3, remaining a little bit higher, and the variances are higher.
Strategy 2. This is a strategy giving the best accuracy for the estimator of the total.The variance estimates are lower than for Strategy 1 for any sample size.In the case of small sample sizes, there is no bias for Strategy 2, which is significant for Strategy 1.In the simulation carried out, calibration does not improve the accuracy in the case of balanced sampling without non-response (Strategy 4).When there is no non-response, the classical Strategy 2 is the best.
Strategy 5. Simple random sampling, non-response and calibration.We compare the estimates with the results of Strategy 3. Unfortunately, in the case of Strategy 5, the empirical biases are more significant, and variance estimates are higher.
The calibration estimator in Strategy 4, in comparison with the Horvitz-Thompson estimator in Strategy 1 for balanced sampling design, does not improve accuracy; all accuracy measures are higher for the former.But if non-response occurs for balanced sampling design (Strategy 3) the calibrated estimator shows more accurate results, in comparison with Strategy 4 without non-response.It should be mentioned that the calibrated weights are random, but their variability is not taken into account in the estimator of variance for the calibrated estimator.
There is no monotonicity in the change of bias due to an increasing sample size.It may occur because of a small number of repetitions K = 10.

Conclusion. Simulation results for Labour Force Survey data show that
if there is no non-response, a simple random sample of clusters and a calibrated estimator of the total (strategy 2) gives the highest accuracy; if non-response occurs, then balanced sampling, adjustment for non-response by reweighting and calibration (Strategy 3) gives the highest accuracy.
v ) = 1; b) any element of the population belongs to at least one possible sample; Lithuanian Statistical Association, Statistics Lithuania Lietuvos statistiku ˛s ąjunga, Lietuvos statistikos departamentas ISSN 2029-7262 online c) technical possibility is available for the selection of indicated samples with the indicated probabilities.

Table 1 .
Empirical biases for the estimates of totals for balancing variables in balanced sampling design, sample size n = 100, 1 000, 5 000

Table 3 .
Estimates of accuracy measures for estimators of the total of a study variable for seven strategies,

Table 4 .
Estimates of accuracy measures for estimators of the total of a study variable for seven strategies, n = 1 000

Table 5 .
Estimates of accuracy measures for estimators of the total of a study variable for seven strategies, n = 5 000