Weighting and imputation comparison in small area estimation

In this paper, different methods of nonresponse adjustment for the totals of small area domains are examined. To improve quality of estimations linear model with random parameters at domain level is used. The empirical results are based on Monte Carlo simulations with repeated samples drawn from a finite population constructed from the Lithuanian survey on short-term statistics on service.


Introduction
In true surveys, we are faced with nonresponse. Nonresponse not only mean less efficient estimates because of reduced sample size, but also standard-complete methods cannot be immediately used to analyze data. There are several methods to correct the consequences of unit nonresponse [1,6] and they are examined in this paper for the small areas. The focus on small area is made because there is relatively little of an impartial comparison between nonresponse treatment [4]. Also for the small area estimation an important aspect is to choose the right estimator and model [2,3]. During the previous research [4] it was showed, that the linear model with random parameters at domain level is a good choice, but which estimator to use is still an open question. That is why, two different estimators (generalized regression estimator [8] and empirical best linear unbias predictor [5]) are investigated in this paper.

Population
Let U = {1, 2, . . . , k, . . . , N } denote a finite population with N units. This population is divided into D nonoverlaping domains U d , d = 1, . . . , D, consisting of N d units. A sample s of the size n units is selected from the population U , s = {s 1 , s 2 , . . . , s n } ⊂ U . Each unit k has an inclusion probability π k = P(k ∈ s) or a sampling weight w k = π −1 k . For different reasons there are missing units in the sample s. Let a response probability for each unit be κ k = P(k ∈ s (r) , k ∈ s), where s (r) ⊂ s is a responded sample.
Let us denote y as a study variable, which values y k are known just for the elements of a response sample s (r) and x = (x 1 , x 2 , . . . , x J ) ′ as a vector of auxiliary variables, which values x k are known for all units in U . Let t d = k∈U d y k be a domain total -parameter of interest. It is assumed that the number of the elements in each domain U d , d = 1, . . . , D, is known, but the domains are not used in the sample design. This means that the sample part in each domain, s ∩ U d , has a random size.

Model and estimators
Let values y 1 , . . . , y N of a study variable y be realizations of independent random variables Y 1 , . . ., Y N , which satisfy the following general linear model [7]: Here β 0 and β j , j = 1, . . . , J, are regression coefficients, u d , d = 1, . . . , D, are random parameters that are related to the corresponding domain and I dk , d = 1, . . . , D, k = 1, . . . , N , are domain indicators (I dk = 1, if k ∈ U d and I dk = 0 otherwise). The errors ε k and random parameters u d are assumed to be independent and identically distributed Gaussian random variables with mean 0 and variances σ 2 and σ 2 1 respectively. Using restricted maximum likelihood (REML) method with incorporated weights [7] model's parameters are estimated and predicted valuesŷ k = β 0 + D d=1 I dûd + J j=1β j x jk are computed for all k ∈ U . These predicted values are used to estimate two different domain total estimators: generalized regression (GREG) estimator [8] and empirical best linear unbiased predictor (EBLUP estimator) [5] These estimators in the case of significant nonresponse rate should be corrected.

Methods for nonresponse adjustment
Weighting and imputation are the two main methods used to correct for bias due to nonresponse and to make efficient use of data.

Weighting
Using weighting method the original inclusion probabilities π k are deflated by the response probabilities κ k and new sampling weights w k = (π k κ k ) −1 , k ∈ U , are obtained. The original response probability is never known in practice, so there are several methods to estimate it. One of them is called a weighting-class, where the response sample s (r) and sample s are divided into G mutually exclusive and homogenous (with respect of the response rate) groups s (r) g and s g , g = 1, . . . , G, with the same response probability κ k for the unit in the same group: Another method for estimating the response probability is to apply a logistic regression model [1]:κ HereB is the maximum likelihood estimator of the parameters of the logistic regression model based on the data , and z k = 0 otherwise. When weighting methods for nonresponse adjustment are applied in the estimation of the domain total, the correction of estimators (2) and (3) should be made by replacing sampling weights w k withŵ k = (π kκk ) −1 , k ∈ s, not only in equations (2) and (3), but also in calculation of the parameters of the model (1).

Imputation
Another method to adjust nonresponse is to impute a value for the missing unit. There are many types of imputation methods, which can be divided into three main groups: 1) Logical imputation (deductive). It is a part of the editing process and is used when reliable, explicit solution exists given appropriate assumptions.
2) Real donor imputation. Here the imputed observation value is borrowed from another respondent. The most common real donor imputations are the nearest neighbors and the random donor imputation. For the nearest neighbor imputation, a missing value y k is imputed by choosing that value y l which corresponds to the value x l closest to x k . The closest value is determined by the distance between any two response values (d kl = J j=1 (x kj − x lj ) 2 , k ∈ s\s (r) , l ∈ s (r) ). For the random donor imputation the data are divided into homogenous groups by a suitable method and the donors are chosen randomly within these groups.
3) Model based imputation. Here the imputed observation value is calculated using the model with the coefficients estimated from the response sample s (r) . The most common method is a regression imputation.
Imputation methods can also be classified as a single imputation (when one value is imputed instead of missing one) or multiple imputation. Multiple imputation produces several imputed datasets and instead of the missing value a mean of imputed datasets is used.
Let us denote a new variable y * which values y * k are equal to y k , if k ∈ s (r) , or y imp k , if k ∈ U \s (r) . Here y imp k can be a single value, if single imputation is used, or the mean of imputed datasets, if multiple imputation is used. Then estimators (2) and (3) can be written as follows:

Simulation study
For the simulation experiment, a real population from Statistics Lithuania is used. The quarterly survey on short-term statistics on service has been taken. The population includes N = 1660 enterprisers, which filled questionnaire in the first quarter of 2008 year. Every record consists of such variables: region of residence, income for the first quarter of 2008 year, number of employees in the same quarter, value-added tax (VAT), classification of economic activities in the European community (NACE). The income is chosen as the study variable y. Let y k denote the value of y for the kth enterpriser, k = 1, . . . , N . The parameter of interest is total income in each region (domain totalt d ). There are D = 14 regions of interest. To improve the quality of the estimators 7 auxiliary variables were used: number of employees (x 1 ), VAT (x 2 ) and indicator of the NACE (x 3 − x 7 ). 1000 independent samples of 80 elements are drawn from the population by simple random sampling without replacement (SRS).
The GREG and the EBLUP estimators are used to estimate the domain total. Each estimate is calculated several times using different methods of nonresponse adjustment. These differences are denoted by adding two letters, LL ∈ {WC , LR, RD, NN , CR, DR}, and number, R ∈ {0, 1, 2}, at the end of the estimate's name (GREG-LLR or EBLUP-LLR). The meaning of these abbreviations is described below. The weighting-class method (WC) and the logistic regression model (LR) are applied to estimate response probability. Also, the performance of different imputation methods (random donor (RD), nearest neighbors (NN), regression imputation using the common model (CR) [2] and regression imputation using the model with domain-intercepts (DR) [2]) is investigated. For weighting-class, random donor and nearest neighbors methods units are grouped by the number of employees and indicator of the NACE. For the logistic regression model auxiliary vector x with values x k = (1, x 1k , x 2k , x 3k , x 5k ) ′ is used. For the regression imputation the mean of imputed datasets with 5 values is applied. For the common model auxiliary vector x with values x k = (1, x 1k , x 2k , x 3k , x 4k , x 5k , x 6k ) ′ is used. For the model with domain-intercepts auxiliary vector x with values x k = (I 1k , . . . , I Dk , x 1k , x 2k , x 3k , x 4k , x 5k , x 6k ) ′ is applied. Here I dk = 1 if unit k belongs to d domain and I dk = 0 otherwise, d = 1, . . . , D.
Each of these methods of nonresponse adjustment is applied for two populations constructed from the real population (indicated by the number R = 0) by generating different response rate. The response rates of 89% and 79% are generated for the first (R = 1) and the second (R = 2) populations, respectively. These rates represents the response rate in the survey (actually the response rate depends on region, number of employees and NACE). To

Conclusions
The results in the tables show that the real donor imputation (nearest neighbor and random donor) are the worst methods, since they increase bias and MRRMSE more than the others methods. The weighting methods (weighting-class and logistics regression) yield the best results in small area estimation with nonresponse. Alike the other nonresponse adjustment methods, they do not so much depend on the nonresponse rate.