Adaptive composite estimation in small domains

Small area estimation techniques are used in sample surveys, where direct estimates for small domains are not reliable due to small sample sizes in the domains. We estimate the domain means by generalized linear compositions of the weighted sample means and the synthetic estimators that are obtained from the regression-synthetic model of fixed effects, based on the domain level auxiliary information. In the proposed method, the number of parameters of optimal compositions is reduced to a single unknown parameter, which is further evaluated by minimizing an empirical risk function. We apply various composite and related estimators to estimate proportions of the unemployed in a simulation study, based on the Lithuanian Labor Force Survey data. Conclusions on advantages and disadvantages of the proposed compositions are obtained from this empirical comparison. 


Introduction
The overall sample size of a survey is usually designed to produce reliable estimates of parameters for the whole survey population and for some of its domains, where the reliability is understood as the acceptability of errors of the estimators. However, if there are additional needs to get estimates under more detailed subdivisions of the population, traditional direct estimation methods are not suitable because the direct estimators are based only on the domain sample and the obtained sample sizes can be too small in some of the new domains. For example, in the Lithuanian Labor Force Survey (LFS), the estimated errors of the domain-specific direct estimates of proportions of the unemployed are acceptable at the county level, but the accuracy of the direct estimates is poor at a lower municipality level. It seems that this problem can be solved by allocating the sample according to the subdivision by municipalities, but then the total sample size is too large for the budget of the survey. c 2020 Author. Published by Vilnius University Press This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Domains or areas of the population are called small if the unbiased or approximately unbiased direct estimators of parameters have unacceptably large variances in these domains. In the small area estimation (SAE) theory, see [24], alternative indirect estimators, based on implicit or explicit linking models, are considered. Through auxiliary information available from administrative sources or the previous surveys, these estimators borrow sample information from neighbor domains that increase an effective sample size and hence reduce the variances of estimators in the small domain. To efficiently link the additional sample data to the domain of interest, good auxiliary information at the domain or unit level is essential for the underlying models.
Two important classes of the indirect estimators can be distinguished: synthetic and composite estimators. The traditional synthetic estimators are based on the implicit linking models, which rely on the assumptions that small domains resemble the whole population or larger domains. That is, as defined in [10,24], an estimator is called a synthetic estimator if a reliable direct estimator for a large domain, covering several small domains, is used to derive an estimator for a small domain under the assumption that the small domains have the same characteristics as the large domain. For example, assuming that the true proportions of the unemployed are the same for all municipalities covered by a single county, the direct estimator of that county is a synthetic estimator for each of these municipalities. We consider a different example of the synthetic estimator in Section 2.2, called as a regression-synthetic estimator. The synthetic estimation techniques have been used historically since about the 1970s [10,11], and they are still popular because of their simplicity and wide applicability [24]. Unfortunately, such an estimation is biased if the implicit model is not approximately true. Therefore, in order to find a trade-off between a bias of the synthetic estimator and a large variance of the direct estimator, the synthetic estimator is linearly combined with the direct estimator as started to be considered in [8,22] and also in [9], among others. So the specificities of estimation domains are reflected by the direct components of the composite estimators. Many of the estimators, proposed in the later literature, are also linear combinations [24]. There are also new developments as well as generalizations in the synthetic estimation, for instance, [23,28]. More complex composite estimators are constructed in [15].
Assuming that the auxiliary information is available only at the domain level, we focus ourselves on the estimation of coefficients in composite estimators of domain means. In Section 2.3, we discuss the known empirical approximations to the optimal linear combination, where the direct component is represented by the weighted sample mean from Section 2.1 and the synthetic part is the regression-synthetic estimator, and we offer an adaptive approach to the approximation method of [8]. In [12], estimators are defined using the values of parameters, minimizing the estimated mean square errors (MSEs) that are called the empirical risk functions. Following that idea, our adaptation is based on the minimization of an empirical risk function. We also consider generalized linear compositions and propose analogous adaptive estimators of their coefficients. That is, in Section 2.4, we extend the estimation method of [4]. The regression-synthetic model of fixed effects, which we incorporate in all the compositions, is a special case of the famous Fay-Herriot (FH) domain level model [9]. The FH model is an example of the explicit linking models, and its developments are reviewed in [5]. Estimating additional domainhttp://www.journals.vu.lt/nonlinear-analysis specific parameters, the authors of [9] derived the empirical best linear unbiased predictors (EBLUPs) of the domain means, see Section 2.5. Since these indirect estimators are expressed as composite estimators, we include them in further comparisons.
To compare the presented synthetic and composite estimators, the Lithuanian LFS data are used in Section 3. It is clear from a survey of literature that quite various SAE techniques have been analyzed and applied in LFSs of different countries. We shall mention only several instances, related to our context of the study. Earlier examples of [11] and [8] on the estimation of unemployment rates are based on the United States and Canada data, respectively. In the more recent study in [1] of the Italian LFS data, the traditional synthetic estimation is applied. Using the Spanish data, the authors of [19] applied much more complex SAE models, and the paper [26] compared various methods, also including the composite estimation of [8]. The synthetic and composite estimators have been analyzed using the Iranian data in [13]. Among other estimators, EBLUP, based on the FH model, is applied to the Dutch [2], Lithuanian [14], and Swiss [21] unemployment data. The conclusions of our comparative simulation study are discussed in Section 4.
2 Synthetic and composite estimation 2.1 Basic notation The set U = {1, . . . , N } consists of the labels of elements of the finite survey population. Let y be a continuous or discrete study variable with the fixed values y 1 , . . . , y N , assigned to the corresponding elements. In order to estimate the means or proportions in the population and its subsets, the sample s = {k 1 , . . . , k n } ⊂ U of size n < N is drawn according to the sampling design, denoted by the function p(·), with the values p(s) specifying the probabilities of drawing all possible samples. The inclusion into the sample probabilities π k = P p {k ∈ s} = s: k∈s p(s), k ∈ U, are assumed strictly positive. Here the symbol P p , and hereafter E p , var p , and MSE p denote probability, expectation, variance, and MSE according to the sampling design, respectively. That is, the probability distribution of any estimator, calculated from the sample s, is induced by the given sampling design p(·). Therefore, the characteristic var p (·) is called the sampling variance or design variance, and the like.
Let U = U 1 ∪ · · · ∪ U M be a partition of the population into M domains, where U i ∩ U j = ∅ as i = j. There are N i elements in the domain U i , and M i=1 N i = N . In general, if p(·) generates samples without replacement, the sample s i = s ∩ U i is of the random size 0 n i N i . Here s = s 1 ∪ · · · ∪ s M and M i=1 n i = n. We consider the means as target parameters, where N i are assumed to be known. In the cases where the sampling design p(·) was not constructed in order to ensure samples of fixed sizes in the domains, small (or even zero) n i can be obtained. Then the accuracy of direct estimators of (1), such as the weighted sample meanŝ is questionable because their variances are large.

Synthetic estimator
We assume that, for the domain U i , the auxiliary information is available as the vector of characteristics x i = (x i1 , . . . , x iP ) . This assumption narrows a choice of direct estimators to that similar to (2) where β = (β 1 , . . . , β P ) are unknown parameters. Here, assuming the deterministic relation µ i ≈ x i β, the random variable ε i is treated as a sampling error in the domain. Thus, we write E p (ε i ) = 0 and var p (ε i ) = var p (μ H i ) = v 2 i . Despite that the sampling errors may not be independent if the domains cut across the sampling design, the model is usually simplified further assuming that the errors are independent and normally distributed, that is, ε i ∼ N (0, v 2 i ), and we fix these assumptions, too. Similar assumptions are used in the more general domain level model [24,Sect. 4.2], and we consider its special case including only the fixed effects. Having consistent estimatorsv 2 i of the sampling variances v 2 i , denote bŷ the feasible generalized least squares estimator of the parameter β, obtained from the data Second, the synthetic estimators of (1) are defined aŝ These estimators rely on the synthetic assumption that the parameter β is the same across all domains. Therefore, though their variances are generally small, compared to that of direct estimators (2), contributions of biases to MSEs of (4) can be substantial if the synthetic assumption is not realistic. Moreover, the biases not necessarily decrease as the sample sizes n i increase.

Composite estimators
To find a compromise between larger variances of approximately unbiased estimators (2) and biases of synthetic estimators (4), their linear compositions with M unknown coefficients 0 b i 1, are a very common approach considered in the literature. Further, following [12], we use the notion "risk function" when dealing with MSE functions or their combinations. Minimizing the risk function MSE p (μ C i ) with respect to b i , the optimal weight is expressed as the population characteristic [24] b where the approximation is based on the assumptions that and thatμ H i is nearly unbiased. However, even using the last standard approximation, evaluation of (6) from the sample data is seldom efficient because it is difficult to construct stable estimators of MSE p (μ S i ) without additional assumptions on similarities among domains as originally proposed in [20], see [24], or an additional modeling like in [16]. Straightforward estimators of MSEs of (4) are [24, Sect. 3.2.5] whereσ 2 (·) stands for an estimator of the design variance var p (·), and their approxima- derived using the assumption var p (μ S i ) v 2 i . However, these estimators are unstable for small sample sizes because they include direct estimators (2) of the domain means. Therefore, applying the approximation from (6) and simplified formula (8), the resulting estimatorsb * of optimal coefficients (6) are unstable as well, and the efficiency of compositions may be lost.
To get robust estimators of compositions (5), convenient for automatic applications, one can reduce the number M of unknown parameters. One of the known methods, proposed in [22], is to assume a common weight b i = φ in (5) and minimize the risk Replacing unknown population characteristics by (8) in the expression of minimizer of the approximated risk [24, should be more stable as compared to (9). The second method is sample size dependent (SSD) estimation. According to the proposal of [8], the weights b i = φ i (δ) in (5) also depend on the single parameter δ, that is, In this case, the problem of explicit minimization of the risk function with respect to δ is complicated. Therefore, subjective choices of the parameter, such as general δ = 1, are proposed in the literature. On the other hand, the empirical risk function based on estimators (7) of MSEs, where the compositionsμ C i =μ C i (δ) are treated as synthetic estimators, can be minimized numerically. The use of simplification (8) is not suitable here because then the empirical risk would be minimized for values of δ close to zero. The solutionδ * = arg min δ>0 r 2 (δ) should give more efficient weightsφ * i = φ i (δ * ) than that, based on a predefined value of δ. We examine the latter estimation method in the simulation study.

Adaptive generalized composite estimator
Since parameters (11) of compositions (5) are estimated, as well as (9) and (10) are based on the given sample, the resulting estimated compositions are viewed as adaptive estimators. Consider the adaptive estimation of generalized compositions of (2) and (4) of the form with the coefficients 0 Similar combinations are discussed in [18], pointing that simple compositions are improved in this way, due to the larger flexibility when combining estimators. However, straightforward evaluations of optimal coefficients of general compositions lead to instabilities, similar to that in the case of simple combinations (5), and a large number of parameters adds more complexity. Indeed, there are two matrices of unknown weights in (13). Our idea is to construct estimators of the weights a ij and b ij , dependent on a single unknown parameter for all i, j = 1, . . . , M , similarly as in SSD estimation of [8], and to evaluate the optimal value of this parameter by minimizing an empirical risk function.
Given the sample s, let us create two sets of size 2M that contain the known numbers.
to the values of set R, where θ 0 is a new parameter. Using the fixed estimatesv 2 i , taken from (4), write the stochastic model are independent random components, which simulates the dependence between the transformed valueμ HS i , modeled as the random variable Y i , and the corresponding valuê τ i , taken from the set P. Here N (0, 0) denotes the degenerate distribution with zero mean. Model (14) is an estimated and extended regression (3), which generates synthetic estimates (4) and their randomized analogues. For the value θ = 1 and the set of indices I H , this model imitates direct estimates (2). Introduce the characteristics of random variables (14): A similar interpretation is that the probabilitỹ p i;j describes the distance from the domain U i to the domain U j of interest by means of the difference in their sizesτ i andτ j , taking into account the distances among all domains. A relatively large value of this probability suggests that both these domains are similar, that is, relatively more information can be borrowed from the domain U i to increase the effective sample size in the domain U j . Therefore, we use characteristics (15) as the weights in the new family of estimators of means (1), defined as which depends on the parameter θ. Here k 1 , . . . , k M are the original labels of domains. Since these estimators are alternatively expressed as j = 1, . . . , M , they are partially estimated generalized compositions (13). That is, there is a single unknown parameter θ in estimators (16) instead of 2M 2 unknown coefficients in the general combinations. Probabilities (15) also depend on the estimated sampling variances (variabilities)v 2 i , i ∈ I H , of the direct estimators. A domain with a relatively large variability tends to borrow sample information from a larger number of neighbor domains. The tuning constant θ controls this number of neighbors through model (14). The probabilitiesp i;j (θ) tend to I(i = j) as the smoothing parameter θ → 0, and then (16) reduces to regression-synthetic estimators (4). Here I(·) is the indicator function. In the latter case, all the domains have no neighbors according to our interpretation.
Due to the complexity of weights (15), dependent on the sample, it is difficult to get expressions of the risk function R 3 (θ) = M j=1 MSE p (μ HR kj ). The minimizer θ * of this function can be estimated by minimizing the sample-based risk function just like we proposed for SSD estimation of [8], or its approximation Given the sample, function (18) is minimized numerically with respect to θ. Here, in each iteration, for the fixed value of θ, the matrix of probabilities (15) is evaluated by independently simulating a large number of samples of size 2M from (14). That is, for each of these samplesỹ 2M , and then we apply the approximatioñ Denote the solution byθ * = arg min θ 0 r 3 (θ). The use of (19) instead of (18) lets us avoid repetitive calculations of possibly complex (for instance, bootstrap) estimatesσ 2 (·).
The obtained estimateθ * of the smoothing parameter θ is plugged into (16) to define the applicable estimatorsμ  (17). It is easy to check that the identity M i=1 (â kikj +b kikj ) = 1 holds.
Let us return back to the previous numeration of domains. To estimate MSEs of the proposed adaptive estimatorsμ HR j , the general method of [20] can be applied, where squared design-biases of estimators are assumed approximately equal in the domains. The application leads to the estimators of MSE p (μ HR j ), having the expressions j = 1, . . . , M . In the case of minimizerθ * obtained from (19), the formulas  (20) and (21).

EBLUP under the Fay-Herriot model
Recall model (3) and its assumptions. Including domain-specific random effects a i , i = 1, . . . , M , assumed independent of the sampling errors, into that model, FH model [9] is obtained, where the random variable Y * i models the direct estimateμ H i . Here the random effects are supposed independent and identically distributed with E(a i ) = 0 and var(a i ) = τ 2 in respect of a distribution, different from that generated by the sampling design.
Then, treating the estimatesv 2 i as given numbers, EBLUPs of means (1), according to [9], are the compositionŝ whereτ 2 is an estimator of the variance τ 2 of the domain-specific random effects.

Applications in Labor Force Survey
To compare efficiencies of the discussed direct, synthetic, and composite estimators, also including EBLUPs (23), we carry out the simulation experiments using the Lithuanian A.Čiginas LFS data of the first quarter of 2018. A part of these sample data is public [17], and Statistics Lithuania granted the permission to use some other variables according to the agreement No. STAT-132(2019). We have narrowed the sample to individuals from age interval 15-70, and we also removed a part of municipalities from the data because of too small fractions of the unemployed. The fixed artificial population, created by replicating the remaining sample data 6 times, consists of N = 51226 persons. The study variable y is binary: y k = 1 if the kth individual is unemployed according to the survey definition, and y k = 0 otherwise, k ∈ U. Therefore, the target means µ i , i = 1, . . . , M , are proportions of the unemployed in M = 36 municipalities. To create model (3), auxiliary variables were selected using a stepwise regression, similarly as in [14], where the true values µ i were taken as the values of a dependent variable. The obtained model has the coefficient of determination equal to 0.52, and it is based on the known auxiliary vectors , where x i2 is the proportion of unemployed (registered unemployed individuals) derived from the administrative Lithuanian Labor Exchange data, x i3 is the proportion of urban population, and x i4 and x i5 are the proportions of individuals from age intervals 15-25 and 41-55, respectively. Figure 1 shows the relation between the true proportion of the unemployed µ i and the auxiliary proportion x i2 , which is the best regressor in model (3), and it might be used alone for the prediction. It is seen that both proportions of the unemployed tend to be larger for smaller municipalities.
In the real survey, the sample of households is drawn with probabilities proportional to the number of their members, and then entire households are surveyed. For simplicity, we consider the simple random sampling without replacement p(·), which is also sufficient to get reliable direct estimates of the proportions for the subdivision of U by counties according to the primary aim of the survey, but the expected sample sizes are too low for a part of municipalities. For the given sampling design, weighted sample means (2) reduce to the unbiased estimatorŝ of the parameters v 2 i are consistent as the population and sample sizes approach the infinity in some way. However, they are unstable for small sample sizes as well as their components. We apply a generalized variance function method [27] in order to smooth estimators (24). More specifically, since the direct estimatorsμ H i of domains with smaller sizes N i have larger sampling variances v 2 i because of smaller sample sizes, and due to the mentioned allocation of the domains by their sizes N i in Fig. 1, we choose the assumption v 2 i ≈ KN γ i , such as in [7]. We estimate the parameters K > 0 and γ ∈ R using the regression equation assumed as independent errors, where the random variables V 2 i model estimates (24). In practice, from the pairs of data (v are smoothed estimates of v 2 i , which we apply to derive regression-synthetic estimators (4). The estimatorsμ H i andμ S i are components of the variants of compositions (5) and (16), compared in the numerical survey: , based on optimal weights (6); (ii)μ uns where φ * is a minimizer of the true risk R 1 (φ); (iv)μ PK i =μ C i (φ * ) with sample-based minimizer (10); 5)) at the fixed value δ = 1.5; with the minimizerδ * of empirical risk (12); (viii)μ HRopt with the minimizerθ * of simplified empirical risk (19).
Here the optimal values b * i , φ * , δ * , and θ * of theoretical interest are obtained by evaluating the corresponding estimators for a large number of samples drawn from the population, and searching for the minimum of the respective true risk function. To compute the variancesσ 2 (·) in the empirical risk of case (vii), we apply a resampling procedure from [3]. That is, the number of artificial populations of size N is reconstructed from the given sample, and a larger number of bootstrap samples is drawn from each of them using original p(·). Then the bootstrap variance is the variance of values of the required estimator, that are calculated from all the samples, multiplied by the finite population correction factor (N −n)/(N −1). If risk (18) were used in (ix), the bootstrap is applicable as well.
In formula (23) of the estimatorsμ FH i , we use the iterative FH moment estimatorτ 2 of the variance τ 2 as proposed in [9]. The choice is based on the analysis made in [6].
We compare the efficiencies of estimators by evaluating them for each of R = 10 4 independent samples, selected from the survey population, and then estimating their root mean squared errors (RMSEs) and absolute biases (ABs). That is, we use the accuracy measures i is a realization of the specific estimatorμ i of µ i , based on the rth Monte Carlo sample.
The estimatorsμ H i ,μ S i , (i)-(ix), andμ FH i of interest are denoted hereafter by their respective superscripts. We compute their characteristics (25) for two different overall sample sizes: n = 2000 and n = 5000. In each of these two situations, the domains are classified by the expected domain sample size into three classes of equal size, and the average of RMSEs as well as ABs is calculated over domains of each class. These results for different n are reported in Tables 1 and 2, respectively, where the averages of (25) over all domains are also presented as common measures of accuracy.
In the tables, estimator H acts as a reference. It is seen that such a direct estimation is not reliable at the municipality level of the survey. In contrast, regression-synthetic estimator S is much more efficient in both tables, despite that its ABs are comparatively large. Treating RMSE as a principal measure of accuracy, optimal composition opt would be the best choice in all the classes of domains. Then the respective errors of other theoretical estimators PKopt, DSCopt, and HRopt inform us what we lose due to reduction of the number of parameters to the single unknown parameter. If to compare the errors of empirical analogues PK, DSC, and HR with respect to that of synthetic estimator S, composition HR improves S in two classes of smaller domains in both tables, while compositions PK and DSC are worse than S in almost all the classes of domains. Adaptive composition DSC is a significantly better choice than simple case DSCfix, but with alternative PK more stable results are obtained, despite that all the accuracy characteristics of corresponding DSCopt and PKopt are very similar. On the other hand, ABs of DSC are lower than that of PK in both tables. Naive composition uns improves direct estimator H, but yields similar results to that of composition DSCfix. Comparing more reliable and applicable estimators S, PK, DSC, HR, and FH, compositions PK and DSC appear relatively good only in the largest domains, and estimator FH is better than S in smaller domains. Generalized composition HR outperforms FH as n = 2000, and the accuracy of both these compositions is similar for a larger overall sample size ( Table 2). In the case of smaller n, a worse performance of compositions PK and DSC may be explained by a relatively small contribution of the direct components as seen from    Table 3. In this case, errors of the estimated optimal parameters are large compared to the sizes of the true values. The values of δ * are larger than the blindly chosen value δ = 1.5 in DSCfix.
To get a deeper insight into differences among reliable estimators S, PK, DSC, HR, and FH, we also present the distributions of their RMSEs in the classes of domains. The situations n = 2000 and n = 5000 are shown in Fig. 2 and 3, respectively. One can conclude from both figures that RMSEs of estimators PK, DSC, and FH are more homogeneous across the domains than that in the case of estimators S and HR. On the other hand, the latter two estimators tend to yield good estimation results for a part of domains in their different classes. That is, the prediction by regression-synthetic estimator S is efficient for large domains, and it is well explained by the scattering in Fig. 1, and adaptive generalized composition HR is effective for a large part of domains from the middle class because there is a larger number of similarly small domains (Fig. 1), and therefore sample information is flexibly shared among them according to the construction of the estimator.

Conclusions
The adaptive version of the composite estimation of [8] may be useful for routine applications, where there is a larger number of survey variables. For repeated surveys, nonadaptive alternatives to the proposed composite estimator could be constructed. In that case, the unknown parameter would be estimated treating a sample of the previous period as an independent data source. Comparing our simulation results with that of [8,26], the present conclusion is that there is no universal prior choice of the parameter value even for similar surveys.
The new adaptive composite estimators, introduced in Section 2.4, strongly depend on the underlying regression model. Therefore, the accuracy of these generalized compositions should be similar to that of the regression-synthetic estimators. Indeed, considering the simulation results, we observe the improvement of the synthetic estimators, but a common gain does not become significant if the synthetic estimators are relatively inefficient.
From additional simulation tests we conclude that both empirical risk functions give similar results in the case of the generalized composition, and thus, the use of the simplified risk is sufficient, while the evaluation of the empirical risk function is computationally intensive in the adaptive SSD estimation. On the other hand, the construction of weights of the generalized compositions is complex itself.
The choices of (empirical) risk functions are not unique. The functions should be consistent with the accuracy characteristics of estimators we are interested in. For instance, if inferences are based on the relative RMSEs, then the considered empirical risk functions should be properly normalized by the corresponding estimators.
Applying the composite estimators in Lithuanian LFS, their success depends on the overall sample size. The results of the proposed generalized composite estimator are very good for smaller domains and where other composite estimators, including EBLUP, are less efficient due to the small overall sample size. In some situations, EBLUP of [9] yields quite reliable estimates for smaller domains, while the synthetic estimation appears efficient for large domains.
An efficient estimation of MSEs of the considered synthetic and composite estimators is complicated in general. There exist only several universal proposals and it seems that there are no new ones in recent literature, except the case of popular EBLUP. The simulation of LFS shows that the biases of estimators are neither negligible nor similar across the domains. Therefore, the empirical MSEs like (20) and (21) may be insufficient for applications. We believe that the estimation of errors can be improved by applying data-specific modeling, but it is outside the scope of the paper.