Estimation of the current population total for a four-phase sampling design

Abstract. The combined ratio-type estimators of the finite population total and their variances in the case of sample rotation for two-phase and four-phase sampling schemes are constructed in the paper. Combined estimators of the finite population total without and with the use of auxiliary information known from the previous survey are built. Two types of sampling design are used for sample selection in each of the phases: simple random sampling without replacement and successive sampling without replacement with probabilities proportional to size. A simulation study, based on the real data, is performed, and the accuracy of the estimators proposed is compared.


Introduction
A sample survey when information is collected regularly on the same population in subsequent time periods with the partial replacement of the sample is studied.Such repetitive surveys (consecutive measurements of the same population) are used in social studies, official statistics, forestry, medicine etc.The Labour Force Survey (LFS) provides estimates of the number of employed and unemployed individuals for each quarter of the year.Repeated sampling from a finite population (or sample rotation) is a sampling procedure, which is usually used for this survey.
Let us denote a finite household population U = {1, . . ., i, . . ., N } of size N .For each household, the number of its members is denoted by m i , i = 1, 2, . . ., N .The sum of household members is obtained by M = N i=1 m i .Let us suppose that the survey variable y means the number of employed (or unemployed) individuals in each household.The values y i of the variable belong to the set of integers {0, 1, . . ., m i }.The parameter of interest is the total of the number of employed (or unemployed) individuals c Vilnius University, 2016 The previous survey data can be used as auxiliary information for the estimation of the population total in order to reduce the variance of the estimator.The efficiency of ratio estimators in the case of any sampling design is discussed in Särndal et al. [21].Combined estimators of the finite population total without and with the use of auxiliary information known from the previous survey are constructed.Sample rotation and two-phase sampling (double sampling) are similar procedures.Sample rotation means that a sample of the current occasion consists of a union of subsamples: one of them is matched with the elements of previous occasions, and the other one is a new one and unmatched with the previously studied elements.A sample under a two-phase sampling design is matched with the first-phase sample.
In this paper, the construction of the combined estimator of the finite population total and its variance in the case of sample rotation is analyzed for two-phase and four-phase sampling schemes.
If the auxiliary variable is well-correlated with the study variable, then it is possible to obtain more accurate estimates of the parameter.A two-phase sampling design and estimators of the total with the use of auxiliary information are given in Särndal et al. [21].They are applied here to the Lithuanian LFS data in the case of a simple random sample of individuals, and the current paper is a further development of [7].
The estimators for a total in the case of the three-phase sampling design are presented in Fuller [9] and Singh [22].Jeyaratnam et al. [13] studied multiphase sampling for the stratification and efficient allocation of the sample size.Many various problems are being solved for two occasions sample data.One of them is the optimal choice of the secondoccasion sampling design in order to minimize the variance of the estimator (Arnab, [3]).Hamad et al. [10] uses two auxiliary variables for the estimator of the total in a twophase sampling design.Subsampling of nonrespondents and corresponding estimators is also a case of estimation under a two-phase sampling design (Okafor and Lee, [16]), under a three-phase sampling design (Hidiriglou and Estevao [11]).Artes and Garcia [4] studied the estimators of the ratio under sampling on two occasions with partial replacement of the elements.Close attention is paid to variance estimation for the estimator of change in the finite population parameter in repeated surveys.There are many studies on this topic, for example, Berger [5], Andersson [2], and Qualité [17].Fattorini et al. [8] studied a special three-phase sampling strategy for the estimation of forest biomass.
Combined estimators of the population total are obtained taking a linear combination of ratio estimators using the π * estimators idea (Särndal et al., [21]) for a multi-phase sampling design and the Horvitz-Thompson estimator (Horvitz and Thompson, [12]) or the ratio estimator.Two types of sampling design are used here for sample selection in each of the phases: simple random sampling without replacement and a successive sampling (unequal probability sampling without replacement) procedure proposed by Rosén [18].The second-order inclusion probabilities for a successive sampling design are approximated by corresponding probabilities for conditional Poisson sampling.The results of Aires [1] and Bondesson et al. [6] are used for this.Then the inclusion probabilities obtained are used to calculate the estimates of the proposed estimators of the totals and their variance estimates.A simulation study, based on the real population data, is performed, and the estimators proposed are compared.http://www.mii.lt/NA

Sample rotation and sample selection
The LFS at Statistics Lithuania is conducted continuously with a quarterly selected sample.All members of a household are included in the sample for two subsequent quarters, excluded from the sample for the next two quarters, and included once more in the sample for two other quarters.It means that one-fourth of the sample of the previous quarter is replaced by the new one each quarter of the year as shown in Fig. 1.
The sample selection procedure is performed as shown in Fig. 2.
It is seen in the sample selection scheme, presented in Fig. 2, that the whole sample s consists of a union of four subsamples: s 1 , s 2 , s 3 and s 4 .The subsamples selected at each of the phases are expressed: The estimators under the sampling scheme described will be discussed further.We are interested in the estimation of the population total t y = i∈U y i for a study variable y.Firstly, we construct four separate design-based estimators of the total using data of the samples s 1 , s 2 , s 3 and s 4 , respectively.Secondly, we propose a combined estimator of the total using sample rotation schemes in Section 4.
Step 1.The sample s 1 is selected from the finite population: U → s 1 .The corresponding first-and second-order inclusion probabilities for elements of the sample s 1 are denoted respectively: An unbiased design-based Narain [15] and Horvitz-Thompson [12] estimator of the population total is used: The variance of the estimator tHT 1y and its unbiased estimator is The values of the study variable y in the previous survey can be used as auxiliary information.Let us denote the study variable of the previous survey (−1 wave) by x with the values x i and the same variable on the current 0 wave by y with the values y i , i ∈ s 1 .We can form the ratio estimator trat 1y of the population total t y by We use here t1y = tHT 1y , t1x = tHT 1x , t (−1) 1x = i∈U x i .Some other approximately unbiased estimators t1y , t1x will be also used in this situation further.The estimator trat 1y is nonlinear.Its approximate variance based on a Taylor linearization of the estimator is expressed as: with r = i∈U y i / i∈U x i .The variance AVar( trat 1y ) is estimated by using r given in (5). http://www.mii.lt/NA Step 2. The sample s 2 is obtained in two-phase sampling: U → s 2 = U \ s 1 → s 2 .The corresponding first-and second-order unconditional and conditional element inclusion probabilities for samples s 2 (first phase) and s 2 (second phase) are, respectively, , Under a two-phase sampling design, using the π * estimator defined in [21, Sect.9.2], the population total t y is unbiasedly estimated by In the case of two-phase sampling, the variance of the estimator t(2) 2y may be expressed by conditional and unconditional variances and expectations: The variance Var( t(2) 2y ) is estimated unbiasedly by The values of the study variable y in the previous survey (−3 wave) can be used as auxiliary information.Let us denote the study variable of the previous survey by x with the values x i and the same variable on the current wave by y with the values y i , i ∈ s 2 .We can form a ratio estimator trat 2y of the population total t y by trat 2y = t Here t = i∈U x i is the total of the variable x, which was a study variable y in the previous survey (−3 wave).The estimator t(2) 2y is given by ( 9).In the case of two-phase sampling, the variance Var( trat 2y ) of the estimator trat 2y also may be expressed by conditional and unconditional variances and expectations replacing the estimator t(2) 2y by the estimator trat 2y in (10).Because trat 2y is a nonlinear estimator, the approximate variance AVar( trat 2y ) of Var( trat 2y ) is derived using a linear term of its Taylor expansion, and the approximate variance of the ratio estimator trat 2y is AVar trat with the ratio r given in (6).As the estimator of the variance will be used Var trat with the estimator of the ratio r2 given in (13).
Step 3. The sample s 3 is obtained in three-phase sampling: The corresponding first-and second-order inclusion probabilities for the sample s 2 (first phase) were denoted in Step 2, and for samples s 3 (second phase) and s 3 (third phase), the inclusion probabilities are, respectively, Under a three-phase sampling design, using the π * estimator, the population total t y is unbiasedly estimated by In this step, the values of the study variable y in the previous survey (−1 wave) can be used as auxiliary information.Let us denote the study variable in the previous survey http://www.mii.lt/NA by x with the values x i , and the same variable in the current wave by y with the values y i , i ∈ s 3 .We can form a ratio estimator trat 3y of the population total t y : Here t = k∈U x k is the total of the variable y in the −1 wave.The estimator t(3) 3y is given in (16).
Step 4. The sample s 4 is obtained in four-phase sampling: The corresponding first-and second-order inclusion probabilities for sample s 2 (first phase) and s 3 (second phase) were described in Step 2 and Step 3 previously, and for samples s 4 (third phase) and s 4 (fourth phase), they are, respectively, Under four-phase sampling, using the π * estimator, the population total t y is unbiasedly estimated by t(4) More complex estimators are presented further.

Combined estimators of the population total
The construction of the combined estimators and their variances of the finite population total (1) in the case of sample rotation for two-phase and four-phase sampling schemes is presented in this section.
1.By a linear combination of t1y and t (2) 2y , we obtain the estimator without the use of auxiliary information of the total The expression for the variance of estimator (19) of the total: , Cov t1y , t Cov t1y , t 2. By a linear combination of trat 1y and t (2) 2y , we obtain the estimator with the use of auxiliary information of the total The expression for the variance of estimator (23): .
The variance Var( trat 2 ) is estimated by with the covariance estimator Further, we are interested in the estimation of the finite population total t y using twophase and four-phase sampling schemes, when simple random samples of households without replacement and samples with probabilities proportional to household size without replacement are drawn in each of the phases.http://www.mii.lt/NA 5 Special cases of sampling design 5.1 Simple random sampling of households without replacement 5.1.1Two-phase sampling scheme Data for two quarters are used for the estimation of the population total t y .Assume that s 1 of size n 1 is a simple random sample from the population U, and its complement s 2 = U \ s 1 of size N − n 1 is also a simple random sample from the population U. s 2 of size n 2 is a simple random sample from s 2 .Then the first-and second-order inclusion probabilities to be used for (19) and ( 23) are calculated as follows: In the case of simple random sampling in each of the phases, the estimator of total (19) without the use of auxiliary information can be rewritten as The variance of the estimator t2 of the total t y is expressed where where Now, in the case of simple random sampling, in each of the phases, estimator (26) without the use of auxiliary information of the total can be rewritten as In the case of simple random sampling, in each of the phases, estimator (27) with the use of auxiliary information of the total can be rewritten as Remark 1.The totals t in (31), (35) cannot be known in a real survey, because the values of the study variable in all waves are known only for sample data.Therefore, we suggest replacing them by the estimates of t y obtained for the corresponding wave using the whole sample consisting of four rotation parts, and to consider them further as fixed.

Unequal probability sampling of households without replacement with probabilities proportional to their size (successive sampling)
The selection of households through individuals is taken as an example of an unequal probability sampling design, which is used for the Lithuanian LFS.An individual is selected from a list with equal selection probabilities, and his/her household is included in the sample.Individual selection is repeated.If the household selected is already in the sample, this selection step is ignored.Otherwise, the household is included in the sample.The process is continued until the predetermined number n of different households is selected.This household selection scheme is studied by Rosén [20] and is called successive sampling design.The larger the household, the higher its probability to be selected for the sample, because any of its members can be selected from the list of individuals.

Order sampling designs
The order sampling design [18] is defined as follows.To each population element i ∈ U, a probability distribution F i is assigned, i = 1, 2, . . ., N .Independent ranking random variables Q 1 , Q 2 , . . ., Q N with distributions F 1 , F 2 , . . ., F N are realized.The elements with the n smallest Q values constitute a sample.The distributions F 1 , F 2 , . . ., F N are called ranking distributions.
Let us consider ranking distributions F i (u) = H(u; λ i ) with a shape distribution function H(u), concentrated on a positive half-line, and real constants λ i > 0, called intensities, desired or target inclusion probabilities, i = 1, 2, . . ., N .The class of order sampling designs includes successive and Pareto sampling designs.
Successive sampling design has an exponential shape distribution function H(u) = 1 − e −u , 0 u < ∞.The ranking variables are with the values of the random variables U i distributed uniformly on [0, 1].
The Pareto sampling design has a Pareto shape distribution function H(u) = u/(1+u), 0 u < ∞.The ranking variables are with the values of the random variables U i distributed uniformly on [0, 1].For these sampling designs, the exact inclusion probabilities are approximately equal to the desired inclusion probabilities λ i .According to Rosén [19], the inclusion probabilities for order sampling designs are asymptotically equal to the desired inclusion probabilities λ i .
In our case, m i is the number of household members, M = N i=1 m i is the total number of individuals, and we choose λ i = nm i /M .We obtain that with such a selection of desired inclusion probabilities, the class of order sampling designs intersects with the class of sampling designs with inclusion probabilities approximately proportional to the size measure (PPS).
The conditional Poisson sampling scheme (CP) is defined as follows: each unit in the population is selected with a prescribed probability p i , p i > 0, and N i=1 p i = n, but only the samples of the desired size n are accepted.The inclusion probabilities for the CP sample will not be exactly equal to p i , but only approximately.
Expressions for the second-order inclusion probabilities π 1ij in the case of order sampling design are presented in Aires [1], but they are too complex, and their computation is time-consuming.
According to Bondesson et al. [6, p. 700], Pareto and CP sampling designs are close for p i = λ i , and inclusion probabilities of all orders for Pareto sampling design may be approximated by the corresponding ones for the CP design.Relying on the approximate equality of the first-order inclusion probabilities, we will approximate the second-order http://www.mii.lt/NAinclusion probabilities for a successive design with the second-order inclusion probabilities for the CP design.

Two-phase sampling scheme
Data for two quarters are used for the estimation of the population total t y .Assume that a sample s 1 of size n 1 , a sample s 2 = U \ s 1 of size N − n 1 , and a sample s 2 of size n 2 are drawn according to the successive sampling design introduced by Rosén [18].The first-phase household target inclusion probability in the sample s 1 is defined as The second-phase household target inclusion probability in the sample s 2 is defined as The first-order inclusion probabilities to be used for the estimation of the total t y in ( 19) and ( 23) are expressed approximately as follows: The estimators t1y = tλ 1y = i∈s1 y i /λ i , t1x = tλ 1x = i∈s1 x i /λ i are used in ( 19), ( 5) and ( 23).
For the estimators of variances ( 21), ( 25), the second-order inclusion probabilities for a successive sampling design are needed.Second-order inclusion probabilities for CP sampling π1ij , presented by Aires [1], are i, j ∈ U, i = j.Here γ 1i = p 1i /(1 − p 1i ), and k 1i is the number of elements with j ∈ U, j = i, such that γ 1i = γ 1j ; the probability p 1i is a selection probability of the element i for a CP sampling design, i ∈ U. Suppose π1i are given inclusion probabilities to the CP design equal to λ i .Keeping them as known, we use the approximation result of Bondesson et al. [6, p. 705] to express p 1i through these inclusion probabilities: Inserting the γ 1i obtained into (37), we find second-order inclusion probabilities to the CP design π1ij .Approximating P (i ∈ s 1 , j ∈ s 1 ) in ( 8) by π1ij , we obtain the second-order inclusion probabilities for the sample s 2 , and keeping π 2i|s 2 as known, we obtain the second-order inclusion probabilities for s 2 ( [1]) as follows: Here k 2i is the number of elements j = i such that γ 2i = γ 2j and We have π2ii|s 2 = π2i|s 2 in the case of i = j.
We remind that the second-order inclusion probabilities for a successive sampling design are approximated by the corresponding probabilities for the CP design.
After replacing the π values in ( 4), ( 7) and ( 12) with the corresponding approximate values presented in this section we estimate the variances of the estimators t2 and trat 2 of the total t y in ( 21) and ( 25) respectively with Cov t1y , t Cov trat 1y , t The number of unequal probability sampling phases is further increased.

Four-phase sampling scheme
Data for four quarters are used for the estimation of the population total t y .Assume that all samples in the selection procedure shown in Fig. 2 are drawn according to a successive sampling design: The first-and second-order inclusion probabilities for samples s 1 , s 2 and s 2 are introduced in Subsection 5.2.2.The corresponding firstorder inclusion probabilities for samples s 3 , s 3 , s 4 , s 4 for the sampling design under http://www.mii.lt/NAstudy, to be used for ( 26) and ( 27), are approximated as follows: Second-order inclusion probabilities for a four-phase sampling design are not presented here.Only the empirical variance of the estimates in the case of a four-phase sampling scheme is used in the simulation study.

Simulation study
In this section, we present a simulation study for the comparison of the performance of several estimators of the total using data of two and four quarters, with simple random sampling (SRS) and sampling with probability proportional to size (PPS) (successive order sampling) of households without replacement in each of the phases.
We study the real LFS data of Statistics Lithuania.The study population consists of N = 500 households.The variables of interest, y and x, are the number of employed (or unemployed) individuals in the population of households in the current and previous waves.The population totals t are available and are used for estimators (23), (27).The correlation coefficient between the variables x and y in the household population for the number of employed individuals of interest is ρ(x, y) = 0.95.It means a strong linear relationship.For the number of unemployed individuals of interest, the correlation coefficient is ρ(x, y) = 0.80.
For the two-phase sampling scheme, B = 500 samples s 1 and s 2 of size n 1 = n 2 = 100 (n = n 1 + n 2 = 200) are selected by simple random sampling and successive sampling.
For the four-phase sampling scheme, samples s 1 , s 2 , s 3 and s 4 of size ) are selected by simple random sampling and successive sampling.
For each of the estimators t2 , trat 2 , t4 and trat 4 , we have calculated the estimates of the population total t y of the study variable y.For all estimators θ = t2 , trat 2 , t4 , trat 4 , the averages of the estimates, the averages of the variance estimates and the empirical variances of the estimates are calculated.The results of the simulation are presented in Table 1.They illustrate the averages of the estimates of variances and empirical variances for each part of the combined estimators t2 and trat 2 , when samples are drawn according to the successive sampling design in each phase.
The combined estimator has a smaller variance than any of its parts first of all because of a higher sample size used to calculate it.Using the ratio estimator, the empirical variance of the estimator of the total decreases for the number of employed individuals and has small effect on the estimates of the number of unemployed individuals.
In almost all cases, the calculated averages of the estimates of variances are close to the empirical variances of the estimates for all parts of combined estimators t2 and trat 2 using a two-phase sampling design.It shows that the expression of variance of a combined estimator of the total obtained for two-phase successive sampling is accurate enough, and the inclusion probabilities for a successive sampling design can be approximated by the corresponding probabilities of unequal probability without replacement conditional Poisson sampling design to calculate estimates of the proposed estimators of the totals and their variance estimators.
The box-plot diagrams of the estimates of the number of employed and unemployed individuals in the household population using the proposed estimators t2 , trat 2 , t4 and trat 4 are presented in Figs. 3 and 4. Simple random samples (SRS) and successive samples with probabilities proportional to the household size (PPS) are drawn in each of the phases using two-phase and four-phase sampling schemes.
The estimates of the number of employed individuals have a lower variance using the PPS sampling design, compared to the SRS sampling design.The estimates calculated for the combined ratio estimator of the total for the four-phase sampling scheme has lower variance than the estimates calculated for the combined estimator of the total without the use of auxiliary information.The estimates with the lowest variance are obtained by the combined ratio estimator of the total for four-phase sampling with PPS sampling in each of the phases.The variances of the estimates of the number of unemployed individuals do not differ much.
The box-plot diagrams of the variance estimates of the number of employed and unemployed persons in the household population using all the estimators obtained t2 , trat 2 , t4 and trat 4 are presented in Fig. 5.The estimates of the variances of estimators of the number of employed individuals using the two-phase sampling scheme with PPS in each of the phases are lower than those obtained with the SRS sampling design.The http://www.mii.lt/NAprobabilities for a successive sampling design can be approximated by the corresponding probabilities of unequal probability without replacement conditional Poisson sampling design to calculate estimates of the proposed estimators of the totals and their variance estimators.
The box-plot diagrams of the estimates of the number of employed and unemployed individuals in the household population using the proposed estimators t2 , trat 2 , t4 and trat 4 are presented in Fig. 3 and Fig. 4. Simple random samples (SRS) and successive samples with probabilities proportional to the household size (PPS) are drawn in each of the phases using two-phase and four-phase sampling schemes.the corresponding probabilities of unequal probability without replacement conditional Poisson sampling design to calculate estimates of the proposed estimators of the totals and their variance estimators.
The box-plot diagrams of the estimates of the number of employed and unemployed individuals in the household population using the proposed estimators t2 , trat 2 , t4 and trat 4 are presented in Fig. 3 and Fig. 4. Simple random samples (SRS) and successive samples with probabilities proportional to the household size (PPS) are drawn in each of the phases using two-phase and four-phase sampling schemes.design.The estimates calculated for the combined ratio estimator of the total for the four-phase sampling scheme has lower variance than the estimates calculated for the combined estimator of the total without the use of auxiliary information.The estimates with the lowest variance are obtained by the combined ratio estimator of the total for four-phase sampling with PPS sampling in each of the phases.The variances of the estimates of the number of unemployed individuals do not differ much.
The box-plot diagrams of the variance estimates of the number of employed and unemployed persons in the household population using all the estimators obtained.t2 , trat 2 , t4 and trat 4 are presented in Fig. 5.The estimates of the variances of estimators of the number of employed individuals using the two-phase sampling scheme with PPS in each of the phases are lower than those obtained with the SRS sampling design.The lowest variance has a combined ratio estimator of the total with the use of auxiliary information.The variation of estimates for the variances of estimators of the number of unemployed individuals differs little, and this is because the correlation coefficient between the variables x and y in the household population for the number of unemployed individuals is lower than for the number of employed individuals.Table 2 illustrates the true variances, averages of the estimates of variances, empirical variances and relative empirical biases: design.The estimates calculated for the combined ratio estimator of the total for the four-phase sampling scheme has lower variance than the estimates calculated for the combined estimator of the total without the use of auxiliary information.The estimates with the lowest variance are obtained by the combined ratio estimator of the total for four-phase sampling with PPS sampling in each of the phases.The variances of the estimates of the number of unemployed individuals do not differ much.The box-plot diagrams of the variance estimates of the number of employed and unemployed persons in the household population using all the estimators obtained.t2 , trat 2 , t4 and trat 4 are presented in Fig. 5. lowest variance has a combined ratio estimator of the total with the use of auxiliary information.The variation of estimates for the variances of estimators of the number of unemployed individuals differs little, and this is because the correlation coefficient between the variables x and y in the household population for the number of unemployed individuals is lower than for the number of employed individuals.

Discussion
A two-phase sampling design with second-phase stratification by the household size has been used to estimate the number of employed and unemployed individuals in [14].The simulation results show that the variance for the estimates of the number of employed individuals decreases significantly in comparison with the one-phase sampling design of the same size, and it does not decrease for the estimates of the number of the unemployed.As we see, the result of the [14] study leads to a similar conclusion as in the case of the current paper.
The combined ratio-type estimator may be effectively used in practice for the estimation of the number of employed individuals.When using ratio-type estimators, the data of the elements belonging to the current sample and to the sample of the previous wave, the data of the previous wave are needed.In the case of non-availability of the data of the previous wave for some elements, the values of the variables needed have to be imputed.
The ratio estimator used here is the simplest way to use auxiliary information at the estimation stage.A regression estimator of the total with the study variable of the previous wave as an auxiliary variable may also be used.A larger number of auxiliary variables from the previous waves and a calibrated estimator of the total instead of a ratio estimator is a possible generalization of the problem.

Figure 1 .
Figure 1.Sample rotation scheme of the Labour Force Survey.

Figure 3 :Figure 4 : 19 Figure 3 .
Figure 3: Estimates of the number of employed individuals

Figure 3 :Figure 4 :
Figure 3: Estimates of the number of employed individuals

Figure 4 .
Figure 4. Estimates of the number of unemployed individuals.

Figure 5 :
Figure 5: Estimates of the variances of estimators for the number of employed (left) and unemployed (right) individuals

Figure 5 : 1 BFigure 5 .
Figure 5: Estimates of the variances of estimators for the number of employed (left) and unemployed (right) individuals -and second-order inclusion probabilities for samples s 1 , s 2 and s 2 are introduced in Subsection 5.1.1.The corresponding first-order inclusion probabilities for samples s 3 , s 3 , s 4 , s 4 to be used for (26) and (27) are calculated as follows: first

Table 1 .
Average of variance estimates and empirical variances for each part of the combined estimator in two-phase successive sampling.