# Statistics Indian Statistical Service Part 3 Question-1 Solved Solution

**6.(a) (i) Define a Horvitz-Thompson estimator.**

The Horvitz-Thompson estimator is a method of estimating the total value of a population parameter based on a sample. It is a type of weighted estimator that assigns weights to each sampled unit based on its probability of selection.

The estimator is named after William Horvitz and Donovan Thompson, who introduced it in their 1952 paper "A Generalization of Sampling Without Replacement from a Finite Universe". The Horvitz-Thompson estimator is widely used in survey sampling and is particularly useful when the sample is selected using a complex design.

The Horvitz-Thompson estimator can be defined as follows:

Let Y be the population total of a variable of interest, and let y_i be the value of that variable for unit i in the population. Let p_i be the probability of selecting unit i in the sample. Then the Horvitz-Thompson estimator of Y is given by:

Y_ht = âˆ‘ (y_i / p_i) / âˆ‘ (1 / p_i)

where the summations are taken over all sampled units. This formula essentially computes a weighted average of the variable values for the sampled units, with weights proportional to their inverse probabilities of selection.

The Horvitz-Thompson estimator has desirable properties, such as being unbiased and having a smaller variance than simpler estimators like the sample mean. It is often used in practice, especially in situations where the sampling design is complex and stratification, clustering or unequal probability of selection is involved.

**(ii) For simple random sampling without replacement, show that the HT estimator is unbiased for the population mean and hence deduce its variance. **

In simple random sampling without replacement, every possible sample of size n has the same probability of being selected, given by the hypergeometric distribution:

p(sample) = (N choose n)^{-1}

where N is the population size and (N choose n) is the number of possible samples of size n.

Let Y_1, Y_2, ..., Y_n be the values of the variable of interest in the selected sample units. Then the sample mean is:

y_bar = (1/n) * âˆ‘ Y_i

The Horvitz-Thompson estimator of the population mean is given by:

Î¼_ht = âˆ‘ (Y_i / p_i) / âˆ‘ (1 / p_i)

where p_i = (N choose n)^{-1} is the probability of selecting unit i in the sample.

To show that the HT estimator is unbiased for the population mean, we need to show that its expected value is equal to the population mean:

E(Î¼_ht) = E[âˆ‘ (Y_i / p_i) / âˆ‘ (1 / p_i)]

= âˆ‘ [E(Y_i / p_i) / âˆ‘ (1 / p_i)]

= âˆ‘ [E(Y_i) / p_i / âˆ‘ (1 / p_i)] (because Y_i and p_i are independent)

= (1/N) * âˆ‘ Y_i (because E(p_i) = 1/N)

= Î¼

Therefore, the Horvitz-Thompson estimator is unbiased for the population mean.

To compute the variance of the Horvitz-Thompson estimator, we can use the fact that the variance of a weighted sum of random variables is given by:

Var(âˆ‘ w_i Z_i) = âˆ‘ âˆ‘ w_i w_j Cov(Z_i, Z_j)

where w_i are the weights and Z_i are the random variables.

In this case, the weights are w_i = 1 / p_i and the random variables are Z_i = Y_i - Î¼_ht, where Î¼_ht is the HT estimator of the population mean. We can write:

Var(Î¼_ht) = Var[âˆ‘ (Y_i / p_i) / âˆ‘ (1 / p_i)]

= âˆ‘ âˆ‘ (1 / p_i^2) Cov(Y_i, Y_j) / [âˆ‘ (1 / p_i)]^2

Since Y_i and Y_j are independent for i â‰ j, their covariance is zero, so we have:

Var(Î¼_ht) = âˆ‘ (1 / p_i^2) Var(Y_i) / [âˆ‘ (1 / p_i)]^2

= (N^2 / n) * [1/n - 1/(N-1)] / (N-1)

= [(N-n)/(N-1)] * (s^2/n)

where s^2 is the sample variance. Therefore, the variance of the Horvitz-Thompson estimator is proportional to the sample variance and the finite population correction factor (N-n)/(N-1).

**6.(b) (i) Describe the purpose of double sampling in ratio method of estimation.**

Double sampling is a technique used in the ratio method of estimation to improve the accuracy of the estimate. The ratio method is a type of two-phase sampling, where the population is first divided into subgroups or strata, and a sample is selected from each stratum. In the ratio method, the estimator of the population total is calculated as the sum of the product of the sample size and the ratio of the population total to the sample total within each stratum.

However, in some cases, the sample size may not be large enough to estimate the population ratio accurately, especially if the ratio varies widely within the stratum. In such cases, double sampling can be used to obtain a more accurate estimate of the ratio and, hence, the population total.

Double sampling involves taking an initial sample from the population and using it to estimate the population ratio. If the estimate of the ratio is not sufficiently accurate, a second, larger sample is taken from the population, with the size of the second sample being determined by the size of the initial sample and the desired level of precision. The second sample is then used to estimate the population total using the ratio method.

The purpose of double sampling is to reduce the variance of the estimator by reducing the variance of the ratio estimator. By using a larger sample for the second phase, the precision of the ratio estimate is increased, resulting in a more accurate estimate of the population total. Double sampling can be particularly useful in situations where the population is large and heterogenous, and the ratio varies widely within each stratum, making it difficult to estimate accurately with a single sample.

**(ii) A simple random sample of 15 orchards was selected from 200 orchards in the region to estimate the average yield of apples in the region. In each of the selected 15 orchards, the average yield during the previous session was obtained. For a subsample of 8 orchards from this 15 orchards, the average yield of apples in this season was recorded. Estimate the average yield of apples from all 200 orchards using all the information given in the table below :**

*Orchard No. .*

**1**

**2**

**3**

**4**

**5**

**6**

7

**8**

*Yield : previous session*

**140**

**119**

**109**

**139**

**127**

**145**

**146**

**188**

*Yield : present session*

**128**

**108**

**101**

**125**

**118**

**135**

**133**

**175**

*Orchard No.*

**9**

**10**

**11**

** 12**

** 13**

**14**

**15**

*Yield : previous session*

**121**

**144**

**105**

**155**

**115**

**142**

**171**

**To estimate the average yield of apples from all 200 orchards using the given information, we can use the two-phase sampling technique.**

First, we take a simple random sample of 15 orchards from the 200 orchards in the region and obtain the average yield during the previous session for each of the 15 orchards. Let's denote the sample mean of the previous session yields by $\bar{y_1}$.

Next, we take a subsample of 8 orchards from the 15 selected orchards and record the average yield of apples in this season for each of the 8 orchards. Let's denote the sample mean of the current session yields by $\bar{y_2}$.

Now, the estimate of the average yield of apples from all 200 orchards can be obtained as follows:

$$\hat{\mu}=\frac{N_1}{N}\bar{y_1}+\frac{N_2}{N}\bar{y_2}$$

where $N_1$ is the total number of orchards in the region, $N_2$ is the total number of orchards in the subsample, and $N$ is the total number of orchards in the 15 selected orchards.

Substituting the given values, we have:

$$\hat{\mu}=\frac{200}{15}\bar{y_1}+\frac{8}{15}\bar{y_2}$$

To compute the estimate, we first need to find the sample mean of the previous session yields, $\bar{y_1}$, and the sample mean of the current session yields, $\bar{y_2}$:

$$\bar{y_1}=\frac{1}{15}\sum_{i=1}^{15}y_{1i}=\frac{140+119+109+139+127+145+146+188+121+144+105+155+115+142+171}{15}=134.67$$

$$\bar{y_2}=\frac{1}{8}\sum_{i=1}^{8}y_{2i}=\frac{128+108+101+125+118+135+133+175}{8}=126.125$$

Now we can substitute the values into the equation:

$$\hat{\mu}=\frac{200}{15}(134.67)+\frac{8}{15}(126.125)=140.82$$

Therefore, the estimate of the average yield of apples from all 200 orchards using the given information is 140.82.

**6. (c) (i) Describe cluster sampling. State the sampling variance of the unbiased estimator of **the **population mean in terms of the **intra-cluster** correlation.**

**Cluster sampling is a sampling technique where the population is first divided into smaller groups or clusters, and then a random sample of these clusters is selected. All members of the selected clusters are then included in the sample.**

For example, in a study of students' test scores in a school district, a cluster sample might be created by randomly selecting several schools, and then selecting all the students in those schools for the study.

The sampling variance of the unbiased estimator of a population means in cluster sampling is influenced by two factors: the variance within clusters and the intra-cluster correlation.

The intra-cluster correlation (ICC) measures the degree of similarity or homogeneity among the observations within each cluster. The sampling variance of the unbiased estimator of population mean in terms of the ICC is given by:

Var(y_bar) = [(1-f) / n) * s^2] + [(f / m) * v]

where y_bar is the sample mean, s^2 is the within-cluster variance, m is the number of clusters, n is the average cluster size, v is the between-cluster variance, and f is the design effect, which is a function of the ICC and is defined as:

f = 1 + (m-1) * ICC

In general, as the ICC increases, the design effect increases, which leads to larger variance of the estimator. This means that the precision of the estimator decreases as the within-cluster similarity increases. Therefore, it is important to account for the intra-cluster correlation in the design and analysis of cluster samples.

**(ii) A survey is carried out from a simple random sample of 90 clusters each of 40 households. The clusters are selected using simple random sampling at the rate ***f= ***1/300. To improve accuracy of results, a statistician proposes to reduce by half the size of the clusters by selecting twice as many of them. Show that the relative efficiency is**

__1+ 19p __**1+39p**

**assuming that the inter cluster correlation ***p ***and other quantities do not change.**

**(7+8)**

Let's first define the terms used in the problem:

n1 = number of clusters in the original sample = 90 n2 = number of clusters in the new sample = 2*n1 = 180 m1 = number of households per cluster in the original sample = 40 m2 = number of households per cluster in the new sample = m1/2 = 20 f = sampling fraction = 1/300

Let Y1 be the variable of interest measured on the original sample of n1 clusters and Y2 be the variable of interest measured on the new sample of n2 clusters. We assume that Y1 and Y2 have the same means, variances and inter-cluster correlation p.

The estimator of the population mean of Y1 is the sample mean of the 90 clusters:

È²1 = (1/n1) Î£(yi)

where yi is the sample mean of the mi households in the i-th cluster.

The estimator of the population mean of Y2 is the sample mean of the 180 clusters:

È²2 = (1/n2) Î£(yi)

where yi is the sample mean of the m2 households in the i-th cluster.

The variance of È²1 is:

Var(È²1) = [(1-f)/f] Var(yi)/mi

where Var(yi) is the intra-cluster variance of the mi households in the i-th cluster.

The variance of È²2 is:

Var(È²2) = [(1-f)/2f] Var(yi)/m2

The relative efficiency of È²2 compared to È²1 is:

RE = Var(È²1)/Var(È²2)

Substituting the expressions for Var(È²1) and Var(È²2) and simplifying, we get:

RE = (2/m2) (1+19p)/(1+39p)

Therefore, the relative efficiency of reducing the cluster size by half and doubling the number of clusters is (2/m2) (1+19p)/(1+39p), where m2 = m1/2 = 20 and p is the inter-cluster correlation.

If you are interested then you can enroll now

Yes, I want to enroll

Thinking about it

## Comments