sunset.gif Natural Resource Sampling

CLUSTER SELECTION


Using Natural Clusters

When individuals in a population regularly or naturally combine to form groups or close geographical clusters, significant savings in data gathering costs can be had by the use of cluster selection. In cluster selection, we select at random a subset of clusters to represent the whole population, with every individual in the selected clusters being measured.

For example, suppose we were interested in estimating the average biomass of pond cypress in the Ocala National Forest of North Central Florida. Using aerial photography, we identify each of the cypress domes (clusters) in the study region. A random sample of domes is selected and measurement teams are sent to each selecte dome to measure biomass of each tree in the dome. Using these data, an overall biomass estimate is computed along with associated confidence intervals.

In cluster sampling, a simple random sample of clusters is taken, all individuals in the selected cluster being included in the sample. If only a sample of individuals is taken from each of the selected clusters, the sampling method is known as two-stage selection. Often a hierarchy of clusters is used: First some large clusters are selected, next some smaller clusters are drawn from elements within the selected large clusters; and so on until finally individuals are selected within the final-stage cluster. This general method is known as multi-stage selection.

Comparisons to Stratified Selection

At first, one might think that cluster selection is just another form of stratified selection. There are major differences between strata and clusters.

  1. In stratified selection, stratea are chosen such that individuals in a particular stratum have approximately the same expected value for the parameter of interest, i.e. stratum individuals are relatively homogeneous. In cluster selection, each cluster is expected to contain the complete range of possible measurements from the whole population, i.e. cluster individuals are relatively heterogeneous.
  2. In stratified selection, a random sample of individuals from each stratum are used to estimate the population parameters of interest. In cluster selection, a sample of clusters is used to estimate the population parameters of interest.
  3. In stratified selection, it is the variability within strata which defines the precision of the parameter estimate. In cluster selection, it is the variability among cluster means which is used, variability of measurements within a cluster is ingnored.

Although strata and clusters are both groupings of units, they serve entirely different sampling purposes. Since strata are all represented in the sample, it is advantageous if they are internally homogeneous in the variables of interest. On the other hand, with only a sample of clusters being examined, the ones selected need to represent the ones not selected. This is best done when the clusters are internally heterogeneous in the survey variables as possible.

Whereas stratified selection always leads to more precise estimates of the population mean, cluster selection, except in special circumstances, leads to a loss in precision compared with a simple random selection. Unless the economy in measurement and data collection created by cluster selection permits a sufficient increase in sampling size to offset the associated loss of precision, cluster selection will be inappropriate.

Comparison to Systematic Selection

Systematic selection can be viewed as a type of cluster selection. For example, in systematic selection from a list, given a value of the starting count, k and intersample count, K, we have a sample consisting of the tex2html_wrap_inline169 , tex2html_wrap_inline171 , tex2html_wrap_inline173 , etc. units. The K possible samples define K different clusters. In systematic sampling we use only one of the possible K clusters. With only one cluster, it is impossible to obtain an estimate of variance. This is why we say that there is no acceptable estimate for variance of the parameter estimate using only one starting point. We need at least two starting points to be able to get a true variance estimate.

In addition, systematic and cluster selection share the following properties.

  1. Both methods divide the population into groups, referred to as primary units.
  2. Each primary unit is subdivided into secondary units. In systematic selection each primary unit has the same number of secondary units. In cluster selection, each cluster may be of different size.
  3. If one primary unit is selected, all the secondary units of that primary unit are included in the sample.
  4. Measurements are made on the secondary unit.

Variable Cluster Size

When clusters are not all of the same size, there are a number of techniques available which take cluster size distribution into account in the final estimates. We may be able to initially stratify clusters by size and hence reduce the variability in cluster size. Another approach is to select clusters with probability proportional to size (PPS). With PPS selection we can have a cluster selection design which produces parameter estimates having properties very similar to what is obtained from cluster selection with equal cluster sizes. Since, in many cases, the true number of individuals in each cluster is not known, selection may need to be preformed using probability proportional to estimated size.

Other Properties of Cluster Selection

  1. Cluster size may be equal or unequal. The latter is suitable for selecting clusters with probabilities proportional to size.
  2. The size and shape of clusters affect sampling efficiency.
  3. The within cluster variability is not taken into consideration in computing the variances of the estimators. Thus, to obtain precise estimates we want the within primary unit varaibility to be large and the variability among cluster means to be as small as possible.
  4. In the majority of cases, cluster sampling is used mainly for convenience in the field rather than to obtain precise estimates.
  5. The estimators of the mean and total are unbiased if primary units are selected at random in cluster selection.

Estimation

Assuming primary units are selected at random, define

N
- the number of cluster (primary) units in the population.
n
- the number of cluster (primary) units selected for measurement.
tex2html_wrap_inline175
- the number of individuals (secondary units) in the i-th cluster unit. All of these units are measured/observed.
tex2html_wrap_inline177
- measurement on the tex2html_wrap_inline179 individual in the tex2html_wrap_inline181 cluster from the sample, i = 1, 2, ... n; j = 1, 2, ... tex2html_wrap_inline183 .

Compute the sum of all measurements in the i-th cluster as:

eqnarray54

The mean for the tex2html_wrap_inline181 cluster is:

eqnarray61

The estimate of the average cluster unit mean is:

eqnarray67

The estimated total (amount) is:

eqnarray75

The sample variance of the cluster unit totals is:

eqnarray83

From this, the sample variance of the total estimate is:

eqnarray93

The variance of the overall mean estimator will depend on the variability of the individual cluster sizes. See the section on estimation in multi-stage selection for information on how to compute this variance.

Ratio Estimation in Cluster Selection

If cluster sizes vary considerably, a ratio estimator may be used. This estimator is similar to that discussed for unequal length strip quadrats. Define the ratio, r as:

eqnarray103

Then the estimate of the population total is given by:

eqnarray111

where M is the total number of individual units in the population. Often it may be difficult to get the exact value of M which limits the usefulness of this estimator. If the cluster total ( tex2html_wrap_inline187 ) is highly correlated with the cluster size ( tex2html_wrap_inline175 ), the ratio estimator, tex2html_wrap_inline191 is a more efficient estimator than is, tex2html_wrap_inline193 defined in the previous section.

Note that tex2html_wrap_inline191 is a biased estimate, but the bias usually decreases as the sample size, n, increases.

Variance estimates and confidence intervals can be obtained as for variable strip sampling.

Cluster Selection for a Proportion

Cluster sampling may also be used to estimate a population proportion. In this case, the cluster total, tex2html_wrap_inline187 , measures the number of individuals in the cluster having the characteristic of interest. The overall population proportion is estimated by:

eqnarray123

with associated variance:

eqnarray132

where tex2html_wrap_inline199 is the average cluster size for the population.

For example, suppose we were to examine planting beds in a pine tree nursery for evidence of fusiform rust. There are N = 415 beds of which we choose n = 25 to sample. We observe tex2html_wrap_inline201 plants in the 25 beds, of which tex2html_wrap_inline203 are found to be infected with disease. The overall proportion estimate is tex2html_wrap_inline205 . If tex2html_wrap_inline207 and tex2html_wrap_inline209 , then the variance of the estimate is tex2html_wrap_inline211 and a 95% confidence interval for the proportion is:

eqnarray153


Copyright ©,1997 L. C. Arvanitis and K. M. Portier, University of Florida