Natural Resource Sampling |

When individuals in a population regularly or naturally combine to form
groups or close geographical clusters, significant savings in data gathering
costs can be had by the use of **cluster selection**. In cluster selection,
we select at random a subset of clusters to represent the whole population,
with every individual in the selected clusters being measured.

For example, suppose we were interested in estimating the average biomass of pond cypress in the Ocala National Forest of North Central Florida. Using aerial photography, we identify each of the cypress domes (clusters) in the study region. A random sample of domes is selected and measurement teams are sent to each selecte dome to measure biomass of each tree in the dome. Using these data, an overall biomass estimate is computed along with associated confidence intervals.

In cluster sampling, a simple random sample of clusters is taken, all
individuals in the selected cluster being included in the sample. If only
a sample of individuals is taken from each of the selected clusters, the
sampling method is known as **two-stage selection**. Often a hierarchy
of clusters is used: First some large clusters are selected, next some
smaller clusters are drawn from elements within the selected large clusters;
and so on until finally individuals are selected within the final-stage
cluster. This general method is known as **multi-stage selection**.

At first, one might think that cluster selection is just another form of stratified selection. There are major differences between strata and clusters.

- In stratified selection, stratea are chosen such that individuals in a particular stratum have approximately the same expected value for the parameter of interest, i.e. stratum individuals are relatively homogeneous. In cluster selection, each cluster is expected to contain the complete range of possible measurements from the whole population, i.e. cluster individuals are relatively heterogeneous.
- In stratified selection, a random sample of individuals from each stratum are used to estimate the population parameters of interest. In cluster selection, a sample of clusters is used to estimate the population parameters of interest.
- In stratified selection, it is the variability within strata which defines the precision of the parameter estimate. In cluster selection, it is the variability among cluster means which is used, variability of measurements within a cluster is ingnored.

Although strata and clusters are both groupings of units, they serve entirely different sampling purposes. Since strata are all represented in the sample, it is advantageous if they are internally homogeneous in the variables of interest. On the other hand, with only a sample of clusters being examined, the ones selected need to represent the ones not selected. This is best done when the clusters are internally heterogeneous in the survey variables as possible.

Whereas stratified selection always leads to more precise estimates of the population mean, cluster selection, except in special circumstances, leads to a loss in precision compared with a simple random selection. Unless the economy in measurement and data collection created by cluster selection permits a sufficient increase in sampling size to offset the associated loss of precision, cluster selection will be inappropriate.

Systematic selection can be viewed as a type of cluster selection. For
example, in systematic selection from a list, given a value of the starting
count, *k* and intersample count, *K*, we have a sample consisting
of the
,
,
, etc. units. The *K* possible samples define **K** different clusters.
In systematic sampling we use only one of the possible *K* clusters.
With only one cluster, it is impossible to obtain an estimate of variance.
This is why we say that there is no acceptable estimate for variance of
the parameter estimate using only one starting point. We need at least
two starting points to be able to get a true variance estimate.

In addition, systematic and cluster selection share the following properties.

- Both methods divide the population into groups, referred to as
**primary**units. - Each primary unit is subdivided into
**secondary**units. In systematic selection each primary unit has the same number of secondary units. In cluster selection, each cluster may be of different size. - If one primary unit is selected, all the secondary units of that primary unit are included in the sample.
- Measurements are made on the secondary unit.

When clusters are not all of the same size, there are a number of techniques
available which take cluster size distribution into account in the final
estimates. We may be able to initially stratify clusters by size and hence
reduce the variability in cluster size. Another approach is to select clusters
with **probability proportional to size** (PPS). With PPS selection
we can have a cluster selection design which produces parameter estimates
having properties very similar to what is obtained from cluster selection
with equal cluster sizes. Since, in many cases, the true number of individuals
in each cluster is not known, selection may need to be preformed using
**probability proportional to estimated size**.

- Cluster size may be equal or unequal. The latter is suitable for selecting clusters with probabilities proportional to size.
- The size and shape of clusters affect sampling efficiency.
- The
*within cluster variability*is not taken into consideration in computing the variances of the estimators. Thus, to obtain precise estimates we want the within primary unit varaibility to be large and the variability among cluster means to be as small as possible. - In the majority of cases, cluster sampling is used mainly for convenience in the field rather than to obtain precise estimates.
- The estimators of the mean and total are unbiased if primary units are selected at random in cluster selection.

Assuming primary units are selected at random, define

**N**- - the number of cluster (primary) units in the population.
**n**- - the number of cluster (primary) units selected for measurement.
- - the number of individuals (secondary units) in the i-th cluster unit. All of these units are measured/observed.
- - measurement on the individual in the cluster from the sample, i = 1, 2, ... n; j = 1, 2, ... .

Compute the sum of all measurements in the i-th cluster as:

The mean for the cluster is:

The estimate of the average cluster unit mean is:

The estimated total (amount) is:

The sample variance of the cluster unit totals is:

From this, the sample variance of the total estimate is:

The variance of the overall mean estimator will depend on the variability of the individual cluster sizes. See the section on estimation in multi-stage selection for information on how to compute this variance.

If cluster sizes vary considerably, a ratio estimator may be used. This
estimator is similar to that discussed for unequal length **strip quadrats**.
Define the ratio, **r** as:

Then the estimate of the population total is given by:

where **M** is the total number of individual units in the population.
Often it may be difficult to get the exact value of **M** which limits
the usefulness of this estimator. If the cluster total (
) is highly correlated with the cluster size (
), the ratio estimator,
is a more efficient estimator than is,
defined in the previous section.

Note that
is a biased estimate, but the bias usually decreases as the sample size,
**n**, increases.

Variance estimates and confidence intervals can be obtained as for variable strip sampling.

Cluster sampling may also be used to estimate a population proportion. In this case, the cluster total, , measures the number of individuals in the cluster having the characteristic of interest. The overall population proportion is estimated by:

with associated variance:

where is the average cluster size for the population.

For example, suppose we were to examine planting beds in a pine tree nursery for evidence of fusiform rust. There are N = 415 beds of which we choose n = 25 to sample. We observe plants in the 25 beds, of which are found to be infected with disease. The overall proportion estimate is . If and , then the variance of the estimate is and a 95% confidence interval for the proportion is:

Copyright ©,1997 L. C. Arvanitis and K. M. Portier, University of Florida