|
1. When is auxiliary information best used; for
sample design or for estimation purposes?
Answer:
To answer this question, we first need
some structure. Suppose we are interested in estimating the population totals
for a set of variables based an a random sample of population units drawn from
some sampling frame.
Aiding this endeavor is a second set of auxiliary variables for which we
already know the population totals. This auxiliary information can be used at
the estimation stage to compensate for units selected for the sample that fail
to provide adequate responses and for population units missing from the
sampling frame. There are no equivalent potential uses of such information at
the design stage.
Even in a textbook environment where the
sampling frame is complete and every sampled unit provides a usable response,
the combination of a simple random sample and an estimator constructed to make
prudent use of the auxiliary information will usually produce better results
than a cleverly constructed sample design employing the same information
coupled with an expansion estimator (a sample-weighted sum where the weights
are the reciprocals of the probabilities of selection).
That is not always the case. In one of the
simplest examples of the use of auxiliary information, there is a single
variable of interest and a single auxiliary. The value of the auxiliary
variable is known and positive for every unit in the population. If the
unit-by-unit ratios of the variable of interest and the auxiliary variable
behave like independent random variables with a common mean and variance, then
for a given sample size, the combination of a sample drawn with probabilities
proportional to the auxiliary variable and an expansion estimator will tend to
be more efficient (have less variance) than a simple random sample combined
with a ratio estimator (an expansion estimator for the variable of interest
multiplied by the ratio of the population total for the auxiliary variable and
the expansion estimator for the auxiliary variable).
Few surveys are conducted to estimate a single
variable total, however. If we have two or more variables of interest each with
their own auxiliary, then we can construct a different ratio estimator for each
variable total, but we can only draw the sample once. The design may be
relatively efficient for one of those variables, but not for all of them.
Moreover, with only a modest loss of efficiency, we can use the same
calibration estimator for each survey variable. A calibration estimator looks
like an expansion estimator but the original sampling weights are replaced by
calibration weights. Calibration weights are modifications of the original
sampling weights constructed so that the calibration-weighted sum across the
sample of each auxiliary variable equals its population total. (Note: The ratio
estimator is an example of a calibration estimator in which the calibration
weight for a unit is its original sampling weight multiplied by the ratio of
the population total for a lone auxiliary variable to the expansion estimator
for that auxiliary variable.) There is no equivalent way to design a sample to
be as efficient for all variables of interest at once.
This is not to say that auxiliary information
should be used in estimation exclusively. Far from it. The National
Agricultural Statistics Service calibrates its quarterly crops surveys on as
many as 20 auxiliary variables in a state. The agency uses the same set of
auxiliary variables in its sample design to assure adequate sample sizes for
each variable of interest (if the auxiliary value is zero, the corresponding
survey value will likely be zero as well). Moreover, the sample design can be
used to increase efficiency of the estimator.
As an example of this, recall the
single-variable-of-interest-single-auxiliary example discussed above. Suppose
again that the unit-by-unit ratios of the variable of interest and the
auxiliary variable can be treated as independent random variables with a common
mean and variance (formally, this is a model assumption). For a fixed sample
size, the design under which the anticipated (model-expected) variance of the
ratio estimator is minimized selects units with probability proportional to the
auxiliary variable. It turns out that under that design, the
original-sample-weighted sum of the auxiliary variable equals its population
total, and the ratio
estimator collapses into the expansion estimator.
Perhaps the best answer to the question is a Zen
one. Mu. Unask it. Auxiliary variables can most profitably be used in sample
design and estimation simultaneously.
Kott, Phillip S. and Bailey, Jeffrey T. (2000),
"The Theory and Practice of Maximal Brewer Selection with Poisson PRN
Sampling," Paper presented at the International Conference on Establishment
Surveys, II, Buffalo, New York . Also at
http://www.nass.usda.gov/research/reports/icespap2c8.pdf.
Phil Kott
USDA/NASS/RDD
703-877-8000 x102
The Survey Statistician, no. 53, page 13-14, January 2006

2.
Why collect new data when data are readily available in
registers and data banks?
William E. Winkler ( william.e.winkler@census.gov )
Answer:
There are three reasons that may limit the suitability of register and other
data. The first reason is that certain data in a register may not be accurate
if the quality of values of individual fields is not high. Some information
such as the value of a sex code or age may be missing or inaccurate because its
accuracy is not needed for the day-to-day needs of the database. For instance,
a tax file may not need sex code or age to be accurate. The tax file may not
accurately track children in those countries where tax breaks are given for
dependent children. If the main tax file needs to connect into other files with
supplementary tax information, then the quality of the tax id field in each
file needs to be high. Any error in the tax id field usually causes the main
tax record to not be connected with the correct corresponding supplementary
record. The second reason is that a set of files may not have unique, verified
identifiers. If the analyst needs to use joint (x, y) data where x comes from
one file and y from another file, then x- and y-data can usually only be easily
and accurately linked using the unique identifiers. The third reason is that an
analyst may need an extra z-variable to combine with (x, y) data where z is not
in any known file. For instance, if z is the amount of tax savings by some
individuals due to a specific tax break or the result of a treatment of a
particular disease by a new drug.
If sex code is in error (or missing), then it
might be corrected using the first name. If age or date-of-birth is missing or
in error, then it might be corrected using an auxiliary data source. If
matching is on name and address, then name would need to be accurate and
address would need to be current. If the file contains 30,000 records with the
name ‘John Smith,’ then an erroneous or out-of-date address would not allow
linkage of records across two files. In some situations, it is virtually
impossible to match a record ‘Karen Jones 1964Apr10’ with ‘Susan K. Smith
1985Jan07’ because Karen Jones now has the last name Smith, she usually uses
her middle name Karen instead of Susan, and one the dates-of-birth is
completely wrong. With businesses, it may be extremely difficult to match ‘John
L. Smith and Sons, Inc 1234 Main Street’ with ‘JLS Co. PO Box 657.’ Business
names are often represented in a number of difficult-to-compare variations. An
address associated with a business may be associated with a location, a PO Box,
or the address of an accountant.
False match and false nonmatch rates are needed
to evaluate the quality of matching. If the false match rate is moderate or
high, then the resultant merged file may yield substantial analytic errors.
Figure 1 illustrates the situation. The line represents the true regression
line. Figure 1a shows original (x, y) regression data without matching error.
Figure 1b shows the regression data with 10% matching error and Figure 1c shows
the regression data with 50% matching error. With a 10% false match rate, the
regression coefficient is more than 10% in error and the R2 statistic is low by
25%. With a 50% false match rate, the regression coefficient is more than 50%
in error and the R2 statistic is 75% low. If the false match rate is very low
and the false nonmatch rate is moderate, then the intersection A ∩ B may not be
a representative subset of either file A or file B. For instance, if low income
individuals in either file A or file B contain disproportionately higher
typographical variation than other individuals, then the low-income individuals
will not be well represented in the intersection.

The Survey Statistician, no.
53, page 12-13, January 2006

3. We made a survey last year in the housing sector, and we are going to repeat it next year in order to measure changes. What do we need to think of? Should we make
an independent sample?
An independent sample is probably not a good idea. The reason
is that you could benefit largely from the postive correlations that occur for
many housing survey variables, between the two years that you compare, by using
(a large part of) last years sample again, and base your inference on the
differences for individual households. Such a sampling procedure can create
drastic gains in sampling variances as compared to
the one based on two independent samples, because the positive correlations are
very favorable for the same sample used at both occasions with an estimator
based on differences. In Wallis & Roberts (1965) an example is given showing
that 25 units, measured twice was comparable in precision to two samples of
2.222 units, measured once each! As the authors puts it: "This illustrates the
the potential importance of proper statistical planning before collecting
data".
Another issue that is important to keep in mind is that the
estimators really will estimate the true change. If you make changes to the
methodolgy of the survey, you will have to exclude those as explanation of the
difference that is obtained. So, the same mode of data collection, and the same
questionnaire at the two occasions, would probably be a wise design decision.
Wallis W. A. & Roberts H. V.:
Statistics: A New Approach, Twelfth printing 1965.
The Survey Statistician, no. 52, page 16, July 2005
|