International Association of Survey Statisticians (IASS)

Ask the Experts: Design

1. When is auxiliary information best used; for sample design or for estimation purposes?
2. Why collect new data when data are readily available in registers and data banks?
3.We made a survey last year in the housing sector, and we are going to repeat it next year in order to measure changes. What do we need to think of? Should we make an independent sample?
 
1. When is auxiliary information best used; for sample design or for estimation purposes?

Answer:
To answer this question, we first need some structure. Suppose we are interested in estimating the population totals for a set of variables based an a random sample of population units drawn from some sampling frame.
Aiding this endeavor is a second set of auxiliary variables for which we already know the population totals. This auxiliary information can be used at the estimation stage to compensate for units selected for the sample that fail to provide adequate responses and for population units missing from the sampling frame. There are no equivalent potential uses of such information at the design stage.

Even in a textbook environment where the sampling frame is complete and every sampled unit provides a usable response, the combination of a simple random sample and an estimator constructed to make prudent use of the auxiliary information will usually produce better results than a cleverly constructed sample design employing the same information coupled with an expansion estimator (a sample-weighted sum where the weights are the reciprocals of the probabilities of selection).

That is not always the case. In one of the simplest examples of the use of auxiliary information, there is a single variable of interest and a single auxiliary. The value of the auxiliary variable is known and positive for every unit in the population. If the unit-by-unit ratios of the variable of interest and the auxiliary variable behave like independent random variables with a common mean and variance, then for a given sample size, the combination of a sample drawn with probabilities proportional to the auxiliary variable and an expansion estimator will tend to be more efficient (have less variance) than a simple random sample combined with a ratio estimator (an expansion estimator for the variable of interest multiplied by the ratio of the population total for the auxiliary variable and the expansion estimator for the auxiliary variable).

Few surveys are conducted to estimate a single variable total, however. If we have two or more variables of interest each with their own auxiliary, then we can construct a different ratio estimator for each variable total, but we can only draw the sample once. The design may be relatively efficient for one of those variables, but not for all of them. Moreover, with only a modest loss of efficiency, we can use the same calibration estimator for each survey variable. A calibration estimator looks like an expansion estimator but the original sampling weights are replaced by calibration weights. Calibration weights are modifications of the original sampling weights constructed so that the calibration-weighted sum across the sample of each auxiliary variable equals its population total. (Note: The ratio estimator is an example of a calibration estimator in which the calibration weight for a unit is its original sampling weight multiplied by the ratio of the population total for a lone auxiliary variable to the expansion estimator for that auxiliary variable.) There is no equivalent way to design a sample to be as efficient for all variables of interest at once.

This is not to say that auxiliary information should be used in estimation exclusively. Far from it. The National Agricultural Statistics Service calibrates its quarterly crops surveys on as many as 20 auxiliary variables in a state. The agency uses the same set of auxiliary variables in its sample design to assure adequate sample sizes for each variable of interest (if the auxiliary value is zero, the corresponding survey value will likely be zero as well). Moreover, the sample design can be used to increase efficiency of the estimator.

As an example of this, recall the single-variable-of-interest-single-auxiliary example discussed above. Suppose again that the unit-by-unit ratios of the variable of interest and the auxiliary variable can be treated as independent random variables with a common mean and variance (formally, this is a model assumption). For a fixed sample size, the design under which the anticipated (model-expected) variance of the ratio estimator is minimized selects units with probability proportional to the auxiliary variable. It turns out that under that design, the original-sample-weighted sum of the auxiliary variable equals its population total, and the ratio
estimator collapses into the expansion estimator.

Perhaps the best answer to the question is a Zen one. Mu. Unask it. Auxiliary variables can most profitably be used in sample design and estimation simultaneously.

Kott, Phillip S. and Bailey, Jeffrey T. (2000), "The Theory and Practice of Maximal Brewer Selection with Poisson PRN Sampling," Paper presented at the International Conference on Establishment Surveys, II, Buffalo, New York . Also at

 http://www.nass.usda.gov/research/reports/icespap2c8.pdf.

Phil Kott
USDA/NASS/RDD
703-877-8000 x102

The Survey Statistician, no. 53, page 13-14, January 2006


 

2. Why collect new data when data are readily available in registers and data banks?
William E. Winkler ( william.e.winkler@census.gov )

Answer:
There are three reasons that may limit the suitability of register and other data. The first reason is that certain data in a register may not be accurate if the quality of values of individual fields is not high. Some information such as the value of a sex code or age may be missing or inaccurate because its accuracy is not needed for the day-to-day needs of the database. For instance, a tax file may not need sex code or age to be accurate. The tax file may not accurately track children in those countries where tax breaks are given for dependent children. If the main tax file needs to connect into other files with supplementary tax information, then the quality of the tax id field in each file needs to be high. Any error in the tax id field usually causes the main tax record to not be connected with the correct corresponding supplementary record. The second reason is that a set of files may not have unique, verified identifiers. If the analyst needs to use joint (x, y) data where x comes from one file and y from another file, then x- and y-data can usually only be easily and accurately linked using the unique identifiers. The third reason is that an analyst may need an extra z-variable to combine with (x, y) data where z is not in any known file. For instance, if z is the amount of tax savings by some individuals due to a specific tax break or the result of a treatment of a particular disease by a new drug.

If sex code is in error (or missing), then it might be corrected using the first name. If age or date-of-birth is missing or in error, then it might be corrected using an auxiliary data source. If matching is on name and address, then name would need to be accurate and address would need to be current. If the file contains 30,000 records with the name ‘John Smith,’ then an erroneous or out-of-date address would not allow linkage of records across two files. In some situations, it is virtually impossible to match a record ‘Karen Jones 1964Apr10’ with ‘Susan K. Smith 1985Jan07’ because Karen Jones now has the last name Smith, she usually uses her middle name Karen instead of Susan, and one the dates-of-birth is completely wrong. With businesses, it may be extremely difficult to match ‘John L. Smith and Sons, Inc 1234 Main Street’ with ‘JLS Co. PO Box 657.’ Business names are often represented in a number of difficult-to-compare variations. An address associated with a business may be associated with a location, a PO Box, or the address of an accountant.

False match and false nonmatch rates are needed to evaluate the quality of matching. If the false match rate is moderate or high, then the resultant merged file may yield substantial analytic errors. Figure 1 illustrates the situation. The line represents the true regression line. Figure 1a shows original (x, y) regression data without matching error. Figure 1b shows the regression data with 10% matching error and Figure 1c shows the regression data with 50% matching error. With a 10% false match rate, the regression coefficient is more than 10% in error and the R2 statistic is low by 25%. With a 50% false match rate, the regression coefficient is more than 50% in error and the R2 statistic is 75% low. If the false match rate is very low and the false nonmatch rate is moderate, then the intersection A ∩ B may not be a representative subset of either file A or file B. For instance, if low income individuals in either file A or file B contain disproportionately higher typographical variation than other individuals, then the low-income individuals will not be well represented in the intersection.

 

The Survey Statistician, no. 53, page 12-13, January 2006

 


 

3. We made a survey last year in the housing sector, and we are going to repeat it next year in order to measure changes. What do we need to think of? Should we make an independent sample?
 

An independent sample is probably not a good idea. The reason is that you could benefit largely from the postive correlations that occur for many housing survey variables, between the two years that you compare, by using (a large part of) last years sample again, and base your inference on the differences for individual households. Such a sampling procedure can create drastic gains in sampling variances as compared to the one based on two independent samples, because the positive correlations are very favorable for the same sample used at both occasions with an estimator based on differences. In Wallis & Roberts (1965) an example is given showing that 25 units, measured twice was comparable in precision to two samples of 2.222 units, measured once each! As the authors puts it: "This illustrates the the potential importance of proper statistical planning before collecting data".

Another issue that is important to keep in mind is that the estimators really will estimate the true change. If you make changes to the methodolgy of the survey, you will have to exclude those as explanation of the difference that is obtained. So, the same mode of data collection, and the same questionnaire at the two occasions, would probably be a wise design decision.

Wallis W. A. & Roberts H. V.: Statistics: A New Approach, Twelfth printing 1965.

The Survey Statistician, no. 52, page 16, July 2005

 


 

Version française