|
1. How can imputation be trusted since it
creates artificial values?
Jean-François Beaumont and Eric Rancourt, Statistics Canada
Answer:
In one way or another, surveys have to deal with the problem of missing values.
Different reasons may explain the presence of missing values, such as refusal
to provide the desired information for at least one question or an
impossibility to contact a given unit. Missing values can also be created at
the editing stage of the survey in an attempt to resolve problems of
inconsistent or suspect responses. To deal with missing values, many estimation
techniques such as maximum likelihood estimation, nonresponse weight adjustment
and imputation can be used. Choosing to use imputation is often based on
practical considerations. For instance, imputation is convenient for ultimate
users since it creates a complete rectangular file, which can be used to obtain
estimates of population parameters of interest as if there were no missing
value. This property is particularly useful when dealing with item nonresponse,
where missing values occur for some but not all variables in the survey. Also,
imputation ensures some consistency between estimates produced by different
users.
Although imputation is usually a very convenient
method of compensating for missing values, it is well known that imputed values
cannot be treated as true values when making inferences about unknown
population parameters. In fact, the real goal of imputation is to help support
estimation in order to make appropriate inferences rather than simply predict
values of micro data. However, to achieve this goal, imputation does consists
in predicting each individual missing value, but of course, this does not
necessarily mean that the imputed value for a given unit is a high-quality
estimate for the true unknown value. Imputation methods must be developed in
such a way as to lead to reasonably high-quality estimates, at least at certain
aggregate levels.
In order to make inferences in the presence of
missing values, assumptions about the unknown mechanism that generates missing
values, i.e. the nonresponse mechanism, are needed. These assumptions are
called nonresponse model. This is to be contrasted to sampling theory, where
the mechanism that generates samples is completely known to the statistician.
Often, the nonresponse model only requires that the nonresponse mechanism be
independent of the variables of interest after conditioning on some auxiliary
variables observed for all sample units. In such a case, a model for the
variables of interest, i.e. an imputation model, is needed. The imputation
model is usually the key to obtain efficient predictions, or efficient
imputations, for the missing values. In particular, the use of auxiliary
variables well correlated to the variables of interest is important to reduce
the error in the estimates due to missing values. Therefore, to the extent
possible, it is crucial to validate all model assumptions underlying the
imputation strategy in order to make valid inferences in the presence of
missing values. If a careful modeling effort is performed, then imputation can
be trusted as a method of treatment of missing values.
Finally, it is important to note that missing
values lead to estimates that are more variable than those that would be
obtained if the entire sample could be observed. As a result, variance
estimates derived under the assumption of full response are not valid in the
presence of missing values. Therefore, the imputation strategy and/or the
variance estimation approach must take imputation into account in order to make
valid inferences.
The Survey Statistician, no. 51, pages 18-19, January 2005

|