International Association of Survey Statisticians (IASS)

Ask the Experts: Processing

1. How can imputation be trusted since it creates artificial values?

 

1. How can imputation be trusted since it creates artificial values?
Jean-François Beaumont and Eric Rancourt, Statistics Canada

Answer: 
In one way or another, surveys have to deal with the problem of missing values. Different reasons may explain the presence of missing values, such as refusal to provide the desired information for at least one question or an impossibility to contact a given unit. Missing values can also be created at the editing stage of the survey in an attempt to resolve problems of inconsistent or suspect responses. To deal with missing values, many estimation techniques such as maximum likelihood estimation, nonresponse weight adjustment and imputation can be used. Choosing to use imputation is often based on practical considerations. For instance, imputation is convenient for ultimate users since it creates a complete rectangular file, which can be used to obtain estimates of population parameters of interest as if there were no missing value. This property is particularly useful when dealing with item nonresponse, where missing values occur for some but not all variables in the survey. Also, imputation ensures some consistency between estimates produced by different users.

Although imputation is usually a very convenient method of compensating for missing values, it is well known that imputed values cannot be treated as true values when making inferences about unknown population parameters. In fact, the real goal of imputation is to help support estimation in order to make appropriate inferences rather than simply predict values of micro data. However, to achieve this goal, imputation does consists in predicting each individual missing value, but of course, this does not necessarily mean that the imputed value for a given unit is a high-quality estimate for the true unknown value. Imputation methods must be developed in such a way as to lead to reasonably high-quality estimates, at least at certain aggregate levels.

In order to make inferences in the presence of missing values, assumptions about the unknown mechanism that generates missing values, i.e. the nonresponse mechanism, are needed. These assumptions are called nonresponse model. This is to be contrasted to sampling theory, where the mechanism that generates samples is completely known to the statistician. Often, the nonresponse model only requires that the nonresponse mechanism be independent of the variables of interest after conditioning on some auxiliary variables observed for all sample units. In such a case, a model for the variables of interest, i.e. an imputation model, is needed. The imputation model is usually the key to obtain efficient predictions, or efficient imputations, for the missing values. In particular, the use of auxiliary variables well correlated to the variables of interest is important to reduce the error in the estimates due to missing values. Therefore, to the extent possible, it is crucial to validate all model assumptions underlying the imputation strategy in order to make valid inferences in the presence of missing values. If a careful modeling effort is performed, then imputation can be trusted as a method of treatment of missing values.

Finally, it is important to note that missing values lead to estimates that are more variable than those that would be obtained if the entire sample could be observed. As a result, variance estimates derived under the assumption of full response are not valid in the presence of missing values. Therefore, the imputation strategy and/or the variance estimation approach must take imputation into account in order to make valid inferences.

The Survey Statistician, no. 51, pages 18-19, January 2005

 


 

Version française