International Association of Survey Statisticians (IASS)

Ask the Experts: Dissemination

1. What are the pros and cons of letting users have access to data micro files?
2. What is meant by blurring in the context of statistical disclosure control?

 

 

1. What are the pros and cons of letting users have access to data micro files?
Kari Djerf, Statistics Finland

Answer:
It is not very easy to give a short reply to such a broad question because one should consider at least ethical, confidentiality, and usability aspects of the issue. 

From the information contents it is quite evident that the basic micro is the only source to investigate all dependencies. Every time the data are aggregated to some higher level, say from enterprises to industries, or individuals/households to some geographical or other domain tables, the researcher will lose some information. Ecological fallacy is a term which describes the aggregation bias: one cannot derive firm conclusions at the basic level when the model was fitted at the aggregated level while the opposite is normally amenable, i.e. conclude at higher levels when basic data were used.

In the context of social surveys it is a tradition to analyse the basic data. Even if there is a choice whether to use basic data or similar data aggregated to multiway tables the natural choice is in favour of basic data despite the fact one can apply nearly or exactly the same model in analysis. However, in business surveys and studies using, e.g. national accounts data, the situation is slightly different. Early econometric methods were developed to the aggregated data and the modifications to fit micro data appeared much later. It was obvious because enterprise or local unit micro data were not available. But now there are micro data and much of the recent econometric analysis are based on those.

Recently a lot of efforts have been devoted to merging basic data sets with each other and/or administrative and statistical registers by record linkage or statistical matching. The outcome of those data sets provides users (whether statistical agencies or researchers) with much richer data for analysis. But there is a clear drawback: the probability of disclosure will increase.

Ethical, legal and confidentiality issues are the major cons of access to basic data, and they are linked together. The ISI declaration of professional ethics say ”Statisticians are frequently furnished with information by the funder or employer who may legitimately require it to be kept confidential. Statistical methods and procedures that have been utilised to produce published data should not, however, be kept confidential.” That declaration is to be updated and probably the coming version will contain much clearer ethical guidelines on confidentiality, as do many of the national and international rules. 

Many statistical offices can provide researchers with basic data but they set conditions on the use. For example, such that the data are only used for the research purposes (possibly limited by subject and time), researchers are not allowed to try to reveal the informants by any means, the results and publication may be required to be investigated by the data provider etc. And the basic data sets are many times controlled by the data providers to check that the confidentiality rules fulfilled. Data may also be perturbated, or synthetized in order to avoid confidentiality violations. But even with those methods such data sets exist whose disclosure cannot totally be avoided. Especially that is related with very skewed distributions, typical in business data and some other rare events in general.

Currently there is a lot of research on confidentiality control methods which will hopefully give new tools to the data providers. The new data access methods via internet and other electronic networks develop so rapidly that new and strong methods are really needed. 

The Survey Statistician, no. 50, pages 18-19, July 2004


 

2. What is meant by blurring in the context of statistical disclosure control?

Eric Schulte Nordholt, The Netherlands

Answer:
Blurring in the context of statistical disclosure control has not a negative meaning as in the usual context. It means that a reported value is replaced by an average. There are many possible ways to implement blurring. Groups of records for averaging may be formed by matching on other variables or by sorting on the variable of interest. The number of records in a group (whose data will be averaged) may be fixed or random. The average associated with a particular group may be assigned to all members of a group, or to the "middle" member (as in a moving average). It may be performed on more than one variable with different groupings for each variable. More information on statistical disclosure control can be found in the glossary that has been published under the knob GLOSSARY at the home page of the CASC project on the internet: http://neon.vb.cbs.nl/casc.

The Survey Statistician, no. 54, page 16, July 2006


 

Version française