|
1. What are the
pros and cons of letting users have access to
data micro files?
Kari Djerf, Statistics Finland
Answer:
It is not very easy to give a
short reply to such a broad question because one should consider at least
ethical, confidentiality, and usability aspects of the issue.
From the information contents it is quite
evident that the basic micro is the only source to investigate all
dependencies. Every time the data are aggregated to some higher level, say from
enterprises to industries, or individuals/households to some geographical or
other domain tables, the researcher will lose some information. Ecological
fallacy is a term which describes the aggregation bias: one cannot derive firm
conclusions at the basic level when the model was fitted at the aggregated
level while the opposite is normally amenable, i.e. conclude at higher levels
when basic data were used.
In the context of social surveys it is a
tradition to analyse the basic data. Even if there is a choice whether to use
basic data or similar data aggregated to multiway tables the natural choice is
in favour of basic data despite the fact one can apply nearly or exactly the
same model in analysis. However, in business surveys and studies using, e.g.
national accounts data, the situation is slightly different. Early econometric
methods were developed to the aggregated data and the modifications to fit
micro data appeared much later. It was obvious because enterprise or local unit
micro data were not available. But now there are micro data and much of the
recent econometric analysis are based on those.
Recently a lot of efforts have been devoted to
merging basic data sets with each other and/or administrative and statistical
registers by record linkage or statistical matching. The outcome of those data
sets provides users (whether statistical agencies or researchers) with much
richer data for analysis. But there is a clear drawback: the probability of
disclosure will increase.
Ethical, legal and confidentiality issues are
the major cons of access to basic data, and they are linked together. The ISI
declaration of professional ethics say ”Statisticians are frequently furnished
with information by the funder or employer who may legitimately require it to
be kept confidential. Statistical methods and procedures that have been
utilised to produce published data should not, however, be kept confidential.”
That declaration is to be updated and probably the coming version will contain
much clearer ethical guidelines on confidentiality, as do many of the national
and international rules.
Many statistical offices can provide researchers
with basic data but they set conditions on the use. For example, such that the
data are only used for the research purposes (possibly limited by subject and
time), researchers are not allowed to try to reveal the informants by any
means, the results and publication may be required to be investigated by the
data provider etc. And the basic data sets are many times controlled by the
data providers to check that the confidentiality rules fulfilled. Data may also
be perturbated, or synthetized in order to avoid confidentiality violations.
But even with those methods such data sets exist whose disclosure cannot
totally be avoided. Especially that is related with very skewed distributions,
typical in business data and some other rare events in general.
Currently there is a lot of research on
confidentiality control methods which will hopefully give new tools to the data
providers. The new data access methods via internet and other electronic
networks develop so rapidly that new and strong methods are really needed.
The Survey Statistician, no. 50, pages 18-19, July 2004
|