GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of...

193
Proceedings of the ninth seminar Innovations in provision and production of statistics: the importance of new technologies Helsinki, Finland, 20–21 January 2000 9th CEIES Seminar – Innovations in provision and production of statistics 1

Transcript of GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of...

Page 1: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

Proceedings of the ninth seminar

Innovations in provision and production of statistics:

the importance of new technologies

Helsinki, Finland, 20–21 January 2000

9th CEIES Seminar – Innovations in provision and production of statistics 1

Page 2: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

2 9th CEIES Seminar – Innovations in provision and production of statistics

Page 3: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

CONTENTS

Page

1st day: Theme 1 : New Technologies

Data Warehouse NSI G. Zettl.................................................................7

The Application of Data Warehousing Techniques in a Statistical Environment

M. Vucsan..........................................................23

SISSIEI - The Statistical Information System on Enterprises and Institutions

E. Giovannini.....................................................35

New technologies in statistics and requirements for central institutions from a user's perspective

Ch. Androvitsaneas............................................54

2nd day Theme 2 : State of the Art

State of the ArtProblems/Solution/Technologies

D. Burget...........................................................63

Data Collection, Respondent Burden, Business Registers

J. Grandjean......................................................76

Experiences with WWW based data dissemination - the StatFin on-line service

S. Björkqvist.......................................................86

Data Access Versus Privacy: an analytical user’s perspective

U. Trivellato.......................................................92

The obligation to provide information and the use of statistics in enterprises

R. Suominen.....................................................108

Statistics: XML-EDI for Data Collection, Exchange and Dissemination

W. Knüppel......................................................114

Theme 3 : Available Software

IT policy for the ESS - Reply from Eurostat D. Defays.........................................................129

Summing up by the chairman of the subcommittee

P. Geary...........................................................132

List of participants..................................................................................................................... 135

9th CEIES Seminar – Innovations in provision and production of statistics 3

Page 4: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

4 9th CEIES Seminar – Innovations in provision and production of statistics

Page 5: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

1st day:THEME 1:

NEW TECHNOLOGIES

9th CEIES Seminar – Innovations in provision and production of statistics 5

Page 6: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

6 9th CEIES Seminar – Innovations in provision and production of statistics

Page 7: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

DATA WAREHOUSE NSI

Günther Zettl*

Statistics AustriaHintere Zollamtsstr. 2bA-1033 [email protected]

Introduction

Statistical offices (National Statistical Institutes or NSIs) all over the world have faced many new demands, expectations and problems for some years now:

The tasks they have to perform are increasing in complexity and scope. At the same time, manpower resources and funding are being frozen or cut back. "Data suppliers" would like to provide their information more simply and cheaply than hitherto. "Data clients" have a completely new means of seeking, collecting and using information: the

main factor is no longer provision by others (e.g. provision by the staff of a statistical offices information service); instead, they are now used to collecting information (inter)actively, online and on demand using appropriate search functions and processing it further on their own PCs. This constantly growing group of customers expects information providers to adapt to their way of handling information.

The politicians', administrators' and the economy's needs in terms of up-to-date, high quality and internationally comparable statistics to help in decision-making processes is continually on the increase.

Technical improvements in information technology (increasingly shorter innovation cycles) lead to major insecurity with regard to long-term investments. A product or technology opted for today may already be obsolete tomorrow.

Managing to deal with these needs and problems and being capable of reacting flexibly to future, completely unforeseeable developments, are herculean tasks. Bo Sundgren of Statistics Sweden wrote the following on this subject:

"It is a challenge for a modern statistical office to be responsive to expectations, demands and requirements from an ever more dynamic environment. Society itself, which is to be reflected by statistical data, is changing at an ever faster rate. This leads to needs for more variability, more flexibility, on the input side as well as on the output side of statistical information systems managed by statistical offices. In order to manage requirements for greater variability in the exchange of data with the external world, and in order to do this with the same or even less financial resources, a statistical office must consider system level actions. It is not enough just to do ‚more of the same thing‘ or to ‚run faster‘. It is necessary to undertake more drastic redesign actions." [SUNDGREN 1996]

One-off activities are inadequate as system level actions – what is needed instead is a package of correlated organisational, statistical and technical measures. Since data are at the heart of the statistical production process and the computer is now the statistician's main tool, working out an * Günther Zettl (born in 1961) studied business economics, specialising in informatics, at the Vienna University of

Economics. In 1988, he started to work in the data processing division of the Central Statistical Office (now called Statistics Austria). Since 1995, he has headed the „Central Data Management“ unit, in which statistical meta information systems and modern IT concepts and technologies (COM/DCOM, XML, data warehouse) form important areas of activity. He is also a member of several ÖSTAT internal project teams (specialist team 5 of the „Diebold re-organisation project“, the „SDSE – system for carrying out statistical surveys“ project) and the Eurostat working group STNE („Statistics, Telematic Networks and EDI“).

9th CEIES Seminar – Innovations in provision and production of statistics 7

Page 8: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

overall strategic concept for the use of the computer (with special emphasis on (meta)data management; cf. [FROESCHL 1999a]) is an important matter.

There are various angles from which statistical production can be viewed. One of the simplest models is defined by a statistical office as a data processing system with two interfaces (fig. 1):

NSI

OUTPUT

INPUT

Fig. 1

1. At the input end, raw data are entered into the "NSI system" by the "data suppliers" (respondents, existing data registers).

2. At the output end, statistical results (object data and meta data at different levels of breakdown and in different forms of presentation) are passed on to "data users".

Many statistical offices are working on data processing projects aimed at modernising the "NSI system" including its interfaces and adapting it to new requirements:

At the input end, the main focus is on reducing the burden for respondents (especially enterprises). One of the best known projects of this type is TELER which is being run by Statistics Netherlands. Statistics Austria (ÖSTAT), in close cooperation with an external software development firm, has started the SDSE project ("System zur Durchführung statistischer Erhebungen" – system for carrying out statistical surveys), the central element of which is an "Electronic Questionnaire Management System" ("Elektronisches Fragebogen Management System" EFBMS).

Inside NSIs, the systematic collection, administration and use of meta data is a basic challenge. A number of statistical offices have already started to build up integrated statistical meta information systems (METIS).

At the output end, printed publications are regarded as no longer adequate by "statistics clients". Here, efforts are concentrated on providing statistical results in electronic form. This includes projects involving the use of the internet for disseminating data, as well as the accelerated and standardised transfer of data to Eurostat using the STADIUM/STATEL software and the GESMES format. In Austria, the new Federal Statistics Law 2000 explicitly requires statistical results to be retrievable free of charge via the internet.

In discussions about the technical infrastructure of statistical offices and in the context of specific data processing projects (mainly in the output sector, but also within NSIs), there has been increasing mention of the concept of "data warehouse" recently. However, this term is sometimes

8 9th CEIES Seminar – Innovations in provision and production of statistics

Page 9: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

used with a meaning which goes well beyond that of the original concept and therefore can lead to misunderstandings.

Therefore what is meant by a data warehouse (and related terms such as "data mart" and "OLAP") is described below. An attempt is also made to relate this concept to the statistical production process and to provide details of how data warehouse concepts and technologies can contribute towards meeting the challenges set for NSIs.

What is a data warehouse?

In the last few years, the term "data warehouse" has become fashionable in the computer industry:

Hardware manufacturers love it because they can supply their customers with powerful computer systems for running a data warehouse.

Software developers love it because they can sell expensive tools and applications (often costing millions of Austrian schillings) and do not have to compete with Microsoft (a situation which incidentally has changed in some sectors – data storage and OLAP – following the introduction of the MS SQL Server 7.0 and the accompanying OLAP services in the meantime).

Consulting companies love it because their services are used by many companies which want to build a data warehouse.

And authors of technical books love it because it is a wonderful subject for writing books and articles about. "The Data Warehousing Information Center“ (http://pwp.starnetinc.com/larryg) currently (in December 1999) lists over 130 books, 70 White Papers and 100 articles accessible on the internet – which most probably only represent a fraction of the range actually on offer.

Of course there are also a number of handy definitions. Here is a small sample:

According to W.H. Inmon (often called the "Father of Data Warehousing") a data warehouse is "a subject-oriented, integrated, time-variant, non-volatile collection of data in support of management’s decision making process" [INMON 1995].

Ralph Kimball, with Inmon, probably one of the most famous "gurus“ in the data warehouse field, defines a data warehouse as "a copy of transaction data specifically structured for query and analysis" [KIMBALL 1996].

For Sean Kelly, a data warehouse is "an enterprise architecture for pan-corporate data exploitation comprising standards, policies and an infrastructure which provide the basis for all decision-support applications" [KELLY 1997].

Sam Anahory and Dennis Murray write: "A data warehouse is the data (meta/fact/dimension/aggregation) and the process managers (load/warehouse/ query) that make information available, enabling people to make informed decisions" [ANAHORY/MURRAY 1997].

Barry Devlin describes a data warehouse as "a single, complete, and consistent store of data obtained from a variety of sources and made available to end users in a way they can understand and use in a business context" [DEVLIN 1997].

According to Stanford University, a data warehouse is "a repository of integrated information, available for queries and analysis. Data and information are extracted from heterogeneous sources as they are generated.... This makes it much easier and more efficient to run queries over data that originally came from different sources" [STANFORD].

Taken individually, none of these short definitions suffices to explain what is meant by the term "data warehouse", but taken all together, they contain basic characteristics which will be described in slightly more detail below. However, it should be pointed out straightaway that, even if theoreticians agree on many features, there can be still differences in how the concept is understood

9th CEIES Seminar – Innovations in provision and production of statistics 9

Page 10: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

in detail. For example, a data warehouse as understood by Inmon does not correspond to a Kimball warehouse in all aspects.

Originally, in the commercial environment computers were primarily used for supporting and automating business processes such as order-processing, invoicing, book-keeping, stock management etc. The aim was for these functions to run faster and more cheaply and for the company to be able to react more quickly to customers‘ demands. The main purpose of course was to derive advantages over competitors.

Computer systems in these areas of application are called OLTP (Online Transaction Processing) systems in data warehouse literature. They are optimised to allow fast response times for simple, pre-defined transactions which often consist in changes, additions or deletions of individual data records. By using normalised data modelling (preferably the third normal form, provided certain compromises do not have to be accepted for performance reasons) the aim is to ensure that the modification of a fact only has to be carried out on a single table line.

However, OLTP programs are not very suitable for providing information for analysis. Normally they allow certain reports to be issued but when further data are required individual programming by the IT division is necessary, if data are available at all (in a stock management system, for example, the current stock level can be determined, but the stock level of several months or a year ago or even earlier is no longer known).

Therefore, in view of their functionality and design, OLTP systems can hardly be used for analysis. To make up for this drawback, in the 1980s, it was proposed to extract data from them at regular intervals, provide them with a time stamp and store them in a system of their own: the data warehouse.

Since data mostly stem from several individually independent upstream systems, they may show a number of inconsistencies: e.g. different product numbers and descriptions in the programs for order-processing and stock management, non-uniform attributes for the same customers, when a firm is active in different business areas and uses more than one order-processing program, etc. Before the data are loaded in the data warehouse, therefore, they must undergo comprehensive integration, as well as structural and format standardisation (which sometimes represents up to 80 % of the total cost of establishing a data warehouse).

Unlike the more functional organisation of the OLTP systems, the placing of data in the data warehouse is oriented towards the main subjects for analysis (customers, products, supply companies etc.). Inmon calls this „subject-orientation“.

The users of the warehouse should be able to find precisely the data they need for their work and carry out queries and analyses without the assistance of data processing experts. This requirement calls for special data modelling which is called "dimensional". To illustrate this data model, often a cube (fig. 2) is used, the edges of which are given the dimensions with their individual members (in the case of a warehouse for a chain of supermarkets, for example, product, outlet and time). Inside the cube, at the crossing of the different dimension members, there are numerical facts, e.g. turnover achieved on a particular day in a particular outlet for a particular product.

10 9th CEIES Seminar – Innovations in provision and production of statistics

Page 11: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

Fig. 2

The members of a dimension can be hierarchically broken down at several levels (e.g. product → product group, outlet → city → district → federal country, day → month → quarter → year). They can also have attributes which might be of interest for analyses (the colour of the product, the selling space of an outlet etc. ).

Queries and analyses of this type of "cube" (which of course may also have more than three dimensions) are called Online Analytical Processing or OLAP. OLAP client programs specialise in presenting the user with any required section through the cube in different arrangements of the dimensions (slice and dice). It is also possible to switch over from one hierarchy level to the elements below it and to navigate in the opposite direction (drill down, drill up).

OLAP cubes can be stored in a proprietary format in a multidimensional database: an OLAP server (MOLAP = multidimensional OLAP). Frequently the data are also located in a relational database (ROLAP = relational OLAP), in which case the so-called star scheme often comes into use. In a star scheme, each dimension with all its attributes and hierarchy levels is stored in a dimension table. The numerical values from inside the cube are stored in the central fact table together with the foreign keys of the dimension tables (fig. 3).

An important characteristic of the star scheme is the denormalisation. For performance reasons (avoiding table joins) the attributes of objects which would be stored in separate tables in a normalised data model and would be referenced by primary/foreign key relationships, are entered in the dimension tables (in fig. 3, for example, the names of the town, district and federal country in the outlet dimension). This redundant data storage makes update operations difficult, which is why a data warehouse is normally read-only, or in Inmon-terminology, "non-volatile", for online users.

9th CEIES Seminar – Innovations in provision and production of statistics 11

Page 12: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

Fact TableTime IDProduct IDOutlet IDTurnover

Product DimensionProduct IDProduct nameProduct groupColour...

Outlet IDStreetCityDistrictFederal countrySelling space...

Time IDDateDayMonthQuarterYearWeekdayDay-before-holiday-flag...

Outlet Dimension

Time Dimension

Fig. 3

A data warehouse can contain massive quantities of data – on the one hand because of the implicit redundancy of the star scheme, and on the other hand because of the long periods for which data are stored. The level of detail (granularity) should also be as large as possible since otherwise potential possibilities for analysis are lost. In fig. 3, if we work on the basis of daily extraction of data and assume that in 500 outlets about half of 2000 products are sold at least once each day, the fact table will grow to just under half a billion records within 3 years!

To avoid accessing detailed data for every query, in a data warehouse frequently required aggregates are calculated in advance along the hierarchies of the dimension members and stored in their own tables. These advance aggregations speed up retrievals, but lead to an explosion in the amount of storage space required. In such circumstances, it is quite obvious that some warehouses have a size in the terabyte range.

Ralph Kimball is a keen defender of dimensional modelling. In his view, a data warehouse should consist of a number of star schemes, with thematically linked data cubes forming a data mart. Cross links between different marts develop through the use of uniform dimensions such as "customer" or "product", whereby consolidation and integration of the dimension data stemming from different advance systems take place in a staging area (which does not have to be relational, but can also consist of flat files).

Other authors such as W.H. Inmon, on the other hand, define a data warehouse as a company-wide normalised repository to which the end users can have direct access only in exceptional cases. From this central store, part quantities of data flow into divisional and functional data marts, which have a multidimensional structure. This multi-layer architecture requires the development of a company-wide data model – a task whose complexity in practice is often made responsible for the failure of data warehouse projects.

In this connection, it should also be pointed out that the term "data mart" has not been clearly defined. Apart from the meanings already mentioned, it is also sometimes used to mean simply a "small warehouse".

A data warehouse contains not just data, but also all processes and programs required to extract data from upstream systems, to clean up, transform and load them in the warehouse, perform

12 9th CEIES Seminar – Innovations in provision and production of statistics

Page 13: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

aggregations and to carry out queries/analyses are part of a warehouse (fig. 4). Basically, a distinction can be made between three subsystems:

1. The input system in which the extraction and processing of source data and the loading of "cleaned" data in the warehouse takes place.

2. The data-holding system which is responsible for storing and managing data (including aggregations and backup/archiving).

3. The output system via which users access data stored in the warehouse via various tools (eg. report generators, OLAP client programs). This subsystem partly overlaps with the information factory (applications for further processing of data from the warehouse).

Data WarehouseData sources

Processing

Information Factory

Extraction

Cleaning

Loading

Scheduling

Data

Meta data

Backup/Archiving

ManagementInformation

Queries

Reports

OLAP

Data Mining

Input OutputData holding

Fig. 4

In all three subsystems, there is also a need for meta data describing the stored and processed data. Ideally, there should be a central meta data basis used by all programs belonging to the warehouse. In practice, however, precisely the opposite is the case: the meta data of the tools which are used are incompatible with each other and have to be defined and administered independently of each other, which can result in a considerable amount of work in real operation. The Object Management Group (OMG) currently attempts to standardise meta data exchange (CWMI - Common Warehouse Metadata Interchange; information available at http://www.omg.org/techprocess/meetings/ schedule/CSMI-RFP.html). Time will tell whether the proposals which have been presented are adopted and actually implemented by software manufacturers.1

1 By the way: Eurostat consultants (Chris Nelson, Anders Tornqvist) from the company Dimension EDI are participating in the Common Warehouse Metamodel Specification. Their proposal for an "InformationSet" involves an extension of the warehouse concept that is important for statistical offices, namely the collection of raw data using electronic questionnaires.

9th CEIES Seminar – Innovations in provision and production of statistics 13

Page 14: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

In short, the following can be said:

A data warehouse is a concept. A data warehouse is a process. A data warehouse must be constructed, in line with individual requirements. A data warehouse, however, is not an individual product, or off-the-shelf software. Of course

there are a number of tools which cover some of the functions in a warehouse, and of course their manufacturers promise that all problems can be solved very quickly with these programs ("the 90-day-warehouse"). In practice, however, it is much more important to deal with conceptional, organisational, architectural and data-modelling questions. Whether any tools should be used, and if so which ones, is not important for the construction of a data warehouse until a relatively late stage.

The statistical output database as data warehouse

The characteristics of a data warehouse mentioned in the previous section must seem familiar to any member of staff of a statistical office. For example, in statistics the multidimensional approach to facts has been around for some time now: in the form of cross-classified tables, which provide a two-dimensional depiction of data with several dimensions.

Other characteristics such as:

very large quantities of data data stretching back over a very long period the need to validate, transform and integrate data hierarchical links between classification members the aggregation of detailed data the storage of these aggregates meta data which describe other data

are nothing new for an NSI – only the terminology used is different (classification criteria instead of dimensions, micro data instead of detailed data, time series instead of "time variant data") but not the underlying concepts.

These concordances between data warehousing and the statistical production process have so far been completely ignored in data warehouse literature, however. Historical articles locate the first data warehouses in the 1980s:

"Data Warehousing first emerged in this period between 1984 and 1988." [DEVLIN 1997] "The very first data warehouses were built in the USA in the mid 1980s by large corporations in

the retail, banking and telecommunications industries. By and large, these early innovators were intent on integrating data that had become hopelessly fragmented across these complex organisations and the most common applications were (and still are) in the domain of marketing and sales." [KELLY 1997]

But if we do not strictly limit the concept of the "data warehouse" to commercial enterprises and their business data, the first manifestations can be identified as early as in the 1970s – in the form of statistical output databases.

The statistical production of NSIs often has a structure which is termed "stovepipe" (cf. [PRIEST 1996]). Individual surveys, from the design of the questionnaires and the selection of respondents, and collection and processing of data up to the production of results tables and publications, are to a large extent implemented by different organisational units largely independently of each other.

14 9th CEIES Seminar – Innovations in provision and production of statistics

Page 15: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

There is hardly any overall integration covering the entire "universe of surveys". Each "stovepipe" can be regarded as an independent statistical information system, which leads to a lot of problems (lack of overview over the entire system; unplanned redundancy with resulting higher maintenance costs and the danger of inconsistencies; disharmonies and discrepancies with regard to statistical concepts, definitions, variables, classifications, results etc.; no standardisation of data holding and processing; little reuse of software).

A statistical output database (fig. 5) combines object and meta data from separate pre-systems – in which it is not OLTP programs that are involved as in a typical warehouse but surveys – in a central application which aims at simple retrievability of data by the users. This does not allow the problems mentioned to be eliminated in retrospect, but for the "statistics client", the existence of an output database compared with a pure "stovepipe" organisation brings with it considerable advantages. Ideally, all information published by an NSI in publications, press bulletins, WWW pages etc. should also be contained in the output database in greater detail or be derivable from it.

Surv

ey 1

Surv

ey 2

Surv

ey 3

Surv

ey 4

Surv

ey n

....

Raw data

Micro data

Macro data

ResultsOutput-DB

Fig. 5

At ÖSTAT, the output database ISIS ("Integrated Statistical Information System", also known internationally as LASD – "Large Scale Statistical Data System") was developed as early as 1972/73 – many years before the expressions "data warehouse" and "OLAP" were invented. Nonetheless, ISIS can be described as a MOLAP server in current terminology:

It consists of over 4000 multidimensional cubes in a proprietary format, with a maximum of 7 dimensions possible per cube.

In all there are several hundred dimensions which can be structured hierarchically. Some aggregations of hierarchical dimensions are calculated in advance and stored when the

data are loaded whereas others are calculated "on the fly" when they are retrieved. With the help of a powerful query language, the user can retrieve results with any dimensions

and at different hierarchical levels (slice, dice, drill down, drill up). There are also numerous mathematical/statistical functions available.

9th CEIES Seminar – Innovations in provision and production of statistics 15

Page 16: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

In addition to the object data, ISIS also contains meta data: in order to find a specific cube (e.g. via a hierarchically structured list of subjects or via full text search) and to find information about the data contained in a cube (data source, breaks in time series, etc.).

ISIS runs on ÖSTAT’s IBM main frame and consists of about 800 assembler and PL/1 modules (including various administrative programs for the database administrator). In view of its early origin (E.F. Codd had only just published the theoretical deliberations which were to serve as a basis for relational database systems) ISIS had to be developed down to the finest details at ÖSTAT. If we began with the implementation today, we would probably use object-orientated programming languages, design an n-tier-architecture using DCOM or CORBA and include for data storage a relational database and possibly also a commercial OLAP-server like Microsoft OLAP Services or Hyperion Essbase. But even if ISIS is no longer at the latest level of IT, it is still state of the art in its applied concepts and ideas!1

There is no denying that Statistics Austria was subjected to three points of criticism regarding ISIS in the past few years:

1. The content is not always completely up-to-date.2. The user interface no longer meets modern expectations.3. The query language is difficult to learn and is quickly forgotten again if it is not used regularly.

The first point of criticism is connected with the ranking of the output database within the production of statistics. For some divisions, the provision of data for ISIS would appear to be a necessary evil that is not dealt with until other work (production of publications in different formats) has been completed. With appropriate management decisions at Statistics Austria, which has a new structure as from 1.1.2000, and with organisational arrangements, this problem should be solved relatively easily.

As far as the second point of criticism is concerned, this is a challenge for the informatics division. At the moment, work is being done on a graphical user interface which allows ISIS queries from any Java-compatible WWW-browser, but can also be started as an independent application. Since a knowledge of the proprietary ISIS query language is no longer necessary when using this new client software, the third point of criticism is dealt with at the same time.

Apart from the fact that a statistical output database contains no detailed data, it has all the main characteristics of a data warehouse. However, if one mentions to data warehouse experts working in a commercial environment on the basis of publications by Inmon to Kimball, that large statistical databases contain thousands of multi-dimensional cubes and hundreds of different dimensions, one is normally confronted with reactions such as:

„I would also be wondering who in the world could possibly mentally manage 113 dimensions in one multidimensional model. People have difficulty conceptually managing much more than seven or so dimensions in one place, even though the tools allow more.“

„I would suggest re-visiting your design, especially if it has 100+ dimensions. I would also re-consider your fact-table design, especially if you have 100+ of those.“

„I too spent twenty years working at a National Statistical Agency (Statistics Canada) and have followed this and other threads discussing the perfect dimensional model of 6-12 dimensions and a couple of fact tables with a great deal of interest. Over the past few years I have presented the Canadian Census model at various local and international venues, and have been told by the 'experts‘ that the model, composed of hundreds of dimensions was poor design and planning, and either could not work or would be impossible to manage.“

1 And there were no Y2K-problems!

16 9th CEIES Seminar – Innovations in provision and production of statistics

Page 17: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

(All these quotations are taken from contributions to the „Data Warehouse Mailing List“2 from the end of November/beginning of December 1999 in response to a mail which was written by a member of a Statistical Office staff – whichever office it was, unfortunately was impossible to identify from the e-mail address).

Where do these many dimensions in a statistical output database come from?

The reason for the "dimensional explosion" lies in the function to be performed by such a database: it is supposed to provide the results of statistical production processes in multi-dimensional form for queries by end users. Statistical information is information not about individuals but about collectives, in other words an NSI can only publish aggregated data – for legal reasons for a start. This is why an output database contains no data at the most detailed level.

If we take for example the Census (including the accompanying full survey of housing) which is held once every ten years, it becomes immediately clear that the star scheme is not particularly appropriate for it. If advanced concepts of dimensional modelling such as "demographic mini-dimensions" (cf. [KIMBALL 1996]) are not applied, a star with only two dimensions is probably obtained: namely "Person" (with numerous attributes such as "sex", "age", "marital status", "nationality", "number of children" etc.) and "housing" (with a regional hierarchy and also a number of attributes). A time dimension does not exist because all person-related data have to be rendered anonymous so that it is not possible to retrieve the Census data of e.g. 1981 and 1991 for the same individual.

On the basis of the number of data records, it would also be a very unbalanced star scheme. Normally dimension tables contain relatively few and fact tables very many records (cf. the example mentioned previously of a data cube for a supermarket chain with 500 entries in the outlet dimension, 2000 in the product dimension and just over 1000 – three years on the basis of daily data extraction – in the time dimension: when an average of 50 % of the products per day and outlet land in the customers‘ shopping baskets, the fact table shows just under half a billion lines after three years). By comparison, the "Census star" in Austria would have about 8 million data records in the person dimension (in the USA it would be over 200 million even!) and 3.5 million in the housing dimension; the facts table on the other hand would probably not exceed ten million records.

A statistical output database, as mentioned, contains no detailed data but many relatively small data cubes resulting from summations using individual variables of the survey. For example, a cube could depict the fact "Number of Persons" and the dimensions "Time" (which is available again for aggregates but not for the finest level of detail), "Region" (a hierarchy with "Municipality", "District" and "Federal country" levels), "Sex", "Age" (with hierarchical age categories) and "Marital Status"; another cube could depict the dimensions of "Time", "Region", "Nationality", "Number of Children" and "Age", etc.

This means that attributes at the level of detailed data become dimensions of one or more aggregated data cubes. This is why, in an extensive statistical output database, a "dimensional explosion" occurs, a phenomenon which I have so far not seen described in any data warehouse literature I know of.

The NSI as data warehouse

2 Subscription possible at http://www.datawarehousing.com

9th CEIES Seminar – Innovations in provision and production of statistics 17

Page 18: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

As shown in the previous section, we are fully justified in calling a statistical output database a data warehouse. We could now go one step further and call the entire statistical office a warehouse. For this we should look at the „NSI System“ from Fig. 1 a bit more closely.

To solve the problems resulting from „stovepipe“ organisation, work is done in many statistical offices on concepts and projects for managing object and meta data and on developing integrated statistical meta information systems. The aim of these activities is horizontal (in other words multi-survey) and vertical (going beyond the stages of statistical production) integration of statistical information systems into a universal information infrastructure.

The general aims are:

to create as extensive, flexible, open, simple and user-friendly access as possible to the object and meta data of relevance to them for both NSI internal and external „statistics users“;

to plan redundancy and avoid inconsistencies; to achieve planned collection, storage and (multiple) use of meta data (which means that they

have to be standardised and harmonised); to establish norms for data-holding in general and for interfaces between the software products

used for producing statistics; to provide support to users with general-use tools, in other words, tools which are not tailored

to a single survey only, in performing their tasks in the production and use of statistics; to enforce global solutions, in other words solutions covering the entire statistical office, instead

of insular solutions or double and multiple developments; and finally, always to take into account the diverging, and in some cases unknown or

unpredictable needs of different user groups.

These demands require the setting up of an NSI-wide information system covering all surveys and supporting the entire statistical production process – from preparations for a survey to the dissemination of results – with appropriate tools.

Raw data

Micro data

Macro data

Results

BASIS2000+

Input

Adm

inistration

Prod

uctio

n

Object data Meta data

Tools/Interfaces

Fig. 6

Fig. 6 shows the BASIS2000+ (meta data based statistical information system) concept developed by ÖSTAT together with the links and data flows to other systems. BASIS2000+ consists of three components:

18 9th CEIES Seminar – Innovations in provision and production of statistics

Page 19: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

The object data component contains statistical data with different levels of aggregation. The first stratum is formed by the checked and corrected micro data of all the surveys conducted by ÖSTAT. In the course of statistical production, micro data are aggregated into macro data, which also form part of the object data component and in many cases serve as a basis for further evaluations. The highest level of aggregation is statistical information in the form of tables and graphs; the aim must be to have all the "information objects" produced by ÖSTAT accessible in BASIS2000+.

The meta data component is at the heart of BASIS2000+. Here the content of the object data component must be documented in order to allow both physical access to object data and the interpretation of their content. With its information about classifications, surveys, statistical concepts, variables, data holdings, publications etc., the meta data component forms an extensive reference database which opens up ÖSTAT’s statistical information to both internal and external users.

The information stored in the object and meta data components is accessed via user and program interfaces. These should be regarded as parts of the third BASIS2000+-component together with standardised data formats and generalised tools which are used at different stages of the statistical production process.

The life cycle of a statistical survey begins with the "observation" stage which covers all activities connected with preparing and planning the survey and collecting data. These take place in the "Input" system in which meta data already stored in BASIS2000+ are accessed and new meta data are added. As soon as this information is available in the meta data component, tools from the tools/interface stratum of BASIS2000+ can make use of them.

In the "Input" system also the second sub-part of a survey ("preparation", in other words, collection, checking and correction of raw data) takes place. Here, too, apart from individually developed software, general-use tools can be employed and the content of the central meta data component is accessed. Finally, the micro data regarded as correct are loaded into BASIS2000+ in a standardised format.

The micro data are aggregated into macro data at the next stage of statistical production. This is done partly by an automatic process within BASIS2000+ and partly in the "Production" system, whereby the macro data holdings produced there are again loaded in the object data component and supplements/updates are carried out in the meta data component.

This division of tasks also applies to the "use" process: the micro and macro data provided in BASIS2000+ (including the accompanying descriptive information) serve as a basis for all analyses. These are either processed with general-use tools or exported into the "Production" system (where for example analyses are carried out with SAS or individual software). Information objects produced in the "Production" system – finished tables, graphs, documents – are stored in BASIS2000+

in standardised formats and documented in the meta data component. The "use" process also includes the search for and retrieval of statistical information by external "clients"; this is done exclusively in BASIS2000+.

Fig. 6 also shows the "Administration" system which comprises all applications which are not or only indirectly connected with statistical production (e.g. a staff information system). Data flows occur between "Administration" and the meta data component of BASIS2000+, for example when the current telephone number of the person responsible for a survey is the subject of an inquiry.

It should be emphasised that BASIS2000+ is not a single, monolithic application. Instead, it consists of minor sub-systems which are integrated on the basis of the jointly used standardised object and meta data. BASIS2000+ is primarily a general concept – a vision. It provides a framework which

9th CEIES Seminar – Innovations in provision and production of statistics 19

Page 20: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

allows implementation to start in sub-sectors, prototypes to be developed with results which can be used for practical applications and experience to be collected as quickly as possible – which in turn allow the overall concept to be refined and adapted in a feedback process.

This was a short review of BASIS2000+. Are we now justified in speaking of a data warehouse?

I don’t think so. Of course there are a number of parallels to the data warehouse concepts, processes and features (especially if we define a warehouse not as a collection of star schemes but as a company-wide, multi-tiered integrated data repository), but the scope and extent of an architecture such as BASIS2000+ goes far beyond the meaning that is associated with the term "data warehouse" by 95 % of IT experts.

The concept of "meta data" alone has to be interpreted on a very much broader basis in a statistical environment. Statistical data are always a combination of object and meta data, whereby the latter are both produced in the statistical production process and re-enter the process as input in other work stages.

Statistical classifications, for example, can be both meta data (texts being allocated to codes) and independent complex "information objects" available in different versions, whose elements may have links with elements of other versions and classifications and to which meta data (e.g. a list of technical terms allocated to classification members) belong as well. Accordingly, a classification database for administering classifications and the accompanying meta data is a central feature of the meta data component of BASIS2000+, whereas in a data warehouse in a commercial environment, such an application is not required.

We get even further away from the typical warehouse when we begin to place the emphasis, when considering statistical meta information systems, not so much on descriptive meta data oriented to human users but instead concentrate on procedural aspects (active, "embedded" meta data, the meta information system as a "workbench"; cf. for example [BETHLEHEM et al. 1999] and [FROESCHL 1999b]).

Since even describing a statistical output database as data warehouse can lead to misunderstandings in discussions with warehouse experts from the commercial sector, it would appear sensible not to call NSI-wide (meta) information systems "data warehouses" – we can thus save ourselves a lot of explaining.

20 9th CEIES Seminar – Innovations in provision and production of statistics

Page 21: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

Literature

[ANAHORY/MURRAY 1997] Sam Anahory/Dennis Murray, Data Warehousing in the Real World, Publishers: Addison-Wesley, ISBN 0-201-17519-3

[BETHLEHEM et al. 1999] Jelke Bethlehem, Jean-Pierre Kent, Ad Willeboordse and Winfried Ypma, „On the Use of Metadata in Statistical Data Processing“, Report for the „UN/ECE Work Session on Statistical Metadata“ in Geneva, 22 to 24 September 1999

[DEVLIN 1997] Barry Devlin, Data Warehouse: from architecture to implementation, Publishers: Addison-Wesley, ISBN 0-201-96425-2

[FROESCHL 1999a] Karl A. Froeschl, „Metadata Management in Official Statistics – An IT-based Methodology Approach“, in Austrian Journal of Statistics, Vol. 28 1999 Number 2

[FROESCHL 1999b] Karl A. Froeschl, „On Standards of Formal Communication in Statistics“, Report for the „UN/ECE Work Session on Statistical Metadata“ in Geneva, 22 to 24 September 1999

[INMON 1995] W.H. Inmon, „What is a Data Warehouse?“, published in the World Wide Web at http://www.cait.wustl.edu/cait/papers/prism/vol1_no1

[KELLY 1997] Sean Kelly, Data Warehousing in Action, Publishers: John Wiley & Sons, ISBN 0-471-96640-1

[KIMBALL 1996] Ralph Kimball, The Data Warehouse Toolkit, Publishers: John Wiley & Sons, ISBN 0-471-15337-0

[PRIEST 1996] G. Priest, „Issues of Meta Information and Integration“, Report for the „UN/ECE Work Session on Registers and Administrative Records in Social and Demographic Statistics“ in Geneva, 11 to 13 November 1996

[STANFORD] The quotation ascribed to Stanford University which was published in the World Wide Web at http://www.datawarehousing.com

[SUNDGREN 1996] Bo Sundgren, „Making Statistical Data More Available“, in International Statistical Review (1996)

9th CEIES Seminar – Innovations in provision and production of statistics 21

Page 22: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

THE APPLICATION OF DATA WAREHOUSING TECHNIQUES IN A STATISTICAL ENVIRONMENT

M.H.J. VucsanStatistics NetherlandsResearch and Development DivisionIT DepartmentPostbus 4000NL-2270 JM [email protected]

Summary: Storing data in a database constructed in accordance with the dimensional model is beneficial in terms of statistical production. The essential feature of the dimensional model is the storage of textual data in what are known as "dimensions" and the storage of numerical data, with conversion keys to the dimensions in a fact table. This is known as a data mart. A pilot project at Statistics Netherlands' Department of Population Statistics, using Microsoft software on Intel machines, has yielded positive results. A data warehouse with two data marts has entered service for regular statistical output and answers to ad hoc questions.

Keywords: data warehouse, data mart, dimensional model, statistics, OLAP, Microsoft

1. PRINCIPLES OF THE DIMENSIONAL MODEL

The concept of the data warehouse is, of course, nothing new to the CBS. Those who can remember boxes of punch cards knows that they were in fact a data warehouse, with the punch cards representing the fact table and the code lists the dimensions. Maybe it's time toresume what we were doing...

1.1 Query versus transaction

1.1.1 What is OLAP?

The abbreviation OLAP stands for On Line Analytical Processing, which means that the database can be expected to give a very fast response to "big" questions.

To this end, the database not only possesses special middle-tier applications, but is also specially modelled, with redundancy preferred to standardisation.

1.1.2 Is the production and analysis of statistics OLAP?

Yes, basically, although a distinction can be made between two types of statistical production. Firstly, there is standard, regular production, i.e. pre-determined overviews which will eventually form part of a CBS publication. This type of output is certainly OLAP, but it does not provide the necessary and sufficient conditions for a data warehouse. Later on, we shall see why we still think a data warehouse is the right solution.

The second type of statistical production, analysis, is an activity for which a data warehouse can be very valuable. After all, next to machine-based methods of data mining, human analysis is still the best method of obtaining the most interesting facts from raw data. You can think of this as a refining process: the raw data in the data warehouse have to be correlated in various ways by humans, so that information can be extracted from them.

Lastly, the future of the statistical production process is one in which "ready-made" statistics presented in paper form will increasingly be displaced by automated means which, within certain

22 9th CEIES Seminar – Innovations in provision and production of statistics

Page 23: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

limits, will allow ad hoc statistics for remote and other users to be compiled electronically. One of the CBS' traditional activities, the compilation of aggregates, will become a threatened species.

1.2 Consistency

1.2.1 OLAP is consistent at global level

A data warehouse works completely differently to an OLTP database. Not only do a number of data have a "calculated accuracy", but not all data might be present.

"Calculated accuracy" denotes the fact that, when an event or "fact" such as a store sale has a "margin" column, that margin is of course a number which originated in the accounts and which has worked its way down from the global level. At this level, the figure may be incorrect. Is the margin on a bar of chocolate really 0.001 cents? If, however, we compile aggregates from the data warehouse and end up with the totals for this column, then we are back where the concept of "margin" was defined.

The fact that not all the data may be present is not, of course, due to a structural lack of input, but to the arbitrary removal of obvious inaccuracies or minor errors in data handling. If they are arbitrary, no harm is done, as long as correct grids can be compiled.

1.2.2 OLAP is consistent over time

Once data have been entered in a data warehouse, they remain there for good. The result for a given period stays the same. A data warehouse is essentially a time series. But more about that later.

1.3 The dimensional model

1.3.1 The data cube

This concept is of particular importance in statistics. The idea is that you can always distribute data over several axes. The cube metaphor is appropriate because two variables are viewed over time, thus creating a kind of three-dimensional space, i.e. a cube, possibly with marginal values. This term is also used for thematic areas in a data mart.

StatLine makes use of this technique.

1.3.2 The star join

Storing cubes in a relational database is not too difficult as long as one realises that the cubes tend to be ‘sparse’, i.e. the data are distributed sparsely among the available cells.

Storing these cells, of course, entails no more than creating a table with fields for all the dimension keys of the cube and the cell values and to start the loading process.

The dimensions are therefore the column headers, no more and no less. Of course, we put these dimensions into tables (the dimension tables). Think, for example, of municipal codes.

Querying the cube therefore always entails joining the table containingthe data and the dimension tables. The ‘WHERE’ clause is the limiting condition applied to the data in the dimension tables, for example 'where municipality name = AMSTERDAM’

9th CEIES Seminar – Innovations in provision and production of statistics 23

Page 24: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

A star join therefore always involves a large table (the fact table) and a number of small tables (the dimensions). The fact table is measured in GIGAbytes and the dimension tables are of the order of 100 MEGAbytes. The database optimizer has special solutions.

So, what's new? We've always been doing that at the CBS. What is new is the idea that we are dealing with anew discipline, rather than a one-off solution for a software problem.

1.3.3 The data mart

A star-join configuration with a fact table and several dimension tables is known as a data mart. A data mart contains information about a specific subject. That subject is almost always a process. The following are some nice examples:

Store sale of product to customer at a particular time

Customer's journey in vehicle to and from destination

Delivery of goods from loading platform to customer on a particular date

Treatment of patient in hospital at a particular time

Balance of customer account at a particular time

Presence of bird in breeding grounds at a particular time.

The nice thing about a data mart is that the data are not free standing, but are (or should be) part of a data warehouse.

1.3.4 The data warehouse

The strength of a data warehouse is that it consists of a number of data marts which are dovetailed to each other. This is done by harmonising the dimensions. The municipality code table or the dimension ‘LOCATION’ for all the data marts must of course be either identical or a subset or superset.

A data warehouse which comprises data marts in this manner has a very high added value. Of course, one can relate any number of processes to each other, for example, the store sales data mart and the purchases data mart, to see how much has been stolen or broken.

Viewed from this perspective, the CBS could become the Netherlands' data warehouse.

It is important to realise that dovetailing the various data marts and their dimensions is a logical concept. It is not at all necessary for two data marts to be located in the same physical data base or in the same computer. We will never approach two data marts with a single SQL statement because that takes much too long. There are much better solutions, such as running two separate queries and letting the client application merge the results .

1.4 Fact and dimension tables

1.4.1 What is a dimension table?

We have already looked at an example of a municipality code table as a (fairly simple) dimension. Basically, a dimension table is just the name of the column. Things become more complicated if we look more closely. Let us take another look at the municipal codes. The dimension "municipality" is unlikely to be of much use in many data marts: something along the lines of a generalised location dimension is probably more suitable.

24 9th CEIES Seminar – Innovations in provision and production of statistics

Page 25: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

What does a location dimension look like? Firstly, of course, we will want to enter the municipality codes. But we will also want to enter the hamlets and the provinces. We now see that the municipal code may not be sufficient for our needs, and we move up to a 4- or 5-digit integer. This is just a number like any other and has no intrinsic significance. We give each record a new number, which then becomes the primary key.

This becomes clear if we look at a fragment of this dimension:

key province municipality hamlet

00234 south holland alkmaar oude-pekela

00233 south holland alkmaar nieuwe-pekela

00232 south holland beverwijk beverwijk

This dimension allows us to join not only at municipal level, but also at hamlet and provincial levels.

1.4.2 What is a fact table?

A fact table contains the process variables which we are interested in. If we look at the process ‘store sale’, we want to know not only when what was sold to whom, but also how much was sold and what it cost. It does not make much sense to create a dimension containing a vast array of amounts and quantities. We may also want to add them up some time. What we are doing is creating a record with a primary key which consists of "foreign" keys to the dimensions and one or more attributes which denote quantities or other variables which can be expressed numerically.

This becomes apparent if we look at a fragment of the fact table:

loc. time prod customer no. amount

00234 00011 88234 211154 2 400

00233 00003 78986 329809 1 3400

Using the combination of the dimension keys location, time, product, customer, it is now possible to record when "something" is sold "somewhere" by "someone". The only information which was not yet known is the quantity and cost, but that information can be found elsewhere in the record. The four foreign keys together form a combination of textual attributes which describe the event perfectly. The fact table contains nothing more than numerical data. The two main reasons for this are compactness and the desire for the data in the fact table to be countable. The desire for the attributes in the fact table to be countable means that the amount is a total, otherwise number*price would constantly recur in the SQL, which would seriously impact on performance. If we want to know the price per item, we can do so either in the product dimension (if the price does not change too often) or by calculating price/number.

Another important feature is that the keys have no intrinsic meaning and consist only of numbers which have been assigned to them. By the way, there is nothing unusual about that at the CBS.

9th CEIES Seminar – Innovations in provision and production of statistics 25

Page 26: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

1.5 Time series

The statistical process could be described in abstract (and highly simplified) terms as the observation of variables at a particular moment in time. With a little effort, it is possible to see any set of statistics as a time series.

1.5.1 A data warehouse is a time series

Data which are stored in a data warehouse have, to put it mildly, a static character. The aim is not for data already stored in the data warehouse to be amended, although that is an option. A data warehouse is only ever topped up, it is never refilled.

The data in a data warehouse describe processes and the course which they follow over time. This makes data warehouses extremely well suited for compiling statistics. Today's data warehouses are almost all used for compiling statistics, although aggregation is, of course, done by the client himself.

1.5.2 Changing dimensions

One of the knottiest problems associated with statistics is that of changing codes. This problem also occurs in data warehouses and has given rise to a number of solutions which are very similar to ones already adopted by the CBS.

The main solution for changing codes is to use a denatured key, i.e. a arbitrary integer. If one of the attributes has changed, a new record with a new key is made.

As soon as new events occur, the new combination is used. The nice thing about this solution is that code changes do not present the slightest problem.

Example:

In the dimension location, the hamlet of Baarsland is transferred from the Municipality of Rijnsburg to the Municipality of Rijnswoude (both of which are in the Province of South Holland) as of 1 November 1996. Queries concerning South Holland are not affected by the change. Nor, of course, are queries using the underlying grid. Queries which aggregate up to municipal level should reveal a shift affecting both municipalities. The question now is whether this is desirable.

A table in the statistical yearbook cannot simply record a shift between two municipalities without further explanation, since the user would have no way of telling what had caused the shift!

With a data warehouse, the user is expected to interpret the results himself. We can therefore expect a user who discovers anomalies in the data for the two municipalities to run a few simple queries and compare the data by year, municipality, etc. in order to check that the shift in the first query was not caused by relatively minor things like changes to municipal borders.

My conclusion is that entering this sort of change in the dimension table does not give rise to problems. (Time will tell)

1.5.3 Time is a separate dimension

Time is, of course, a separate dimension. I don't think that anyone would dispute that. And yet, the manner in which we treat it in a dimensional model is not what you might expect.

26 9th CEIES Seminar – Innovations in provision and production of statistics

Page 27: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

At first glance it would seem reasonable to use the date in the internal data base format in the records of the fact table, or a number which represents yymmdd, which is familiar from a wide range of statistics. This approach is not very practical, however.

Let us take a look at some of the problems associated with entering dates in fact tables:

Tiresome conversions to week numbers, etc.

Calculating dates means that queries take a long time to answer.

Not until AFTER the fact table has been accessed do we know if Q1 76 is in the data base.

All these problems can be avoided by using a separate dimension table for time. The table stores all possible time representations using denatured numerical keys. The key (a arbitrary number) is then used as a link to the fact table.

The following is an example of what a time dimension table can look like:

key day month quarter year leave status

00093 Wednesday 12 4 1977 0

00094 Thursday 12 4 1977 1

00095 Friday 01 1 1978 1

It is clear what is going on here: all possible times and dates are given in completely denormalized form.

If I want to know what happened on Thursdays in the course of the years, the join is:

blabla WHERE day of week = ‘Thursday’ etc.

Not very difficult and quite fast. Other tasks, like compiling monthly aggregates, can also be accomplished more easily.

1.6 Aggregates

1.6.1 The need for aggregates

With fact tables of more than several gigabytes, it makes sense to compile a few aggregates for some of the more frequently requested data.

Generally speaking, the use of precompiled aggregates is the most important means of enhancing data warehouse performance. The reasons are obvious.

1.6.2 Automatic navigation

Where aggregates are present in a data warehouse, using them to answer queries was previously a major problem, because it required fast access to a sort of data dictionary and software which allowed automatic navigation.

The compilation of separate fact tables containing the aggregate and smaller tables for the aggregated dimensions ultimately proved the best solution for storing aggregates in the database.

Incidentally, modern OLAP engines which provide the end user with data cubes have made aggregate management and navigation completely automatic.

9th CEIES Seminar – Innovations in provision and production of statistics 27

Page 28: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

1.6.3 Management

Thanks to automatic navigation, the management of the aggregates has become a task for the data base administrator alone. He is the one who uses the performance data obtained in the course of the day to decide if aggregates are required and, if so, which ones. Compiling aggregates is therefore a dynamic process which belongs in the management sector.

Happily, this makes designing a data warehouse much easier.

2. COMPLICATIONS IN THE DESIGN OF STATISTICAL DATA MARTS

The supermarket model is not appropriate for the design of a data mart which is to be used in a statistical environment. Things start to go wrong trying to define the process under observation. What process should be observed and recorded for examining population data?

Nor is the situation much better as regards the design of the dimensions. Not only is it unclear what dimensions there are: their attributes are usually difficult to identify.

In statistical environments, allowance also has to be made for the fact that a number of codes, which have traditionally had an important role in statistics, have to be included in the dimensions. The reason is that, in an analysis which requires aids other than standard query tools, these codes are often needed to avoid complex operations when extracting a data subset with a view to translating the texts back into code.

Also, the design is usually influenced not only by data needs, but also by data availability.

3. APPLICATION OPTIONS IN THE STATISTICAL PRODUCTION PROCESS

If we take a critical look at the conceptual framework of data warehousing, we see that we are dealing with a simple method of compiling statistics. If we can adapt the technology, it should be possible for us to put it to good use.

3.1 Checks and corrections

If we bear in mind that statisticians have a considerable need for insights into the masses of data with which they are confronted, it becomes clear that a data warehouse can be a powerful aid. After all, as long as data are stored in flat files, we need the help of a program and ask the right question in order to get an answer, and that is all we get. The use of a data warehouse allows us to check the data visually, as it were. Not only can practically any aggregate be produced within seconds, one cannot avoid noticing anomalies in the subtotals, since they appear on screen!

We have found that statisticians want to load data at ever earlier stages of correction so as to gain a firmer grip on the correction process. This would not be the first time that checking and correction software "corrected" genuine phenomena because they were formerly implausible. A likely use is the repeated loading of a data mart in the checking and correction cycle, so as to steer the cycle. A first step in this direction was taken with the current population statistics project, by working with a provisional load before all the processing operations were completed and all the secondary data were known. The risk of publishing incorrect figures is very small, because the data warehouse manager ensures that users know what they are doing by giving the cubes names and making the material available to the users.

3.2 Analysis

28 9th CEIES Seminar – Innovations in provision and production of statistics

Page 29: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

In the statistical analysis phase, a data warehouse is important not only as a replacement for ad hoc query systems, but also as a means of obtaining insights into the cleaned population. For further analysis, it is important to be able to distinguish data subsets. Data warehouses are not the most suitable tools for model-based estimates, special tables with SPSS, etc. We also expect this type of aid to make checking, correction and analysis activities increasingly interwoven.

3.3 Output

Data warehouses make very suitable output media, but it is too early to use them or access them outside the CBS: security procedures are not yet in place, particularly procedures protecting against repeat consultation and recombination of data. The CBS is using the StatLine program for this activity. StatLine is constructed on data warehousing principles and will serve for a few more years to come.

4. TECHNICAL IMPLEMENTATION AT THE CBS

We decided to use Microsoft software for the first data mart. The CBS had recently decided that Microsoft software would be the standard. Microsoft has taken a conscious decision to make date warehousing available for mass use, in the form of SQLserver 7. Not only does SQLserver contain the Plato cube engine as standard, but tools such as EXCEL2000 dovetail neatly with the back-end software.

The diagram).shows how the server components interact and how the link to the client application (EXCEL2000) is created. The link between the work station and the server goes via OLE-DB.

4.1 The data warehouse consists of data marts

Given the CBS' decision to decentralise and downsize much of the processing work, it was decided not to put the CBS data warehouse on a large computer. On the contrary, we started from the premise that a data warehouse is a logical unit consisting of numerous data marts and lends itself to distribution among a large number of machines. It was decided, for practical reasons, that a data mart would be indivisable between computers. This has a number of advantages.

4.2 Star configurations in a relational SQLserver database

Although the Plato engine is excellently suited for making cubes from an OLTP database with a normalised model, it was decided at the outset not to take that avenue. There were two reasons for this decision, Firstly, articles were beginning to appear in the specialist press, saying that perhaps

9th CEIES Seminar – Innovations in provision and production of statistics 29

Page 30: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

this wasn't such a good idea after all, because it would be extremely difficult to store historical events etc. in this way; and secondly, because our aim (in addition to consulting cubes using interactive tools) was to extract data subsets from the data warehouse. We therefore opted to create genuine star configurations. These star layouts were then stored in the relational database with an SQLserver 7 kernel.

4.3 Aggregates and cubes in PLATO

The Plato engine is an add-on to the SQLserver software. It can run either on the database server itself, or on its own server. The software also operates with databases produced by manufacturers other than Microsoft. Its official name is Microsoft DSS services.

If we really want to do things by the book, the aggregates should be made in the data base and maintained during loading. For our purposes, however, it was sufficient that we could use the Plato engine to compile aggregates as a means of automatically creating more space or improving efficiency.

Constructing the various cubes is the task of the data warehouse manager (decentral) and is fairly straightforward. With a graphic tool, it is easy to indicate which fields from which dimensions and which numerical data are to be made into a cube. The relatively simple tasks of definition and processing can then begin.

By opting for a DSS server (Plato, pivot table services, etc.: the beast has many different names) as a means of providing the interactive end user with data, we also opted to allow access via cubes. In this context, cubes are thematic areas within a data mart (star configuration). What happens is that the data warehouse manager chooses a number of dimensions from among all the dimensions in the mart. He then selects a numerical item from the fact table and has it pre-processed to form a cube. This ensures a better performance than is achievable via direct queries of the star in the relational database.

4.4 End user stools

EXCEL2000

For statistical applications, spreadsheets are an excellent aid for browsing through a data warehouse via a DSS engine. EXCEL2000 can contact the Plato OLAP provider (Microsoft DSS services) via OLE-DB, and so make cubes available to the user relatively easily.

Spreadsheets dovetail neatly with statisticians' skills and have the immediate effect of raising productivity.

Our extraction tool

Yes, we had to get our hands dirty and build our own tool. It should be possible to extract not only ad hoc aggregates but also data subsets from a data warehouse in a statistical environment. The Microsoft pivot table services are not the most appropriate. That is why we developed a simple program for this project which can write selections to a file and which makes it easy for a query to be formulated without the user having to understand SQL. The user can take a look at the SQL statement, however, and use it as a kind of semi-manufactured product.

5. THE ANNUAL STRUCTURAL SURVEY OF NETHERLANDS MUNICIPALITIES

(Annual enumeration of the whole population)

30 9th CEIES Seminar – Innovations in provision and production of statistics

Page 31: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

The Department of Population Statistics carries out an annual census based on the population data available to the municipalities. Hitherto,. the census consisted of a large number of large sequential files for which special software was used for calculations. Not only is the management of these files a difficult task, because of their size and number, but consulting them also had to be done on a planned basis. Through-put times (including de-archiving) of up to 60 hours were not exceptional.

Two years ago,they decided, in consultation with the Automation Division, to undertake a pilot study of whether a data warehouse could solve the problems. We began a joint project during which it gradually become clear how a data warehouse would have to function in a statistical environment, and that the model could not be constructed entirely along the lines of the familiar supermarket model.

5.1 Structure of the model

Initially, it looked as if just one data mart would be sufficient to meet our information needs, but it soon became clear that the ADDRESSPERSON (ADRESPERSOON) data mart could not provide information on the relationship between persons living at the same address. It was therefore decided to create a second data mart, called ADDRESSFAMILYRELATIONSHIP (ADRESGEZINRELATIE), which we now refer to as simply FAMILY (GEZIN). The overall structure of the two data marts is as follows.

5.1.1 PERSON data mart

The PERSON data mart is dedicated to the analysis of a person's residence at a particular address. All the data about this person, insofar as they are available form the Population Register, are stored in the data mart.5.1.2 The ADDRESSRELATIONSHIP data mart

9th CEIES Seminar – Innovations in provision and production of statistics 31

Page 32: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

In this data mart, the relationship between two persons is the most important fact. The keys of both persons and those of their youngest and oldest children are also included in the fact table. Numerous data were also taken from PERSON so as to avoid having to constantly combine two marts.

5.2 Physical implementation

As I have already said, we decided on Microsoft products for the database, DSSservice and query tool. The main reason was that the software was already available to the CBS. We do not believe in departing from existing standards without a very good reason.

5.2.1 Custom software for loading data

Loading the data warehouse was more problematic. Firstly, no suitable Microsoft tool was available. Secondly, the search for commercially available loading software was frustrated because the many tools that claimed to support data marts did nothing of the sort. As our timetable was under threat, we decided to write some custom software using Microsoft's Visual Basic. After the pilot study, the choice was easy. the loading software was ready and it seemed reasonable to develop this software for production purposes.

5.3 Management

The entire project has been transferred to its owner, the Population Division. This means that, although we (IT and Applications Development) continue to do research and provide support, it is their data warehouse and they are responsible for it.

5.3.1 Contents management

A locally based statistician is responsible for managing the contents. Although he has acquired the knowledge necessary for making cubes and defining and adding users and roles, he remains first and foremost a statistician. This ensures that the data warehouse does not degenerate into a "mere" technical tour de force, but remains a working statistical tool.

5.3.2 Technical management

Management of the software and data model has been transferred to the local computer experts, who are answerable to the contents manager.

32 9th CEIES Seminar – Innovations in provision and production of statistics

Page 33: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

6. CONCLUSION

Data marts can offer statistical offices a solution for the production of statistics and may replace existing methods where more and more data come to comprise an integrated data warehouse. It has been firmly established that significant gains are to be had in terms of checking, correction and analysis.

Literature

The Data Warehouse Toolkit, Ralph Kimball, Wiley; ISBN 0-471-15337-0

SQLserver7 Data Warehousing, Michael Corey et al, Osborne; ISBN 0-07-211921-7

The Data Warehouse Lifecycle Toolkit, Ralph Kimball, Wiley; ISBN 0-471-25547-5

Building the Data Warehouse, W.H. Inmon, Wiley; ISBN 0-471-14161-5

9th CEIES Seminar – Innovations in provision and production of statistics 33

Page 34: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

SISSIEI - THE STATISTICAL INFORMATION SYSTEM ON ENTERPRISES AND INSTITUTIONS

Enrico GiovanniniCentral Director for statistics on enterprises and institutionsISTATvia C. Balbo, 16,IT- 00184 Roma [email protected]

and Alberto SorceResponsible for the co-ordination of information systems of the Central Directorate for statistics on enterprises and institutionsISTATvia C. Balbo, 16IT- 00184 [email protected]

1. Introduction

The organisation and the activities of European National Statistical Institutes (NSIs) are being deeply changed by different factors, such as the evolution of European rules in the field of statistics, the growing information needs of users, the need to reduce the statistical burden on respondents and the continuous innovation in the field of information technology. Nowadays, more and more often, the services of NSIs are supplied following the "information system" pattern, which implies a radical change in the approach of statisticians and affects the adopted organisation and methodology.

Owing to this "system" approach, deeply affecting the operation and services of Statistical Institutes, there are no "ready-made" solutions to meet the different needs and situations. The efforts to impose a rational order, made by well-known international experts, are useful to a theoretical approach, but are difficult to implement. Moreover, several works are based on the viewpoint of information technology experts, whereas statistical information systems are more complex, even because statistics follow more specific "protocols" (i.e. the availability of meta-data to have better quality statistics).

Thus, it could be useful to analyse the actual experiences made within NS Is, which are re-arranging their activities following this approach. In particular, this paper discusses SISSIEI, the statistical information system on enterprises and institutions, developed by the Central Directorate for Institutions and Enterprises Statistics (DCII) of the Italian Statistical Institute (ISTAT). It represents a multidimensional structure devoted to cover statistics produced by the DCII with reference to agricultural units, to private enterprises and to public and private institutions.

This System is being constructed as an element unifying the activities carried out for statistics on enterprises and institutions, and it is a tool to have more efficient statistics and to rationalise informational flows collected from statistical units. It implies a real "cultural" change as to how surveys are carried out. In fact, as for national accounts, this System constructs a framework where all the surveying and processing activities (ex-ante and ex-post) should be set, coded and adequately supported1.

The first paragraph of this paper discusses the role of Community regulations in the field of statistics to implement information systems, then the guidelines of the ISTAT statistical information system on enterprises and finally the institutions will be examined. Paragraph three discusses the general approach to the construction of a statistical data warehouse, paragraphs four and five illustrate ISTAT’s data warehouses for structural statistics on enterprises and external trade statistics. The concluding remarks close up the document.

1 See Egidi e Giovannini (1998) on the characteristics of statistical information systems.

34 9th CEIES Seminar – Innovations in provision and production of statistics

Page 35: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

2. Community regulations and implementation of information systems

Over the past years, Community regulations in the field of statistics, in particular economic statistics, have laid the foundations of an information system and member states were asked to introduce remarkable changes in the existing organisations and in adopted methodologies. The regulations on statistical units, classification of economic activities, registers of units, structural statistics, short-term statistics, national accounts and specific sectors (tourism, transport, etc.) are establishing harmonised concepts, definitions and classifications, that are the building blocks of a complex system of surveys and statistical processing of different data.

At the same time, National statistical systems should achieve the highest efficiency level due to the need to increase supply of statistical information and to reduce the statistical burden on respondents, while Community quality requirements are leading to the implementation of more advanced statistical techniques.

In Italy, the introduction of Community regulations was welcomed as an opportunity to develop the statistical information system, rather then limiting it. In addition, ISTAT benefited from larger government transfers, which supported the innovation of statistics on enterprises involving: extensive use of administrative information for statistical purposes; new methodologies used in the different stages of surveys; optimisation of the general organisation; new IT systems and related concepts, from mainframe architecture to distributed systems; users participation (usually trade associations) in the definition of products and strategies to

arouse the interest of respondents.

The plan for the development of the information statistical system on enterprises and institutions has been partly implemented (60%) and it will be completed within two years, when all regulations will have been introduced.

Re-structuring statistics adopting a "system" approach implies several advantages. The methodological design was followed by the re-definition of the organisation of different “Services” producing statistics. In particular, in 1997, the Central Directorate for Institutions and Enterprises Statistics (See fig. 1) was sub-divided into the following three macro areas: structural statistics on enterprises and institutions; short-term statistics on enterprises; economic censuses and statistical registers.

The area of structural statistics has been further divided into three “Services”: agricultural statistics, structural statistics on industrial and service enterprises, and statistics on private and public institutions. Short-term statistics have been divided into four “Services”: statistics on prices, statistics on external trade, short-term statistics on enterprises’ activity, and short-term statistics on employment and labour cost. The last area deals with the creation and updating of statistical registers (on agricultural holdings, enterprises, public and private institutions) and censuses. These three macro-areas are sided by some units co-ordinating the implementation of information systems, carrying out researches in the field of economics and methodology, and taking care of publishing and organisation.

The adoption of the "process" approach (and not by “sector”) in the Central Directorate organisation focussed attention on the improvement of the different phases of data surveying and processing, producing a more effective use of human resources and more timely data releases. Moreover, the current organisation into areas follows the pattern suggested by Community regulations and it is easier for the specific Services to meet Community requirements and co-

9th CEIES Seminar – Innovations in provision and production of statistics 35

Page 36: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

ordinate relations with Eurostat2.

A "process" based organisation does not show the diverse dimensions of statistical data. Users are more frequently asking for integrated information overviews (such as labour market) or data on a specific economic sectors and these data can only be provided through a global reading of short-term and structural data. Thus a "process" based organisation should be supported by specific tools allowing cross-sectional reading of data, from a "matrix" viewpoint for processes/products.

The Information Statistical System on Enterprises and Institutions SISSIEI (discussed in the following paragraph) meets this need: from each data production process it integrates processed information at micro data level. It should be underlined that the "system" approach described with reference to "basic" statistics follows the same pattern of national accounts, which is the best-developed available system for statistical information. National accounts integrate available sources, in compliance with harmonised definitions and classifications, to provide information with informational value added greater than each single source.

For example, the regulation concerning structural statistics on enterprises established that estimates should be based on the integration of different sources and that Member States should transmit consistent evaluations, based on definitions and classifications compliant with national accounts. In other words, data integration which had been only performed by national accounts, is now a duty to be fulfilled by the suppliers of "basic statistical survey data", who should construct an "intermediate system", different both from single surveys and from national accounts.

On the other hand, implementing a statistical information system before national accounts “level” affects relations between the latter and "basic" statistics and the possibility to conduct some economic analyses. Having a “basic” statistical information system, national accounts can have information harmonised and integrated with the definitions and classifications of European System of Accounts (ESA 95) and differences between statistics on enterprises and national accounts can be more easily detected and interpreted3, so that micro and macro connections characterising economic phenomena can be better analysed.

3. An outline of the statistical information system on enterprises and institutions

Thus, the reasons for such a radical change should be clear. Moreover, it should be pointed out that the economic system in Italy is characterised by a large number of small-sized enterprises, which increases the statistical burden on respondents and on the National Statistical Institute. In fact, a large number of units is necessary to produce data required by Community regulations and data processing is very complex. For these reasons, the development of an integrated information system is a part of a larger strategy aiming at the co-ordination of the production of statistical data.

2 Eurostat Directorate for business statistics (and few statistical institutes) has been re-structured adopting a “regulation” viewpoint and relations between national and European experts have greatly improved, as well as the effectiveness of working-groups.

3 Underground economy is an example; it represents one of the main differences between survey findings and national accounts.

36 9th CEIES Seminar – Innovations in provision and production of statistics

Page 37: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

Following UN (1999), “The system approach is a general human approach to describe, analyse, and control complex phenomena. Some basic propositions of the system approach are:

a complex phenomenon can be conceptualised as a system, a sort of imperceivable system, since it cannot be fully understood by a single mental act;

a system consists of parts; a part of a system is in itself another system, a subsystem of the former system; any system, even the whole phenomenon first considered, is a part of a wider system, a

super-system or environment, of the former system; the parts of a system are related to each other, and to the system as a whole, and the

system is related to its parts as well to other systems in its environment”.

In particular, the Statistical Information System on Enterprises and Institutions (SISSIEI) is a super-system which gives the possibility to integrate all available statistical information relative to a single agricultural holding, a single enterprise, a single institution, regardless of the characteristics of the specific source (statistical survey, administrative data, etc.). The System is designed to permit the comparison of data referred to the same variable from different surveys (monthly, annual, etc.) for each statistical unit, giving the possibility to analyse also the evolution of the unit in terms of mergers and acquisitions, to check data quality, to prepare products for dissemination, to conduct micro econometric analysis, etc. within the same information system, with a clear reduction in costs and a large improvement in the efficiency of the statistical production. In addition, SISSIEI gives internal users (statisticians) the possibility to access generalised tools to conduct surveys, like software for sample analysis, to manage questionnaires and contacts with enterprises by post, fax or e-mail, to check data, etc.

From an architectural point of view, the System is constituted by few Relational Data Base Management Systems (RDBMS) distributed on a metropolitan network and has been developed using mainly SAS and Oracle. It can be accessed over the Intranet from all ISTAT’s offices, under a centralised and strict control of users to maintain the confidentiality of the data.

Chart 1 illustrates the structure of the System, with a description of the individual databases on which SISSIEI is based. The System is divided into two areas: the first for agricultural holdings, industrial and service enterprises, the second for public and private institutions. Each subsystem is based on the related legal-economic units, in compliance with Community regulation n. 2186/93. In particular, the system on enterprises (SISSI) has ASAIA (Statistical Register of Agricultural Holdings and Enterprises) and ASIA (Statistical Register of Active Enterprises in the industrial and service sectors; for institutions ASIP1 (Statistical Register of Public Institutions) and ASIP2 (Statistical Register of Private Institutions) are available.

The creation and updating of legal-economic units registers has been largely supported by the development of administrative sources, of statistical studies and information systems. In the past, registers were updated in the time interval between general censuses using a few available data on the surveyed units. In the second half of the 90's, (ASIA) the statistical register of active enterprises could be implemented because administrative registers were developed. ASIA is based on the

9th CEIES Seminar – Innovations in provision and production of statistics 37

Page 38: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

continuous updating of data1; thus, a low cost "yearly census" can be carried out, as required by the Community regulation 2186/93, and the quality of data is definitely higher than in "traditional censuses".

The approach adopted to construct ASIA was adopted to implement the registers of agricultural holdings, Public and Private Institutions. The following registers are currently available: - ASIA, including about 3,500,000 industrial and service enterprises;- ASIP1, including about 13,000 public institutions; - the first version of ASIP2, including about 400,000 units that should be private institutions;

ASIP2 would be checked after January 2000, through a specific survey;- the first version of ASAIA, including 3,000,000 agricultural holdings is currently being

checked as part of the preliminary activities for the year 2000 agricultural census.

The registers provide the codes (taxpayer's code, chambers of commerce code, social protection code, etc.) to link to all other statistical information from different surveys. Each part of the System integrates information present in the register with information from other sources (surveys or administrative data) and referres to specific economic phenomena; each sub-system can be related to the other systems using specific navigation tools linking information in different physical database (but not logically different).

In particular, the sub-system referres to the industrial and service enterprises, based on ASIA, integrates the following surveys, over the period 1989-992:

- annual survey on balance sheets of large enterprises (70,000 units per year);- annual survey on balance sheets of small enterprises (50,000 units per year);- annual survey on preliminary estimates of balance sheets of very large enterprises (8,000

units per year);- occasional surveys on technological innovation of industrial enterprises (5,000 units per

wave);- occasional survey on technological innovation in service enterprises (6,000 units per

wave);- occasional survey on labour cost (12,000 units)- annual survey on scientific research (2,000 units per year);- occasional multipurpose survey (300,000 units);- monthly external trade statistics (intrastat and extrastat, about 300,000 units per month);- monthly survey on orders and turnover of industrial enterprises (14,000 units per

month);- monthly survey on retail sales (6,000 units per month);- monthly survey on employment and labour cost in large enterprises (1,000 units per

month).

1 The register was constructed integrating data from the archives of the Ministry of Finance, INPS (National Institute of Social Security), INAIL (National Institute for Industrial Accidents), the Chambers of commerce, theNational Electricity Board and Telephone service providers. These data are then integrated with information from ISTAT surveys.

2 For external trade statistics the examined period is 1991-99; for other monthly statistics 1996-99; for PRODCOM 1996-99. The system has been designed to be ready to accept both data expressed in Lira or in Euro and is also capable of converting data in different currencies.

38 9th CEIES Seminar – Innovations in provision and production of statistics

Page 39: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

Figure 1 - The Statistical information system on enterprises and institutions

Notes ______ implemented area

- - - - - - area being implemented

………. area being designed

Within few months the System will be enlarged to cover other surveys and other sectors (i.e. agriculture). In addition, the statistical System on the Public Administration, based on the recently developed register, will be implemented. In fact, the extreme flexibility of the System permits the integration of information of different kind and origin; in addition, the performance of the System is also very good in terms of efficiency: the users can conduct very complex searches in the database very quickly.

SISSIEI is not only a network of databases. It is an instrument to support many phases of statistical production: planning new surveys, in terms of variables and sample structure; checking and correcting collected data; quality analysis; dissemination of results, etc.

As shown in chart 2, before the beginning of a new survey, the possibility to use general software for the survey design permits to simulate different strategies in terms of costs and statistical burden. For example, by having the possibility to check whether a specific enterprise is already covered in other surveys, it is possible to define the sample of the new survey excluding that enterprise, in order to reduce the statistical burden (co-ordinated sample strategy)3.

3 In particular, to select enterprises from a sample structure, a Data Warehouse (DW) was implemented with information on active enterprises (about 3,500,000) and their local units. This approach is more efficient and timely than the previous one based on SAS standard procedures applied to sequential files.

9th CEIES Seminar – Innovations in provision and production of statistics 39

Page 40: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

Figure 2 – General software and tools used in Statistical information system on enterprises and institutions

Once the survey is designed, it is necessary to post questionnaires to the enterprises. Because the register is usually based on information referred to the past 18 months, the addresses of enterprises may be not updated. For this reason, SISSIEI offers users a preliminary version of the register ASIA, in which the identification characters are updated using all the information collected by current surveys, in particular short-term surveys, and administrative sources related to these characters. Finally, questionnaires are sent using general procedures to contact enterprises by post, fax or e-mail.

During the processing, SISSIEI receives raw, corrected and final data. Every time a certain phase of the quality checks is completed, the data are transmitted to the System, which stores data with a certain code, referred to the “quality level”. In addition, during the quality checks for a certain survey it is possible to use data from other surveys: for example, during the check of turnover from the annual survey on balance sheets for a certain enterprise, it is possible to compare the raw value with the annual aggregation of monthly data collected for the same statistical unit.

40 9th CEIES Seminar – Innovations in provision and production of statistics

Page 41: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

For these purposes, SISSIEI has navigators for statistical data and a general data dictionary. The System permits the on-line access to micro data using SQL embedded commands. Through an OLAP4 (on-line Analytical Processing) protocol it is also possible to display on a screen window all indexes numbers and figures related to groups of enterprises from the same economic sector, of the same size and geographical location.

The System permits to calculate measures of quality, sending this information to SIDI, which is the Information System for the Documentation of Surveys implemented by ISTAT, and to other information systems of meta data.

Eventually, when final data are calculated, SISSIEI serves as the base for the construction of Data Warehouses (DW) for dissemination purposes. As described in the next section, during 1998, ISTAT produced its first DW for the dissemination of results from the 1996 intermediate economic census. The success of this new instrument has been largely recognised by the users: the database was available on the Internet site (address: http://cens.istat.it), free of charge, and in nine months about 2,500 users visited the site, carrying out about 200,000 extractions of data. From that database, in few months, about 20 CD-ROMs and 120 books were prepared, without any additional editorial intervention.

The general structure of this System is an implementation of the chart in figure 3, from UN (1999). Several functions, discussed in the paragraph “a vision for the future”, have already been implemented by SISSEI. There are still a number of survey processing systems, a corporate (or a set of) data warehouse, a number of analytical processing systems; the DW contains “areas” for raw data, meta data and final observation registers; all the different steps in conducting surveys (survey planning, survey operation, survey evaluation) can be managed by the System.

4 OLAP offers a fast way of looking at and analysing data from any perspective without having to specify in advance the perspective and level of detail required. As a result, OLAP is a significant advancement over the tools and techniques that were developed at earlier stages in the development of business intelligence system. OLAP is particularly well suited to serving the general inquiry needs of knowledge workers. However, the level of analysis required may demand that more powerful modelling tools be applied to investigate the strength and likely causes of relationships. In those cases, the OLAP tools should be integrated with other business intelligence systems and with analytic and scientific tools. To get the best results, OLAP applications should be based on the solid foundations of data warehousing. In applications where data volumes are high and users have specialised areas of interest, efficiencies can be gained through the use of data marts for the most commonly used applications and through the use of facilities that enable knowledge to reach through to warehouse data.

9th CEIES Seminar – Innovations in provision and production of statistics 41

Page 42: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

Figure 3. An information systems architecture for statistical organisations.

4. The data warehouse environment

As described above, the development of Data Warehouses (DW) is a key element of the new “vision” of statistical production. The DW is an information system where data are organised and structured for easy user access and to support decision-making processes. The following systems are enabled from the DW: DSS (Decisional Support System) EIS (Executive/Enterprise Information System).

The former is used to solve specific problems, while the latter to supply a continuous data flow not depending on specific problems.

The DW is an OLAP system differing from OLTP systems (On Line Transaction Processing), though data are from the latter. OLAP systems are subject-oriented systems; they are integrated, historical and permanent systems. It does not include static analytical data as OLTP systems, rather flexible data; moreover OLAP data are not current but historical data, as they are used in analyses and are not affected by current transactions.

A DW is always divided from its operational environment whereas it includes all data from the operational environment. DW data are not to be modified; they are loaded at the beginning and then accessed, but they are never updated as in OLTP systems. Before being loaded in the DW, data are integrated following different strategies, by names, measures of variables, decoding structures, attributes, etc.

42 9th CEIES Seminar – Innovations in provision and production of statistics

Page 43: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

The data source for a decision-making system (such as DW) is an operational system, though the former is not a mere copy of the second, thus the two systems have reduced redundancy: data in a decision-making system are filtered, they are time stamped, summaries are included and are physically and basically changed before being loaded into the DW. In particular, besides detailed data, data are summarised by two different aggregation levels: the first level specifies (first level data mart) the time unit, and in the second level (final data mart) only high frequency access summarised data are permanently stored. Thus, if data are more frequently accessed, the summary level is higher. In other words, a smaller number of data should be stored, and data access is faster and more efficient.

SAS/Data Warehouse is the software used for our applications. The DW includes: “subjects”, “operational data”, “detailed tables”, “data mart” and “information mart”. The “subject” is a set of data concerning a specific issue. In particular, in SAS/Warehouse Administrator each subject may include several components (such as SAS dataset). Operational data (ODD) are the input to load DW and can be extracted from flat files and stored in SAS dataset. The “detailed tables ” include data from ODD at the lowest detail level. On these tables aggregations are made to construct “data mart”, that are a sub-set of a DW, they refer to a specific examined set of information: a data mart is a logic section of the DW, and it includes the user final aggregations.

Each data mart is assumed to be a complete DW and can be constructed using the results of queries or analyses on users. An “information mart” is a catalogue including (or displaying) information on coded variables.

Two main approaches are required to develop a DW environment. The first is based on the creation of a central DW, using data from legacy system and other sources. This central warehouse can then be used to load departmental DW or local data marts. The second approach is based on the creation of independent subject area data marts, each loaded directly from the legacy systems and other data sources.

The central DW approach can start with a simple DW, expanded over time to meet growing users demand and becomes an environment containing connected warehouse systems. In a simple warehouse environment there are three areas which need to be managed: extraction and transformation of the data from operational systems; the warehouse data base; the data exploitation tools.

It is necessary to manage the network that provides access to users. Usually there are at least three repositories for meta data and other related information: one to cover the data structures and transformation rules for the extraction of data from the legacy systems; one for the warehouse database; and one or more for the exploitation tools depending on how many different tools are being used. These repositories need to be managed, both individually and as whole. Data in the environment of the DW database should be managed as well.

The complexity of this task depends on the chosen database, but it includes backup, recovery, re-organisation, archiving, performance monitoring, and tuning. Departmental or local sub-sets of data (data marts) are created to enhance the performance of user queries and to reduce dependency on the DW. This additional level of data increases the complexity of environment management: it adds another level of meta data and possibly another repository, it requires control and management of data distribution to the data marts; and, unless administration of the data mart is completely devolved to the local level, it also requires data management of the data mart database. The situation becomes even more difficult if the environment evolves further through the creation of multiple warehouses. In some such cases, the complexities of administration are overwhelming.

9th CEIES Seminar – Innovations in provision and production of statistics 43

Page 44: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

In the independent data mart approach, the independent data mart route to data warehousing currently appears to be the most popular and it is easily understood. The creation of a single subject-oriented data mart to solve a particular problem represents a simple solution. The administration of such an environment is relatively straightforward and can be easily contained. The three areas to be managed are: extraction of data from the sources, and transformation into the correct data structures for the

data mart database; the data mart database itself; the exploitation tools.

Since this environment does not usually contain such large volumes of DW, or the interrelationships found in central warehouses, it is easier to manage. If such a simple data mart solution were the only warehouse implementation in the organisation, the task of the administrator would be relatively easy. However, this approach does not usually stop with one data mart and, once additional data marts are added, the situation becomes far more complicated. The task of bringing a number of separate data marts into a single warehouse environment is extremely challenging. Usually, each individual data mart has been developed independently. Such data marts have the potential to become the legacy systems of the data warehouse era. As such, they carry the problem of data currency and inconsistencies in data definition that the DW was designed to solve. This unattractive situation is avoided only where development is controlled by a single system's administration architecture.

A DW is likely to contain very large volumes of data, not always relevant to all users. Working through these volumes of unrelated data can be inefficient and time-consuming. To address this situation, the institute-wide data from warehouse can be subdivided into specific areas of interests. These data structures that contain information of particular interest are referred to as data marts. Knowledge workers, such as business analysts, generally go through the decision-making process in three phases: discovery, which involves finding key problem areas; analysis and confirmation, which involves proving the finding and describing them in more

detail; presentation, which is concerned with delivering the findings in a suitable manner to other

decision-makers.

Each of three phases requires different software tools that offer different capabilities and focus.

Moreover, many data exploitation tools create their own environments, each with its own repository. Such a repository holds the information required to exploit the data for satisfying the queries generated by the tool. If the DW is to be centrally administered, these environments must be incorporated into the management structure as a whole. Even where responsibility for administration of the data exploitation tools is devolved to the local user level, a link between the central DW administration system and the distributed tool environments is still required. This link is necessary to ensure that the tool environments affected by central changes can be identified, and the changes that affect the tools can be implemented.

Organising all customer data into an integrated warehouse environment is one of the biggest challenges faced by information experts. However, an equally important challenge is to create an integrated environment to inform on data availability and exploitation possibilities. Without data about data (meta data) the organisation will, at best, fail to get the full return on its investment in data warehousing. At worst, there is the risk that as the amount of data in the data warehousing infrastructure rises exponentially, business users may renounce because the information retrieval

44 9th CEIES Seminar – Innovations in provision and production of statistics

Page 45: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

process is excessively time consuming. Other trends are also contributing to the need for a tighter inventory of information resources. For example: changes in the business environment create constantly evolving business definitions; data marts are proliferating, often without central planning; business units and teams create their own terms for similar data elements; trans and multinational ventures create language difficulties; increasing staff turnover means a constant outflow of undocumented knowledge; valuable but unstructured external information (such as Web-based) is adding to data volumes.

In a DW environment there is not only a requirement to manage the interchange of information between repositories of meta data, but to manage the meta data itself as part of the warehouse data. Typically the warehouse contains data that are old and data structures are likely to have changed. Inevitably, when a user begins accessing the information within a warehouse, the following questions may arise: what information is available in the warehouse? What do the definitions mean (for example, exactly how is a customer defined)? Are data current and reliable?

At the NSI (or Directorate) level, meta data are designed to answer the above questions, it should be stressed that meta data will be evaluated at this level. Answers to these questions can be usually found early in the life of a DW. However, the role of meta data becomes more complex as the information grows and user’s demands increase. Consequently organisations should structure a complete environment for meta data management and approach the issue methodically from day one on to avoid problems.

Moreover, many data exploitation tools create their own environments, each with its own repository. Such a repository holds the information required to exploit the data for satisfying the queries generated by the tool. If the DW is to be centrally administered, these environments must be incorporated into the management structure as a whole. Even where the responsibility for administration of the data exploitation tools is devolved to local user level, there still needs to be a link between the central DW administration system and the distributed tool environments. This link is necessary to ensure that the tool environments affected by central changes can be identified, and the changes that affect the tools can be implemented. The important point is that the data exploitation tools are not forgotten in the warehouse administration context. Their needs must be included in the overall warehouse management system, which requires the creation of a register of tools, ideally on a central repository that holds a definition of the entire DW environment.

The NSI definitions and rules that are associated with the DW have to be managed and administered. These rules and definitions must be agreed across the Institutes and then kept up-to-date with any changes that may occur. These definitions, together with their relationships to the actual data structures of the warehouse, should also be accessed by users.

The starting point should be our expectations from DW; in others words, the correct construction of this instrument can be carried out only after that its purpose has been clearly defined. In this case, data on industry services are the starting point. These data can be aggregated by geographical division, sale classes or number of people in charge: besides, they can be combined to produce tables to highlight specific characters. A DW, is a "mechanism" to access information, the exact amount of information should be known, and its relations and links. The selection of variables is made by users. Another parameter to be considered is the possibility to select a series date specialised marts, for fast queries, though other characteristics may be affected such as the suitable timing of information (shared archives may be not aligned).

5. A data-warehouse for the intermediate economic census

9th CEIES Seminar – Innovations in provision and production of statistics 45

Page 46: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

This paragraph discusses the characteristics of the ISTAT DW used to disseminate data from the intermediate census of industrial and service enterprises, the reference time being 31 December, 1996. The 1998 intermediate census was based on ASIA (the Statistical Register of Active Enterprises), constructed using administrative data from other sources. The field survey was carried from January to September 1998 and data of about 450,000 enterprises were checked, while 3,500,000 enterprises are included in the census, whereas census findings (referred to the whole population) were disseminated in December 1998.

The first ISTAT DW was implemented with the support of the SAS Institute. Data dissemination was mainly performed through the DW, which could be accessed free of charge over the Internet. Users could navigate through information, selecting only information best fulfilling their needs. They could construct tables, to be downloaded onto their PCs and carry out analyses by sector, geographical area, size and historical series, that could be provided only on request for previous censuses. In 1999, about 2,500 users extracted about 300,000 statistical tables, on average 130 extractions per user and 800 a day.

It took about six months to design and implement. The following products were used: SAS/Warehouse Administrator, to construct the Data Warehouse, SAS/Desktop Application Builder, to select aggregations and for the navigation software. The DW needs about 25 GB disk space, as it includes not only the 1996 intermediate census findings, but also data from the 1971, 1981 and 1991 general enterprise censuses. More than 100 software programmes were created, that is about 14,000 lines.

The SAS/Warehouse Administrator was used to manage the DW construction, following the steps listed below: data loading from ASCII file on Unix platforms; identification of the observation field; selection of examined enterprises; table loading; harmonisation of data; identification of operational data; creation of subjects; creation of data mart.

1971, 1981 and 1991 (only for enterprises) census data have been stored into three different files, one for each year, 1996 data were stored into two files: one for enterprises and one for local units. The following tables were loaded: tables to decode data mart, tables to decode ODD; and tables for dimensions.

The classifications of economic activity were related to reference years using the tables to decode data mart: i.e., the table “ate81_71” was used to relate the 1981 classification to the 1971 classification. Tables to decode ODD were used to relate information to the reference date of the latest available ISTAT classifications (i.e., the 1991 classification was used for economic activities, and it was not changed in 1996).

The following activities were carried out to harmonise data: the re-classification of economic activities to relate them to currently used classifications; the geographical re-classification:

o for municipalities where changes occurred; o for specific municipalities (capital of region/province, major urban areas); o due to new geographical aggregations not related to administrative geographical

division (local employment system, industrial district);

46 9th CEIES Seminar – Innovations in provision and production of statistics

Page 47: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

the re-classification of the previously adopted legal status to relate it to current classification; the harmonisation of coding for handicraft, since definition changed from one census to the

following; the harmonisation of coding for the diffusion field (local unit of a single-site enterprise, or

enterprise with several local units, or non main site); established rules were used to allocate the following variables for 1996: “geographical

distribution of enterprise” (municipalities, province, region, country), “distribution of local units” (geographical location of local units with reference to main site) and “employee classes”.

Five operational data sets were constructed (for 1971, 1981, 1991, 1996 local units, 1996 enterprises). Two further groups of operational data were represented by tables for dimensions, used to show modes of coded variables, and decoding tables. The Star Scheme is the logic model used to store data, since each table is related to several other tables with descriptive data, such as dimensions.

The identified subjects are enterprises and local units. Each subject includes four detailed tables referred to census years. Detailed tables include classification and analysis variables defined in the loading and harmonising phase.

The detailed tables for the two subjects were constructed using the data mart of aggregated data used by the navigation and printing software. D ata marts are sets of classification and analysis variables necessary to study an event. 5 data mart were created:1. 1971-81-91-96 comparisons;2. 1981-91-96 comparisons;3. 1991-96 comparisons;4. 1996 data;5. 1996 statistical - economic indicators.

The following processes were used to create data mart: identification of the observation field; merging of detailed data (for each year) of each subject (a dataset for enterprises and one for

local units); aggregation of data; allocation of format and description to variables; creation of first level data mart; creation of post-processing for final data mart used by the navigation module.

Depending on the number and typology of classification variables, some of the results of aggregations required by the navigation software were stored. Due to the amount of time required to process this large amount of data, aggregations could not be made when required by users. For example, the 1996 data mart would have determined more than one million aggregations, if 20 classifications had been linked.

Thus it was necessary to select which aggregations should be stored, determining the other typologies through the "one which is closer" algorithm. The DAB (Desktop Application Builder), module was used, it was developed by SAS Institute using the Sas EIS software. A "metabase" for each data mart was constructed, including the examined classification and analysis variables, starting from the aggregated data matrix, where all the classification variables were summarised. Then, aggregations to be stored were selected (type), hierarchic types were selected first (i.e. geographical division, region, province), and then some constant variables were selected (i.e. handicraft).

9th CEIES Seminar – Innovations in provision and production of statistics 47

Page 48: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

In 1999, the second phase of the intermediate census was carried out, referred to 31 December 1997. About 300,000 enterprises were interviewed using a questionnaire to elicit information on the diverse aspects of enterprise activities and organisation (being member of groups of enterprises, share of domestic and foreign markets, investment on innovative technologies, labour organisation, etc.). Collected data are currently being corrected and will be disseminated early in 2000 with the updated release of the ASIA register, referred to 1997.

The design of the new DW for these data was begun. This project, named DIONISIO (Data warehouse Internet Osservatorio Nazionale Industria e Servizi Informazioni Organizzate - National Information Centre for Information on Industrial and Service Enterprises), aims at disseminating integrated census data, differently from previous structural statistics on industrial and service enterprises. This project implies several problems: variables to be disseminated are from different surveys, there is a large number of variables and different aggregations are used, such as the geographical area, size, sector of economic activity, etc. Moreover, these data are confidential, differently from the identifying characters of units surveyed in 1996-97; in other words, specific measures should be taken not to break confidentiality.

Specific application software was developed for PC using SAS modules, then data would be loaded and maintained on a Unix platform. The personal computer was chosen to assure a more efficient interaction with developers, while the adoption of a Unix platform is consistent with the Institute standards.

The application should take into account the following areas: statistical problems outlined by the preliminary studies, the safety of information, the confidentiality of information to be disseminated, easy access of users, and consistency of tables based on census with data from other structural surveys. The safety of information, better performance and suggestions for future applications are expected from a test phase at the end of design and user monitoring procedures.

The strategy outlined below will be adopted to disseminate data related to structural census surveys: data in section A of questionnaire (referred to basic characteristics of enterprises as of 31

December 1997) will be disseminated using the DW described above; data in section B of questionnaire (referred to structural variables) will be disseminated

constructing a two-version DW: a “large” version for ISTAT in-house users and a "light" version for other users. In this way two goals would be achieved: to disseminate data safeguarding confidentiality and supply complete and exhaustive data to in-house users that can adequately aggregate basic data.

48 9th CEIES Seminar – Innovations in provision and production of statistics

Page 49: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

6. A data warehouse for external trade statistics

The ISTAT DW for external trade was first implemented to construct about 1,000 tables in the "External trade and international activities of Italian enterprises - 1998" Yearbook, published by ISTAT and the National Institute for foreign Trade (ICE) in July 1999. The initial focus was on the designing and creation of the necessary data mart and info mart and on the implementation of several inquiring and reporting on-line typologies over the web.

Later on, this project was enhanced to integrate the DW in the production process of monthly data on external trade and to offer easy and efficient access to data over the Internet. In this way, any Intranet authorised user may: access statistical data on the Web Server, including on-line Yearbook tables; access data mart SAS on the Application Server to request aggregated data (in specific

formats); access fact tables and micro data stored in the DBMS Oracle1 for more specific requests.

Actually, the DW for the external trade statistics has been completed and it contains monthly micro data from 1991 to 1999, referred to 9,000 groups of products by 250 countries, by 103 provinces of origin (or destination) of goods. From this database, a smaller database has been obtained, which can be accessed over the Internet, while the full database will be used to automatically prepare all the standard products (CD-ROMs, publications, etc.) and to fulfil specific users requests. The 300 millions of records in 80 Gb disk space rank this DW among the so-called VLDB.

In 1998, the Service for foreign trade statistics, after a feasibility study previously developed, started a new project of its new information system. The main target of this project was the migration, by December 1999, of the information system from an old centralised system (MVS, Adabas database, Natural and Cobol procedures) to a new developing environment (SAS and Oracle as DBMS) within a Unix environment. The need for “total quality management” was the origin of this new project: it regarded different points, as staff training and motivation, the due check-out of the methods for information processing and therefore the data quality and user satisfaction as principal targets.

This DW is fully integrated in SISSIEI through the taxpayer's codes referring? to operators on foreign markets (mainly commercial or industrial enterprises). As the surveys (Intrastat and extra UE countries) have fiscal relevance (the Ministry of Finance is responsible for the data collection) each new operator involved in import/export operations is recognised and entered in the business register (ASIA); in addition, through the taxpayer's code foreign trade data can be linked with all other economic data collected by different surveys, in order to check the quality of data and to conduct economic analysis.

In the external trade DW the main "subjects" are the statistical weight, the invoice amount, and the quantity (in kg or other). The principal “dimensions” are: the flow of transactions, the scheme, the VAT number, the eight-digit combined nomenclature (NC8), the country of origin, the place of origin, the province, the custom, the transport mode, the delivery terms and type of transactions. From these principal "dimensions" other "larger-classifications" can be identified such as: from the combined nomenclature we have aggregated classifications (Nace Rev. 1, groups of

commodities, SITC, NST/R, NACE/CLIO, the economic use); from provinces we have regions and geographical divisions; from countries we have classifications by large geographical and economic areas; from custom we have custom areas.

1 This is the server for the database in the adopted IT framework adopted by ISTAT.

9th CEIES Seminar – Innovations in provision and production of statistics 49

Page 50: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

Due to the monthly frequency of surveys, the month, quarter and year are the examined "time dimensions".

The external trade DW is based on a three -tier architecture:- introduction or client level;- application level;- database level.

A –Client level

The client level is based on Web technologies offering several advantages: an interface managed through thin client with less demanding hardware requirements than desktops, improving reliability and reducing costs and the overhead related to hardware maintenance. Moreover, a large part of client software based on Web technologies is resident on local servers and loaded only when required; this approach greatly reduces the need to install and manage specific software on clients, such as DSS (Decisional Support System) tools and of the middleware related to database.

The three main advantages of this architecture are: independence of platform: client only needs a browser. It is not necessary to develop clients for

different platforms, the browser interfaces with the other application software; a single environment to access applications, because there is no client software, in other words

learning how to use and using this system is easier for users. Application software can be accessed through the same environment and the same interface: the browser;

no connection: the number of requests to carry out actions and report in a common client-server architecture could be limited if a high number of users were accessing it at the same time, in this system it is performed using an HTTP protocol, not requiring a permanent client-server connection and reduces the typical problems of these applications.

B – Application level

The engine of the inquiring and reporting application of the DW is embedded in the application level. The principal components are: HTTP Server and application Server. The HTTP (or Web) Server is the main element of multi-tier architectures based on Web technology. APACHE is the used HTTP Server, despite being a free software it is robust, high performance, portable and efficient. It guarantees an acceptable safety level, though the development phase is rather "complex"2.

On the Application Server an inquiring and reporting application was implemented using two products: SAS/IntrNet and the library SAS/SCL "DWNET". SAS/IntrNet meets different requirements that can be mutually integrated, such as the creation of static pages and dynamic Web pages. Users may access static HTML pages, created by the webmaster, using the former mode; the latter mode allows users to interact with SAS system sessions on the intranet or navigate through data using the OLAP mode.

The "DWNET" library, implemented by an Italian software house, is used to develop the Web site of the project and allows the SAS/Intrnet developer to change the approach to create dynamic

2 It supports CGI specifications and Server Sit?e Include (SSI) used to include a server-side code in HTML pages: the web server performs a parsing of the document before sending the HTML page requested by a GET order in compliance with the HTTP protocol. In case specific instructions are found, then a suitable interpreter is called and the output from a standard input is replaced by a set of instructions and sent to the browser.

50 9th CEIES Seminar – Innovations in provision and production of statistics

Page 51: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

applications.3 The DWNET library can manage the connection and perform all the client-server applications.

C –Database level

Two approaches could be selected: one based on Multidimensional Database (MDDB) and one on Relational Database. The second solution was adopted, namely a ROLAP approach (Relational On-line Analytical Processing). Thus, the basic functions to create statistical hyper-cubes can be implemented, in other words information can be organised to have the dissemination system centred on one or more "subjects" (events) and a set of “dimensions”, hierarchically structured by level, through which users can navigate4.

All statistics on external trade (since 1991) are in the database server, that is validated micro data, main (fact table) and (data mart) secondary aggregates, and metadata. Considering the dimensions of data for each year and the involved variables, 4 tables with micro data were created for each year, namely 35 million records.

In the Application Server the following can be found: secondary aggregates (data mart) and information mart from SAS processing of aggregates, needed to perform inquiring and reporting of information in the DW. An ROLAP architecture was selected because tests using MOLAP had not been successful. Moreover, in the DW meta data are available through the function “access to data dictionaries”, and are included in the inquiring and reporting functions.

7. Concluding remarks

ISTAT is implementing its Statistical Information System on Enterprises and Institutions (SISSIEI). It is a complex System covering all the phases of statistical surveys, adopting the suggestions and recommendations from international organisations to develop a more modern and efficient way to produce statistics.

Despite organisational and technical-scientific difficulties, the System is being implemented fast, with remarkable positive effects for statistical suppliers and users. During 2000, the System will be fully operating and this would imply important changes in the dissemination of data.

In this scenario, the strategy based on the construction of a Data Warehouse is an important element that cannot be left out. This strategy covers data available in-house and data that can be accessed through the Internet. ISTAT experiences show that a flexible approach, ready to include hardware and software innovations, is the most suitable approach. After the first 1998 DW and the one being designed several innovations have been introduced and field experiences play a central role. Testing on prototype models is the best solution, and very large databases can be implemented using the currently available robust and versatile software.

3 Instead of creating an HTML page within a SAS procedure, a page HTML is created using a common HTML editor (thus, it is easier to create pages) and using SSI (Server Side Include) the code SAS/base, MACRO SAS, SQL to interface with Oracle, JavaScript is included in the page. Development time is considerably reduced.

4 The following on-line analytical actions can be performed:-selection (screening or filtering or dicing) where criteria are imposed to filter data or levels of a dimension to

reduce the number of retrieved data;-pivoting (or rotation) where the dimension orientation of the cube, i.e. exchanging lines and rows or moving one

of the dimensions of a line toward a column;-roll down (or drill down or drill through) to navigate between levels of data from the highest aggregation level to

the most detailed one;-roll up (aggregation or strengthening) that is the harmonisation of data for the highest levels of dimension

hierarchies;-slicing, to select data fulfilling a specific condition for one of the established dimensions.

9th CEIES Seminar – Innovations in provision and production of statistics 51

Page 52: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

To construct an Information system such as SISSIEI a great effort on training is required to follow the “System” based approach. A clear and rigid set of rules concerning communications between the different components of the System is required and changes should be introduced in the actual organisation of statistical activities to achieve the expected goal.

References

Egidi V. and Giovannini E. (1998) “Sistemi informativi integrati per l’analisi di fenomeni complessi e multidimensionali” in Istat Proceeding of IV National Statistical Conference, Roma, 1999

United Nations (1999) “Information systems architecture for national and international statistical offices: guidelines and recommendations” United Nations Statistical Commission and Economic Commission for Europe: Conference of European statisticians statistical standards and studies – No. 51

52 9th CEIES Seminar – Innovations in provision and production of statistics

Page 53: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

New technologies in statistics and requirements for central institutions from a user's perspective

Christos AndrovitsaneasEuropean Central BankDirectorate General StatisticsKaiserstr. 29, D-60311 Frankfurt/[email protected]

Despite the number of related initiatives and the enthusiasm observed in the domain of statistics, Internet technologies do not necessarily provide easy answers to existing problems or help to meet all statisticians’ objectives. Data quality, timeliness and efficiency are always essential objectives against which a statistical system has to be assessed. Thus, the use of new technologies is not a sufficient condition to achieve these objectives, as the extent to which they facilitate the provision, production and dissemination of statistics needs to be carefully evaluated also in terms of integration and efficiency. Moreover, the technological advancement and the changing nature of the user community have been affecting the concept of the “end-user”, transforming gradually users of statistics into potential statistical “centres” which would benefit - nowadays more than any other period in the past - from system integration. These factors have to be taken into account when identifying user requirements or when planning a new framework for data dissemination taking advantage of new technologies.

In this paper these particular aspects concerning the use of new technologies in statistics are highlighted. Related issues which have already been sufficiently discussed in the context of other meetings and conferences are not repeated here and the reader should refer to the corresponding original contributions.1

1. Automation in provision and dissemination of statistics

1.1 New technologies and cost considerations

As technologies converge, as bandwidth opens up, as "anything is possible" moves from hyperbole to reality, governments are stretched in their efforts to be considered nimble followers, let alone leaders, in new ways of doing business. Even if they could overcome the structural, regulatory, and cultural difficulties involved in rapid adoption of new technologies, most governments would still face an even bigger hurdle: their lack of the resources required to acquire and implement complex and sturdy technologies.

Though this paragraph by Tapscott & Agnew (1999) does not refer to the provision and dissemination of statistics exclusively, it depicts in the best possible way the dilemmas which statistical organisations are facing nowadays. Fortunately, statistical organisations do not necessarily have to catch up with Web state of the art technologies or to constantly try to fit themselves into the changing technological environment. Room for manoeuvre is given, especially from the driving force of statistics user requirements, which seem to be much more stable than changes in Internet technology: Data quality; Timeliness; Efficient means in data exchange.

As will be explained below, a mix of traditional EDI methodologies and Web-based dissemination (in the context of a system integration approach) might be the optimal way both of keeping costs relatively low and of responding to the requirements of the traditional and the wider – Web-based – audience.

1 See especially Sarreither & Novak (1999), Podehl (1999) and Statistics Norway (1998).

9th CEIES Seminar – Innovations in provision and production of statistics 53

Page 54: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

1.2 The role of EDI

Major considerations in providing and disseminating statistics are timeliness and cost-effective solutions. Automation across all partners involved in a statistical data exchange is a key element in improving timeliness and maximising efficiency.

The central banking community has quite a long history in statistical EDI (Electronic Data Interchange). Nowadays, the exchange of statistical information on paper or fax between partners is regarded as a non-existent option, even when discussing contingency measures. In the event of a system failure, fall-back solutions foresee the use of alternative secondary systems and processes. Statistical data and metadata have been exchanged between central banks using a unique message format for all economic domains and full integration has been achieved across several partners’ systems. Especially, in recent years, an advanced environment has been developed around the use of GESMES/CB.2 It is a common belief among central banks that the advantages of the followed EDI strategy and the chosen solutions have made it possible to enjoy net benefits from the undertaken investments, already some months after the full introduction of GESMES/CB in statistical data exchanges. High efficiency had been a major concern3 and it has been achieved both in terms of fulfilment of requirements and in terms of resources required for development and implementation. In the ECB the principle of component reusability has been used extensively and it has been primarily based on the GESMES/CB layer (a single “loader” from GESMES/CB to in-house database systems is used). In this context also, when central banks act as client institutions vis-à-vis other major statistical centres, they attribute a particular importance to the possibilities for automation and the use of GESMES/CB. The ability to download data and metadata is an essential requirement for importing data from external sources into local systems. Simple “viewing” of data does not present any interest in practice, as “storing” and “further processing” are regarded as the natural next step of data dissemination activities once data reach the receiving partner.

1.3 Internet vs. EDI?

The use of Internet can further facilitate the provision and dissemination of statistics. Internet offers the possibility to allow access to the general public and it is especially in this domain that its comparative advantage vis-à-vis other means becomes more visible. Particular aspects of major concern for statisticians include, for example, simultaneous release to all interested parties and “speed of dissemination” (IMF 1997), which can be served in an ideal manner using Web-based dissemination. Also, the introduction of Web-based dissemination requires a data warehouse type of approach4 and it contributes to the creation of an internal corporate statistical culture. However, a Web-based data exchange does not necessarily lead to solutions which allow easy integration, especially for the party accessing the data. Integration has been also a major concern over the past years and standardised EDI interfaces had been seen as essential means for achieving integration-related objectives (e.g. Malmborg & Sundgren 1994, Keller & Bethlehem 1998). From this point of

2 GESMES/CB is a relatively simple GESMES profile which is based on a very powerful time series model. It allows the exchange of multidimensional data, attributes and structural metadata. A basic consideration during the design phase had been the possibility to allow database systems to communicate between each other, without necessarily presuming a visual tabular representation of the exchanged information (Androvitsaneas 1997). GESMES/CB was jointly developed by the European Central Bank (ECB) and the Bank for International Settlements (BIS) for the exchange of statistics concerning any economic domains (see BIS, ECB 1999). It has been used in full statistical production across the 15 EU central banks since early 1998 and it has been serving already both data provision and dissemination activities within the European System of Central Banks (ESCB) in a very efficient manner (Androvitsaneas 1999); institutions using GESMES/CB in full production will be present on almost all continents by mid-2000.

3 For a discussion about efficiency in a statistical information systems context, see esp. Sundgren (1999a), UN (1999) and Linde & Vanags (1999).

4 For problems and questions related to such a transition, see for example Björkqvist (2000).

54 9th CEIES Seminar – Innovations in provision and production of statistics

Page 55: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

view, the introduction of Web-based solutions should not cause potential conflicts with other approaches which might ensure an even higher level of automation. It has to be noted that Web-based solutions should either embed bridges towards advanced automation (e.g. downloading facilities in a standardised format using a minimum number of manual operations) or, otherwise, they should not exclude the support of more traditional EDI approaches.

2. Statistics and the user community

2.1 Are there non-central institutions?

Let us consider a case in which a number of institutions (reporting institutions) send statistical data to another one (central institution) which collects them, performs some processing and possibly calculates some macrodata or aggregates. As reporting institutions we could consider national statistical institutes (NSIs) reporting to EUROSTAT or NCBs reporting to the European Central Bank (ECB). But the role of an NSI (or an NCB) would change into a “central institution” role when reporting firms to NSIs (or reporting banks to NCBs) are considered. Also, to a certain extent, the statistical information systems area of an NSI or an NCB has anyway to play a “central” institution role when it receives data and metadata from another institution. In any case of regular data transfers, partners have an interest in seeing themselves as playing a central role and in targeting a level of automation as advanced as possible. If it not the case yet in some data reporting circuits, two-way statistical data exchange has to be seen as a very essential characteristic of handling statistics over the next few years.

2.2 Are there still end-users?

In fact, nowadays, there are not so many “end-users” any more. Even for the simplest operation on a statistical table on a Web page, no-one would be content just to view a figure. Even in the most extreme “end-user” case, it would be still desirable for this user to “copy” at least one or more numeric values and to “paste” them into an electronic spreadsheet, a document or an econometric program which is probably simultaneously open (and, if possible, in some cases, with a dynamic link to his local copy of the original database).

2.3 Each user a potential processing centre

As discussed earlier, after performing data retrieval, individual or corporate users need, in general, to perform further processing using tools which are available locally in their environment. Let us assume a case in which - with considerable effort - an institution has made, apart from the data, also processing tools available on its Web page. However, it would be rather simplistic to assume that the majority of users would prefer to use this certain predefined range of tools found on-line (and probably not easily usable at first glance), rather than loading the data onto their local infrastructure and performing the processing using database systems and tools they are already familiar with. The only case in which he might prefer to use the provider’s tools would be if these were superior to the local ones. But still, even in this case, this would be a source of problems for users if they were confronted with different and alternative data providers’ tools. Even an elementary system integration would be quite difficult for the user in this case, unless all dissemination centres were providing and supporting the same tools.

Of course, the solution to this problem is for centres to support data downloading. By doing so, users could load data and metadata onto their local environment and perform processing using locally available database systems and tools in a consistent manner regardless of who is the data provider.

3. New technologies in statistics and underlying data model

9th CEIES Seminar – Innovations in provision and production of statistics 55

Page 56: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

3.1 A data model for each user?

Each potential user group of statistics on the Web5 may have different requirements. A possible response to such a variety of requirements could be to accommodate them by enlarging the range in the provision of offered metadata or, in other words, by offering various forms or presentations according to various user groups’ expectations. However, obviously, this could lead to an enormous rise in maintenance costs for statistical organisations.

Even in some cases of sophisticated systems allowing extensive flexibility (Sundgren 1997), the expectation is that a downloading process (e.g. in GESMES) is essential when processing intentions go beyond elementary calculations and simple visualisation. Also, due to efficiency considerations, flexibility and adjustment have their limits. A centre disseminating its data on the Web cannot easily serve alternative underlying data model expectations. Ideally, it would be desirable to target agreements as wide as possible on the type of supported data models for Web dissemination. On the other hand, if this is not easily feasible, it would still be preferable for institutions to support a concrete and relative simple data model, providing the necessary explanations and documentation about this data model on-line, than trying to adjust to numerous foreseen and unforeseen user expectations. There have been already many cases of statistical organisations providing very complete information and documentation on-line.6 However, the more complete the supplied information, the more difficult it is to download it, unless special mechanisms for this are foreseen. Once again, EDI – GESMES tools are essential in order to interpret, load and automate the local storage of data and metadata.

Ideally, of course, standardisation in the supported data model and the semantics would lead to an optimal solution for both data providers and data users.

3.2 Towards a data model for disseminating statistics on the Web?

Some efforts have been made to produce recommendations concerning standards and a minimum set of metadata for Web-based dissemination (see Statistics Norway 1998 and UN/ECE 1999). The focus has been on analysing various user groups and the conclusion has been that a significant range of user requirements is faced, due to the variety observed across user groups. However, modelling metadata reflects in fact a partial aspect of the more general data modelling. Also, in order to keep efficiency and integration at a satisfactory level (both for the provider and for the user) as discussed earlier, it is obvious that:

(1) downloading facilities have to be supported and

(2) a certain degree of “harmonisation” concerning the data model (and the metadata) is desirable in order to avoid complexity for users.

In fact, these issues are not new for the statistical EDI world, as they have been extensively discussed in the past. The first issue relates to the development of appropriate filtering mechanisms and the second one to the need for convergence in the semantics7. In this context, a significant amount of work done so far in these areas – at least at the conceptual level – is reusable and it should be seriously taken into account. For example, the central banking community has been using

5 For a discussion on potential user groups and their requirements see Statistics Norway (1998).6 E.g. Statistics Canada publishes on the Web several underlying code lists for time series which are accessible

on-line (http://www.statcan.ca/english/Subjects/Standard/standard_classifications.htm).7 Several efforts are under way to harmonise the contents and presentation of metadata (Statistics Norway

1998 and UN/ECE 1999; on the latter, see especially the issues discussed in the Annex). A Work Session on metadata standardisation for dissemination, in the framework of UN/ECE activities, is foreseen for November 2000.

56 9th CEIES Seminar – Innovations in provision and production of statistics

Page 57: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

a single, quite flexible and at the same time robust, data model. This is combined with an advanced level of harmonisation over the qualitative information (attributes) and agreed procedures for the administration of structural metadata.

The ECB has been planning new enhancements for its data loading and dissemination systems in the near future. They are based on new technologies, but without diverging from its integration, full automation and high efficiency objectives. For example, work is being conducted in the areas of “statistics on the Web” with embedded filters to GESMES/CB and in redesigning the basic conversion software between alternative formats (GESMES/CB-EDIFACT, GESMES/CB-XML, FAME and ODBC) using modular object-oriented techniques.

4. Conclusion

New technologies in the dissemination of statistics are not a panacea. When implementing solutions based on new technologies, there is a risk of targeting the use of the latest techniques and disregarding some essential aspects of automation. Integration and automation are core means of maximising the efficiency of dissemination and processing of statistical information for users (individuals and other user institutions). Nowadays the community of “real end-users” has shrunk and both institutions and individuals require, in general, when they access statistical information, to “load” data (and metadata) onto their local environment for further processing. Bridges allowing full automation are essential for an institution supporting and maintaining the dissemination process, and for the broader user community as well. Ideally, Web-based dissemination of statistical information should comprise, apart from the possibility to access numeric data, also the possibility to access qualitative information and underlying structures and code lists as well. However, this information could be easily manageable by user institutions only when appropriate downloading facilities are provided and only if these facilities conform to certain data modelling standards. In this context, the incorporation of filters allowing downloading in standard statistical formats (i.e. GESMES or GESMES/CB) is an important factor when disseminating on the Web, in order to ensure that users will be in a position to use statistical information and meta information in the best and most efficient possible way. For example, providing “downloading” facilities towards GESMES/CB through a CD-ROM or a Web-based dissemination:

Would require only a low-cost development for the corresponding module;

No technical maintenance would be needed (standards do not change!) and only support for changing structural metadata would be required (which is something that it is needed anyway);

Would make statistics really usable (in an automated fashion) by a broad community able to “read” data and metadata in this standard (e.g. central banks world-wide, EUROSTAT, IMF, some NSIs, etc.);

The need to devote resources to building state of the art presentations and tools on the Web would become less justifiable, as long as extraction in an EDI format were supported;

The investment in new technologies would become even more justifiable globally, in terms of meeting automation objectives.

References

Androvitsaneas, C. (1997): Statistical data exchange in Stage Three of Economic and Monetary Union: towards a time series model in an EDI environment, Proceedings of the Fourth International FAME User's Conference, Wiesbaden, June 1997.

Androvitsaneas, C. (1999): GESMES/CB supporting Monetary Union, “Statistics, telematic networks and EDI bulletin”, EUROSTAT, Theme 9 - Research and Development, 1999/2.

9th CEIES Seminar – Innovations in provision and production of statistics 57

Page 58: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

Bank for International Settlements (BIS) & European Central Bank (ECB): GESMES/CB User Guide, Release 1.4, March 1999.

Björkqvist, S. (2000): Experiences with WWW based data dissemination – the StatFin on-line service, 9th CEIES, Helsinki, January 2000.

IMF (1997): Data Dissemination Standard (http://dsbb.imf.org/), December 1997.

Keller, W. (1997): EDI, the future, Statistics Netherlands, Vol. 12 – Special Issue, Autumn 1997.

Linde, J. & Vanags, I. (1999): Some technological and economic problems in the implementation of modern information technology, Meeting on the management of Statistical Information Technology, UN/ECE, 15-17 February 1999.

Malmborg, E. & Sundgren, B. (1994): Integration of Statistical Information Systems - Theory and Practice, 7th International Conference on Scientific and Statistical Data Base Management, Charlottesville, Virginia, 1994.

Podehl, M. (1999): Data base publishing on the Internet, Meeting on the management of Statistical Information Technology, UN/ECE, 15-17 February 1999.

Sarreither & Novak (1999): The impact of Internet on statistical organisations, Meeting on the management of Statistical Information Technology, UN/ECE, 4 February 1999.

Statistics Norway (1998): Guidelines for statistical metadata on the Internet, UN/ECE, Conference of European Statisticians, Forty-sixth plenary session, CES/1998/32, 18-20 May 1998.

Sundgren, B. (1997): Sweden’s Statistical Databases: an infrastructure for flexible dissemination of statistics, UN/ECE Conference of European Statisticians, June 1997.

Sundgren, B. (1999a): An information systems architecture for national and international statistical organisations, UN/ECE Meeting on the Management of Statistical Information Technology, February 1999.

Tapscott, D. & Agnew, D. (1999): Governance in the Digital Economy, International Monetary Fund, Finance and Development, Vol. 36, No. 4, December 1999.

UN (1999): An information systems architecture for national and international statistical organisations, Conference of European Statisticians, Statistical Standards and Studies - No. 51, Geneva, 1999.

UN/ECE (1999): Report of the September 1999 Work Session on Statistical Metadata, Note prepared by the Secretariat for the 48th plenary session of the Conference of European Statisticians (planned for 13-15 June 2000), November 1999.

58 9th CEIES Seminar – Innovations in provision and production of statistics

Page 59: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

2nd day:THEME 2:

STATE OF THE ART

9th CEIES Seminar – Innovations in provision and production of statistics 59

Page 60: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

60 9th CEIES Seminar – Innovations in provision and production of statistics

Page 61: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

STATE OF THE ARTPROBLEMS/SOLUTION/TECHNOLOGIES

Dieter BurgetStatistics AustriaHintere Zollamtsstrae 2bPostfach 9000A-1033 [email protected]

9th CEIES Seminar – Innovations in provision and production of statistics 61

Page 62: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

62 9th CEIES Seminar – Innovations in provision and production of statistics

Page 63: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

9th CEIES Seminar – Innovations in provision and production of statistics 63

Page 64: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

64 9th CEIES Seminar – Innovations in provision and production of statistics

Page 65: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

9th CEIES Seminar – Innovations in provision and production of statistics 65

Page 66: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

66 9th CEIES Seminar – Innovations in provision and production of statistics

Page 67: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

9th CEIES Seminar – Innovations in provision and production of statistics 67

Page 68: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

68 9th CEIES Seminar – Innovations in provision and production of statistics

Page 69: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

9th CEIES Seminar – Innovations in provision and production of statistics 69

Page 70: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

70 9th CEIES Seminar – Innovations in provision and production of statistics

Page 71: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

9th CEIES Seminar – Innovations in provision and production of statistics 71

Page 72: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

72 9th CEIES Seminar – Innovations in provision and production of statistics

Page 73: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

9th CEIES Seminar – Innovations in provision and production of statistics 73

Page 74: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

DATA COLLECTION, RESPONDENT BURDEN, BUSINESS REGISTERS

Jean-Pierre GrandjeanFrench National Institute of Statistics – INSEE 1

18 blvd Adolphe PinardF- 75014 [email protected]

1. Introduction

It is common view that the European and national demand for statistics is expanding, in particular in the domain of business statistics, i.e. statistics that are compiled from data provided by the businesses.

Simultaneously, the administrative burden that weighs on the businesses has generally been growing during the last 20 years. This is mainly due to more and more complicated legislations established to cope with a social, economical and technical organization which is more and more complex. The businesses have started complaining against this situation a few years ago, and in many countries it appears that the politicians have reacted to these complaints in launching « simplification » programs. This type of policies has also come to light at the European level (cf. the SLIM initiative).

The surveys that are carried out by the National Statistical Institutes (NSI) are clearly part of this burden. So, in some countries, the statisticians have been asked to reduce their survey programs.

We observe some sort of contradiction between the increasing demand for statistical figures and the concern for keeping the burden within acceptable margins, or even reducing it.

I will describe the French situation2, the strategy that has been implemented by the French statisticians and discuss the role new technologies have been playing and will play in the future in this domain. From many years of experience of participation in European fora, mainly in Luxembourg, I draw the impression that national situations are extremely diverse, from an institutional, organisational, political, sociological and historical viewpoint. It is clear for me that the interest of what I describe is quite relative. However some trends may be significant. Moreover, some proposals may be relevant from a European perspective.

2. The institutional and political landscape

The question of the statistical burden has been raised in France at the beginning of the 90s. A number of reports have been written on this topic, either by statisticians, or by experts, or by representatives of the enterprises. This activity generally took place under the patronage of the National Council for Statistical Information (CNIS)3, a rather large organization in which businesses, trade unions, universities, experts, administrations and official statisticians belonging to the INSEE and to the different statistical services of ministries are represented. This committee has a number of subcommittees specialized by domain, one duty of which is to express their opinion on

1 Head of the department « System of Enterprise Statistics », INSEE. The views expressed in this paper are attributable to the author and do not necessarily reflect those of the INSEE.

2 See Grandjean (1997a) for a general description of the French business statistics system. This system is mainly caracterized by the coexistence of a powerful central coordinating body, the INSEE, which is part of the Ministry of Economy, and of statistical services located in a number of ministries  : Industry, Agriculture, Employment, etc.

3 A Web site is devoted to the CNIS activities : www.cnis.fr.

74 9th CEIES Seminar – Innovations in provision and production of statistics

Page 75: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

the mid-term and the yearly work-programs that are presented to them by the statisticians. These work-programs include the surveys that are to be carried out. A number of simplification proposals have been expressed in the above mentioned reports, a number of which have been put into application.

The present government has presented two simplification programs in 1997 and 1998, and should present a new one in the next few months. These programs are mainly directed towards Small and Medium Enterprises (SME)4. They consist of a number of measures, addressing a variety of topics, in all domains, fiscal, social, juridical. One of them aimed at reducing the number of statistical questionnaires addressed to the SMEs.

We feel that there is no serious reason why « simplification » should not stay on the agenda of politicians in the future.

3. How to manage the burden problem

A number of ways have been explored.

3.1 Monitoring the survey program

In France, the yearly survey program is published as a ministerial order, signed by the General Director of the Insee, on the authority of the Minister of Economy, in the Official Journal, at the beginning of each year.

As mentioned above, the CNIS is entitled to give an opinion about the appropriateness of any new survey or of the renovation of any existing survey. However, we thought that some progress had to be made concerning the technical quality of the surveys. A new subcommittee of the CNIS was created, the so called « stamp-committee »5. It is composed of representatives of businesses, of Chambers of Commerce, of an agency in charge with administrative simplification, and of statisticians. It is chaired by a statistician. Any new survey project and any project of renovation of an existing survey is presented to this « stamp-committee ». The dossier that accompanies the presentation has to follow a required scheme. The committee evaluates the statistical quality of the project according to a number of criteria related to the preparation of the questionnaire6, the size of the sample, the sampling scheme, the dissemination scheme, etc.

There was also some thought given to the response obligation. A few years ago, every survey published in the ministerial order was mandatory. There is now a new category of surveys that are not mandatory, but considered of « general interest ». This distinction is also made by the « stamp-committee », which decided in a number of cases that a new survey should be considered only of general interest, and not mandatory, even though the service that had conceived it had wished it to be mandatory.

It was also decided that all existing surveys should be reviewed by the committee in 3 years time, in order to ascertain their usefulness, the necessity of every question, the size of the samples, etc. This is being done.

3.2 Optimizing the sample sizes

4 The French definition of SMEs includes all enterprises up to 500 employees. However the European definition, which is limited to 250, is more and more referred to.

5 In French, it is called « Comité du label ».6 A lot of attention is paid to the participation of « users » in the design of the questionnaire and to the quality

of the tests that have to be made.

9th CEIES Seminar – Innovations in provision and production of statistics 75

Page 76: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

One efficient way of reducing the burden is to optimize the sampling schemes, to reduce the size of the samples and to raise the size threshold above which the businesses are systematically surveyed. This has been done for a number of major surveys, even though we have been cautious concerning the exhaustivity thresholds in order not to disrupt too much the continuity of the time series.

Concerning the above mentioned measure taken by the government to reduce the number of statistical questionnaires addressed to the SMEs, we could exhibit the result of such optimizations.

Of course, these optimizations have limits regarding the quality of the results, for example at the regional level.

3.3 Using administrative data for statistical purposes

We call administrative data micro-data that are collected from enterprises by administrations in order to manage the regulations they have in charge. These data are not collected through statistical surveys. The French statistical law entitles the INSEE and the statistical services of ministries to receive administrative data from other administrations in order to produce statistics from these data. It should be noted that an administration is not obliged to transmit its administrative data to the INSEE. There is always a negociation, it may be long, one of the key points in the discussion being the presence or absence of identifiers in the data that will be transmitted to the INSEE.

The INSEE has been using administrative data for several decades. Examples are the use of Value Added Tax (VAT) data to compute a monthly turnover index, the use of social security data about employment and wages for the computation of quarterly indices. The structural yearly figures about wages are computed from data collected by the agency in charge of pensions. Structural statistics on enterprises, excluding the bank and insurance sector, are produced by a system in which survey data and business income tax data are combined. Structural statistics on the bank and insurance sector are computed directly by the supervisory agencies. The balance of payments statistics are produced by the Banque de France from administrative data. The monthly Intrastat statistics are derived from administrative data that are collected by the Customs administration, both for fiscal and statistical purposes.

I will not detail the pros and cons of using administrative data for statistical purposes. They have been well documented elsewhere7.

In France, we know that, in some situations, there is some data collection overhead. Some very close data are collected from businesses both through a statistical questionnaire and through an administrative form. And all of these data are being gathered and used by the INSEE to produce statistics. Our general strategy is in the end to suppress these double data collections. The administrative data will generally be the core data. The statistical surveys will collect supplementary data that are not collected by other administrations.

The major argument not to quicken the pace to suppress the double data collections is a calendar argument. The administrative data are available to us very late, after the results of our statistical surveys have been disseminated. But this situation could dramatically evolve in the next few years, mainly because of new technologies. And another argument is that the combined use of administrative and survey data is not very easy. We prefer to take some time in order to improve our methods.

3.4 Using a single identifier for businesses

7 See, for example, the Proceedings of the Seminar on the Use of Administrative Sources for Statistical Purposes – Luxembourg – 15-16 January 1997, published by Eurostat in 1997.

76 9th CEIES Seminar – Innovations in provision and production of statistics

Page 77: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

One key element that makes quite feasible the previously expressed strategy is the fact that the INSEE manages the French business register8. We give every business an identifier that is common to all administrations9. This identifier should be used by law in all contacts between businesses and administrations. In this respect, the situation is improving regularly. One recent and major event was the decision taken by the Tax administration to use our business identifier in its own registers.

From this practical experience, and having in mind the fact that our enterprises will be more and more European, we think that the Commission should pay some attention to the following question: should we dispose of a common European identifier for all the Community enterprises ?10. We wonder if the existence of such an identifier would not be in the long term a way to lower the administrative burden at the European level.

3.5 Making the burden more acceptable

A number of actions have been engaged in order to obtain a more positive view of businesses regarding statistical surveys. I quickly describe some of them.

The questionnaires are progressively being modernized in order to have a more attractive look and to be more understandable by the enterprises. The cover letters are being rewritten in order to insist on the usefulness of the surveys.

Statistical results are systematically fed back to the enterprises which have answered a survey. The nature of these results can be different from one survey to another. In the most sophisticated situation, the enterprise receives results that allow it to compare its results to the distribution of the results of similar enterprises.

The clerks who process the surveys are trained for a better command of telephone exchanges with businesses.

A small (40) network of interviewers specialized in the domain of business surveys is being created, in order to make tests of new questionnaires, to realize some particularly difficult or important surveys by interview in the field, and to get direct responses from non-respondents when these non-responses are judged particularly disturbing for the quality of the survey results.

The « legal » procedure that may end in imposing a fine on non-respondents is managed in a such a way as we try to profit by this « crisis » situation in order to convince the non-respondents of the usefulness of their answer to the survey. It is also much more respectful of the defence rights than it was a few years ago.

On a more technical level, we have already invested and intend to invest more on the topic of sample coordination and sample rotation11. We have also tried to have more automated editing procedures in order to minimize the number of phone calls addressed to enterprises to check if an answer is correct or not.

3.6 Making use of new technologies

8 It is well known that a number of countries all over the world have implemented a unique administrative business identifier. Others are working on it. The INSEE is in the rare position to be responsible for the management of this identifier.

9 For a general presentation of the French business register, see Picard (1997) et Bernard (1997).1 0 A possible basis for such an identification scheme could be the VAT number used to report the Intra-

Community exchanges of goods. The French VAT number is of course based upon our unique business identifier.1 1 See for example, Cotton and Hesse (1992), Rivière (1999).

9th CEIES Seminar – Innovations in provision and production of statistics 77

Page 78: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

We can say that in France the real impact of the new technologies on the statistical burden has been rather limited, probably smaller than what could be expected in 1990.

Even though the French statisticians have been carefully watching the movement towards electronic data interchange1 since the end of the 80s, they have been very cautious during the 90s concerning the experiment of new technologies for survey data collection. A number of reasons may explain that attitude.

There was mainly a concern about the advantages that could be gained from such projects compared with their costs, for the businesses themselves, as well as for the statistical administration. The main problem was expected to be met in the acceptance of enterprises to use electronic means to answer our surveys, whereas it was well known that they generally didn’t think that answering our surveys was highly prioritary, compared for example with fulfilling other administrative tasks, such as reporting to the tax or to the social security administration. The technologies that could be used always implied some sort of investment from the businesses, for example in training, that we felt would be hard to justify.

Concerning the impact of these technologies on the burden, it was not so obvious. One may argue that the main difficulty faced and the greatest cost incurred by an enterprise when answering a quantitative survey is clearly to compute and gather the answers to the questions. It is all the more true when the enterprise is large, because it appears that filling in our questionnaires very often imply the participation of a number of different units within the enterprise. In this situation, the automatization of the computation of the answer to the survey is costly. The circulation of a paper questionnaire may well be simpler and less expensive.

Concerning the small enterprises, for which the burden due to statistical surveys is relatively higher, compared with the largest ones, there was a difficulty with the rate of equipment in microcomputers, modems and so on, that was low. It is better now, but it is not sure that the person which is in charge with the task of responding a given survey disposes of the necessary hardware, software and telecommunication link. Moreover, it is not easy to exhibit a clear gain when answering a questionnaire on a screen rather than on a paper2.

On the other hand, the INSEE itself doesn’t carry out so many monthly or quarterly surveys that were felt to be more prone to successful pilots. It is partly due to the fact that we often rely on administrative data that are collected by other administrations, as it was stressed before. Is is also due to the fact that many of these surveys are realized by the ministry statistical services, or even by some professional associations on behalf of the statistical administration3. These services and these associations are often more flexible and reactive than the INSEE, even though their capacity to invest is often less important.

In fact, only one significant achievement can be reported upon. In 1994, the FIEE, the federation of electric and electronic industries, implemented a computerized self-administered questionnaire (CSAQ) for their monthly survey about industrial production4. It was rather sophisticated, allowed the online transmission of the survey data from the enterprise to the federation, the reverse online 1 We don’t describe the use of other new technologies such as OCR, because these technologies have no real impact

on the responding enterprises, and don’t significantly lower the burden. Some significant French achievements could however be described.

2 One may argue that there is a « tangible » gain when the data are checked as soon as entered, so that they may be corrected by the enterprise before sending them to the statistical agency. Then the enterprise should not be recalled later.

3 See Grandjean (1997a) for some explanations on this situation.4 This survey collects some of the data that are used for the computation of the industrial production index.

78 9th CEIES Seminar – Innovations in provision and production of statistics

Page 79: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

transmission of statistical results and of the updated questionnaire at the beginning of each new year. The modem was given by the federation to the enterprises. They could convince about 200 enterprises out of 350 belonging to the sample to use this medium to regularly answer the survey. This system has recently been replaced by an Internet data collection. More will be said about that later. From this experiment, and from others, reported mainly by the US statistical agencies5, we keep the idea that automated data collection systems should be very carefully designed, so that the investment cost could be rather high for complex surveys, and so that the enterprises which use them may get convinced that they are really reducing their administrative tasks.

It is indeed clear that the burst of the Internet has very deeply changed the landscape, mainly because the use of the Internet technologies allows the large scale deployment of sophisticated data collection instruments with a much lower budget than that was necessary beforehand. This seems to be true as well for the statistical administration as for the inquired businesses. Moreover, the Internet appears to be the ideal mean to disseminate statistical figures, and to target new users for our statistical products, the first one being the medium 6 enterprises. So doing, we should be in a better position to convince them that our surveys are really useful. In some respects, we may assert that investing in disseminating on the Internet is more prioritary than investing in data collection.

One major problem, however, is the fact that the Internet technologies are not stabilized, and it is highly likely that they will not stabilize in the near future. The quick obsolescence of the these technologies is a difficulty. The businesses don’t change their computing equipment as often as some software and hardware producers would like. This matter of fact always sets transition problems when we try to make our data collection instruments evolve. On another hand, the software tools that would allow the efficient development of Internet data collection instruments are not yet available. One could hope that tools such as Blaise will be quickly available for that purpose, so as to reduce the implementation and maintenance costs.

4. The Internet era

4.1 The political and legal context

The present government is the first one to attach great importance to new technologies. At the beginning of 1998, it published an ambitious program to foster the entrance of the country into the information society. Some consequences are already visible, specially within the administrations that weren’t admittedly very advanced in this respect.

From a legal viewpoint, the possibility to report for any questionnaire or any form by electronic means had been opened in a law enacted in 1994. In 1998, the use of encryption has been very significantly liberalized, whereas it was beforehand quite restricted. And in 1999, the Ministry of Justice accepted to propose a bill that would recognize the probative value of digital signatures. The law should be enacted in 20001.

At the end of 1999, a law article was passed which holds that enterprises the turnover of which exceeds 100 millions French francs (15 millions euro) should monthly report via electronic means on their VAT from May 2001 and on their income tax beginning with the fiscal year 2000. They should also electronically pay their VAT from May 2001. 13000 enterprises are concerned. The Tax administration has expressed the view that the threshold of 100 millions francs should be quickly lowered during the following years.

5 See for example Kanarek and Sedivi (1999). 6 In general, small and very small enterprises don’t directly use our statistical figures.1 This law will transpose in France the recent European directive n° 1999/93/CE on a « Community framework for

electronic signatures ».

9th CEIES Seminar – Innovations in provision and production of statistics 79

Page 80: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

4.2 The security problems

Much has been said and written about the security problems associated with the Internet. The working French Internet data collection systems I have heard of, either administrative or statistical, don’t generally use sophisticated authentification techniques1. The respondents are given an identification code and a password. When encryption is being used, it is quite generally based upon the SSL protocol, which is supported both by Netscape Navigator and Internet Explorer, with short keys. The enterprises which do accept to use these data collection instruments are duly informed of these security conditions and do take their responsibilities.

In 1998 and 1999, the Ministry of Economy has been actively studying the conditions in which electronic exchanges with both enterprises and citizens could be reasonably and efficiently secured. The project is based upon the concept of Public Key Infrastructure (PKI) and on the use of digital signing certificates, based upon the X509 ISO standard. The Ministry will publish the format of the certificates it will accept. The compatibility with this format of the certificates that are produced by private certification operators, such as Verisign and others, will be validated and registered, and the enterprises will be able to choose and buy whatever registered product they want. As the INSEE is part of the Ministry of Economy, it is likely that it will join this scheme in order to secure its Internet data collections and also its electronic commerce.

4.3 The French Internet data collection projects

Two systems are working today. Both of them have to do with the monthly production survey in the manufacturing industry. Let’s recall that this survey produces input for the industrial production index.

The first one is managed by a professional association in the field of « tiles and bricks ». About 25 enterprises out of the 100 that are inquired use it.

The second one is managed by the FIEEC1 as it was mentioned in paragraph 3.6. It is interesting to take note of the fact that when the FIEEC replaced its previous system based on a CSAQ by an Internet data collection, in June 1999, the number of « electronic » respondents fell down from 200 to 80. It is now 100. There are two reasons to that. First, a significant number of small enterprises have not yet an access to the Internet. Second, in the biggest enterprises, the security problems associated with the Internet are not yet completely mastered, so it is not so easy to deploy Internet based applications in these enterprises.

The FIEEC system will in fact stop working very soon. The FIEEC monthly inquiry is to be managed from January 2000 by the statistical service of the Ministry of Industry (SESSI). This service will put in operation a new Internet data collection system for their monthly industrial production survey in February 2000. The total number of inquired enterprises is 3500. The responding enterprises will be identified by their business register identifier. To begin with, the transmission will be secured with the SSL protocol.

The last project to be reported has to do with road freight transport statistics. The statistical service of the Ministry of Transportation has been discussing about an exchange format with 3 of the major businesses in this sector. For the time being, this format is a comma-separated-value (CSV) one. The exchanges should start on an experimental basis in the very near future. The files will be exchanged simply as attached files within e-mails. These businesses want to lower their burden, through a complete automation of their response. If the experiment is successful, the technical

1 One exception is mentioned in § 6.1 Since 1995, the FIEE has become the FIEEC, C standing for Communication.

80 9th CEIES Seminar – Innovations in provision and production of statistics

Page 81: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

solutions should be strengthened and secured. The target population of respondents using this EDI solution would be about 30 enterprises.

5. New technologies and the business register

As was said before, the INSEE manages the French interadministrative business register. When an enterprise is to be created, or ceased, or when its activity or its legal status changes, one unique form has to be fulfilled and produced in a so called « center of formalities of enterprises » (CFE). There are about 1500 CFE all over France. They are specialized according to the type of activities of the enterprises. The Chambers of Commerce are the CFEs for the commercial enterprises, some social security offices are the CFEs for the professional people, etc. The forms are transmitted to the INSEE which registers the new businesses or updates the register in case of changes or ceases « dead » businesses. The INSEE sends back the updating information in the form of notices to a number of administrations. The register is continuously updated.

Strangely enough, the above mentioned 1994 law that allows for electronic reporting of any form, makes an exception for these registration forms, which can’t be electronically transmitted by the enterprises. So we can’t report any experiment in this field.

However, INSEE has always been very much interested in automating the information exchanges with the CFEs (about 2 millions forms a year arrive at the INSEE, which issues 8 millions notices). Two EDIFACT messages have been designed to that end. The first experiments took place in 1995. The progress has been slow. At the beginning of 1998, only 0.5 % of the forms sent to the INSEE were sent as EDIFACT messages. There has been a significant acceleration in 1998 and 1999. The percentage of « Edifacted » forms is now 17 %. The transmission protocol is X400.

6. Administrative data collection

As mentioned above, the INSEE uses administrative data to produce statistics. So we are interested in the progress the collecting administrations make in the use of new technologies. The result should be higher quality data available more quickly. Some achievements are noteworthy and will be shortly described.

The Customs administration has been developing electronic data collection for many years. Concerning Intrastat, they collect quite a number of forms. A convenient volume unit is the « line » of form. They receive about 44 millions lines per year, among which 70.3 % by electronic means : 38.2 % on tape or diskette, 30.3 % by teletransmission (diverse media are allowed : X400, SMTP, XMODEM, the CFT file transfer system), 1.5 % through a Videotex based data entry system, 0.3 % through a legacy online data entry system. 4200 enterprises use the IDEP CSAQ developped by Eurostat, and the Customs administration expects this figure to grow quickly in the near future. Half of the lines which are received on paper are processed with an OCR system. An Internet data collection system should be in production next summer.

The Tax administration has been investing for a number of years in a system entitled « Transfer of fiscal and accounting data2 » (TDFC) to collect yearly business income tax data3. The system was designed primarily for SMEs which often don’t fill in themselves their income tax form, but rather remunerate a chartered accountant to that end. So the Tax administration has signed agreements with a number, about 450 today, of so called « relay-centers », that may be software houses, chartered accountants or IT services companies. From data provided by the chartered accountants who accept to participate in this system, these relay-centers must produce files that conform to a

2 See Grandjean (1997b) for a general description of the use of yearly business income tax data for statistical purposes and the associated problems.

3

9th CEIES Seminar – Innovations in provision and production of statistics 81

Page 82: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

well defined format. Up to now, the TDFC format was proprietary. A new one (EDI-TDFC) is being finalized, which makes use of the INFENT EDIFACT message. These files may be sent to the tax administration using tape, X400 or the CFT file transfer system. The relay-centers can digitally sign the forms they send on behalf of the enterprises using a smart-card based digital signature hardware system. The Tax administration has received about 500000 forms through this system for the last fiscal year, and this number is growing regularly. A number of accounting sofware packages are able to produce TDFC files and a few of the major accounting software houses are already investing in the new EDI-TDFC format22. The recent law mentioned in § 4.1 announces new developments which will mostly be based upon Internet technologies.

Concerning the yearly data on wages, the main administrative source arrives at the INSEE through a single stop shop system4 that has been managed by the social security agency in charge of pensions since 1981. This agency collects data from all private businesses and redirect them towards a number of other administrations: Tax administration, other social security agencies, Ministry of Employment, INSEE. When electronically collected, data must be provided by the enterprises according to a proprietary format, the so called TDS format5. This format is provided by most of the payroll software packages. The businesses have to report one line of form per person salaried during the year. About 50 % of the businesses report electronically (900000 out of 1800000), for about 75 % of the total number of reported lines (30 millions out of 40 millions). They can report on tape, diskette, or by telecommunications (X400, CFT, XMODEM). An Internet based SSL secured reporting system has been been made available last summer.

7. Conclusion

The Internet has opened new perspectives for business survey data collection. But, from the past experience, we must not expect miracles. It will take some time to convince the inquired enterprises, and particularly the small and medium ones, of using the new Internet-based data collection instruments. We must of course never forget to consider with the highest priority the viewpoint of these enterprises when designing these automated instruments. They should be very carefully designed and the targeted enterprises should be as narrowly associated to these designs as possible.

We must not forget either that a large part of the burden due to the survey response lies in the computation of the answers. The existence of standard charts of accounts and of standard definitions in other domains (employment, wages, environment, etc.) is in our mind one element liable to simplify the task of the responding enterprises6. This is all the more important within the European Community, where the national legislations are so diverse. We should involve ourselves very deliberately in the construction of such future standards.

And finally, we think that our major challenge is to keep trying to convince the enterprises that their participation in our surveys is important, for themselves as well as for the national and

2 2 The INSEE has been discussing with the French Association of Chartered Accountants and a major accounting software house within the framework of the European TELER project about the way of automating the production of the figures demanded by the « yearly business survey ». These discussions showed that the problem was not simple and didn’t lead to concrete results.

4 See Faure (1997) for some views on this system.5 An EDIFACT version is available, called TDS-EDI, and based upon the SOCADE EDIFACT message. Its use is

very limited.6 For example, we have read with interest a « Draft Commission Recommendation on the subject of environmental

issues in the annual accounts and annual reports of companies », the implementation of which in the accounting software packages, when it is finalized, could facilitate the task of businesses that have to report on environment protection expenses and investment.

82 9th CEIES Seminar – Innovations in provision and production of statistics

Page 83: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

European community. New technologies will play their part in this never ending work. We think that we must not forget other forces.

References

Bernard C. (1997) « The SIRENE directory » Courrier des Statistiques – English series N° 3

Cotton F. and Hesse C. (1992). Coordinated selection of stratified samples. Proceedings of Statistics Canada Symposium November 92.

Faure J.L. (1997) « A French experience of combining a survey and an administrative file : The annual wage declaration supplementary survey » Proceedings of the Seminar on the Use of Administrative Sources for Statistical Purposes – Luxembourg - 1997

Grandjean J.P. (1997a) « The system of enterprise statistics » Courrier des Statistiques – English series N° 3

Grandjean J.P. (1997b) « The use of fiscal sources for statistical purposes :A French case study » Proceedings of the Seminar on the Use of Administrative Sources for Statistical Purposes – Luxembourg - 1997

Kanarek W. and Sedivi B. (1999) « Internet Data Collection at the U.S. Census Bureau » presented at the FCSM 1999 Research Conference

Picard H. (1997) « The inter-departmental system and SIRENE » Courrier des Statistiques – English series N° 3

Rivière P. (1999) « Coordination of Samples : the Microstrata Methodology » presented at the 13th

International Roundtable on Business Survey Frames – Paris- September 1999

9th CEIES Seminar – Innovations in provision and production of statistics 83

Page 84: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

EXPERIENCES WITH WWW BASED DATA DISSEMINATION -THE STATFIN ON-LINE SERVICE

Sven I. Björkqvist Statistics FinlandInformation Technology ServicesEDP-methodsTyöpajakatu 13FIN-00022 [email protected]

Summary: This paper describes Statistics Finland’s experiences in introducing a WWW-based on-line dissemination database - the StatFin on-line service. Statistics Finland has had on-line data dissemination databases for years, but these systems are no longer able to meet the needs of the users, nor are they seen as dissemination databases for the whole office. The change from centrally co-ordinated on-line output to a distributed, but still centrally administrated WWW-based on-line service was not an easy one - especially when it was to be made within one year. Still the project responsible for this change seems to have succeeded as the feedback from the users is highly positive and the service already contains over 30 million data elements within over 100 tables.

Background

Statistics Finland has a long history in providing on-line output database services. We have used mainframe based on-line output databases since the 1980’s1 and made them accessible through the Internet. This happened long before the WWW-environment made its breakthrough in the world of Internet. These databases were quite advanced and they had a lot of functionality so that the users, often working with “dumb terminals” could process the data using the services of these databases.

The office also had, and still has, metadata systems (the Classification Database and the Unified File System) for describing the data used within the office. The data architecture of Statistics Finland describes the way the data should flow from production databases to output databases (Saijets 1999). All seemed to be well in theory, but in practise there were a vast amount of separate production systems using their own data sources and their own dissemination channels. This situation has been called the stovepipe-model (Keller 1998).

Since those times a lot has changed both in the data processing facilities of the users and in the use of the network of networks, the Internet. In the beginning of the 1990’s Internet became a phenomenon known by almost everyone, and the traditional users, universities and the science community, faced new challenges when the WWW-technology brought millions of new users for the services of the network. Statistics Finland was also faced with the pressure introduced with this change, as it’s on-line databases, based on mainframe solutions and direct terminal access were suddenly seen as hard-to-use “dinosaurs”.

Definition of policy

In the year 1998 Statistics Finland started searching a new data warehouse solution for storing aggregate data in table format for internal use and for disseminating this data, or parts of it, using WWW-technology and CD-ROM-discs. The selection process was a long and difficult one. There were lots of options to choose from, but none of them seemed to cover all the needs the office had. The main choices were to:

1 i.e.the regional database (ALTIKA), the time-series database (ASTIKA)

84 9th CEIES Seminar – Innovations in provision and production of statistics

Page 85: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

1. Continue with our current situation (stovepipes), to further develop the output side and enhance the co-ordination by strict rules.

2. Select one of our data-streams and tools related to it as an office-wide standard (one of the candidates was PC-Axis-based distribution).

3. Look for other solutions and replace our current production model and tools with them.

4. Look for a tool strong enough to solve the most obvious (output) problems and to integrate it to our metadata and production systems, thus increasing the degree of integration.

Because of the diversity of the options a project was launched to evaluate the options and to make a proposition for a production model. This project was called the StatFin2000-policy definition project. The project did evaluate the different options and systems and proposed a production model where a central data-warehouse acted as a storage for aggregate data in table format. The project also came to conclusion that the StatLine-output system, made by Statistics Netherlands was the tool to use in implementing this production model.

The decision was not an easy one - nor was it accepted by all parts of the organisation. This imposed heavy pressure on the success of implementation of the selected production model and the tools.

The goals of the StatFin2000-implementation project

The StatLine system is made in the Netherlands. Therefore it was necessary to implement it in the production environment of Statistics Finland. There was a lot of work to be done and the schedule was tight (one year). The main objectives set to the implementation project were to:

1. localise the program suite (the StatLine Suite) for use in Statistics Finland (definition of concepts, translating the programs and manuals)

2. purchase the software and hardware required for using the system

3. adapt the other production tools used in Statistics Finland to seamlessly integrate to the StatLine system

4. plan the best practises (using pilot-systems) to use the data warehouse and information service (StatLine system) in different use-cases and to prepare the instructions and manuals needed

5. plan the structure of the internal data warehouse and the WWW-service and to help and guide the statistical departments in providing the content to the system

6. arrange the necessary technical support, courses and training

7. guide and supervise the office-wide introduction and implementation of the system

8. plan the organisation and find/appoint resources for administrating the system after the project has ended

Resources

Due to the importance of the success and the tight schedule, the project was resourced very well. For its one-year duration, it got a budget of 1,5 million Finnish Marks (approx. 254 000 Euros).

9th CEIES Seminar – Innovations in provision and production of statistics 85

Page 86: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

This enabled the project to recruit enough personnel with a wide scale of expertise that the complexity of the work required. The preceding project (the StatFin2000-policy definition project) also got 300 000 Finnish Marks (approx. 50 500 Euros) to buy the server-machines needed for the implementation.

Except the financial resources the project also got the full support from the top-level management of the office, which proved to be even more important than the money.

The projects consumption of resources was quite stable. During the first four months there were five persons in the working group (some of which worked for the project on a part-time basis only) plus the project manager. After that one full-time member was hired to the project which increased the consumption of the resources accordingly. In the beginning of October yet another member was added to the working group.

Because the project introduced an office wide change it also induced a lot of costs for other organisation units, but exact data on those costs are not yet (December 1999) available. As a conclusion it could be stated that the resources consumed by the projects working group are only a fraction of all the costs it induced to the whole office.

The organisation of the project

The organisation of the project was different from the traditional project organisations of Statistics Finland. In the normal project organisation there is a steering group, project manager and a working group. This structure was seen inadequate for the implementation project, so an adapted model of the traditional project organisation was introduced.

Picture 1. The organisation of the StatFin2000-implementation project.

The project had a steering group consisting mostly of directors and very experienced experts. The project manager was chosen from the Information Technology Services/EDP-methods unit. Quality assurance groups were introduced to report on the quality issues to the steering group and to the project manager. The members of the working group were experienced metadata- and IT-experts selected from IT-services and data administration department. The StatFin special interest group was a selection of 40 people from the statistical departments. They played a key role in exchanging information, distributing knowledge and promoting the system within the office - without their contribution the project would most likely have failed in the very beginning.

Schedule

The project begun officially in the 1st of January 1999. However some preparatory work was done before that, which enabled a smooth and rapid start for the project. The deadline for the project was set to be the 31st of December 1999.

86 9th CEIES Seminar – Innovations in provision and production of statistics

Page 87: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

When the project was set, there was a set of milestones that formed the skeleton on which the projects working packages were built. The milestones were the following:

1. Internal service must be up and running with usable data in the January of 1999

2. Public non-chargeable service must be up and running, with significant amount of data in

15th of May 1999

3. Public chargeable service must be up and running in the beginning of the year 2000

Work packages

Because the project was to introduce an office-wide infrastructure change, there were three main work areas:1. Creating the technical infrastructure 2. Service and designing the service-concepts3. Communication

These areas were divided to work packages. The work in most of these packages was concurrent and sometimes even overlapping. The work packages per work area were:

Creating the technical infrastructure1. Purchasing and installing server-machines (2)2. Localising (definition of concepts and translation) software (StatLineSuite) and manuals3. Developing conversion (import, export) programs to integrate StatLine with existing tools

and infrastructure4. Developing a web-site with Statistics Finland's look-and-feel5. Finding a method to enable charging for the use of the data6. Developing and automating the process of updating the data7. Developing tools for usage-reporting and log-analysis

Service and designing the service concepts1. Designing the structure of the service2. Integrating the service with the web-pages of Statistics Finland3. Developing feed-back utilities4. Following the use of service and reporting it5. Develop best practises for service administration and administrating the service6. Quality handbook on the service and immediately related processes

Communication1. In-office communication, presentations2. Training, consulting, advisories3. Establishing a discussion forum and to actively use it4. Customer relations, reaction to feedback5. Marketing the service, participating seminars, promotion happenings, presenting papers6. Co-operation with CBS, participating the work of the SOS consortium7. Communicating the best practises and process descriptions

9th CEIES Seminar – Innovations in provision and production of statistics 87

Page 88: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

The outcome

The project succeeded in almost every task it was given. Statistics Finland now has an internal reference data warehouse, public statistical www-service (the StatFin statistics service) and technical infrastructure to build chargeable services using the StatLine technology. We also have over 100 persons trained to use the system in order to produce material (data and metadata) to the databases.

The databases themselves are quite large and growing fast. The public service contains approximately 100 tables totalling up to 30 million cells (15.11.1999). The flow of feedback from the clients is constant, positive and growing. All feedback is stored in electronic form for processing and analysis.

Other official statistics producers in Finland have shown a great deal of interest in the system and negotiations are on the way to get them to participate in filling the services with data from all areas of the society. This is a very positive thing because one of the roles of Statistics Finland is to co-ordinate the production of statistics in Finland and with an attractive output tool, shared by many producers of statistics, this co-ordination and integration is clearly visible and useful to our clients.

Essential experiences

The support (both financial and principal) of the high level management is vital to an office-wide project. Without this kind of commitment an office-wide change is impossible within one year.

In-office communications are essential to overcome the resistance of change. People should always be aware of the impacts of the change to their workload. It is also important that everyone involved feels that the change brings significant advantages both for themselves and for the whole office.

The StatFin-special interest group was very useful in communicating between the project and the statistical departments. It was a way to establish two-way communications within the office thus enabling fast reactions to feedback and suggestions.

Training a critical mass of in-office users (the content providers for the database) makes it easier to fill a warehouse with data. Training alone is, however, not enough: There must be a constant person-to-person support available in order to make the process of filling a database as streamlined and secure as possible.

Distributed co-ordination is not always a good thing - at least some kind of an editorial board would be needed to co-ordinate the contents and structures of the service. Totally centralised co-ordination is still not the answer as it lowers the degree of commitment and makes the task of filling and updating the database look like a task for this central co-ordination body.

When introducing a new system for dissemination, it is essential to integrate it with existing tools within the office. This reduces the workload when filling the system with data thus making it easier for the system to be accepted as a common tool.

Rules and regulations are needed to lower the threshold for accepting the system, but in the long run the system must also prove to be an attractive tool, otherwise it will be rejected.

Hiring people for an one-year period brings in the feeling of insecurity thus decreasing the effectiveness and motivation of those people as the year becomes to it’s end.

88 9th CEIES Seminar – Innovations in provision and production of statistics

Page 89: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

Customers see WWW-based systems as more easy-to-use than the mainframe based ones. This, however, requires the WWW-based user-interface to follow the mainstream design rules of alike systems.

A huge collection of data disseminated free-of-charge is not a threat to a statistical office, but a significant way to build a positive image: Content provider in the information society.

On-line dissemination for a target group such as the Finnish people requires very thorough explanations and descriptions of the data in order to avoid misunderstandings and to make the data suitable for professional usage. The explanations and descriptions should come from centralised metadata systems; otherwise describing the data will be a huge burden for statistical departments.

The future

The StatFin-service, as the other StatLine based data warehouses in Statistics Finland will continue their growth after the introduction and implementation project has ended.

The system is in use, in real everyday production, but there is still a need to further develop, enhance and standardise the interfaces and streams for data and metadata, in order to seamlessly integrate the StatLine system to our production processes. This and reacting to the constant flow of user feedback in order to will be a significant challenge for the coming years.

References

M. Saijets, P. Toivonen, S.I. Björkqvist, M. Mäkinen, R.Syvänperä, K.Palteisto, J. Kuosmanen (1999): Data architecture at Statistics Finland, Information Technology Services, Statistics Finland. Paper available in English only.

W. J. Keller, J.G.Bethlehem (1998): Between Input and Output, Proceedings of the NTTS’98 seminar, Sorrento, Italy, 1998.

9th CEIES Seminar – Innovations in provision and production of statistics 89

Page 90: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

DATA ACCESS VERSUS PRIVACY: AN ANALYTICAL USER’S PERSPECTIVE

Ugo TrivellatoDipartimento di Scienze StatisticheUniversità di PadovaVia San Francesco, 33IT– 35121 [email protected]

Multae utilitates impedirentur si omnia peccata districte prohibirentur[If all sins were strictly forbidden, then many useful things would be hampered]

Thomas Aquinas (S. Th. II-II, q. 78, a. 1 ad 3)

1. Introduction

The issue of privacy vs. data access is multi-faceted. In fact, it is at the crossroads where two sets of considerations meet: the first is the traditional concern of official statistical agencies for confidentiality, both for ethical reasons and as a guarantee of maintaining respondents’ collaboration; the second is the more recent and broader concern about the legal protection of individuals with regard to the processing of personal data, that extends to all kinds of personal data and their uses. Moreover, the intensity and focus of these concerns and the ways of handling them vary considerably from one country to another, as they are rooted in each country’s cultural and institutional context1.

The debate on the topic is very alive and controversial2, especially as a consequence of recent international recommendations or regulations on data protection: Recommendation No. (97) 18 of the Council of Europe “concerning the protection of personal data collected and processed for statistical purposes” (hereinafter simply referred to as the CERecommendation)3, and Directive 95/46/CE “on the protection of individuals with regard to the processing of personal data and on the free movement of such data”4. EU Member States have to comply with the Directive, bringing into force national laws and secondary legislation. In several countries the process is still going on, in

1 See Als (1996) for an informative and penetrating review. I give just two examples, relating to France and the Netherlands. In France, data on individuals and families are entitled to a strong protection, while this is not the case for “individual information of an economic and financial nature [on firms, which] cannot be used for the purpose of revenue control or for the purpose of economic repression”, but can be used for other purposes (Decree No. 84/628, Article 6: see Buzzigoli, Martelli and Torelli, 1999, pp. 22-27). The option taken by France is not shared by most EU countries: they do not make that distinction or take an even opposite attitude, with higher protection given to data on firms. As for the Netherlands, the vivid sketch provided by Als (1996, p.11), while commenting on concerns for confidentiality and the comparatively poor survey responses of Dutch people, is illuminating about the role of cultural factors: “Take a walk the evening in the streets of any small Dutch town. You will be able to observe family life – no shutters, no curtains; yet these same people who literally exhibit themselves will not accept that their sex and their date of birth should form part of an individual identifier! And they reject population censuses.”

2 As an example, an issue of The Economist, 1st-7th May 1999, devoted the cover, the leading article and the special section to “The end of privacy”.

3 See Council of Europe (1997). This is simply a recommendation to Member States, but it is quite interesting because it deals specifically with data protection for statistical and research purposes.

4 See European Parliament and Council of the EU (1995).

90 9th CEIES Seminar – Innovations in provision and production of statistics

Page 91: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

general or specifically with respect to the processing of micro data5 for statistical and scientific purposes.

I will approach the issue from a particular perspective, which is characterised by two aspects: first taking the point of view of an analytical user, i.e. of a person wanting to collect and principally process micro data for statistical/research purposes; and secondly, focusing on some basic features of the issue and on their policy implications, rather than on technical aspects and methodological nuances.

The outline of the paper is as follows. The societal role of statistics and research and their intrinsic needs are considered. Indeed,

they are the basis for motivating a ‘credit of confidentiality’ when personal data are (collected and) processed for statistical/research purposes.

Some basic principles and guidelines for regulating this credit of confidentiality, as they are set out in the CE Recommendation and in the EU legislation, are summarised.

Statistical, technological and legal devices, and related practices, for protecting confidentiality while allowing the processing of micro data for statistical/research purposes are briefly reviewed.

The state of the affairs at the level of the EU is discussed, as regards both rules and practices followed by Eurostat and draft legislation.

The concluding remarks stress the need for additional efforts to make micro databases widely available to researchers.

2. Statistics and research: their societal role and their intrinsic needs

In the context of the legislation on data protection, a balance has to be struck between privacy and the fundamental right to freedom of expression. This right explicitly includes the freedom to receive, and impart information (Article 10 of Convention for the Protection of Human Rights and Fundamental Freedoms, 1950), where the freedom to receive information is considered as implying the freedom to seek information.

The balance is altered when statistics and research come into play, because of their societal function. This function is stated as follows in the CE Recommendation: “The needs in both the public and private sectors for reliable statistics for analysis and understanding of contemporary society, and for defining policies and strategies for making arrangements in practically all aspects of daily life”6. It is precisely because of this function that specific regulations are set out regarding the protection of personal data collected and processed for statistical and research purposes.

Three points deserve attention. First, it has to be stressed that the distinctive characteristic of a statistical purpose is the collective use of micro data. This means that individuals are the necessary medium for the background information, but that they are not regarded as significant in their own right. In fact, starting with the basic material in the form of individual information about many different people, the statistician elaborates results designed to “characterise a collective phenomenon” (CE Recommendation, Article 1). In other words, the statistical result separates the information from the person: personal data are collected and processed with a view to producing consolidated and anonymous information. From this perspective, it is also clear that protecting

5 Some clarification about terminology is perhaps useful (more in the sequel). ‘Micro’ data are unit record data, i.e. data pertaining to ‘individuals’ (persons, firms, etc.). ‘Personal’ or ‘confidential’ data refers to unit record data relating to an identified or identifiable individual. ‘Anonymous’ data designates micro data pertaining to a non-identifiable individual.

6 See also Jowell (1981) and Reynolds (1993), among many others.

9th CEIES Seminar – Innovations in provision and production of statistics 91

Page 92: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

privacy is in the interest of statisticians, in order to maintain the confidence of respondents and the public and to avoid prejudicing future data supply.

Secondly, from the standpoint of data protection (much) scientific research is similar to, and indistinguishable from, statistics7. This statement holds true both for basic research and for research supporting policies. This point is clearly spelled out in the CE Recommendation (Explanatory memorandum, paragraph 14) with respect to strictly scientific purposes: “Scientific research uses statistics as one of a variety of means of promoting the advance of knowledge. Indeed, scientific knowledge consists in establishing permanent principles, laws of behaviour or patterns of causality which transcend all the individuals to whom they apply. Thus it is aimed at characterising collective phenomena, this being the very definition of statistical results. It could be said, therefore, that research becomes statistical at a certain stage in its development.” Similar arguments apply to research supporting policies: their design, monitoring and evaluation. In this respect, too, the relevant information always relates to collective phenomena and therefore cannot, under any circumstances, entail direct or individualised consequences for individuals.

It is important to add that scientific research calls for an increasing use of micro data. Various factors operate to bring about this tendency8. To be brief, I will simply mention the growing attention paid to individual agents (persons, households, firms), to their heterogeneity, to micro-dynamics and interdependencies; and, for purposes of assistance to policy and decision-making, the focus on distributive features and on programmes targeted to specific groups of agents9.

The third point, and one which is crucial, has to do with the access to micro databases. ‘Free’, i.e. reasonably open and equitable, access to micro data is essential to science, as well as to the functioning of a democratic society. Science is an incremental process that relies on open discussion and on competition between alternative explanations. This holds both for basic research and for research supporting policies10. Thus, replication studies are fundamental to science. To this end, it is vital that free access to micro databases be allowed to qualified researchers who are willing to analyse them.

Arguments about the role of official statistics in a democratic society point in the same direction. The principle of impartiality, one of the “fundamental principles of official statistics” adopted by the UN Statistical Commission, as well as of the principles for Community statistics set out in Council regulation 322/97 (the so-called UE “Statistical law”)11, implies that statistical information should be made accessible to all on a fair basis. Strictly speaking this principle, just like the others, applies to aggregate statistics produced by an official statistical agency, not to micro data. However, it is totally reasonable to extend it to access to micro data for research purposes12, while 7 I do not consider here that part of research, especially in the medical and psychological sciences, that

involves personalised feedback. In this area, personalised intervention is basic to research (even though statistical analysis may come into play at a later stage), and this calls for specific ethic and legal rules.

8 Indeed, the impressive development of micro databases both on households/individuals and on firms/establishments and more recently on linked employer-employee data, from surveys as well as from administrative sources, is also ‘supply’ driven, thanks to the enormous advances in data collection, storage and processing allowed by the computer revolution. I will look only on the ‘demand’ side.

9 I elaborate more on these points in Trivellato (1999). Various contributions to the subject were presented at the Eurostat-Istat Conference on “Economic and social challenges in the 21 st century: statistical implications”, Bologna, 5-7 February 1996. See particularly Atkinson (1996) and Malinvaud (1997).

1 0 Heckman and Smith (1995, p. 93) convincingly argue that “evaluations build on cumulative knowledge”. See also Rettore and Trivellato (1999).

1 1 See UN Statistical Commission (1994) and Council of the EU (1997).1 2 For an extended discussion on the topic, with arguments based on the nature of statistical information as a

quasi-public good, the distinction between centralised funding and centralised production, and the theory of

92 9th CEIES Seminar – Innovations in provision and production of statistics

Page 93: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

adopting appropriate measures to protect respondent confidentiality.

One of the implications of this standpoint is that statistical information collected by official statistical agencies has to be treated largely as a public good. Official statistics share some features with public goods and, in addition, collective fixed costs have a dominant role in producing them (Malinvaud, 1987, pp. 197-198). However, it is not a public good per se: it is clearly possible to discriminate among users, both through pricing and through selective, discriminatory access13. Characterising official statistical information essentially as a public good is a normative issue the result of a choice in a democratic society. The main implication for our case is that access to micro data for all bona fide researchers possibly subject to registration and appropriate undertakings, where required for confidentiality reasons should be at no cost or at marginal cost.

3. Where to stop the pendulum? Some basic rules and their interpretation

I have dwelt some time on these matters, but a sound perception of the guiding principles is important. In my opinion, they convincingly explain why a bona fide analytical user should be given a sort of ‘credit of confidentiality’ in access to micro data.

Clearly, this credit should be reasonable, that is to say sparing in its size and regulated in the light of the technical aspects involved, and accompanied by adequate guarantees. Indeed, as long as data are used to produce statistical results and the results themselves are impersonal, there is no threat of infringement of confidentiality. There are, however, risks of disclosure, from disseminated statistical information and when processing the micro data. As a consequence, the data might be used for non-statistical purposes, particularly to take decisions or measures in respect to a specific individual. It is in this connection that technical precautions are stipulated and legal guarantees (including penalties) are laid down.

The picture is rather intricate, because various elements of the provisions intersect and overlap: there are international and national regulations; there are aspects pertaining to data protection in general and specifically to the processing of personal data for statistical and research purposes (or just to the processing of such data by official statistical agencies); there are simple recommendations, laws and secondary legislation. In an effort to simplify, I will focus on some basic rules relevant to our specific topic, chiefly as they emerge from the discipline at international

bureaucracy and stimulus of competition, see Behringer, Seufert and Wagner (1998).1 3 A remarkable, fortunately unsuccessful, example of such a perspective is provided by the so-called ‘Rayner

doctrine’. Some twenty years ago, Sir Derek Rayner was asked to prepare a report to the UK Prime Minister, consisting of a review of the Government Statistical Services with proposals for restructuring, and retrenching, them. A paramount statement of the report was: “[Statistical] information should not be collected primarily for publication. It should be collected primarily because government needs it for its own business” (Rayner, 1980). For a presentation of the Rayner Report and of the severe criticisms expressed within the Royal Statistical Society, see Hoinville and Smith (1982). The UK Government progressively abandoned the Rayner doctrine, up to its drastic reversal with the 1998 “Green Paper” (HM Government, 1998).

9th CEIES Seminar – Innovations in provision and production of statistics 93

Page 94: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

level the CE Recommendation on the one hand and Directive 95/46/CE, complemented by the Council Regulation 322/97, on the other14.

First, it is important to clarify the notion of ‘personal (or confidential) data’ and the complementary notion of ‘anonymous data’. According to the CE Recommendation, “‘personal data’ means any information relating to an identified or identifiable individual …. An individual shall not be regarded as ‘identifiable’ if identification requires an unreasonable amount of time and manpower. Where an individual is not identifiable, data are said to be anonymous.” The point is clear and quite important. Essentially, it means that (i) a reasonably low risk of identification in anonymous data sets is accepted, and (ii) provisions for data protection do not apply to anonymous data. However, the notion of identifiability is phrased differently across the various pieces of legislation15; and, what is perhaps more important, the ways of putting these concepts into practice are still largely open. So, implementation is crucial.

As for the provisions governing the processing of personal data for statistical purposes, a crude but useful distinction can be made between positive principles specifically relevant to the domain and derogations from some general norms of data protection. Among the first category, it is worth mentioning:

The principle of lawful use of personal data, for statistical purposes only: “Personal data collected and processed for statistical purposes shall serve only those purposes. They shall not be used to take a decision or measure in respect of the data subject, nor to supplement or correct files containing personal data which are processed for non statistical purposes” (CE Recommendation, principle 4.1).

What one might term a ‘principle of parsimony’, which implies that personal data should be rendered anonymous and identification data (i.e. those personal data that allow direct identification of the individual) should be separated from the data used to produce the statistical results as soon as it is reasonable.

A set of indications about measures to be taken to ensure the security of personal data.

A firm statement about publication: “Statistical results shall be published or made accessible to third parties only if measures are taken to ensure that the data subjects are no longer identifiable on the basis of these results, unless dissemination or publication manifestly presents no risk of infringing the privacy of the data subjects” (CE Recommendation, principle 14.1).

Exemptions to general provisions for data protection, when personal data are processes for statistical purposes, include:

The possibility of processing for statistical purposes data that were originally collected for non-

1 4 I leave aside broader considerations on the overall impact of the EU data protection directive. In passing, I just note that it gives individuals a property right in information about themselves, thus an unprecedented control over such information. Various concerns have been expressed in this respect: the directive has a “ frightening … bureaucratic orientation, an orientation which consists in regulating everything down to the slightest detail” (Als, 1996, p. 20); it is doubtful whether the directive can be applied in practice, if too many people try to use it; “broadly enforced, such a property right would be antithetical to an open society” (The Economist, May 1st-7th

1999, p. 13).1 5 For instance, the formulation of the Council Regulation 322/97 alludes to a somewhat wider notion of

identifiability (“To determine whether a statistical unit is identifiable, account should be taken of all the means that might reasonably be used by a third party to identify” it: Article 13; emphasis added). On the contrary, the formulation of the UK Data Protection Act 1998 is decidedly more liberal, because it refers to identification “ from those data and other information which is in the possession of, or is likely to come into the possession of, the data controller” (Section 1 (1); emphasis added).

94 9th CEIES Seminar – Innovations in provision and production of statistics

Page 95: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

statistical purposes (chiefly, for administrative uses), with partial derogations from the obligation to inform the persons involved.

The conservation of personal data, which can be extended “for longer periods [than is necessary for the purposes for which the data were collected or … further processed]” (Directive 95/46/EC, Article 6)16.

Restrictions to the right of access and rectification by any person to personal data concerning him/her.

What are the operational implications of these principles for an analytical user? First, one has to consider that they are formulated in fairly general terms. Besides, Directive 95/46/CE, the only act binding on Member States17, essentially confines the discipline on data processing for “historical, statistical or research purposes” to exemptions to general provisions on data protection, to be laid down by national legislation subject to adequate safeguards. Finally, both the CE Recommendation and Directive 95/46/CE are wisely silent on technical and organisational measures for assuring anonymity, confidentiality and secure processing, thus allowing for adaptation to advances in computing and statistics.

The overall consequence is that considerable room is left for different attitudes across the various countries. Harmonisation proceeds, but at a moderate pace. Member States’ provisions remain quite diversified. National legislation and, even more, national practices do matter18. A look at practices is useful.

4. Safe data, safe settings and the Web: devices and practices for preserving confidentiality while allowing access to micro data

Strategies for ensuring disclosure avoidance while enhancing free data access are often classified under the headings of ‘safe data’ and ‘safe settings’. This will be my starting point, but I will argue

1 6 An interesting specification of reasons for the conservation of identification data is given in the CE Recommendation, principle 11.1: “a) for the collection, checking and matching of the data; or b) to ensure the representativeness of the survey; or c) to repeat the survey with the same people”. The UK Data Protection Act 1998, Section 33 (3), is particularly liberal in this respect: “Personal data which are processed only for research purposes …may be kept indefinitely”.

1 7 Member States were asked to adopt the legislative and administrative measures necessary to comply with the Directive within three years from its adoption (Article 32). In some countries, however, the process is still under way.

1 8 I base this statement, and part of the considerations of subsequent Section 4, on a cursory survey of national legislation and practices in selected countries. For reviews, see Als (1996) on UE countries; Motohashi (1998) on selected OECD countries, especially on longitudinal micro databases on firms/establishments; Buzzigoli, Martelli and Torelli (1999), who survey Australia, Canada, France, Germany, The Netherlands, US; Biggeri (1999), who deals with the Italian case; Bodin (1999b), who examines Sweden, The Netherlands, UK and the US. For the US, see also Duncan, Jabine and deWolf (1993) and Stevens (1998). I directly consulted actual or draft national legislation for Belgium (Draft Law on official statistics, January 2000), Finland (Statistics Act 62/1994, with subsequent amendments up to 1998), France (see CNIS, 1999), Italy (Law No. 675/96 on data protection; Decree No. 281/1999 on provisions for personal data processed for historical, statistical and scientific purposes), Spain (Organic Law No. 15/99 on the protection of personal data) and the UK (Data Protection Act 1998; see also Lloyd, 1998).

9th CEIES Seminar – Innovations in provision and production of statistics 95

Page 96: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

that due consideration should be given to the most up-to-date technology, including the Word Wide Web, and to the overall system adopted to handle data distribution matters19.

4.1. Safe data

The production of ‘safe’ micro-data sets, i.e. data for which factual anonymity is assured, involves a variety of disclosure control measures, applied to the statistical units (sampling and sub-sampling, micro-aggregation, masking, etc.) and/or to the variables (variable suppression, aggregation of modalities and bottom/top-coding, strategies for injecting error, etc.).

There is considerable experience in several countries of the production and release of anonymised data sets, for a variety of social domains. The most prominent examples within the EU are perhaps the data sets of some general purpose household panel surveys: the German Socio-Economic Panel (GSOEP), the British Household Panel Survey (BHPS), and the European Community Household Panel (ECHP)20.

The release of ‘safe’ data sets deserves some consideration. Not surprisingly, it varies with the risk of respondent re-identification (and with its obvious counterpart: reduction in the information content of the data set). In some cases, when the risk is taken to be practically nil, the data are distributed as ‘public use micro-data files’, released to everybody. In several other cases, when the risk is judged to be small (indeed, the data are considered anonymous) but not negligible, they are released with some restrictions, mainly based on role and licensing. These restrictions vary in degree: at one extreme, “universal access … for all ‘bona fide’ non-commercial users (subject to registration and standard undertakings of non-abuse), at no [or at marginal] cost” (Jenkins, 1999, p 81), as is the case for BHPS and GSOEP; at the other extreme, quite restricted release procedures based on examination and approval of a research project by an ad hoc board, which are subject to detailed agreements and expensive.

Overall, my assessment is that variability in these release policies is only moderately related to the risks of disclosure associated to the micro-data sets involved. What it mainly reflects is differences in attitudes – liberal versus restrictive, in short – across countries and statistical agencies.

4.2. Safe settings

The polar approach consists of granting permission for external researchers to have access to micro data held within the ‘safe setting’ of a secure data storage and working area under the control of the official statistical agency (and usually on its premises).

For our purposes it is interesting because it combines a high level of protection, relying on physical and logical restrictions to data access, with the opportunity for the researcher to work on confidential data. On the other hand, it is quite clear that a safe setting implies selective entry (because of restricted access rules, capacity limitations, heavy administrative monitoring,

1 9 The literature on the subject is enormous. A few general references are Eurostat (1996), Willenborg and de Waal (1996), Fienberg and Willenborg (1999). Useful, updated papers were presented at several ad hoc Seminars organised by Eurostat: the 3rd International Seminar on Statistical Confidentiality, Bled (Slovenia), 2-4 October 1996 (Statistical Office of the Republic of Slovenia and Eurostat, 1996); the Conference on Statistical Data Protection, Lisbon, 25-27 March 1998 (Eurostat, 1999); the Joint UN ECE/Eurostat Work Session on Statistical Data Confidentiality, Thessaloniki, 8-10 May 1999. The papers presented at this last meeting can be found at http://www.unece.org/stats/documents/1999.03.confidentiality.htm.

2 0 Extensive information on the first two databases can be found at http://www.diw-berlin.de/soep/e.faltblat.html and http://www.iser.essex.ac.uk/bhps respectively. On the ECHP, see Section 5.

96 9th CEIES Seminar – Innovations in provision and production of statistics

Page 97: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

substantial direct and indirect costs to the user), and might impose constraints on the principle of free access for research purposes.

Well-known examples of safe settings for micro databases on firms/establishments include the Center for Economic Studies (CES) of the US Bureau of the Census and the CeReM at Statistics Netherlands21.

It is worth noting that it would be improper to identify safe settings as areas where disclosure avoidance is assured by physical and logical restrictions only. Indeed, ethical and legal guarantees are already used as well. As an example, a potential outside user at the CES must (i) obtain a special sworn status by taking a legal oath not to disclose confidential data, and (ii) process the data at the designated secure site.

4.3. The Web

The scene is changing remarkably, however. New technological developments in communications, computing and statistical software are opening up radically new opportunities. One point has to be stressed. Developments in information technology do not act simply as a threat to confidentiality. Cryptography, database protection, audit systems for statistical databases, statistical data confidentiality methods and software22 are making impressive progress. Their combined use makes it possible to design a system capable of (i) allowing an analytical user to access remote statistical micro databases, over the Internet, and to process them for research purposes, while (ii) providing secure statistical confidentiality control.

Interesting examples of electronic micro-data access/dissemination already exist. See, among others, the Data Liberation Initiative (DLI) developed by Statistics Canada, the Data Access and Dissemination System (DADS) project of the US Census Bureau23, Nomis in the UK24. But the potential of this approach has still to be fully explored. Indeed, given its capabilities and flexibility, it is likely that in the future it will become the dominant approach.

In this context, a ‘safe setting’ (or a ‘safe network’, as we should perhaps call it) will simply designate a set of rules allowing the processing of (potentially) confidential data within a secure environment, with no reference to a physical location. Rules will consist of technological devices, logical and statistical procedures as well as of ethical and legal guarantees.

4.4. The organisation system for micro-data access for research purposes

There is another, crucial point which deserves attention. In data distribution matters for research purposes, the issue is not confined to row micro-data access/dissemination. Some other features are essential, or useful:

extensive documentation on the data, with questionnaire, code-books, and other meta-data 2 1 See McGuckin and Pascoe (1998) and Balk (1998) respectively. Note that the CES has already established

several Research Data Centers (this is the name given to its safe settings) scattered in the country.2 2 Many papers presented at the Conference on Statistical Data Protection, Lisbon, 25-27 March 1998, dealt

with these topics: see Eurostat (1999). See also McClean (1998).2 3 See Buzzigoli, Martelli and Torelli (1999), pp. 18-20 and 65-66, and US Census Bureau (1997). As far as I

am informed, the micro-data sets distributed through the two initiatives consist only of public use micro-data files.2 4 Nomis is an official labour market online database for the Office of National Statistics, which disseminates

geo-statistical information to 800 customers. Properly, it is a Geographical Information System (GIS) database, not a micro database. However, it is relevant for our purposes, because the geographical resolution is down to the electoral wards (some 10,000 units), with a rather skew distribution of the resident population, employment and unemployment and potential disclosure. See Blackmore (1998).

9th CEIES Seminar – Innovations in provision and production of statistics 97

Page 98: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

informing about the data source, data quality, etc.;

information and training, especially when new technologies and software are introduced or new micro-data sets are made available;

significant involvement and feedback among analytical users and from them to data producers and handlers (via, e.g., user groups, scientific boards of advisors or ad hoc institutions). This dialogue has positive, cumulative effects in two directions: (i) it provides “significant value added to basic data (new compatible and comparable derived variables and data structures),… with these derived variables deposited and [… further] distributed” along with the basic data (Jenkins, 1999, p. 78), as well as opportunities for exchanges of views about how best to use the data; (ii) it supplies the data producing agency with useful information for improvements in data collection design, production and distribution.

This kind of broader organisation system demands a joint effort by the data-producing agencies and institutions or associations from the scientific community, often using some ad hoc intermediary agency. I do not intend to elaborate any further on this point. It has been persuasively illustrated by Jenkins (1999), with reference to the UK experience with the Data Archive: an agency independent of the various data producers that efficiently handles data distribution matters, coupling this with free access to bona fide analytical users and with a great deal of work on documentation, training, networking, etc.25.

While one aspect – distribution to users allocated to a specialised agency is to some extent peculiar to the UK, the basic features of the organisational set-up just described are typical of several leading experiences in various countries. Among these are the already mentioned Canadian DLI1, and the Dutch programme based on an agreement between the Central Bureau of Statistics and WSA/NWO (a specialised agency founded by the Netherlands Organisation for Scientific Research with the specific “aim of making the CBS micro data files available for scientific research at reduced cost” and with standard, simplified undertakings on confidentiality2).

To sum up, some of the best practices in free access to micro data for statistical purposes result from a combination of various ingredients: (i) reasonably liberal, flexible legislation, relying more on self-responsibility and codes of conduct than on extensive administrative monitoring; (ii) a sound organisational framework for handling data distribution matters; (iii) an advanced system of secure data access, based on the Web.

2 5 Extensive documentation on the UK Data Archive can be obtained from http://daww.essex.ac.uk.1 In the DLI, the partners to Statistics Canada are essentially scientific associations and associations of

university libraries; the technical and administrative support is provided directly by Statistics Canada; data dissemination, via Internet, is mainly to academic institutions, which then make the data available to professors and students for non-commercial uses.

2 WSA/NWO plays an intermediary role, while the release of micro data is handled directly by CBS, by means of a “model contract for external research institutes concerning the multiple use of micro-data sets” (quotations are from this contract). The research institution then asks the individual researcher to sign a confidentiality undertaking. See http://129.125.158.28/wsahomuk.html.

98 9th CEIES Seminar – Innovations in provision and production of statistics

Page 99: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

My perception is that, among the EU countries, the most favourable mixture of these ingredients can be found in the UK3. It is also interesting to consider the case of the Netherlands. It exemplifies how considerable efforts on points (ii) and (iii) can partially compensate for rather strict legislation on (and a strong public concern about) confidentiality. Countries relying heavily on detailed legislative provisions and administrative monitoring tend to lag behind. In some of them, however, such as France and Italy, the legislative process is still underway and the issue of access to micro data for statistical and research purposes is currently being given due consideration.

5. The state of the art within the EU

I come now to the state of affairs with the EU and Eurostat. Council Regulation 322/97 has basically two provisions that are relevant to micro data access and confidentiality.

(a)Article 17 states that access to confidential data transmitted to Eurostat may be granted by Eurostat itself, provided that it is for scientific purposes and that explicit approval for such access has been given from the Member State which provided the data.

(b)Article 20, while establishing compulsory assistance by the Committee on Statistical Confidentiality4, affirms that the Commission should adopt measures “designed to assure that all the national authorities and [… Eurostat] apply the same principles and minimum standards for avoiding disclosure of confidentiality”.

On the one hand, the Commission and Eurostat are asked to play an active role in the harmonisation of confidentiality rules and “minimum standards”. On the other hand, for the release of harmonised anonymous micro-data sets at EU level, Eurostat has to obtain the approval of every Member State, each for its own national data set. This suggests that Eurostat might be induced by the harmonisation criterion to adopt the standards set up by the stricter State, that is to say to embrace what could rather be called ‘maximum standards’. I see an ambiguity a kind of vicious circle here, which might hamper positive action by Eurostat in promoting and implementing open-minded harmonised rules about confidentiality. A clear, firm commitment to a policy of liberal access to the scientific community is needed to overcome this risk. This has not been the case, at least until recently.

As for Eurostat’s present data access policy, it is still quite restrictive. As far as I am aware:

In general, access to micro data is allowed only at the safe setting of the Eurostat secure system, mostly for research activities in the interest of the Commission, by Eurostat’s personnel or by consultants contracted by Eurostat.

Dissemination of anonymised micro databases is confined to just one survey, the ECHP (admittedly one of the major advances in the European statistical system).

3 See Jenkins (1999) for an illustration of its positive results with reference to micro-data sets from large-scale household surveys. It is also interesting to mention how the issue of confidentiality is addressed within the online GIS database Nomis (see previously footnote 24). Essentially, it goes through processes of: “1. Formal licensing of users for sensitive data series. 2. Affirmative statements about confidentiality on all outputs where data owners agree the statements. 3. Automating confidentiality rules so that confidential items are clearly flagged. That importantly allows researchers to analyse the data but not to publish or to pass on those cells. 4. Maintaining a full audit trail of all extractions.…5. Developing a mutually beneficial partnership between data owners and users” (Blackmore, 1998, p. 2).

4 The Committee was set up by Article 7 of Euratom/EEC Council Regulation 1588/90 on the transmission of data subject to statistical confidentiality to the Statistical Office of the European Communities.

9th CEIES Seminar – Innovations in provision and production of statistics 99

Page 100: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

The rules for data dissemination are based on research contracts, stipulating strict conditions for data access and use, and on fairly high prices (Marlier, 1999a).

This distribution strategy has been seriously criticised by Jenkins (1999), who vividly contrasts it to the more open and fruitful practices followed in the UK system. I largely share his views. I will just add two comments. First, the data dissemination rules can hardly be justified only on confidentiality grounds (one should keep in mind that we are dealing with anonymised data sets, to which data protection legislation does not apply). Second, the price system has some unpleasant features: the data are quite expensive; the very articulated set of prices for various types of users is more like a protectionist tariff system than a means for recovering the marginal costs. A propensity for restricted access and extensive administrative control still appears to be at work, with likely adverse consequences for science and policy advice5.

It is fortunate that recently important steps were taken by Eurostat for reconsidering practices and legislation. As far as I know, the topic was put on the agenda of the Statistical Programme Committee in March 1999, in response to the need to access micro, possibly confidential data by research teams selected within the Targeted Socio-Economic Research Program (TSER)6. A Task Force on Access to Confidential Data for Research Purposes was set up, and a broad, sound approach was adopted. It is well illustrated by one of the key principles developed to guide the work: “The key role of statistical institutes and authorities is to release the maximum amount of information without breaching the confidentiality of the individual respondent. Data have been collected at public expense, often entailing considerable response burdens. It is the duty of official statisticians to ensure that the best possible (safe) use is made of the data”7.

Notable developments have already come about, in two directions: innovations in dissemination practices and the drafting of a new regulation.

As for dissemination practices, the price policy for ECHP micro-data sets is being revised, and substantial reductions are planned. It would be appreciated if the price system were also made simpler, and less segmented by types of users8. Discussion has also begun about the possibility of disseminating an anonymised micro database from the Labour Force Survey (Franco, 1999)9.

5 Patently, this is conjecture, but it is also a legitimate concern. Evidence about the effects of legislation on data protection upon research is scarce. (Indeed, the legislation is recent, and its effects critically depend on the ways it is implemented and enforced.). I am aware of just one study in the medical field, carried out in the US, aimed at determining the effects of state legislation requiring patient informed consent prior to medical record abstraction by external researchers. Its conclusions are the following: “Legislation requiring patient informed consent to gain access to medical records for a specific study was associated with low participation and increased time to complete the observational study. Efforts to protect patient privacy may come into conflict with the ability to produce timely and valid research to safeguard and improve public health” (McCarthy et al., 1999, p. 417).

6 As it was noted, it is paradoxical that research teams selected by EU authorities for carrying out TSER in the interest of the EU encountered tremendous difficulties in getting authorised access to the relevant data sets.

7 Task Force on Access to Confidential Data for Research Purposes (1999, Explanatory notes), p. 1. 8 Short after this paper was written, the 3/99 issue of the EC Household Panel Newsletter appeared, announcing a

new pricing policy (Marlier, 1999b). The good news is that the price for the ECHP users’ data base has been reduced significantly, roughly by 50%, as of 1 January 2000. Unfortunately, however, the segmented set of prices for various types of analytical users has not been reconsidered.

9 The database comprises micro-data from the Labour Force Surveys conducted in all the Member States of the EU and EFTA, as well as in a number of applicant countries.

100 9th CEIES Seminar – Innovations in provision and production of statistics

Page 101: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

The most significant piece of work, however, is the drafting of a Commission Regulation on “access to confidential data for scientific purposes”10. Building on the basis of certain key principles, including the one just quoted, the Draft Regulation establishes some important rules:

(a)data access will be granted via two approaches: (a1) release of ‘safe’, anonymised micro-data, and (a2) on-site access to confidential data held within a ‘safe setting’ on Eurostat premises11;

(b)a substantial set of surveys or statistical data sources is listed, which can be accessed under one or another approach12;

(c)fairly restricted licensing procedures are maintained for both approaches;

(d)an indication is given about costs, which vaguely echoes the criterion of marginal costs13, but stipulates also that “they should not lead to unfair competition with the national authorities”.

First, credit is due to Eurostat for the work done and the more open attitude taken on the issue. But the drafting process also offers opportunities for an informed debate and, hopefully, for improvements. My opinion is that these opportunities should be seized, and should involve the wider scientific community. CEIES is a good place to start the discussion.

I offer a few comments to initiate this debate. They are based on two key considerations. It would be convenient to adopt a medium-term perspective, flexible enough to leave room for developments in the procedures that could be implemented. Besides, strategies for preserving confidentiality should rely significantly on self-responsibility and codes of conduct14. Compared with these standpoints, the route taken by the Draft Resolution runs the risk of being too narrow and oriented towards the short term. In other words, the solutions devised are likely to be rigid and might rapidly become obsolete. This calls for some specific remarks and suggestions.

Attention should be given to more liberal licensing procedures. Basically, free access (subject to registration, standard undertakings and codes of conduct) should be granted to anonymised data sets. Legislation on data protection is not an obstacle in this respect. And I do not see any convincing reasons for ruling out this possibility a priori.

The Draft Resolution sets out exactly the approaches that can be used for granting data access: safe data with restricted licensing and safe settings. Why be so circumstantial? There are two

10 See Task Force on Access to Confidential Data for Research Purposes (1999).11 ‘Safe settings’ can also be established on the premises of Member State national statistical institutes, under

appropriate conditions.12 Note that in the case of on-site access to confidential data, data on firms/establishments are subject to stricter

conditions than those relating to households and natural persons. 13 “Costs related to the use of the Commission facilities and to the data accessed shall be borne by the researcher ”

(Article 9). To me, the notion of “costs related to data accessed” is essentially indeterminate.14 The Directive 95/94/EC explicitly deals with codes of conduct (Article 27). Both Council Regulation 322/97 and

the Draft Regulation, on the other hand, ignore them. Good examples of general purpose codes of conduct for statisticians are: the Declaration of Professional Ethics, adopted by the International Statistical Institute in 1985 (International Statistical Institute, 1986); the Code de déontologie statistique, adopted in 1984 by the French Association des Administrateurs de l’INSEE; the Code of Practice for Official Statistics, approved in the UK in 1996 (Bodin, 1999a); the Ethical Guidelines for Statistical Practice, recently approved by the Board of Directors of the American Statistical Association (American Statistical Association, 1999). More focused on confidentiality issues are various rules of conduct set up within the UK official statistical system, chiefly the GSS Code of Practice on the Handling of Data Obtained from Statistical Inquiries, issued in 1991, and the statement from the Office for National Statistics on Maintaining the Confidentiality of Data, made on 1 April 1996 (see Office for National Statistics, 1996, and Bodin, 1999b, pp. 5-10).

9th CEIES Seminar – Innovations in provision and production of statistics 101

Page 102: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

potential shortcomings in that. Rules about the release of anonymised micro-data would tend to be unduly uniform (inter alia, they exclude the release of public use micro-data files). Moreover, the detailed provision of two approaches precludes developments in the most promising direction: secure data access based on the Web.

In the same vein, I have some misgivings about the listing of the databases, for which safe release or on-site access is granted, by their name. Clearly, it would be better to specify just the types of surveys and the categories of data sources, from which micro-data sets could be obtained.

The document is silent about the organisational framework for handling data distribution matters. It would be wise to introduce some guidelines about how to develop a beneficial partnership between Eurostat and the wider scientific community.

A clarification of the price setting policy, definitely geared to the criterion of marginal costs and to uniform prices for analytical users15, is much needed.

6. To conclude

A further, cooperative effort is needed among official statisticians and the wider scientific community. It has to be directed towards the preparatory work for the drafting of liberal laws and regulations, as well as to the design and implementation of sound organisation systems for micro-data access for research purposes.

But it also has to be targeted at more general aims, relating to the perception of these issues by the public. We need to broaden and deepen public understanding of the societal role of statistics and research: for the society’s well-being as well as for its democratic life. We need also to extend public understanding of our technical capability and our ethical responsibility to maintain confidentiality.

15 With respect to costs and prices, I find the indication to avoid “unfair competition with national authorities”, given by the Draft Resolution, not quite appropriate. Indeed, a national statistical agency has some sort of monopolistic control over the information it produces, then some discretion on setting prices. The risks to be avoided are likely to be rather different from, or at least more diversified than, a threat to competition.

102 9th CEIES Seminar – Innovations in provision and production of statistics

Page 103: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

References

Als G. (1996), “Statistical confidentiality in the 15 Member States of the European Union: a critical comparison”, in Statistical Office of the Republic of Slovenia and Eurostat, Third International Seminar on statistical confidentiality, Bled - Slovenia, 2-4 October 1996. Collection of papers, Ljubljana, Statistical Office of the Republic of Slovenia, pp. 9-29 (mimeo).

American Statistical Association (1999), “ASA issues: Ethical guidelines”, prepared by the Committee on Professional Ethics and approved by the Board of Directors, Amstat News, No. 269, November 1999, pp. 9-15.

Atkinson A.B. (1996), “Social and economic change: implications for statistics in the 21 st century”, invited paper at the Eurostat-Istat Conference on ‘Economic and social challenges in the 21st

century: statistical implications’, Bologna, 5-7 February 1996 (mimeo.).Balk B.M. (1998), “Establishing a Center for research of economic micro data at Statistics

Netherlands”, in Proceedings of the International Symposium on linked employer-employee data, 21-22 May 1988, Arlington (VA), U.S. Bureau of the Census (CD-ROM).

Biggeri L. (1999), “Diritto alla privacy e diritto all’informazione statistica”, in Sistan-Istat, Atti della Quarta Conferenza Nazionale di Statistica, Roma, 11-13 novembre 1998, Roma, Istat, Tomo 1, pp. 259-279.

Bodin J.-L. (1999a), Etat des réflexions sur les principes fondamentaux de la statistique publique, Série Etudes n° 2, Paris, AFRISTAT.

Bodin J.-L. (1999b), “Réglementation en vigueur et pratiques en usage: aux Pays Bas, au Royame Uni, en Suède, aux Etats-Unis”, Groupe de travail entre statisticiens publics et statisticiens privés pour réflechir à la transposition en droit français de la directive 95/46 du Parlement européen et du Conseil, Paris (mimeo).

Behringer F., W. Seufert and G.G. Wagner (1998), “Problems and examples of dissemination of ‘scientific use of micro data files’ in Germany and elsewhere”, in Proceedings of the International Symposium on linked employer-employee data, 21-22 May 1988, Arlington (VA), U.S. Bureau of the Census (CD-ROM).

Blakemore M. (1998), “Customer-driven solutions in disseminating confidential employer and labour market data”, in Proceedings of the International Symposium on linked employer-employee data, 21-22 May 1988, Arlington (VA), US Bureau of the Census (CD-ROM).

Buzzigoli L., C. Martelli and N. Torelli (1999), “Accesso ai dati statistici individuali: l’esperienza di altri Paesi”, Rapporto di ricerca n. 1999.12, Commissione per la Garanzia dell’Informazione Statistica, Roma (mimeo).

CNIS (Conseil national de l’information statistique) – Groupe de concertation sur la transposition en droit français de la directive européenne 95/46/CE (1999), “Rapport au Ministre de l’économie, des finances et de l’industrie”, Paris, Septembre 1999 (mimeo).

Council of Europe (1997), Recommendation No. R (97) 18 concerning the protection of personal data collected and processed for statistical purposes, adopted by the Committee of Ministers on 30 September 1997 at the 602nd meeting of the Ministers’ Deputies, Strasbourg [with Explanatory memorandum].

Council of the EU (1997), “Council regulation (EC) No. 322/97 of 17 February 1997 on Community Statistics”, Official Journal of the European Communities, 22.02.1997, No. L 52/1-7.

Duncan G.T., T.H. Jabine and W.A. deWolfs (Eds.) (1993), Private lives and public policies: confidentiality and accessibility of Government statistics, National Academy of Sciences, Washington, D.C., National Academy Press.

9th CEIES Seminar – Innovations in provision and production of statistics 103

Page 104: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

European Parliament and Council of the EU (1995), “Directive 95/46/EC of the European Parliament and of the Council of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data”, Official Journal of the European Communities, 23.11.1995, No. L 281/31-50.

Eurostat (1996), Manual on disclosure control methods, Luxembourg, Office for Official Publications of the European Communities.

Eurostat (1999), Statistical data protection. Proceedings of the conference, Lisbon, 25 to 27 March 1998, Luxembourg, Office for Official Publications of the European Communities.

Fienberg S.E. and L.C.R.J. Willenborg (Eds.) (1999), Disclosure limitation methods for promoting the confidentiality of statistical data, special issue of Journal of Official Statistics, 14 (4).

Franco A. (1999), “Individual data in the Labour Force Survey: dissemination policy”, Working Party on the Dissemination of statistical information, Doc. No. DWP/04/99-III-7-EN (mimeo)

Heckman J.J. and J.A. Smith (1995), “Assessing the case for social experiments”, Journal of Economic Perspectives, 9 (2), pp. 85-110.

HM Government (1998), Statistics: A matter of trust, Presented to Parliament by the Economic Secretary to the Treasure, London, HMSO [known as the “Green Paper”].

Hoinville G. and T.M.F. Smith (1982), “The Rayner review of Government Statistical Services”, Journal of the Royal Statistical Society, A, 145, Part 2, pp. 195-207.

International Statistical Institute (1986), “International Statistical Institute Declaration of Professional Ethics”, International Statistical Review, 54 (2), pp. 227-242.

Jenkins S.P. (1999), “Measurement of the income distribution: an academic user’s view”, in CEIES, Proceedings of the seventh seminar: Income distribution and different sources of income, Cologne-Germany, 10th-11th May 1999, Eurostat, Theme 0 - Miscellaneous, Luxembourg, Office for Official Publications of the European Communities, pp. 75-84.

Jowell R., (1981), “A professional code for statisticians? Some ethical and technical conflicts”, in Bulletin of the International Statistical Institute. Proceedings of the 43rd Session, Buenos Aires, Volume XLIX, Book 1, pp. 165-209 (with discussion).

Lloyd I. (1998), A guide to the Data Protection Act 1998, London, Butterworths.Malinvaud E. (1987), “Production statistique et progrès de la connaissance”, in Atti del Convegno

sull’informazione statistica e i processi decisionali, Roma, 11-12 Dicembre 1986, Annali di Statistica, Serie IX, Vol, 7, Roma, Istat, pp. 193-216.

Malinvaud E. (1997), “Effet des evolutions techniques et des changements de la spécialisation internationale sur les marchés du travail et les systèmes productifs: comment les statisticiens pourraient-ils relever les défis?”, International Statistical Review, 65 (1), pp. 97-109.

Marlier E. (1999a), “The EC Household Panel Newsletter. Editorial”, Statistics in Focus, Theme 3 - 2/1999.

Marlier E. (1999b), “The EC Household Panel Newsletter (3/99). New pricing policy”, Statistics in Focus, Theme 3 - 16/1999.

McCarthy D.B. et al.(1999), “Medical records and privacy: empirical effects of legislation”, Health Services Research, 34 (1), pp. 417-425.

McClean S. (1998), “Statistical microdata, macrodata and metadata on the Web: perspectives”, Proceedings of the 21st SCORUS Conference, Belfast, 8-11 June 1998, Conference Compendium, University of Ulster, pp. 4.5.1-4.5.8.

McGuckin R.H. and G. Pascoe (1998), “The Longitudinal Research Database (LRD): status and usefulness”, Survey of Current Business, November 1998, pp. 30-37.

Motohashi K. (1998), “Institutional arrangements for access to confidential micro-level data in OECD countries”, in Proceedings of the International Symposium on linked employer-employee data, 21-22 May 1988, Arlington (VA), U.S. Bureau of the Census (CD-ROM).

Office for National Statistics (1996), Maintaining the confidentiality of data, London, HMSO.Rayner D. (1980), Review of government statistical services: Report to the Prime Minister, London,

HMSO.

104 9th CEIES Seminar – Innovations in provision and production of statistics

Page 105: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

Rettore E. and U. Trivellato (1999), “Come disegnare e valutare politiche attive del lavoro”, Il Mulino, 48 (385), 1999, pp. 891-904.

Reynolds P.D. (1993), “Privacy and advances in social and political science: balancing present costs and future gains”, Journal of Official Statistics, 9 (2), pp. 275-312.

Statistical Office of the Republic of Slovenia and Eurostat (1996), Third International Seminar on statistical confidentiality, Bled - Slovenia, 2-4 October 1996. Collection of papers, Ljubljana, Statistical Office of the Republic of Slovenia, pp. 9-29 (mimeo).

Stevens D.W. (1998), “Confidentiality revisited: motives and consequences”, in Proceedings of the International Symposium on linked employer-employee data, 21-22 May 1988, Arlington (VA), U.S. Bureau of the Census (CD-ROM).

Task Force on Access to Confidential Data for Research Purposes (1999), “Draft Commission Regulation on Access to Confidential Data for Research Purposes. (i) Draft Commission Regulations. (ii) Explanatory notes”, Working Party on Statistical Confidentiality, Documents Eurostat/A4/SS/21 and Eurostat/A4/SS/22 (mimeo).

------ (1999), “The end of privacy. The surveillance society”, The Economist, May 1st-7th 1999, pp. 13-14 and 19-23.

Trivellato U. (1999), “Progettare un’informazione statistica pertinente”, in Atti della Quarta Conferenza Nazionale di Statistica, Roma 11-12-13 novembre 1998, Roma, Sistan-Istat, 1999, Tomo 1, pp. 49-72.

UN Statistical Commission (1994), Fundamental principles of official statistics, adopted at the Statistical Commission’s 473rd meeting on 14 April 1994, UN Statistical Commission Paper E/CN.3/1993/26, New York, United Nations.

US Bureau of the Census (1997), IT operation plan. Part I. Data Access and Dissemination System, CB-DR-97-02-N, Washington, US Department of Commerce.

Willenborg L. and T. de Wall (1996), Statistical disclosure control in practice, New York, Springer.

9th CEIES Seminar – Innovations in provision and production of statistics 105

Page 106: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

THE OBLIGATION TO PROVIDE INFORMATION AND THE USE OF STATISTICS IN ENTERPRISES

Risto SuominenDirectorFederation of Finnish EnterprisesP.O.Box 999FIN-00101 [email protected]

Administrative data and different administrative data registers are widely used in the Scandinavian countries to compile statistics. By the use of data registers, the aim has been to try to diminish the direct information retrieval from those obliged to provide information. The aim in producing statistical information is to use this indirect method of collecting data, which has also been defined in statistical law as the primary method of assembling data. In measuring according to the quantity of observation units, 93 % of data collecting is done by using the statistical registers. (Riitta Poukka – Terhi Tuominen: The Burden of Providing Information in Enterprises. Tietoaika 5/-1999).

The objectives of direct information collecting, are private persons or households, municipalities and enterprises. The point of view here is on the obligation to provide information and the use of statistics in enterprises.

The obligation to provide information in enterprises

Compiling data from enterprises is done not only by Statistics Finland, but also by other authorities and for example by business sector organizations. By far the most important and the largest compiler of data is the Statistics Finland. It has mapped out its own compiling in enterprises during the years 1988 – 1994, the latest being from the year 1997.

TABLE 1

Inquiries directed to enterprises and the number of responding enterprises

1997 1994

Inquiries directed to enterprises 73 52

Objects of inquiries 111 500 178 000

The number of statistical inquiries directed to enterprises has in three years increased remarkably, when at the same time the number of enterprises being objects of these inquiries, has clearly reduced. Membership in the EU, together with the increasing demand for more statistics within the enterprises, have contributed to the growth of the inquiries. According to the latest study, inquiries are clearly more concise and better targeted to those enterprises in question than earlier. Many of the inquiries are in connection with the field of business and focus on a very concise part of the activities of enterprises. The basic principles in these inquiries are that the biggest enterprises in the respective branches are all automatically responding to these inquiries, whilst in SME:s, a random sample is picked out. The greatest burden for answering these statistics lies with the big commercial business enterprises.

106 9th CEIES Seminar – Innovations in provision and production of statistics

Page 107: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

In 1997, there were 213 000 active enterprises in Finland. When the big companies took part in many different statistical inquiries, small enterprises had a very slight chance of ever taking part in the yearly data compiling of Statistic Finland. When an enterprise is founded, every new enterprise provides Statistics Finland with its basic information to its company register. Thus every enterprise will have answered the inquiry of Statistics Finland at least once in its lifetime.

TABLE 2

Information collecting targeted to at least 5000 enterprises

Company register 39 000

of which:

Basic information from new enterprises 20 000

Updating data from sole entrepreneurs 7 500

Income and taxation information directly from agricultural enterprises 9 000

Road transportation of goods, domestic 8 400

Industrial statistics, commodity, fuel and energy x) 5 200

Structural statistics, Service branches 5 000

Structural statistics, Industry and Construction xx) 7 000

x) inquiries made directly to places of work, not to the enterprises

xx) inquiries directed to about 4000 enterprises and to their 3000 places of work

Reference: Riitta Poukka – Terhi Tuominen: The Burden of Providing Information in Enterprises. Tietoaika 5/-l999

The inquiries for the basic information – the company register – on enterprises is by far the largest information gathering, focused on the field of enterprises. In addition to the 20 000 inquiries directed to the new enterprises, a separate register inquiry was directed to 19 000 enterprises in 1997. Out of the extensive individual statistics, two were directed to industry and the rest to service, construction and agriculture and forestry.

The burden of the inquiries

The main part of the administrative burden of the enterprises lies within taxation or the administrative procedures of the use of labour force and to the costs arising from them. Inquiries made for the purpose of compiling statistics do not account for significant administrative costs to the enterprise. The answering to these statistical inquiries is felt both tedious and unpleasant despite of the relatively small exertion, especially by the SME:s.

The use of the time and work in answering these statistical questions, varies from one questionnaire to the other and on the size of the enterprise. Most strenuous were considered the questions on financial statistics, structural statistics, questions on industries waste water and questions on income

9th CEIES Seminar – Innovations in provision and production of statistics 107

Page 108: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

and costs for foreign maritime traffic. The estimate for time used, differed very much between different inquiries. It was estimated that in bigger companies more time was spent in answering than in the smaller companies. The resources which were available to the enterprises, naturally affected the smaller companies with more strain than that of the bigger companies.

The frustration felt especially by the smaller companies in collecting the statistics, might have something to do with the fact that they in their own activities very seldom make use of the statistics. The market area of these enterprises can be very limited or their products very specific, so that the use of statistics as a support to the enterprise is difficult. Thus the collecting of statistics is felt solely as a burden without being able to use these statistics to back up the development of its entrepreneurship.

TABLE 3

The average amount of burden caused by inquiries and the sizes of statistical samples

Inquiry Size of Burden Annual Annual Sample burden burden

in samplesStructural statistics ofIndustry and Construction(annual statistics) 3748 (5 h 32 min) 5 h 32 min 9 yrs 7 mo

Service structures(annual statistics) 3794 (2 h 25 min) 2 h 25 min 4 yrs 9 mo

Volume-index of industrialproduction (monthly statistics) 1157 43 min 8 h 31 min 5 yrs 1 mo Road tansportation of goods(quarterly statistics) 1857 58 min 3 h 52 min 3 yrs 9 mo

Accommodation statistics (monthly statistics) 1444 2 h 6 min 25 h 13 min 19 yrs

Reference: Riitta Poukka – Terhi Tuominen: The Burden of Providing Information in Enterprises. Tietoaika 5/-1999

The use of statistical information

It is difficult to find objective instruments to measure to what extent and the kind of statistics enterprises use to develop their activities. From the user statistics of Statistics Finland some indication as to how much statistical production is being used, can be obtained. A clear and an unique picture of the use of the statistical information these statistics can not naturally give.

108 9th CEIES Seminar – Innovations in provision and production of statistics

Page 109: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

TABLE 4

Users of statistical library 1997 – 1998

1997 1998Users 38 102 38 400

of which visitors 15 625 13 560

telephone clients 18 698 18 620

letter, fax and e-mail clients 3 780 6 220

The number of the users of the statistical library of Statistics Finland has remained about the same in the last two statistical years, but there has been a distinct change in the user structure. The amount of library users has clearly reduced and correspondingly the number of customers communicating with letters, faxes and e-mail has remarkably grown. Apparently customers using e-mail have rapidly increased. The number of customers, using telephones, remained unchanged during these observation years and it can be readily assumed that the number of customers using letter and faxes has not increased, but the transition has been into the use of new information techniques – the use of the e-mail.

In chargeable activities, the share of the entrepreneur clients in the year 1998 was nearly a third. The largest user group was the State with its different departments and ministries. The third notable user group was the municipalities.

FIGURE 1

Income from the chargeable services provided by Statistics Finland according to groups of clients, 1998

9th CEIES Seminar – Innovations in provision and production of statistics 109

Page 110: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

The most satisfied groups using the chargeable services provided by Statistics Finland were schools and educational establishments, less satisfied were offices and institutions of municipalities. The use of internet as information provider interested most of the users of the chargeable services.

TABLE 5

Internet-service 1997 – 1998

1997 1998

Internet-searches for text files from the pages of Statistics Finland (daily) 8 000 18 000

Visitors to internet pages (weekly) 4 500 7 500Statistics Finland opened its internet pages in February 1995 and the use of it has rapidly increased. In 1998 Statistics Finland had already about 10 000 internet pages in use. Number of users has doubled during the years 1997 to 1998. Last year 7 500 visitors weekly visited the internet pages. The most significant internet service users were the employees of public organizations. Also employees of big companies, students, researchers and journalists were notable user groups of the internet. Presumably internet will be clearly the most significant way of using statistical information in the future. So far, small- and medium sized enterprises have not found the possibilities provided by the internet to use statistical information, as its use has remained rather insignificant.

CONCLUSIONS

Burden as information provider- In Finland, all of the enterprises will have to answer at least once to the statistical

inquiries of Statistic Finland.- A large number of statistical inquiries are yearly directed to enterprises.- A smaller enterprise is very seldom a target as a sample for statistical inquiries of

Statistics Finland.- There is a great difference in the tediousness of answering between different

inquiries.- Enterprises experience the responding to statistical inquiries frustrating.- Statistics Finland has tried to encourage response by giving feed-back information

on how the enterprise is placed in relation to other enterprises.- Use of electrical information instruments is rather insignificant.- Enterprises hope for easier forms to fill out and more instruction as how to fill out

these forms sent.

Use of statistical information- Small enterprises seldom use statistical information in developing their own

activities.- Use of statistical information via the internet is expanding rapidly.- Employees of public sectors, students and journalists and the employees of large

enterprises are most eager to use internet for statistical information. - It pays to support the use of statistic information based on internet by increasing

supply and flexibility of its use. Pricing requires careful consideration, also bearing

110 9th CEIES Seminar – Innovations in provision and production of statistics

Page 111: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

in mind that the use of internet most probably adds to the use of other forms of statistical information.

Sources:

Riitta Poukka - Terhi Tuominen: The Burden of Providing Information in Enterprises, Tietoaika 5/1999 Statistics Finland: Annual Report 1998

9th CEIES Seminar – Innovations in provision and production of statistics 111

Page 112: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

STATISTICS: XML-EDI FOR DATA COLLECTION, EXCHANGE AND DISSEMINATION

Wolfgang KnüppelEurostatUnit A-2: Information and communication technologies for the Community Statistical SystemBâtiment Jean MonnetRue Alcide de GasperiL-2920 [email protected]

112 9th CEIES Seminar – Innovations in provision and production of statistics

Page 113: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

9th CEIES Seminar – Innovations in provision and production of statistics 113

Page 114: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

114 9th CEIES Seminar – Innovations in provision and production of statistics

Page 115: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

9th CEIES Seminar – Innovations in provision and production of statistics 115

Page 116: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

116 9th CEIES Seminar – Innovations in provision and production of statistics

Page 117: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

9th CEIES Seminar – Innovations in provision and production of statistics 117

Page 118: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

118 9th CEIES Seminar – Innovations in provision and production of statistics

Page 119: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

9th CEIES Seminar – Innovations in provision and production of statistics 119

Page 120: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

120 9th CEIES Seminar – Innovations in provision and production of statistics

Page 121: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

9th CEIES Seminar – Innovations in provision and production of statistics 121

Page 122: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

122 9th CEIES Seminar – Innovations in provision and production of statistics

Page 123: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

9th CEIES Seminar – Innovations in provision and production of statistics 123

Page 124: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

124 9th CEIES Seminar – Innovations in provision and production of statistics

Page 125: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

THEME 3:AVAILABLE SOFTWARE

9th CEIES Seminar – Innovations in provision and production of statistics 125

Page 126: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

126 9th CEIES Seminar – Innovations in provision and production of statistics

Page 127: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

IT POLICY FOR THE ESSREPLY FROM EUROSTAT

Daniel DefaysEurostatUnit A-1: Computerised management of information systemsBâtiment Jean MonnetRue Alcide de GasperiL-2920 [email protected]

Perfect timing

The seminar, organised by the "Innovation in provision and production of statistics" Subcommittee of the CEIES, is particularly welcomed by Eurostat. The importance of new technologies is clearly growing for statistical institutes and calls for a more coordinated approach at European level. Furthermore, the timing of the seminar is highly appropriate. A colloquium on a similar subject was organised last September in Luxembourg and its conclusions will be discussed at the next meeting of the SPC. The contributions to this ninth CEIES seminar will make it possible for Eurostat to complete the picture already formed through discussions with the NSIs. The viewpoints of data-producers, users and scientists expressed in the course of the two days will be included in the reflections; this will make it possible to have a consolidated view of the expectations of all the partners in the construction of the ESS.

Why is an IT policy needed?

Why is an IT policy needed? In the past, it was mainly considered by NSIs as a private matter - an obvious application of the subsidiary principle. Has something changed to justify a more co-operative approach in this area? The seminar clearly demonstrated the emergence of different kinds of forces which can justify a more active role on the part of Eurostat in the matter. They come from user demand, from the similarity of the concerns of the data-providers, from the necessity to organise the ESS as a network, from the common pressure of a dynamic technological environment and, of course, from the European integration process and the market. Given these common pressures, the scarcity of resources and the complexity of the problems, an IT policy for the ESS seems desirable.

Common demands of users

Users want higher quality (timeliness, accuracy, relevance, etc.) as pointed out by Mr Androvitsaneas and other speakers. Data must be available on specific dates, to everybody. The web is becoming the most natural way to access information and information needs are getting broader; access to macro data is not sufficient anymore; the research community wants disaggregated information. These requirements have to be dealt with by most NSIs and they will probably evolve in the future. Common solutions could be considered, as illustrated by the interest created by experiences of some Member States (e.g. creation of web sites and implementation of data-warehouses).

Similarity of data-providers' concerns

At the same time, data-providers are increasingly reluctant to deliver information to administrations and there is growing concern about the protection of privacy.The same path followed by NSIs

9th CEIES Seminar – Innovations in provision and production of statistics 127

Page 128: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

NSIs are trying to integrate their information systems, which were mainly production-oriented in the past (as illustrated by Giovannini, for instance), to make intensive use of administrative sources (Finland, France) and to promote the use of standards in data exchanges. They share experiences and items of software (PC Axis, Blaise, StatLine, etc.).

The European integration process

The construction of Europe and the implementation of a statistical work programme also call for a more collegial approach. The ESS needs a technological backbone.A common legislative framework (Giovannini) is creating ideal conditions for co-operative development. The expectations of the market are high: a feeling of European citizenship is emerging, globalisation is blurring national boundaries, and the enlargement process, with the arrival of several new Central and Eastern European Countries, is creating new demands for cooperation.

A technological boost

The essential role played by technology in the convergence process cannot be ignored. The e-society is imposing its standards on everybody and it is not a coincidence that many statistical institutes are simultaneously considering the potential benefits of new tools or new concepts like data-warehouses, object-oriented techniques or web interfaces.

The strengths of the ESS

Is the ESS able to meet these new challenges? Are we well enough equipped to use the new technologies to cope with new demands? Our strengths reside mainly in the highly experienced staff of the statistical institutes, in the impetus given by European integration and in the similarity of our production and dissemination objectives.

Where are we?

A lot has already been achieved : common R&D development cofinanced by Community funds, a high level of involvement of the ESS in standardisation activities (GESMES, RDRMES, CLASET), development of a common architecture to enable the distribution of statistical services (DSIS), common tools to exchange information (Stadium, Statel, etc.), common developments in sectors where European integration is well advanced (Edicom) and the launching of a programme of exchange of technology (ETK seminar held last year in Prague). At the same time, Member States have organised cooperative action on a bilateral or regional basis and some have put software at the disposal of others. The ingredients for a more ambitious IT policy for the ESS are present.

Difficulties

Enthusiasm should not hide the difficulties. More in-depth cooperation in the field of IT has not been possible in the past for several reasons. The lack of an explicit IT policy is not the only explanation. NSIs are independent organisations, with different traditions and cultures. Their sizes differ and they are part of administrative systems with an existing legacy. This creates inertia and special conditions which mean general solutions are not always appropriate.

The ESS IT policy

128 9th CEIES Seminar – Innovations in provision and production of statistics

Page 129: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

An ESS policy has to take this diversity into account. It has to be a mixture of prescriptive action, recommendations and enabling activities.The first priority is to provide an organisational framework where strategic questions can be discussed, priorities identified and resources allocated. Eurostat has therefore proposed to create an IT Steering Committee which coordinates other sectoral activities in the traditional areas of R&D, metadata and reference environments, data exchange and transfer of technology.The priority given in the past to R&D activity, based on a competitive approach where NSIs are invited to respond to Eurostat calls for proposals, will remain. The results of the 1999 call are very encouraging and the participation of NSIs has substantially increased.In the area of standardisation and data exchange, a more prescriptive approach seems to be needed. In order to be efficient, standards have to be applied by everybody. Member States have asked Eurostat to promote the generalised use of GESMES in its data exchanges with NSIs. Efforts will also continue to improve the harmonisation of metadata. This is seen as a prerequisite for better integration of our information systems.

Finally, the transfer of technology and know-how between the various partners of the ESS will be organised in a more systematic way. For instance, centres of excellence will be created, seminars organised and best practices documented. The areas of e-commerce, data-warehousing and metadata have been identified as topics of high interest and should be covered as a priority.These are the first achievements of a more elaborate ESS IT policy. Clearly, this is not enough to meet all the expectations expressed in the seminar.

The way forward

In the medium term, together we will have to consider the definition of a framework which will facilitate the exchange of software components, to organise common product management, to give to some of our web sites a common European look, and, why not, to create some kind of ESS portal.Clearly, in the area of technology itself, the wise choice is not to force integration. The national institutions have peculiarities which call for specific approaches. The internal organisations of the IT departments depend on the centralised/decentralised character of the statistical offices. This does not prevent us from exchanging experiences and from organising us to make it possible to compare the efficiency of our different organisations.

Other issues raised during the seminar

Other issues were raised during the seminar : the pricing policy of Eurostat for the dissemination of data, restriction of access to microdata and the need to continue the harmonisation of business registers and to consider unique identifiers. They go far beyond IT policy and are not discussed in this reply.

Success

The seminar was a success. It has paved the way for future progress in European integration and for better partnership in the IT sector. Eurostat thanks everyone involved for their active contribution to this interesting event.

9th CEIES Seminar – Innovations in provision and production of statistics 129

Page 130: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

SUMMING UP BY THE CHAIRMAN OF THE SUBCOMMITTEE

Patrick GearyEconomics DepartmentNational University of Ireland,MaynoothMaynoothCo. Kildare, [email protected]

The importance of new technologies in the provision and production of statistics has been emphasised in many of the contributions to the seminar.

There was general agreement that there are great benefits to both data producers and data users in the area of dissemination and production. As far as production of data is concerned, the benefits to quality and timeliness of data were highlighted. In addition, the beneficial effects of the introduction of new technologies on the internal organisation of National Statistical Institutes (NSIs) were described, but the challenges to NSIs were also stressed.

Among the other issues addressed were the following:

There is still a challenge for the NSIs is to find suitable methods or incentives to overcome the reluctance of some data providers to use new technologies, in order to eliminate the costs of operating multiple systems.

There was an active discussion on data confidentiality and access to micro data. It was stated that access to anonymised micro data is too restricted and that new regulations should be framed to increase access, to support the research community, which in turn would give more public support to initiatives like the ECHP.

But as far as access to micro data of enterprises is concerned, data confidentiality requirements were regarded as effectively prohibiting access.

130 9th CEIES Seminar – Innovations in provision and production of statistics

Page 131: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

List ofParticipants

9th CEIES Seminar – Innovations in provision and production of statistics 131

Page 132: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

132 9th CEIES Seminar – Innovations in provision and production of statistics

Page 133: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

LIST OF PARTICIPANTS

Eurostat Daniel DefaysWolfgang KnüppelNicole LauwerijsTapio LeppoJoseé Nollen

Statistics Finland Sven I. BjörkqvistHeli Jeskanen-SundströmMarika LaihoEero PaananenTimo Relander

Austria Dieter Burget, ÖstatErich Hille, Austrian National BankGerhard Kaltenbeck, Austrian National BankJoachim Lamel, Austrian Federal Economic ChamberJosef Richter, Austrian Federal Economic ChamberGünther Zettl, Östat

Belgium Claude Cheruy, Institut National de StatistiqueClaude Delannoy, Institut National de StatistiqueFrans Desmedt, Institut National de Statistique

Czech Republic Ebbo Petrikovits, Czech Statistical Office

Denmark Hermann Pfeifer, European Environment Agency

Estonia Eda Fros, Statistical Office of Estonia

Finland Auli Jaakkola, Confederation of Finnish Industry and EmployersLasse Lakanen, National Board of CustomsAnu Muuri, National Research and Development Centre for Welfare and Health (STAKES)Risto Suominen, Federation of Finnish EnterprisesPekka Tanhua, National Board of CustomsJussi Varjus, SAS

France J.P. Grandjean, INSEE

Germany Christos Androvitsaneas, European Central Bank (ECB)Günter Kopsch, Federal Statistical OfficeSteven Smith, European Central Bank (ECB)Doris Stärk-Rötters, Federal Statistical Office

9th CEIES Seminar – Innovations in provision and production of statistics 133

Page 134: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

Hungary Tamás Koltai, Hungarian Central Statistical OfficeImre Pap, Hungarian Central Statistical Office

Ireland Patrick T. Geary, NUI MaynoothMargaret Mcloughlin, Central Statistical Office

Italy Enrico Giovannini, ISTATUgo Trivellato, University of PaduaGiovanni D’Alessio, Bank of ItalyAugusto De Paolis, Bank of ItalyGiulio Barcaroli, ISTATViviana Egidi, ISTATGerardo Giacummo, ISTATSilvio Serbassi, ISTATAlberto Sorce, ISTAT

Latvia Lilita Laganovska, Central Statistical BureauArvids Avotins, Central Statistical Bureau

Lithuania Rimvydas Ignatavičius, Statistics Lithuania

Luxembourg Robert Weides, STATEC

The Netherlands Marton Vucsan, Statistics Netherlands

Norway Tore Eig, Statistics Norway

Poland Stanislaw Sieluzycki, Central Statistical Office

Portugal Daniel Bessa, AURNAna Lucas, Instituto Nacional de Estatistica

Romania Alexandru Brodeala, National Commission for StatisticsVictor Dinculescu, National Commission for StatisticsGheorge Emanoil Vaida-Muntean, National Commission for Statistics

Slovenia Julija Kutin, Statistical Office of the Republic of SloveniaErna Miklič, Statistical Office of the Republic of Slovenia

Spain Pedro Tena, State Secretariat for Transport and Infrastructures

Sweden Kaisa Ben Daher, Statistics SwedenPekka Koski, Statistics SwedenGunnar Olsson, Statistics SwedenAnders Törnqvist, Comfact AbBjörn Walters, Statistics Sweden

134 9th CEIES Seminar – Innovations in provision and production of statistics

Page 135: GETTING RESEARCH STATISTICS FOR THE MEDICAL Web viewInnovations in provision and production of statistics: the importance of new technologies. Helsinki, Finland, 20–21 January 2000

Switzerland Oliver Lorenz, Swiss National Bank, Switzerland

United Kingdom Derek Miles Andow, Office for National StatisticsEd Bin, Primark CorporationRichard Brent, Primark DatastreamJames L.T. Denman, Office for National StatisticsNick Dyson, Dept of Social SecurityClive Jerome, Office for National Statistics

9th CEIES Seminar – Innovations in provision and production of statistics 135