Internet dissemination. Part I. Storing and retrieving data. Disseminating statistics: Internet and...
-
Upload
aniyah-channell -
Category
Documents
-
view
218 -
download
0
Transcript of Internet dissemination. Part I. Storing and retrieving data. Disseminating statistics: Internet and...
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
1
Storing data. Structure for dissemination
• Different data storage formats
• Data retrieval and presentation
European Statistical
Training
Programme
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
2
Different data storage formats
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
3
Different designs for cities
Chaotic urbanization (old towns)
Madrid (Spain), City Centre Toledo (Spain) , City Centre
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
4
Different designs for cities
Organized development (new districts and towns)
Madrid, a modern district Manhattan, NY
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
5
Different designs for cities
... most often, old and new urban districts
New cities
And, what about storing ( for disseminating) statistical data?
Is there a best solution ?
Tres Cantos, Spain, 1970Brasilia, Brazil, 1956
M
A
D
R
I
D
must coexist
can be designed in a ‘structured’ way, but ...
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
6
Introduction
Questions to answer
When the dissemination stage begins
Different storage formats
SDMX-ML Standard
Issues to address
The role of metadata
Document structure normalisation
Example of an application with unstructured data
Example of a tool for structuring data
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
7
• The role of metadata
• Experiences of document structure normalisation applied to statistical dissemination. Ordinary files to multidimensional databases
• Example of tools for structuring information to be disseminated: PX-Make
Introduction
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
8
• How to structure statistical information in order to disseminate it better– and therefore metadata,– different structuring and storage formats,– and some information technologies for dissemination
• and try to answer some questions:Must we always think big? Should we use the latest and most powerful dissemination
technology?Must we try to use one single technology?Are the DW systems the best, or the only, dissemination
technology?
Questions to answer
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
9
• Anticipating the content of the presentation…
• The INE’s answer to those questions is based on our own experience, and on our own restrictions (resources, time, etc.)
• and the answer to all of them is NO, because:
– Some systems may demand a large previous investment of time and resources, and then not be sufficiently dynamic
– Each type of statistical information may require a different dissemination technology
– And because it is possible, under a single “brand and aspect”, (INEbase in our case), to group data from very different statistical operations, applying different dissemination technologies, trying to achieve as well very similar interfaces for final users
Questions to answer
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
10
• The day that we finish tabulating a survey or statistics, it seems as though the work is done, but…
• There is still time and work to do before the statistics are disseminated. This is our situation:
– 1 day for the press release (on paper, fax and the Internet)
– 1 day to post all the content from the tables database and the temporal series database on the Internet
– 10 days to edit a diskette or CD-ROM, including replication
– 1 month for the book
When the dissemination stage begins
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
11
• In order to comply with these deadlines and shorten the time taken by editorial operations ...
• It is necessary to begin the dissemination project a long time in advance of the tabulation process
• Therefore,
– it is useful for us to know as much as possible about data, metadata, formats, methods, dissemination techniques and standards
– it is to this that we dedicate part of this presentation
When the dissemination stage begins
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
12
• UnstructuredUnstructured formats, not enriched with metadata, not particularly focused on computer processing or statistical dissemination– easy, quick and cheap to produce– poor informative content– very limited computer processing
• StructuredStructured formats, programs and standard methods – less easy or cheap– quick to produce (…it can be obtained) – rich dissemination media, secure and stable– with a guarantee of being able to address new requirements– easy to automate dissemination processes
Document structure normalisation
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
13
Document structure normalisation
Example of non-structured
document or file: a Press release ( .doc, or .pdf )
Possible ‘processing’ of this document or file:
•Reading•Printing•Or ‘photocopying’
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
14
• Basically what we are going to be looking at in this presentation, as the degree of structuring chosen increases, is...
1. Visual and presentation performance of the format we are using will increase
2. Complexity and the human and economic cost of implementing that solution will increase
• We will also see that, on a website (ours in this case) different formats can share the same storage system with no problems and are used for different purposes or types of statistical products
Document structure normalisation
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
15
• Unstructured formats:
– Not enriched with metadata (there will be a different session dealing exclusively with metadata)
– Not particularly focused on statistical dissemination
• Adobe Acrobat PDF• Text and spreadsheets• Static HTML pages
– We can certainly “get by” with them
Document structure normalisation
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
16
• PDF, XLS … (no comment will be made about them)
Use of static HTML pages online and in offline publications
• AdvantagesAdvantages
– Documents with a statistical table aspecttable aspect can be “shown” using the tags <table>, <body>, <cell>...
– There are many functions available for formatting text, although not so many for organising tables
– Both static and dynamic HTML pages can be created (dynamic pages are usually generated with the help of “CGI” type programs which send logical queries to databases and file servers, or with other online data access technologies (ASP, PSP) Example
– And all Office tools offer the possibility of editing static HTML pages.
Document structure normalisation. Unstructured formats
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
17
Use of static HTML pages online and in offline publications
• Disadvantage
– HTML enables us to “show” metadata,
– but it will not manage it for us in our best interests:
(such as conventions regarding the meaning of different parts of the information, which would be useful for computer presentation processes), but as just another text
Document structure normalisation. Unstructured formats
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
18
• Example of HTML source code<HTML>
<HEAD>
<TITLE>Población de 16 y más años por sexo, grupos de edad (4) y relación con<BR>
la actividad económica.</TITLE>
</HEAD>
<BODY>
<TABLE>
<TR ALIGN=LEFT>
<TH COLSPAN=8>Población de 16 y más años por sexo, grupos de edad (4) y relación con<BR>
la actividad económica.</TH>
<TR ALIGN=LEFT>
<TH> </TH>
<TH VALIGN COLSPAN=1>Población > 16 años</TH>
<TH VALIGN COLSPAN=1>Activos</TH>
<TH VALIGN COLSPAN=1>Ocupados</TH>
<TH VALIGN COLSPAN=1>Parados</TH>
<TH VALIGN COLSPAN=1>Parados que buscan primer empleo</TH>
<TH VALIGN COLSPAN=1>Inactivos</TH>
<TH VALIGN COLSPAN=1>Población contada aparte</TH>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT VALIGN=TOP>Ambos sexos</TH>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT VALIGN=TOP>Total</TH>
<TD>9.0</TD>
<TD>9.0</TD>
<TD>9.0</TD>
</TR> ....
Document structure normalisation. Unstructured formats
... <TR ALIGN=RIGHT><TH ALIGN=LEFT VALIGN=TOP>De 16 a 19 años</TH><TD>9.0</TD><TD>9.0</TD></TR></TR><TR ALIGN=LEFT>
</TR></TABLE></BODY></HTML>
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
19
StructuredStructured formats (based on metadata) (based on metadata) :
– Standards promoted by official statistical institutions, EDIFACT/GESMES, SDMX ...
– Actual ‘de facto’ standards for disseminating statistical data ( readers: “Pseudo OLAP”: PC-Axis, SuperTABLE, EVA, Navidata, Beyond 20/20 ®)
– Conventional databases, with capacity to store data and metadata, and to dynamically generated the required information
– Multidimensional systems, or OLAP as they are actually called, for storage and dissemination of data: Data Warehouse approach
Document structure normalisation
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
20
Is it necessary to spend time and resources structuring statistical data files Is it necessary to spend time and resources structuring statistical data files
or creating costly databases for dissemination?or creating costly databases for dissemination?
1.- Automating data presentation tasks and achieving productivity bonuses will only be possible when the structure of the files generated is widely recognised, stable, repetitive … All of which will aid editing tasks, irrespective of the mediumirrespective of the medium (paper or electronic and online publications)
2.- Presentation logic will be in response to a metadata model, and metadata may be used to reinforce searchsearch functions
3.- Clear metadata documentation simplifies communication between servicescommunication between services, producers and the dissemination unit, and enables concurrent working
4.- and it will subsequently facilitate the communication of data between between organizations,organizations, or to individuals, Web Services, content syndication environments..
Structured formats. The key question: the role of metadata
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
21
• One way or another, using metadata will bring us closer to a matrix model
• However, how easy is it to structure tables in multidimensional matrix form reflecting possible variable crosses, based on metadata used to describe them?…
• Not always easy, sometimes “cubist art” (using cubes) is required Not always easy, sometimes “cubist art” (using cubes) is required for complex tables such as this one:for complex tables such as this one:
Structured formats.The key question: the role of metadata
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
22
• This “cubist art” demands that, besides concerning ourselves with metadata, we focus on clearly identifying matrices resulting from the tabulation process and which are valid for dissemination systems.
• Sometimes it is necessary to manipulate a tabulated matrix• Dividing it into several matrices• Combining a classification variable with a counting
one• Concatenating classification variables• Combining previous actions
• It should be recognised this may entail an added workload
Structured formats.The key question: the role of metadata
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
23
• The decision:
– Have we already opted to simply produce structured files or databases with statistical tables ordered as matrices, resulting from systematically crossing variables, and accompanied by all the metadata necessary for their interpretation?...
• If the answer is ‘YES’, it is necessary to talk of available formats and procedures
Structured formats.The key question: the role of metadata
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
24
• EDIFACT,EDIFACT, EElectronic DData IInterchange FFor AAdministration CCommerce and TTrade, electronic document structures promoted by the United Nations for exchanging documents electronically in the field of trade and public administrations
• GESMES GESMES = GEGEneric SStatistical MESMESsage– adaptation for statistical purposes of the EDIFACT EDI syntax– Designed by a workgroup composed of statistics institutes, customs bodies and
central banks.– Financed as part of the European Union IDAEuropean Union IDA project (Interchange of data
between administrations)– Published in 1993– Adapted to multidimensional “data set” structures including their own
metadata– Complete, detailedComplete, detailed, and somewhat complexcomplex– In use between EUROSTAT and all the INEs, on a communication system
based on the “Stadium - Statel - Testa services” extranet
Different storage formats Standards promoted by official statistics institutions
EDIFACT/GESMES, EDAMIS
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
25
• Basic EDIFACT syntax:– An EDIFACT exchange comprises of a sequence of
segmentssegments– each segment has a unique 3-character identifieridentifier– There are rules of orderrules of order for segments– “Entity-relation” modelling techniques were used to
design message syntax
Different storage formats Standards promoted by official statistics institutions
EDIFACT/GESMES
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
26
• GESMES implementations
– ECOSER (economic time series)
– BOPSTA (balance of payments)
– PRODCOM (production data)
– CLASET (statistical classifications)
– RDRMES (raw data collection)
– GESMES / CB (central banks short term economic indicators)
Different storage formats Standards promoted by official statistics institutions
EDIFACT/GESMES
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
27
• Problems …Problems …
– The more popular table presentation programs often do not have the capability of exporting and importing data with GESMES (PC-Axis does this)
– Through intensive use of the internet, new technologies emerged, particularly the standard XML / SDMX
Different storage formats Standards promoted by official statistics institutions : EDIFACT/GESMES
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
28
• An example of a messageAn example of a message: UNH+01001+GESMES:D:95A:E6'BGM+74:::PC-AXIS Win 2.0'DTM+137:19980813:101'NAD+MS+ine'CTA+CC+:INE Difusion e-mail: [email protected] Fax: +34 91 5839158'NAD+MR+eurostat'ASI+01001'SCD+4+sexo++++:1'SCD+4+grupos de edad (4)++++:2'SCD+4+relacion con la actividad economica++++:3'SCD+3+Poblacion de 16 y mas años++++:4'DSI+epa4t97'GIR+5+SDB:AB+01:AC+Ejemplo.- Resultados nacionales:AD'ARR+
+9.0:9.0:9.0:9.0:9.0:9.0:9.0:9.0:9.0:9.0:9.0:9.0:9.0:9.0:9.0:9.0:9.0:9.0:9.0:9.0:9.0:9.0:9.0:9.0:9..0.0:9.0:9.0:9.0:9.0:9.0:9.0:++9.0:9.0:9.0:9.0:9.0'
IDE+5+01001'
Different storage formats Standards promoted by official statistics institutions EDIFACT/GESMES
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
29
• SDMX (SStandard DData and MMetadata eXchange) http://www.sdmx.org/ is an initiative promoted by BIS (International Payments Bank), OCDE, IMF, World Bank, European Central Bank, UN and EUROSTAT in order to:
• Promote the use of standards in exchanging statistical information between institutions
• There are already pilot projects in place, or experiences such as:
– Eurostat SODI ( SSdmx OOpen DData IInterchange)
– NAWWE (“The primary objective of the NAWWE project is to use a web based mechanism for collecting national accounts data based on already internationally agreed national accounts standards”)
http://stats.oecd.org/nawwe/
– Mexico : http://www2.inegi.gob.mx/estestint/Standards/default.asp
Different storage formats. SDMX
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
30
• Not institutionally standardised, although they have come to be “de facto” standards
• Specially designed for holding and presenting data and metadata• Reader programs mimic the functions of OLAP multidimensional data
presentation (to show dimensions and hierarchies, to pivot, to deepen, to nest, to show graphics and statistical maps)
• Full metadata handling capability• Several programs, several regional markets:
– PC Axis (Sweden)PC Axis (Sweden): Nordic countries, UNECE, other EU countries (Spain too), South Africa, Guatemala…
– CUB X / EVACUB X / EVA: Eurostat program– Beyond 2020Beyond 2020: USA, Canada, UK, France ...– SuperTABLESuperTABLE: Australia
Different storage formats
De facto standards and “pseudo OLAP” visualisers
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
31
• Statistical table management application with a spreadsheet interface
• Windows environment
• Simple to handle and use by non-IT experts
• Simple to generate: Write in ASCII with tags, structured, self-documented, and easy to translate to XML (Adaptation to SDMX ver. 2 format underway)
• Easy to associate with Office applications
Different storage formats
De facto standards and “pseudo OLAP” visualisers: PC-Axis
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
32
• The program has been developed by Statistics SwedenStatistics Sweden• The GUI shows typically statistical elements: universes or
contents, variables, modify variable and value selections, nestings, etc...
• File generation can be fully automated :– By means of robots or tabulation program macros (SAS)– From relational or multidimensional databases containing
the information (such as our Tempus 2 system)– Or simply displayed online using “cgi / web gateways”
type programs
Different storage formats
De facto standards and “pseudo OLAP” visualisers: PC-Axis
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
33
The table filetable file (*.px) contains both metadatmetadataa and datdata a :Lines of MetadataLines of Metadata…
AXIS-VERSION="2000";CREATION-DATE="20070228 09:44";SUBJECT-AREA="Demography";SUBJECT-CODE="l1";MATRIX="L10026E";TITLE="Population of main capital cities.";CONTENTS="Population of the largest urban agglomeration. Year 2005";DESCRIPTION="Population of the largest urban agglomeration. Year 2005 ";DECIMALS=0;SHOWDECIMALS=0;STUB="Country/Agglomeration";HEADING="population (thousands)";UNITS="population (thousands)";LAST-UPDATED="26/03/07";CONTACT="INE E-mail:www.ine.es/infoine Internet:www.ine.es Tel:+34 91 5839100 """;VALUES("Country/Agglomeration")="Afghanistan. Kabul","Albania. Tirana","Algeria. Algiers","American Samoa. Pago Pago", …
Different storage formats
De facto standards and “pseudo OLAP” visualisers: PC-Axis
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
34
SOURCE="Statistical Yearbook of Spain ";COPYRIGHT=YES;NOTE="Information Source: United Nations Demographic Yearbook. ";VALUENOTE("Country/Agglomeration","Australia. Sydney ")=" Including Christmas Island, Cocos (Keeling) Island and Norfolk ""Island.";VALUENOTE("Country/Agglomeration","Channel Islands. ST. Helier")="Including the islands of Guernsey and Jersey. ";VALUENOTE("Country/Agglomeration","China. Shanghai")="For statistical purposes, the data for China do not include Hong Kong #""and Macao Special Administrative regions (SAR) of China. #"" ";VALUENOTE("Country/Agglomeration","Comoros (The). Moroni")="Including the island of Mayotte. ";VALUENOTE("Country/Agglomeration","Finland. Helsinki")="Including Aland Islands. ";VALUENOTE("Country/Agglomeration","Mauritius. Port Louis")="Including Agalega, Rodrigues and Saint Brandon. ";VALUENOTE("Country/Agglomeration","Norway. Oslo")="Including Savalbard and Jan Mayen islands. ";VALUENOTE("Country/Agglomeration","Saint Helena. Half Tree Hollow")="Including Ascension and Tristan da Cunha. ";
Lines of dataLines of data …DATA=2994 388 3200 …
Different storage formats
De facto standards and “pseudo OLAP” visualisers: PC-Axis
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
35
The correspondence between tables and the program interface is intuitive: publication / folder or database, subject areas, tables, variables, values, data ...
Different storage formats
De facto standards and “pseudo OLAP” visualisers: PC-Axis
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
36
View a table
Different storage formats
De facto standards and “pseudo OLAP” visualisers: PC-Axis
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
37
• DDE and OLE communication with other Office programs: Excel, Word...
• Multiple export formats: Excel, Text, Html, Dbase, Gesmes, shortly SDMX...
Other functionsOther functions:
Different storage formats
De facto standards and “pseudo OLAP” visualisers: PC-Axis
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
38
• Easy to combine with browsing structures based on static and dynamic HTML pages (as done by the INE in INEbase and monthly INEbase)
• Statistical Graphs and Maps (with PX-Map)
• In several languages
• Calculation functions, on rows, columns and between tables of equal dimensions
• Customisable views and printing
Different storage formats
De facto standards and “pseudo OLAP” visualisers: PC-Axis
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
39
• Site in English maintained by Statistics Sweden with links to all the programs in the PC-Axis suite, to countries where PC-Axis solutions are used, to the forum, download area, etc
• http://www.pc-axis.scb.se/
Different storage formats
De facto standards and “pseudo OLAP” visualisers: PC-Axis
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
40
• How the INE uses PC-AxisHow the INE uses PC-Axis:
– As a reference for designing the dissemination database, and
particularly for its metadata storage structures (INEbase
and Tempus 2
)
– As a format for the files from all statistical operations not yet
uploaded to Tempus 2, or which are not anticipated to be
uploaded ( a program -‘Jaxi’- displays them online)
– As another export format offered by INEbase
– And for building “offline” programs: monthly INEbase, EPA ...
Different storage formats
De facto standards and “pseudo OLAP” visualisers: PC-Axis
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
41
Different storage formats
De facto standards and “pseudo OLAP” visualisers: PC-Axis
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
42
• One or more CGI programs provide “pseudo OLAP” browsing, search and presentation functions
• Data is held, by means of PC-Axis files, on the internet file server, in a directory structure which follows the logical subject tree of the organisation – that of the ISO-.
Different storage formats
De facto standards and “pseudo OLAP” visualisers: PC-Axis
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
43
Different storage formats
De facto standards and “pseudo OLAP” visualisers: PC-Axis
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
44
• EVA (ex CUB.X) is a very similar program to PC-Axis, which also enables handling of multidimensional tables
• http://epp.eurostat.cec.eu.int/extraction/evajava/evajava/help/en/homepage.htm states that:
“EVA stands for EEurostat's VVisual AApplication, the Eurostat's Common Browser for its statistical databases. EVA is a specialised
multidimensional statistical table browser”
Different storage formats
De facto standards and “pseudo OLAP” visualisers: CUB-X / EVA
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
45
• Eurostat’s “New Cronos” database and its HTML presentation
Different storage formats
De facto standards and “pseudo OLAP” visualisers: CUB-X / EVA
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
46
Shell HTML
Different storage formats
De facto standards and “pseudo OLAP” visualisers: CUB-X / EVA
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
47
Retrieving data in
HTML table format
Different storage formats De facto standards and “pseudo OLAP” visualisers: CUB-X / EVA
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
48
The visualiser for Windows
• The dimension “blocks” show the groups of values of the variable, and allow for rotating the chosen values
• “Spreadsheet” type interface, drag and drop functions to modify the header row and the header column
• Values and codes
• Multiple export format: Excel, Dbase, Gesmes...
Different storage formats
De facto standards and “pseudo OLAP” visualisers: CUB-X / EVA
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
49
INFOFri Dec 19 10:16:08 1997 @_# trueValues=5246 on 5544 #_@LASTUPFri, 19 Dec 1997 10:11:04 +0100TYPERVDELIMS(),@DIMLST(soft,theme,domain,collect,table,indic,country,time)DIMUSE(R,R,N,N,N,V,V,V)POSLST(newcronos)(theme1)(eur2)(01-cn)(01-cn-a)(cnpib90a)
(01,22,30,11,34,32,14,28,16,24,18,36,38,40,41,26,42,46)(1999a00,1998a00,1997a00,1996a00,1995a00,1994a00,1993a00,1992a00,1991a00,1990a00,1989a00,1988a00,1987a00,1986a00)FORMATFORMATRNOTAV:VALLST(0)(6309801.90,6119851.10,5942454.30,5792810.20,5691611.00,5554092.30,5394748.60,5424698.30,5373350.50,5198353.80,
An example of an EVA file:
Different storage formats
De facto standards and “pseudo OLAP” visualisers: CUB-X / EVA
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
50
– Distributed by the Canadian company of the same name, which also produces:
• the (independent) file and metadata preparation system• and an “internet file server” version of the program
– Data and metadata are stored in files which are not directly legible (binary), it is not possible to create Beyond files from outside its specific “builder” programs
– Capabilities: spreadsheet interface, drag and drop, exchange and nesting of variables, statistical graphs and maps
– Two main types of file: tablestables and “extractsextracts”. A distinguishing feature of Beyond is its capability for handling microdata and tables with the same program. “Extracts” are indexed microdata files which are specially handled so that online tabulation is quick
Different storage formats
De facto standards and “pseudo OLAP” visualisers: Beyond 20/20
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
51
• Chapters, tables and “extracts”
Different storage formats
De facto standards and “pseudo OLAP” visualisers: Beyond 20/20
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
52
• Viewing tables
Different storage formats
De facto standards and “pseudo OLAP” visualisers: Beyond 20/20
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
53
• Selecting variables from an “extract” or microdata file
Different storage formats
De facto standards and “pseudo OLAP” visualisers: Beyond 20/20
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
54
Structured storage. Conventional databases with the capability to store data and metadata and dynamically generate the information requested
• It is also possible to create Databases for online statistical dissemination, with robust metadata support :
– Adhering in its design to a pre-existing structured format (the Swedish model, Spain –Oracle-,...)
– Or with a model of its own (Holland, the StatLine system)
Different storage formats
Databases
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
55
• StatLine, from the Netherlands Central Bureau of Statistics, is a fantastic reference, of how to combine a database with a look-up system…
• It seems a simple medium, but is the outcome of several years’ work...
Different storage formats
Databases
http://statline.cbs.nl/
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
56
• StatLine: powerful presentation of metadata, “pseudo OLAP” look-up functions. (Data supported by a relational-model Database on the server, Java Applet internet technology, it is recommendable to have ample broadband …)
Viewing metadata, cubes, dimensions
Different storage formats
Databases
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
57
Viewing data, functions of deepening, nesting, pivoting, Dragging and dropping
StatLine: powerful presentation of metadata, “pseudo OLAP” look-up functions. (Data supported by a relational-model Database on the server, Java Applet internet technology, it is recommendable to have ample broadband …)
Different storage formats
Databases
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
58
The INEbase Tempus II subsystem ( Time Series databank )The INEbase Tempus II subsystem ( Time Series databank )
Relational database system are also widely used as dissemination tools. The INE uses them:
• 1.- As a more compact store than PC-Axis files, distributing in different tables the different metadata components and data, and enabling:
• construction of queries on demand• exporting PC-Axis, Excel format …A growing part of the information is uploaded to the INEbase Tempus II subsystem, a relational database system (Oracle) in which the following are made compatible: • single information storage• and a presentation in two possible forms: tables and chronological series
Different storage formats
Databases
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
59
Analyze
Define
Create
1.- Relational Model
2.- GES_Tempus
Tool for managing
processes. Using new TP2 format
4.- Tempus 2 (model + data)
3.- Gathering data from Tempus, PC-Axis and other sources.
5.- Displaying tables (collection of series)
(March 2004)
8.- Program for accessing series
Tempus 2.
6.- Accessing to series (first version)
7.- Extracting data from T2 e.g: FMI
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
60
The INEbase Tempus II subsystem ( Time Series databank )The INEbase Tempus II subsystem ( Time Series databank )
Different storage formats
Databases
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
61
TheThe INEbase Tempus II subs INEbase Tempus II subsyystemstem
We developed a tool ( Ges-Tempus) for managing all operations at Tempus 2
Different storage formats
Databases
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
62
Division (statistical operation) APR-06 SEP-06 APR-07 AUG-07 DIC-07
APR-06->DIC-07 % 20 months División
1 CCM 54 54 54 54 54 0 CCM2 CNTR 957 921 1.013 1.011 1.011 54 5,64 CNTR3 CTA 1.272 1.272 1.275 1.272 1.272 0 CTA4 DPOD 24.537 24.537 24.537 24.537 24.537 0 DPOD5 DPOH 8.201 8.201 8.198 8.201 8.201 0 DPOH6 DPOP 24.543 24.543 24.546 24.546 24.546 3 0,01 DPOP7 EIE 11.237 11.237 10.757 11.237 11.237 0 EIE8 EOT 9.791 10.043 10.043 10.043 NEW EOT9 EPA 85.229 126.023 157.196 212.130 212.130 126.901 148,89 EPA10 ETCL 5.029 5.029 5.029 5.029 5.029 0 ETCL11 HPT 37.940 37.940 37.940 37.940 37.940 0 HPT12 IAS 234 234 365 354 354 120 51,28 IAS13 ICM 1.128 1.191 818 1.191 1.191 63 5,59 ICM14 ICN 21 21 21 21 NEW ICN15 IDB 1.764 1.764 1.764 1.814 1.764 0 0,00 IDB16 IEP 21 21 21 21 NEW IEP17 IPC 82.705 82.705 115.878 115.878 126.765 44.060 53,27 IPC18 IPCA 393 393 393 393 NEW IPCA19 IPI 4.613 4.613 4.616 4.613 4.613 0 IPI20 IPP 5.362 5.362 10.867 10.867 10.867 5.505 102,67 IPP21 IPR 3.902 3.902 3.960 3.958 3.958 56 1,44 IPR22 MNP 93.290 97.582 101.874 101.874 101.874 NEW MNP23 Reserv. IPC 157.728 157.728 157.728 157.728 NEW Reserv. IPC24 EPOB 18.624 18.624 18.624 NEW EPOB25 TV 145 145 145 NEW TV26 DIR 660 179.502 179.502 NEW DIR27 ECPF 3.596 34.617 34.617 NEW ECPF28 ECM 7.920 0 NEW ECM
TOTAL 298.707 432.818 674.349 754.141 978.437 679.730 227,56 TOTAL
Tempus 2. Divisions and series
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
63
2.- As a dissemination system of statistical data closer to the concept of “lists” than of “tables”: An example, the List of place names :
Filtrable lists, not crosses of variables
Different storage formats
Databases
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
64
3.- As a dissemination system of statistical data closer to the concept of “lists” than of “tables”. Another example, the Industrial Product Survey
Filtrable lists, not crosses of variables
Different storage formats
Databases
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
65
What might the role be of BI/DW systems in a statistical dissemination strategy?What might the role be of BI/DW systems in a statistical dissemination strategy?
• When in a company or economic purpose being studied …– The number of variables or dimensions to be analysed is high– Granularity or level of subject or territorial detail is also high– It is difficult to predict many of the possible subject and territorial crosses,
as well as that of hierarchical presentation levels appropriate for different types of users
• …We shall need to model “n-dimensional cubes” populated by cell volumes significantly greater than 10 raised to 5…
• We can continue to use traditional relational modelling systems, however… It is time to speak to an expert in multidimensional analysis!It is time to speak to an expert in multidimensional analysis!
Different storage formats
OLAP systems for data dissemination Multidimensional databases
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
66
• This will spell the end of working with multiple “data sets”, or with a set of relational tables, to store (less numerous) “cubes” which contain a large amount of data with a high level of subject, territorial or temporal granularity
• Ideal for displaying results of Censuses and other operations enabling small-area statistics– Censuses– Large surveys, large company or establishment directories, high
level of detail• Intranet or Internet• Use of the most advanced OLAP techniques• One example is the experience of the INE in the 2001 Censuses
Different storage formats
OLAP systems for data dissemination Multidimensional databases
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
67
• It is not yet the most common dissemination technology, however it is more predictable around the XML format (and the SDMX project), for there to be built, in addition to data exchange standards …
– automated surveying systems for companies via the internet
– data dissemination systems
• They are ideal for combining with structuring systems for data in herent to XML, using “classic” or “multidimensional - OLAP” databases
• An interesting experience which is underway: the pilot project on the foreign debt, based on the SDMX standard
Different storage formats
Operating international normalisation experiences: SDMX-XML
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
68
Interesting example of the OECD using SDMX: http://stats.oecd.org/nawwe/csp/default.html
Different storage formats
Operating international normalisation experiences: SDMX-XML
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
69
• We have said “Structure to disseminate”. However ...
• What if the data is completely unstructured, What if the data is completely unstructured, as is the case as is the case withwith old, paper-based old, paper-based publications?publications?
Example of an application with unstructured data
• The INE does not rule out using the internet to disseminate these valuable historical collections. The INEbase HISTORIA project is currently in its final stages of cataloguing, and combines en mass OCR processing (scanning), a SGBDR system, and a file server, in order to provide guided access and search systems in order to view and download pages from those publications, in PDF and Excel formats
• This will be covered in another presentation
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
70
Example of an application with unstructured data
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
71
• Depending on the IT infrastructure dedicated to storage, and dissemination of data and metadata, we are able to use different tools to structure the information to be disseminated, from greater to lesser complexity …– A metadata creation environment in a
multidimensional database system (The INE uses it in the DW 2001 Census )
– Or one associated with a relational database (The INEbase Tempus II environment …)
– Or something as simple as handling PX-Make ( O PX-Make ( O PX-Edit) PX-Edit) , in order to produce PC-Axis files ...
Example of a tool for structuring data: PX-Make
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
72
Example of a tool for structuring data: PX-Make
• Interface designed for working with PX files
• Exchange of data with Excel, Access..., and with PX files already made
• EASY: a day’s training is enough. Used by service promoters
• It is part of the SW from the “PC-Axis suite”
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
73
Example of a tool for structuring data: PX-Make
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
74
Example of a tool for structuring data: PX-Make
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
75
Data retrieval and presentation
Are our ‘official’ statistical data, naked or boring data?
http://blogstats.wordpress.com/2007/07/20/naked-data/
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
76
Are ‘official’ statistical data boring,
or ‘naked data’?
Let’s see some ways for helping our users to access more friendly to our information.
Statistical data can be even
amusing!
Specially if the information is structured! ... of course
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
77
• “Some say that statistics or data aren’t very sexy, that they have the image of being quite difficult, of being boring or even of being biased and not worth to be studied”.http://blogstats.wordpress.com/category/0223-gapminder/
• “Listening to representatives of Gapminder or Swivel one could think official statistics is just naked data, difficult to access and not considering new technologies. Is this true?”http://blogstats.wordpress.com/2007/07/20/naked-data/
Are ‘official’ statistical data boring, or ‘naked data’?
• “Official statistics are a key “public good” that foster the progress of societies”. OECD World Forum, Istanbul Declaration, June 2007
Kindly suggested to watch interesting video ‘Unveiling the beauty of statistics’, presented by Hans Roslings at the OECD World Forum in Istanbul in June 2007
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
78
Are ‘official’ statistical data boring, or ‘naked data’?
One of the questions we did on our survey was:
“Do you think blogs and forums are interesting in statistical dissemination?”
Blogs can be perfect tools for being used for statisticians in order to know new initiatives for improving statistical data dissemination
e.g. http://blogstats.wordpress.com
BLOGS
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
79
Private initiatives
Using ‘our’ official data they attempt to make it user-friendly
• Gapminder : www.gapminder.org “Gapminder developed the Trendalyzer software that converts international statistics into moving, interactive and enjoyable graphics.” ( http://en.wikipedia.org/wiki/Hans_Rosling )
• Many Eyes “Our goal is to "democratize" visualization and to enable a new social kind of data analysis” ( www.many-eyes.com )Internet Penetration and Usage in Europe, by Country, Sept. 2007
• Swivel : www.swivel.org “Where Curious People Explore Data”Average age at death by Age at retirement
• Netvibes : www.netvibes.com “Built-in Netvibes modules include an RSS/Atom feed reader” ( http://en.wikipedia.org/wiki/Netvibes)
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
80
Are ‘official’ statistical data boring, or ‘naked data’?
Graphs “A picture is worth a thousand words”
Different ‘friendly’ visualization styles for similar data
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
81
Are ‘official’ statistical data boring, or ‘naked data’?
CertiEnabling users to make calculations, even using everyday language to explain
the objective of the program
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
82
Are ‘official’ statistical data boring, or ‘naked data’?
‘Gossiping’ ( Why not?) some demanded data
Or surnames ( Spain )
‘Friendly’ style
‘Naked’ style
The most frequent names ( Zurich)
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
83
Are ‘official’ statistical data boring, or ‘naked data’?
Giving users tools for helping to use our sites
Search engines (including suggested links) and sitemap
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
84
Are ‘official’ statistical data boring, or ‘naked data’?
Giving users possibility to look for values, to configure results screen, to export to different formats ...
Selection of variables (INE Spain) ... or PX-Web model (Finland)
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
85
Special sections for ‘other’ types of users
With children in mind (Brazil) ... or journalists (Spain)
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
86
Historical information is very popular and demanded indeed
Evolution of municipalities in Spain
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
87
Even giving colloquial texts for unspecialised users
The same information is available for specialized users in other sections
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
88
The icing on the cake
Interesting maps ( Switzerland) ... Population clock (Census- USA)
Internet dissemination . Part I. Storing and retrieving data.
Disseminating statistics: Internet and Publications
Madrid, 3-5 March 2008
89
Thank you very much for your attention. Any questions, please?
Storing data Structure for dissemination
• Different data storage formats
• Data retrieval and presentation
European
Statistical Training
Programme