usecspro: Importing CSPro hierarchical datasets to Stata · Even when CSPro describes data as one...

25
Introduction CSPro Data Organization Implementation Interactive Windows-only features 2013 UK Stata Users Group meeting Cass Business School, London usecspro: Importing CSPro hierarchical datasets to Stata Sergiy Radyakin [email protected] Research Department (DECRG), The World Bank September 13, 2013 1 / 25

Transcript of usecspro: Importing CSPro hierarchical datasets to Stata · Even when CSPro describes data as one...

Page 1: usecspro: Importing CSPro hierarchical datasets to Stata · Even when CSPro describes data as one level, it partitions it into multiple record types and interleaves records of all

Introduction CSPro Data Organization Implementation Interactive Windows-only features

2013 UK Stata Users Group meetingCass Business School, London

usecspro: Importing CSPro hierarchical datasets to Stata

Sergiy [email protected]

Research Department (DECRG),The World Bank

September 13, 2013

1 / 25

Page 2: usecspro: Importing CSPro hierarchical datasets to Stata · Even when CSPro describes data as one level, it partitions it into multiple record types and interleaves records of all

Introduction CSPro Data Organization Implementation Interactive Windows-only features

Introduction

”This is tricky” - William Gould, StataCorp, circa 1996

http://www.stata.com/support/faqs/data-management/reading-hierarchal-dataset-with-infile/

2 / 25

Page 3: usecspro: Importing CSPro hierarchical datasets to Stata · Even when CSPro describes data as one level, it partitions it into multiple record types and interleaves records of all

Introduction CSPro Data Organization Implementation Interactive Windows-only features

Census and Survey Processing System (CSPro)

Developed and supported by the US Census Bureau, with fundingfrom USAID

CSPro is used for basic data processing: data entry, validation,corrections and recoding, tabulation, etc.

Software is declared to be in public domain and is distributed freelyfrom the website:

http://www.census.gov/population/international/software/cspro

Actively used since about 2000 by hundreds of organizations to collectdata as part of:

CensusesLabour Force SurveysIncome and Expenditure SurveysDemographic and Health Surveys

3 / 25

Page 4: usecspro: Importing CSPro hierarchical datasets to Stata · Even when CSPro describes data as one level, it partitions it into multiple record types and interleaves records of all

Introduction CSPro Data Organization Implementation Interactive Windows-only features

CSPro files

CSPro software uses multiple file types (more than 15) with differentextensions and for different purposes. The minimal set of files that isneeded to retrieve the data from this package is the following:

Dictionary file

Dictionary files (*.dcf) define the structure of data. One dictionary may beapplicable to multiple datasets.

Data file

Data files: (usually *.dat or without any extension) contain all datapacked into a single file.

Both files are necessary to import the data.

4 / 25

Page 5: usecspro: Importing CSPro hierarchical datasets to Stata · Even when CSPro describes data as one level, it partitions it into multiple record types and interleaves records of all

Introduction CSPro Data Organization Implementation Interactive Windows-only features

CSPro data organization

CSPro allows the database architect to describethe structure of the dataset in terms of levels,records, and data items.

Level is a unit of data corresponding to the surveyinstrument (type of questionnaire). Most of surveysrequire only one level. Some surveys require severalsurvey instruments, and hence corresponding databasesconsist of several levels.

Record is a unit of data corresponding to a topic of thesurvey: such as a group of questions related toemployment or health status. Here a table is a group ofrecords of the same type.

Data item is a unit of data corresponding to attributes(variables).

How does CSPro logically organize data, oftentimes into a single level?

5 / 25

Page 6: usecspro: Importing CSPro hierarchical datasets to Stata · Even when CSPro describes data as one level, it partitions it into multiple record types and interleaves records of all

Introduction CSPro Data Organization Implementation Interactive Windows-only features

Relational database

Consider a simple relational database (as used by e.g Microsoft Access, or OpenOffice Base):

Here we have two tables: of households and individuals, and a relationship indicating that morethan one individual may reside in one household. The identifier matching the two tables is thefield HHID (household number).

6 / 25

Page 7: usecspro: Importing CSPro hierarchical datasets to Stata · Even when CSPro describes data as one level, it partitions it into multiple record types and interleaves records of all

Introduction CSPro Data Organization Implementation Interactive Windows-only features

Relational database

It is natural to describe this as two-level dataset with first level beinghouseholds and second level being individuals:

7 / 25

Page 8: usecspro: Importing CSPro hierarchical datasets to Stata · Even when CSPro describes data as one level, it partitions it into multiple record types and interleaves records of all

Introduction CSPro Data Organization Implementation Interactive Windows-only features

CSPro approach

In fact CSPro prefers the following single level layout of the same database:

This type of data layout is known in Stata’s world as wide dataset.

Note that all attributes here are attributes of the household. Even though we tend to think ofage as a characteristic of an individual, here it is in fact: age1 - age of the first memberof the household, age2 - age of the second member of the household, etc.

The number of slots allocated for the person-specific information (in this example: 9) has to bepreset at design time and should be sufficiently large to accommodate all realistic cases.

This could be quite wasteful in terms of the occupied space, since most households would have

’less-than-maximum’ number of persons.8 / 25

Page 9: usecspro: Importing CSPro hierarchical datasets to Stata · Even when CSPro describes data as one level, it partitions it into multiple record types and interleaves records of all

Introduction CSPro Data Organization Implementation Interactive Windows-only features

CSPro data file layout

Even when CSPro describes data as one level, it partitions it into multiple record types andinterleaves records of all types in a single data file, marking each record type with a special’record-type’ identifier.

CSPro records data as ASCII text (or utf-8 unicode text in CSPro 5.0), with each field ofspecified width (in characters); records do not have to be of the same width: here personalrecord is shorter than household record.

9 / 25

Page 10: usecspro: Importing CSPro hierarchical datasets to Stata · Even when CSPro describes data as one level, it partitions it into multiple record types and interleaves records of all

Introduction CSPro Data Organization Implementation Interactive Windows-only features

Implementation

Stata itself does not import CSPro files directly, and neitherStat/Transfer (Circle Systems), nor DBMS/Copy (ConceptualSoftware, discontinued) import these files specifically.

The only alternative remained to use CSPro itself to export data toStata (it does support export to a plain text format plus a syntax fileto generate variable and value labels.)

Not all Stata users have CSPro installed. Only Windows users canpotentially install it. There is no alternative for Mac and Linux users.

-usecspro- is a package developed by Sergiy Radyakin for Stata 10 ornewer that implements import of the hierarchical CSPro data files.

-usecspro- provides a simple mouse-click solution to import the data,as well as provides a broad API for the advanced users.

10 / 25

Page 11: usecspro: Importing CSPro hierarchical datasets to Stata · Even when CSPro describes data as one level, it partitions it into multiple record types and interleaves records of all

Introduction CSPro Data Organization Implementation Interactive Windows-only features

Features

Data label - supported, record name is used as the dataset label in the resulting Statadataset.

Variable names - fully supported (identical naming conventions), converted to lowercase.

Variable labels - fully supported

Value labels - partially supported:

only the first set of labels is used,interval labels (no equivalent in Stata) are converted to discrete labels (whenpossible),non-integer values are not labelled (not possible in Stata).

Missing values - supported. The three special CSPro values reserved to denotemissingness are converted to Stata’s extended missing values:

DEFAULT→.a, NOTAPPL→.b, MISSING→.c

Decimals (including implied decimals) - supported. Even when decimal separator is notsaved into the dataset, the decimals are imputed during conversion using the declared fieldproperties in the dictionary.

Leading zeroes - not supported.

11 / 25

Page 12: usecspro: Importing CSPro hierarchical datasets to Stata · Even when CSPro describes data as one level, it partitions it into multiple record types and interleaves records of all

Introduction CSPro Data Organization Implementation Interactive Windows-only features

Conversion strategy of -usecspro-

The strategy is (what happens behind the scenes):

1 parse the dictionary file to ’understand’ the data organization

2 filter the dataset to have only records of the single type

3 write the Stata’s dictionary file to read-in the data

4 write the Stata’s do-file to do formatting adjustments, recode missingvalues, etc.

5 read the data from step #2 using the dictionary from step #3

6 execute the do-file from step #4

The user doesn’t need to know nor understand the above. It is sufficientto use the following friendly syntax.

12 / 25

Page 13: usecspro: Importing CSPro hierarchical datasets to Stata · Even when CSPro describes data as one level, it partitions it into multiple record types and interleaves records of all

Introduction CSPro Data Organization Implementation Interactive Windows-only features

Basic syntax

Interactivelyusecspro

In Stata do-files

usecspro using "data.dat", dictionary("dictionary.dcf")

level("LevelName") record("RecordName")

In Mata functions

cspro_convert("dictionary.dcf", "data.dat", "LevelName",

"RecordName")

13 / 25

Page 14: usecspro: Importing CSPro hierarchical datasets to Stata · Even when CSPro describes data as one level, it partitions it into multiple record types and interleaves records of all

Introduction CSPro Data Organization Implementation Interactive Windows-only features

Stata syntax

cspro aboutDisplay information about the program

usecspro using "file.dat",

dictionary("file.dcf") level("levelname")

record("recordname") [clear]

Import CSPro data record ’recordname’ of data level ’levelname’using specified dictionary (non-interactive)

cspro dir ["path"]Show the list of CSPro files in the current (or specified) directory

cspro use ["file.dcf"]Show the data levels of the specified dictionary file. If file notspecified, open a dialog to pick a file.

cspro import "file.dcf"Same as above

cspro dump "file.dcf"Dump all records of all data levels as separate datasets into asubfolder. Datasets are nested into subfolders with name of thelevel.

cspro join using "file.dat",

dictionary("file.dcf") level("levelname")

records("R1 R2 ... RN")

Join specified records of the same data level ’levelname’ usingCSPro identifier of this level.

14 / 25

Page 15: usecspro: Importing CSPro hierarchical datasets to Stata · Even when CSPro describes data as one level, it partitions it into multiple record types and interleaves records of all

Introduction CSPro Data Organization Implementation Interactive Windows-only features

Mata syntax

Basics

cspro_about()Display information about the program

cspro_convert("dictionary.dcf",

"data.dat", "levelname",

"recordname")

Import CSPro data record ’recordname’ of data level ’levelname’using specified dictionary (non-interactive version for program-mers)

Interactivities

cspro_import("folder")Writes to the output a clickable list of the CSPro dictionary filesin the specified directory

cspro_list_levels("dictionary.dcf")Writes to the output a clickable list of data levels contained inthe dictionary file

cspro_list_level_records(

"dictionary.dcf","levelname")

Writes to the output a clickable list of records for the specifieddata level in the dictionary file

15 / 25

Page 16: usecspro: Importing CSPro hierarchical datasets to Stata · Even when CSPro describes data as one level, it partitions it into multiple record types and interleaves records of all

Introduction CSPro Data Organization Implementation Interactive Windows-only features

Mata syntax (cont.)

Utilities

cspro_give_levels("dictionary.dcf")Returns a list of data levels contained in the dictionary file

cspro_give_level_records(

"dictionary.dcf", "levelname")

Returns a list of records for the specified data level in the dictio-nary file

cspro_dump("dictionary.dcf",

"data.dat", "folder")

Dumps all records of all data levels as separate datasets into asubfolder

cspro_dump_simple("dictionary.dcf")Dumps all records of all data levels into a subfolder, data file isassumed to be same as the dictionary file, with extension .dat,the output folder is in the dictionaryname˙DUMP in the folderwhere dictionary is located (will be created if does not exist)

cspro_join("dictionary.dcf",

"data.dat", "levelname", "R1 R2... RN")

Performs a join of two or more records of the ”levelname” datalevel into a single dataset using CSPro identifier of this level

cspro_guess_joinp("dictionary.dcf",

"data.dat", "levelname", "R1 R2 ...

RN")

Performs a join of two or more records of the same data levelinto a single dataset using an extended (guessed) ID

*cspro_is_utf8_file("filename.ext")Checks if a given file is encoded in utf-8

*cspro_convert_utf8_to_ansi(

"utf8file.txt", "ansifile.txt")

Converts a given file from utf-8 encoding to ANSI

* in Windows only 16 / 25

Page 17: usecspro: Importing CSPro hierarchical datasets to Stata · Even when CSPro describes data as one level, it partitions it into multiple record types and interleaves records of all

Introduction CSPro Data Organization Implementation Interactive Windows-only features

Interactive use example

17 / 25

Page 18: usecspro: Importing CSPro hierarchical datasets to Stata · Even when CSPro describes data as one level, it partitions it into multiple record types and interleaves records of all

Introduction CSPro Data Organization Implementation Interactive Windows-only features

Interactive use example

18 / 25

Page 19: usecspro: Importing CSPro hierarchical datasets to Stata · Even when CSPro describes data as one level, it partitions it into multiple record types and interleaves records of all

Introduction CSPro Data Organization Implementation Interactive Windows-only features

Interactive use example

19 / 25

Page 20: usecspro: Importing CSPro hierarchical datasets to Stata · Even when CSPro describes data as one level, it partitions it into multiple record types and interleaves records of all

Introduction CSPro Data Organization Implementation Interactive Windows-only features

Interactive use example

20 / 25

Page 21: usecspro: Importing CSPro hierarchical datasets to Stata · Even when CSPro describes data as one level, it partitions it into multiple record types and interleaves records of all

Introduction CSPro Data Organization Implementation Interactive Windows-only features

Interactive use example

21 / 25

Page 22: usecspro: Importing CSPro hierarchical datasets to Stata · Even when CSPro describes data as one level, it partitions it into multiple record types and interleaves records of all

Introduction CSPro Data Organization Implementation Interactive Windows-only features

Interactive use example

22 / 25

Page 23: usecspro: Importing CSPro hierarchical datasets to Stata · Even when CSPro describes data as one level, it partitions it into multiple record types and interleaves records of all

Introduction CSPro Data Organization Implementation Interactive Windows-only features

Windows-only features (dialog)

Dialog interface:

23 / 25

Page 24: usecspro: Importing CSPro hierarchical datasets to Stata · Even when CSPro describes data as one level, it partitions it into multiple record types and interleaves records of all

Introduction CSPro Data Organization Implementation Interactive Windows-only features

Windows-only features (unicode)

Unicode support - CSPro 5.0 is using utf-8 encoding for dictionary (variableand value labels) and data files (string values).

Files are saved in unicode even if no non-ASCII characters are used. Hence filesneed to be converted to ASCII+ANSI since Stata does not support unicode.

Windows users of -usecspro- can additionally specify a codepage to be used for non-ASCIIcharacters.

Only one codepage can be specified (e.g. 1251: Cyrillic (Windows) ) and it is applied toboth dictionary file and data files. This choice is only available in the dialog if thedictionary is actually saved in utf-8 encoding.

Codepage can be specified as an additional parameter (default=1252 Western Europeancodepage) of both Stata and Mata commands:

usecspro using "file.dat", dictionary("file.dcf") ///

level("L") record("R") codepage(1251)

mata cspro_convert("file.dcf", "file.dat", "L", "R", 1251)

Cyrillic smaller letter ya is now supported in variable and value labels.Click to learn why it is so special?

24 / 25

Page 25: usecspro: Importing CSPro hierarchical datasets to Stata · Even when CSPro describes data as one level, it partitions it into multiple record types and interleaves records of all

Introduction CSPro Data Organization Implementation Interactive Windows-only features

Windows-only features (unicode)

To view the non-ASCII characters correctly in Stata, adjust the fontsettings of the output window to match the encoding that was selectedduring the conversion:

25 / 25