Chris DibbenUniversity of Edinburgh
Linking historical administrative data
Context• History of very important
contributions:– Dutch Famine Birth Cohort Study
– epigenetics, thrifty phenotype– Överkalix study – epigenetics,
sex differences– UK Longitudinal Study – health
inequalities
Two new developmental projects
• Scottish Mental Surveys 1932 and 1947
• Scottish civil registration data
• New cohorts for people now in old age
The ‘Scottish Mental Survey’
1947 Scottish Mental Survey
1939 register
Birth1936
ED code, address, household members:
marital status, occupation
The Scottish Longitudinal
study
Scottish morbidity records
1939 books recorded
the date of death (up to
1980)
linkage to the death database (1974 onwards)
Education
Employment
Early life environment
1970
34
Hospitalisation
Mortality
Birth1936
0Age
Year
Mental ability
11
SchoolAchievement
(time estimated)
1947
Occupation (estimated)
1991
55
Detailed household/ individual
information
2001 2011
65 75
Background – Scottish vital events
• Civil registration of births, deaths and marriages in Scotland began on 1 January 1855
• All historical vital events records have been converted into digital image format with a supporting index
• Modern vital events data (from 1974 onwards) are available electronically
Digitising Scotland• Approximately 50 million
occupation strings, 8 million causes of death
• Classify occupations to Historical International Standard Classification of Occupations (HISCO)
• Cause of death to a modified ICD10
• Each with a location
Historical Geocoding
GEOCODINGTOOL
+
=+
GEOMETRYFEATURES
Year Historical address
2010 Ladywell House, Ladywell Road, Edinburgh, EH12 7T
1910 Ladywell House, Ladywell Street, Edinburgh
1810 Ladywell House, Ladywell Street, Edinburgh
1710 Ladywell House, Lady[vv]ell Street, Edinburgh
Postcode change
Without postcode
Interpretation error
17101810
19102010
• Change of road networks (new road replace old) over time• Change of road names over time• Interpretation errors from the address digitisation
GEOMETRYFEATURESGEOMETRY
FEATURESGEOMETRYFEATURES
17101810
19102010
Challenges
• Significant methodological issues:– How can we consistently code
occupational data so that researchers can explore changing patterns and trends?
– How can we automate this process so that the majority of records do not need to be manually coded?
Digitising Scotland• Records of births, marriages and deaths recorded
in Scotland from 1855 to present day.
14
15
16
17
18
Experimental Dataset
• Use a dataset with similar content for experiments
• 60,000 records from the Cambridge Family History Study (records from 1800-1990)
• Occupation descriptions and associated HISCO codes
• HISCO coding done by historians• Dataset contains 330 different HISCO codes
19
20
HISCO Hierarchy Example
Classification ExampleString from record Gold Standard
ClassificationAutomatic Classification Output
Farm horseman 62460 Horse Worker 62460 Horse Worker
Shoe maker 80110 Shoemaker, General
80110 Shoemaker, General
Fireman (railway) 98330 Railway Steam-Engine Fireman
98330 Railway Steam-Engine Fireman
Fireman 58100 Fire-Fighter 58100 Fire-Fighter
Stationer 41000 Working Proprietors (Wholesale and Retail Trade)
91000 Paper and Paperboard product makers
21
Classification ExampleString from record Gold Standard
ClassificationAutomatic Classification Output
Farm horseman 62460 Horse Worker 62460 Horse Worker
Shoe maker 80110 Shoemaker, General
80110 Shoemaker, General
Fireman (railway) 98330 Railway Steam-Engine Fireman
98330 Railway Steam-Engine Fireman
Fireman 58100 Fire-Fighter 58100 Fire-Fighter
Stationer 41000 Working Proprietors (Wholesale and Retail Trade)
91000 Paper and Paperboard product makers
22
Approach
• Text analysis• Supervised machine learning
–Apache Mahout framework.• Combination of these techniques.
23
Supervised Machine Learning
Training Data Machine Learning
Unseen Data
Prediction Model
Predicted Classification
24
Prediction Model
Supervised Machine Learning
Training Data Machine Learning
Unseen Data
Prediction Model
Predicted Classification
25
Prediction Model
Farm horseman 62460Shoe maker 80110Fireman 58100 Stationer41000
Supervised Machine Learning
Training Data Machine Learning
Unseen Data
Prediction Model
Predicted Classification
26
Prediction Model
Farm horseman 62460Shoe maker 80110Fireman 58100 Stationer41000
Farm horsemanBoot makerFiremanPainter
Supervised Machine Learning
Training Data Machine Learning
Unseen Data
Prediction Model
Predicted ClassificationPrediction Model
Farm horseman 62460Shoe maker 80110Fireman 58100 Stationer41000
Farm horsemanBoot makerFiremanPainter ?
Prediction Model
100%
100%
Asthma
Miners asthma
spasmodiccollier's
miner'sminers
asthma
dropsy
bronchial
String Similarity SGD Naïve Bayes Majority Vote Confidence Weighted 1 Confidence Weighted 20
10
20
30
40
50
60
70
80
90
100
Classification Accuracy
Techniques
Acc
ura
cy %
Creation of a fully-linked vital events database for the whole Scotland back to 1855
19741855 Present
Vital Events (24 million births, deaths and marriages)Digital Images + Index
Vital Events Database
Vital Events Database
Fully-linked Vital Events Database
Large scale family reconstruction studies and Pedigrees
Gottfredsson, Magnús, et al. "Lessons from the past: familial aggregation analysis of fatal pandemic influenza (Spanish flu) in Iceland in 1918."Proceedings of the National Academy of Sciences 105.4 (2008): 1303-1308.
Acknowledgments
• The Digitising Scotland project is funded by ESRC;• The support from National Records of Scotland is
also gratefully acknowledged.
Top Related