Chris Dibben University of Edinburgh Linking historical administrative data.

34
Chris Dibben University of Edinburgh Linking historical administrative data

Transcript of Chris Dibben University of Edinburgh Linking historical administrative data.

Page 1: Chris Dibben University of Edinburgh Linking historical administrative data.

Chris DibbenUniversity of Edinburgh

Linking historical administrative data

Page 2: Chris Dibben University of Edinburgh Linking historical administrative data.

Context• History of very important

contributions:– Dutch Famine Birth Cohort Study

– epigenetics, thrifty phenotype– Överkalix study – epigenetics,

sex differences– UK Longitudinal Study – health

inequalities

Page 3: Chris Dibben University of Edinburgh Linking historical administrative data.

Two new developmental projects

• Scottish Mental Surveys 1932 and 1947

• Scottish civil registration data

• New cohorts for people now in old age

Page 4: Chris Dibben University of Edinburgh Linking historical administrative data.

The ‘Scottish Mental Survey’

Page 5: Chris Dibben University of Edinburgh Linking historical administrative data.

1947 Scottish Mental Survey

1939 register

Birth1936

ED code, address, household members:

marital status, occupation

The Scottish Longitudinal

study

Scottish morbidity records

1939 books recorded

the date of death (up to

1980)

linkage to the death database (1974 onwards)

Education

Employment

Page 6: Chris Dibben University of Edinburgh Linking historical administrative data.

Early life environment

1970

34

Hospitalisation

Mortality

Birth1936

0Age

Year

Mental ability

11

SchoolAchievement

(time estimated)

1947

Occupation (estimated)

1991

55

Detailed household/ individual

information

2001 2011

65 75

Page 7: Chris Dibben University of Edinburgh Linking historical administrative data.

Background – Scottish vital events

• Civil registration of births, deaths and marriages in Scotland began on 1 January 1855

• All historical vital events records have been converted into digital image format with a supporting index

• Modern vital events data (from 1974 onwards) are available electronically

Page 8: Chris Dibben University of Edinburgh Linking historical administrative data.

Digitising Scotland• Approximately 50 million

occupation strings, 8 million causes of death

• Classify occupations to Historical International Standard Classification of Occupations (HISCO)

• Cause of death to a modified ICD10

• Each with a location

Page 9: Chris Dibben University of Edinburgh Linking historical administrative data.

Historical Geocoding

GEOCODINGTOOL

+

=+

GEOMETRYFEATURES

Year Historical address

2010 Ladywell House, Ladywell Road, Edinburgh, EH12 7T

1910 Ladywell House, Ladywell Street, Edinburgh

1810 Ladywell House, Ladywell Street, Edinburgh

1710 Ladywell House, Lady[vv]ell Street, Edinburgh

Postcode change

Without postcode

Interpretation error

17101810

19102010

• Change of road networks (new road replace old) over time• Change of road names over time• Interpretation errors from the address digitisation

GEOMETRYFEATURESGEOMETRY

FEATURESGEOMETRYFEATURES

17101810

19102010

Page 10: Chris Dibben University of Edinburgh Linking historical administrative data.
Page 11: Chris Dibben University of Edinburgh Linking historical administrative data.
Page 12: Chris Dibben University of Edinburgh Linking historical administrative data.

Challenges

• Significant methodological issues:– How can we consistently code

occupational data so that researchers can explore changing patterns and trends?

– How can we automate this process so that the majority of records do not need to be manually coded?

[email protected] 12

Page 13: Chris Dibben University of Edinburgh Linking historical administrative data.

Digitising Scotland• Records of births, marriages and deaths recorded

in Scotland from 1855 to present day.

[email protected]

Page 14: Chris Dibben University of Edinburgh Linking historical administrative data.

14

Page 15: Chris Dibben University of Edinburgh Linking historical administrative data.

15

Page 16: Chris Dibben University of Edinburgh Linking historical administrative data.

16

Page 17: Chris Dibben University of Edinburgh Linking historical administrative data.

17

Page 18: Chris Dibben University of Edinburgh Linking historical administrative data.

18

Page 19: Chris Dibben University of Edinburgh Linking historical administrative data.

Experimental Dataset

• Use a dataset with similar content for experiments

• 60,000 records from the Cambridge Family History Study (records from 1800-1990)

• Occupation descriptions and associated HISCO codes

• HISCO coding done by historians• Dataset contains 330 different HISCO codes

19

Page 20: Chris Dibben University of Edinburgh Linking historical administrative data.

20

HISCO Hierarchy Example

Page 21: Chris Dibben University of Edinburgh Linking historical administrative data.

Classification ExampleString from record Gold Standard

ClassificationAutomatic Classification Output

Farm horseman 62460 Horse Worker 62460 Horse Worker

Shoe maker 80110 Shoemaker, General

80110 Shoemaker, General

Fireman (railway) 98330 Railway Steam-Engine Fireman

98330 Railway Steam-Engine Fireman

Fireman 58100 Fire-Fighter 58100 Fire-Fighter

Stationer 41000 Working Proprietors (Wholesale and Retail Trade)

91000 Paper and Paperboard product makers

21

Page 22: Chris Dibben University of Edinburgh Linking historical administrative data.

Classification ExampleString from record Gold Standard

ClassificationAutomatic Classification Output

Farm horseman 62460 Horse Worker 62460 Horse Worker

Shoe maker 80110 Shoemaker, General

80110 Shoemaker, General

Fireman (railway) 98330 Railway Steam-Engine Fireman

98330 Railway Steam-Engine Fireman

Fireman 58100 Fire-Fighter 58100 Fire-Fighter

Stationer 41000 Working Proprietors (Wholesale and Retail Trade)

91000 Paper and Paperboard product makers

22

Page 23: Chris Dibben University of Edinburgh Linking historical administrative data.

Approach

• Text analysis• Supervised machine learning

–Apache Mahout framework.• Combination of these techniques.

23

Page 24: Chris Dibben University of Edinburgh Linking historical administrative data.

Supervised Machine Learning

Training Data Machine Learning

Unseen Data

Prediction Model

Predicted Classification

24

Prediction Model

Page 25: Chris Dibben University of Edinburgh Linking historical administrative data.

Supervised Machine Learning

Training Data Machine Learning

Unseen Data

Prediction Model

Predicted Classification

25

Prediction Model

Farm horseman 62460Shoe maker 80110Fireman 58100 Stationer41000

Page 26: Chris Dibben University of Edinburgh Linking historical administrative data.

Supervised Machine Learning

Training Data Machine Learning

Unseen Data

Prediction Model

Predicted Classification

26

Prediction Model

Farm horseman 62460Shoe maker 80110Fireman 58100 Stationer41000

Farm horsemanBoot makerFiremanPainter

Page 27: Chris Dibben University of Edinburgh Linking historical administrative data.

Supervised Machine Learning

Training Data Machine Learning

Unseen Data

Prediction Model

Predicted ClassificationPrediction Model

Farm horseman 62460Shoe maker 80110Fireman 58100 Stationer41000

Farm horsemanBoot makerFiremanPainter ?

Prediction Model

Page 28: Chris Dibben University of Edinburgh Linking historical administrative data.

100%

100%

Asthma

Miners asthma

spasmodiccollier's

miner'sminers

asthma

dropsy

bronchial

Page 29: Chris Dibben University of Edinburgh Linking historical administrative data.

String Similarity SGD Naïve Bayes Majority Vote Confidence Weighted 1 Confidence Weighted 20

10

20

30

40

50

60

70

80

90

100

Classification Accuracy

Techniques

Acc

ura

cy %

Page 30: Chris Dibben University of Edinburgh Linking historical administrative data.

Creation of a fully-linked vital events database for the whole Scotland back to 1855

19741855 Present

Vital Events (24 million births, deaths and marriages)Digital Images + Index

Vital Events Database

Vital Events Database

Fully-linked Vital Events Database

Page 31: Chris Dibben University of Edinburgh Linking historical administrative data.

Large scale family reconstruction studies and Pedigrees

Page 32: Chris Dibben University of Edinburgh Linking historical administrative data.

Gottfredsson, Magnús, et al. "Lessons from the past: familial aggregation analysis of fatal pandemic influenza (Spanish flu) in Iceland in 1918."Proceedings of the National Academy of Sciences 105.4 (2008): 1303-1308.

Page 33: Chris Dibben University of Edinburgh Linking historical administrative data.
Page 34: Chris Dibben University of Edinburgh Linking historical administrative data.

Acknowledgments

• The Digitising Scotland project is funded by ESRC;• The support from National Records of Scotland is

also gratefully acknowledged.