Big Data to Knowledge in the Health Sciences: The Application
and Value of Cancer Infodemiology Georgia Tourassi, PhD SMC13 2013
Smoky Mountains CSE Conference Gatlinburg, TN September 5,
2013
Slide 2
2 Environmental Cancer Risk and Migration Pattern PIs: Georgia
Tourassi / Songhua Xu
Slide 3
Environmental Cancer Risk and Migration Pattern
Slide 4
Infodemiology The epidemiology of digital (mis)information The
Internet has made measurable what was previously immeasurable: The
distribution of health information in a population, tracking (in
real time) health information trends over time, and identifying
gaps between information supply and demand. G Eysenbach, Am J Med
2002 4
Slide 5
Infodemiology in Action 5
http://www.google.org/flutrends/video/GoogleFluTrends_USFluActivity.mov
Slide 6
Applications Areas Detecting and quantifying disparities in
information availability Monitoring public health relevant
publications on the Internet Tracking effectiveness of health
marketing campaigns Monitoring health related behaviors Syndromic
surveillance Unknown drug side effects and complications . 6
Slide 7
Social Media Use among Internet Users 7 Chou, WS et al. 2009.
Social Media Use in the US: Implications for health communication,
J Med Internet Res, 1(4): e48.
Slide 8
Social Media Use among Internet Users 8 Chou, WS et al. 2009.
Social Media Use in the US: Implications for health communication,
J Med Internet Res, 1(4): e48.
Slide 9
Cancer Community One in five internet users with cancer A
growing number of cancer patients share online their personal
stories regarding their symptoms, treatments, emotional and
physical concerns, and many other issues arising throughout the
cancer diagnosis, treatment, and recovery phases. Promising
potential of knowledge discovery via analyzing user generated
content in online cancer communities. 9
Slide 10
CASE STUDY 1 Parity and Breast Cancer Risk 10
Slide 11
Case-Control Study Knowledge Discovery Population Cases with
Breast Cancer Controls without Breast Cancer Childbirth No
Childbirth Childbirth No Childbirth
Slide 12
Institutes Conventional Data Collection Hospitals
Organizations
Slide 13
Proposed Data Collection On-Line Obituaries
Slide 14
Web Crawling and Text Parsing Local Newspaper Websites Web
Crawler Parser AgeGenderChildbirth Cause of Death
Slide 15
Information Retrieval - Age
Slide 16
Information Retrieval - Gender
Slide 17
Information Retrieval - Childbirth
Slide 18
Information Retrieval - Cause of Death
Slide 19
Data Collection Obituaries published online 2000-2012 59,002 w/
breast cancer 50,927 w/out breast cancer After cleaning 20,332 case
group 15,946 w/ at least one biological child 15,954 control group
13,548 w/ at least one biological child
Slide 20
Case and Control Groups 20,332 women 15,946 with children 78.4%
15,954 women 13,548 with children 84.9%
Slide 21
Childbirth Incidence
Slide 22
Odds Ratio Ages from 30-69 Years Old Age-Adjusted by 2010 US
Standard Population Odds of Childbirth Incidence in the Case Group:
13643 / 4284 = 3.2 Odds of Childbirth Incidence in the Control
Group: 6556 / 1545 = 4.2 Odds (of Childbirth Incidence) Ratio =
0.74, CI:(0.69,0.79) ParityCaseControl Nulliparous4284 (24.1%)1545
(19.1%) Parous13463 (75.9%)6556 (80.9%)
Slide 23
Reliability? STUDYODDS RATIOCONFIDENCE INTERVAL Layde et al
(USA, 1989) 0.73(0.65, 0.83) Lambe et al (Sweden, 1996) 0.82(0.78,
0.86) Our study 0.74(0.69, 0.79)
Slide 24
Number of Children & Breast Cancer Risk Layde LambeOur
Study ParityODDS RATIO Lower Bound Upper BoundODDS RATIO Lower
Bound Upper BoundODDS RATIO Lower Bound Upper Bound 01.00
10.920.781.090.920.860.970.890.850.94
20.830.730.950.840.800.890.850.810.89
30.700.610.810.760.710.810.840.800.89
40.600.510.700.670.610.740.620.580.66
50.550.440.670.530.450.630.520.470.57
60.520.400.670.330.240.460.530.470.61
7+0.410.310.530.430.300.640.440.390.49
Parous0.730.650.830.820.780.860.740.690.79
Slide 25
Sample Size STUDYCASESCONTROLS Layde et al (USA, 1989)
4,5994,536 Lambe et al (Sweden, 1996) 12,78263,888 Our study
20,33215,954
Slide 26
Discussion Limitation of obituaries Cannot derive effect of
additional factors (e.g., age at first pregnancy, breastfeeding,
lifestyle choices) Other types of online patients personal life
stories can overcome these limitations
Slide 27
CASE STUDY 2 Geospatial Cancer Mortality Trends in the US
27
Slide 28
Web Mining for Deriving Geospatial Cancer Mortality Trends in
the US Collecting, compiling, and reporting the related
surveillance statistics is a time consuming process introducing
substantial delays in the monitoring process. We propose to study
whether general cancer mortality trends can be adequately captured
by automated analysis of text content found in online obituaries
published in US newspapers.
Slide 29
Method Overview We implemented a obituary crawler to collect
large number of obituaries from online local newspapers. We
implemented a rule-based natural language system to transform the
collected obituary documents into a structured format. We applied
two correction factors to account for anticipated biases of the
statistics derived from the collected dataset. We compare statistic
reports derived from the collected obituary dataset with the cancer
mortality statistics reports published by SEER to show that we can
generate more accurate cancer mortality reports from the collected
dataset.
Data Collection Obituary Crawler o Based on an online obituary
search engine, ObitFinder o Serviced by Legacy.com, one of the
largest online obituary providers for the US newspaper industry o
1,100 newspapers, 2005-2009 (200+ GB) o Covering 46 US states (AR,
ND, WV, HI, WY excluded) Data o Random selection o 3,572,122 online
obituary articles
Slide 32
Data Analysis Anticipated Biases o The number of cancer-related
obituaries could be biased due to different prevalence of
obituaries for different age groups or states o The proportion of
obituaries including cause of death could bias the number of
cancer-related obituaries Correction Factors o Referencing the
statistics from the CDC Deaths Final Report (2005-2009) o
Incorporating cultural openness factor of a particular age group or
a state
Slide 33
Correction Factors Adjustment Ratio 1 (Age-based Obituary
Distribution across States) Age-based obituary distribution over
states may be different from age-based death distribution over
states E.g., In the case of Tennessee,
[#Obituary(TN)/#Obituary(US)] is 0.86%, but [#Death(TN)/#Death(US)]
is 2.43% Adjustment Ratio 1: for TN is 2.43/0.86 = 2.84 We can
compute adjustment ratio 1 for each state Adjustment Ratio 2
(Obituary Content Richness) Proportion of Obituaries which include
cause of deaths may be different depending on states
http://en.wikipedia.org/wiki/List_of_causes_of_death_by_rate E.g.,
In the case of California, 20.7 % of obituaries include
cause-of-death related terms; however, only 5.2 % of Alabama
obituaries include cause-of-death related terms. Adjustment Ratio
2: for CA is 5.2/20.7 = 0.13 We can compute adjustment ratio 2 for
each state
Slide 34
Case Study 1: Breast Cancer 6,935 female subjects
Slide 35
Slide 36
Reliability?
Slide 37
Case Study 1: Lung Cancer 5,312 subjects
Slide 38
Slide 39
Reliability?
Slide 40
Conclusions Cancer mortality trends can be captured reliably in
a time- efficient, cost-effective, and fully automated way by
mining content that is openly available on the Internet. Using
breast and lung cancer as case studies, we observed that the trends
discovered via web mining were very similar to those reported by
NCI. Proposed correction factors are useful to account for
anticipated biases of statistics from obituary datasets.
Slide 41
Summary Web mining is a cost-effective way for epidemiological
knowledge discovery Well suited as a hypotheses generator
Monitoring trends in a dynamic way by continuously parsing and
analyzing new online content