Transcript of “Big Data to Knowledge” in the Health Sciences: The Application and Value of Cancer...
- Slide 1
- Big Data to Knowledge in the Health Sciences: The Application
and Value of Cancer Infodemiology Georgia Tourassi, PhD SMC13 2013
Smoky Mountains CSE Conference Gatlinburg, TN September 5,
2013
- Slide 2
- 2 Environmental Cancer Risk and Migration Pattern PIs: Georgia
Tourassi / Songhua Xu
- Slide 3
- Environmental Cancer Risk and Migration Pattern
- Slide 4
- Infodemiology The epidemiology of digital (mis)information The
Internet has made measurable what was previously immeasurable: The
distribution of health information in a population, tracking (in
real time) health information trends over time, and identifying
gaps between information supply and demand. G Eysenbach, Am J Med
2002 4
- Slide 5
- Infodemiology in Action 5
http://www.google.org/flutrends/video/GoogleFluTrends_USFluActivity.mov
- Slide 6
- Applications Areas Detecting and quantifying disparities in
information availability Monitoring public health relevant
publications on the Internet Tracking effectiveness of health
marketing campaigns Monitoring health related behaviors Syndromic
surveillance Unknown drug side effects and complications . 6
- Slide 7
- Social Media Use among Internet Users 7 Chou, WS et al. 2009.
Social Media Use in the US: Implications for health communication,
J Med Internet Res, 1(4): e48.
- Slide 8
- Social Media Use among Internet Users 8 Chou, WS et al. 2009.
Social Media Use in the US: Implications for health communication,
J Med Internet Res, 1(4): e48.
- Slide 9
- Cancer Community One in five internet users with cancer A
growing number of cancer patients share online their personal
stories regarding their symptoms, treatments, emotional and
physical concerns, and many other issues arising throughout the
cancer diagnosis, treatment, and recovery phases. Promising
potential of knowledge discovery via analyzing user generated
content in online cancer communities. 9
- Slide 10
- CASE STUDY 1 Parity and Breast Cancer Risk 10
- Slide 11
- Case-Control Study Knowledge Discovery Population Cases with
Breast Cancer Controls without Breast Cancer Childbirth No
Childbirth Childbirth No Childbirth
- Slide 12
- Institutes Conventional Data Collection Hospitals
Organizations
- Slide 13
- Proposed Data Collection On-Line Obituaries
- Slide 14
- Web Crawling and Text Parsing Local Newspaper Websites Web
Crawler Parser AgeGenderChildbirth Cause of Death
- Slide 15
- Information Retrieval - Age
- Slide 16
- Information Retrieval - Gender
- Slide 17
- Information Retrieval - Childbirth
- Slide 18
- Information Retrieval - Cause of Death
- Slide 19
- Data Collection Obituaries published online 2000-2012 59,002 w/
breast cancer 50,927 w/out breast cancer After cleaning 20,332 case
group 15,946 w/ at least one biological child 15,954 control group
13,548 w/ at least one biological child
- Slide 20
- Case and Control Groups 20,332 women 15,946 with children 78.4%
15,954 women 13,548 with children 84.9%
- Slide 21
- Childbirth Incidence
- Slide 22
- Odds Ratio Ages from 30-69 Years Old Age-Adjusted by 2010 US
Standard Population Odds of Childbirth Incidence in the Case Group:
13643 / 4284 = 3.2 Odds of Childbirth Incidence in the Control
Group: 6556 / 1545 = 4.2 Odds (of Childbirth Incidence) Ratio =
0.74, CI:(0.69,0.79) ParityCaseControl Nulliparous4284 (24.1%)1545
(19.1%) Parous13463 (75.9%)6556 (80.9%)
- Slide 23
- Reliability? STUDYODDS RATIOCONFIDENCE INTERVAL Layde et al
(USA, 1989) 0.73(0.65, 0.83) Lambe et al (Sweden, 1996) 0.82(0.78,
0.86) Our study 0.74(0.69, 0.79)
- Slide 24
- Number of Children & Breast Cancer Risk Layde LambeOur
Study ParityODDS RATIO Lower Bound Upper BoundODDS RATIO Lower
Bound Upper BoundODDS RATIO Lower Bound Upper Bound 01.00
10.920.781.090.920.860.970.890.850.94
20.830.730.950.840.800.890.850.810.89
30.700.610.810.760.710.810.840.800.89
40.600.510.700.670.610.740.620.580.66
50.550.440.670.530.450.630.520.470.57
60.520.400.670.330.240.460.530.470.61
7+0.410.310.530.430.300.640.440.390.49
Parous0.730.650.830.820.780.860.740.690.79
- Slide 25
- Sample Size STUDYCASESCONTROLS Layde et al (USA, 1989)
4,5994,536 Lambe et al (Sweden, 1996) 12,78263,888 Our study
20,33215,954
- Slide 26
- Discussion Limitation of obituaries Cannot derive effect of
additional factors (e.g., age at first pregnancy, breastfeeding,
lifestyle choices) Other types of online patients personal life
stories can overcome these limitations
- Slide 27
- CASE STUDY 2 Geospatial Cancer Mortality Trends in the US
27
- Slide 28
- Web Mining for Deriving Geospatial Cancer Mortality Trends in
the US Collecting, compiling, and reporting the related
surveillance statistics is a time consuming process introducing
substantial delays in the monitoring process. We propose to study
whether general cancer mortality trends can be adequately captured
by automated analysis of text content found in online obituaries
published in US newspapers.
- Slide 29
- Method Overview We implemented a obituary crawler to collect
large number of obituaries from online local newspapers. We
implemented a rule-based natural language system to transform the
collected obituary documents into a structured format. We applied
two correction factors to account for anticipated biases of the
statistics derived from the collected dataset. We compare statistic
reports derived from the collected obituary dataset with the cancer
mortality statistics reports published by SEER to show that we can
generate more accurate cancer mortality reports from the collected
dataset.
- Slide 30
- System Architecture RDBMS Metadata Reference Html Documents
Content Extraction Module Context Extraction Module Context
Enriching Module Rule-based Context Inference Module Exact
Dictionary-based Chunking Module Rule-processing Module
newspaper_reference Obituary Content Extracted Context Enriched
Context Inferred Metadata Context Integration Module Dictionary
dic_age dic_year dic_gender_male Rule age_at_death year_of_death
gender Database Module ID Assignment Module Data-cleansing Module
Integrated Context Statistical Analysis Module Statistics Report
Web Sequential CrawlerRandom Crawler breast_cancer/lung_c ancer
dic_gender_female kwd_bc kwd_lc rdm_sm rdm_lg Raw Database Cleansed
Database RDBMS Metadata Reference Html Documents Content Extraction
Module Context Extraction Module Context Enriching Module
Rule-based Context Inference Module Exact Dictionary-based Chunking
Module Rule-processing Module newspaper_reference Obituary Content
Extracted Context Enriched Context Inferred Metadata Context
Integration Module Dictionary dic_age dic_year dic_gender_male Rule
age_at_death year_of_death gender Database Module ID Assignment
Module Data-cleansing Module Integrated Context Statistical
Analysis Module Statistics Report Web Sequential CrawlerRandom
Crawler breast_cancer/lung_c ancer dic_gender_female kwd_bc kwd_lc
rdm_sm rdm_lg Raw Database Cleansed Database RDBMS Metadata
Reference Html Documen ts Content Extraction Module Context
Extraction Module Context Enriching Module Rule-based Context
Inference Module Exact Dictionary-based Chunking Module
Rule-processing Module newspaper_referen ce Obituary Content
Extracted Context Enriched Context Inferred Metadata Context
Integration Module Dictionary dic_age dic_year dic_gender_male Rule
age_at_death year_of_death gender Database Module ID Assignment
Module Data-cleansing Module Integrated Context Analysis Module
Statistics Report Web Crawling Pre-processing Natural Language
Processing Database Processing Data Analyzing Web Sequential
Crawler Random Crawler breast_cancer/lun g_cancer dic_gender_femal
e kwd_bc kwd_lc rdm_s m rdm_lg Raw Databas e Cleanse d Databas
e
- Slide 31
- Data Collection Obituary Crawler o Based on an online obituary
search engine, ObitFinder o Serviced by Legacy.com, one of the
largest online obituary providers for the US newspaper industry o
1,100 newspapers, 2005-2009 (200+ GB) o Covering 46 US states (AR,
ND, WV, HI, WY excluded) Data o Random selection o 3,572,122 online
obituary articles
- Slide 32
- Data Analysis Anticipated Biases o The number of cancer-related
obituaries could be biased due to different prevalence of
obituaries for different age groups or states o The proportion of
obituaries including cause of death could bias the number of
cancer-related obituaries Correction Factors o Referencing the
statistics from the CDC Deaths Final Report (2005-2009) o
Incorporating cultural openness factor of a particular age group or
a state
- Slide 33
- Correction Factors Adjustment Ratio 1 (Age-based Obituary
Distribution across States) Age-based obituary distribution over
states may be different from age-based death distribution over
states E.g., In the case of Tennessee,
[#Obituary(TN)/#Obituary(US)] is 0.86%, but [#Death(TN)/#Death(US)]
is 2.43% Adjustment Ratio 1: for TN is 2.43/0.86 = 2.84 We can
compute adjustment ratio 1 for each state Adjustment Ratio 2
(Obituary Content Richness) Proportion of Obituaries which include
cause of deaths may be different depending on states
http://en.wikipedia.org/wiki/List_of_causes_of_death_by_rate E.g.,
In the case of California, 20.7 % of obituaries include
cause-of-death related terms; however, only 5.2 % of Alabama
obituaries include cause-of-death related terms. Adjustment Ratio
2: for CA is 5.2/20.7 = 0.13 We can compute adjustment ratio 2 for
each state
- Slide 34
- Case Study 1: Breast Cancer 6,935 female subjects
- Slide 35
- Slide 36
- Reliability?
- Slide 37
- Case Study 1: Lung Cancer 5,312 subjects
- Slide 38
- Slide 39
- Reliability?
- Slide 40
- Conclusions Cancer mortality trends can be captured reliably in
a time- efficient, cost-effective, and fully automated way by
mining content that is openly available on the Internet. Using
breast and lung cancer as case studies, we observed that the trends
discovered via web mining were very similar to those reported by
NCI. Proposed correction factors are useful to account for
anticipated biases of statistics from obituary datasets.
- Slide 41
- Summary Web mining is a cost-effective way for epidemiological
knowledge discovery Well suited as a hypotheses generator
Monitoring trends in a dynamic way by continuously parsing and
analyzing new online content
- Slide 42
- Conclusion
- Slide 43
- 43 Georgia Tourassi, PhD (tourassig@ornl.gov) Songhua Xu, PhD
(xus1@ornl.gov) Thank you