Download - “Big Data to Knowledge” in the Health Sciences: The Application and Value of Cancer Infodemiology Georgia Tourassi, PhD SMC’13 2013 Smoky Mountains CSE.

Big Data to Knowledge in the Health Sciences: The Application and Value of Cancer Infodemiology Georgia Tourassi, PhD SMC13 2013 Smoky Mountains CSE Conference Gatlinburg, TN September 5, 2013

2 Environmental Cancer Risk and Migration Pattern PIs: Georgia Tourassi / Songhua Xu

Environmental Cancer Risk and Migration Pattern

Infodemiology The epidemiology of digital (mis)information The Internet has made measurable what was previously immeasurable: The distribution of health information in a population, tracking (in real time) health information trends over time, and identifying gaps between information supply and demand. G Eysenbach, Am J Med 2002 4

Infodemiology in Action 5 http://www.google.org/flutrends/video/GoogleFluTrends_USFluActivity.mov

Applications Areas Detecting and quantifying disparities in information availability Monitoring public health relevant publications on the Internet Tracking effectiveness of health marketing campaigns Monitoring health related behaviors Syndromic surveillance Unknown drug side effects and complications . 6

Social Media Use among Internet Users 7 Chou, WS et al. 2009. Social Media Use in the US: Implications for health communication, J Med Internet Res, 1(4): e48.

Social Media Use among Internet Users 8 Chou, WS et al. 2009. Social Media Use in the US: Implications for health communication, J Med Internet Res, 1(4): e48.

Cancer Community One in five internet users with cancer A growing number of cancer patients share online their personal stories regarding their symptoms, treatments, emotional and physical concerns, and many other issues arising throughout the cancer diagnosis, treatment, and recovery phases. Promising potential of knowledge discovery via analyzing user generated content in online cancer communities. 9

CASE STUDY 1 Parity and Breast Cancer Risk 10

Case-Control Study Knowledge Discovery Population Cases with Breast Cancer Controls without Breast Cancer Childbirth No Childbirth Childbirth No Childbirth

Institutes Conventional Data Collection Hospitals Organizations

Proposed Data Collection On-Line Obituaries

Web Crawling and Text Parsing Local Newspaper Websites Web Crawler Parser AgeGenderChildbirth Cause of Death

Information Retrieval - Age

Information Retrieval - Gender

Information Retrieval - Childbirth

Information Retrieval - Cause of Death

Data Collection Obituaries published online 2000-2012 59,002 w/ breast cancer 50,927 w/out breast cancer After cleaning 20,332 case group 15,946 w/ at least one biological child 15,954 control group 13,548 w/ at least one biological child

Case and Control Groups 20,332 women 15,946 with children 78.4% 15,954 women 13,548 with children 84.9%

Childbirth Incidence

Odds Ratio Ages from 30-69 Years Old Age-Adjusted by 2010 US Standard Population Odds of Childbirth Incidence in the Case Group: 13643 / 4284 = 3.2 Odds of Childbirth Incidence in the Control Group: 6556 / 1545 = 4.2 Odds (of Childbirth Incidence) Ratio = 0.74, CI:(0.69,0.79) ParityCaseControl Nulliparous4284 (24.1%)1545 (19.1%) Parous13463 (75.9%)6556 (80.9%)

Reliability? STUDYODDS RATIOCONFIDENCE INTERVAL Layde et al (USA, 1989) 0.73(0.65, 0.83) Lambe et al (Sweden, 1996) 0.82(0.78, 0.86) Our study 0.74(0.69, 0.79)

Number of Children & Breast Cancer Risk Layde LambeOur Study ParityODDS RATIO Lower Bound Upper BoundODDS RATIO Lower Bound Upper BoundODDS RATIO Lower Bound Upper Bound 01.00 10.920.781.090.920.860.970.890.850.94 20.830.730.950.840.800.890.850.810.89 30.700.610.810.760.710.810.840.800.89 40.600.510.700.670.610.740.620.580.66 50.550.440.670.530.450.630.520.470.57 60.520.400.670.330.240.460.530.470.61 7+0.410.310.530.430.300.640.440.390.49 Parous0.730.650.830.820.780.860.740.690.79

Sample Size STUDYCASESCONTROLS Layde et al (USA, 1989) 4,5994,536 Lambe et al (Sweden, 1996) 12,78263,888 Our study 20,33215,954

Discussion Limitation of obituaries Cannot derive effect of additional factors (e.g., age at first pregnancy, breastfeeding, lifestyle choices) Other types of online patients personal life stories can overcome these limitations

CASE STUDY 2 Geospatial Cancer Mortality Trends in the US 27

Web Mining for Deriving Geospatial Cancer Mortality Trends in the US Collecting, compiling, and reporting the related surveillance statistics is a time consuming process introducing substantial delays in the monitoring process. We propose to study whether general cancer mortality trends can be adequately captured by automated analysis of text content found in online obituaries published in US newspapers.

Method Overview We implemented a obituary crawler to collect large number of obituaries from online local newspapers. We implemented a rule-based natural language system to transform the collected obituary documents into a structured format. We applied two correction factors to account for anticipated biases of the statistics derived from the collected dataset. We compare statistic reports derived from the collected obituary dataset with the cancer mortality statistics reports published by SEER to show that we can generate more accurate cancer mortality reports from the collected dataset.

System Architecture RDBMS Metadata Reference Html Documents Content Extraction Module Context Extraction Module Context Enriching Module Rule-based Context Inference Module Exact Dictionary-based Chunking Module Rule-processing Module newspaper_reference Obituary Content Extracted Context Enriched Context Inferred Metadata Context Integration Module Dictionary dic_age dic_year dic_gender_male Rule age_at_death year_of_death gender Database Module ID Assignment Module Data-cleansing Module Integrated Context Statistical Analysis Module Statistics Report Web Sequential CrawlerRandom Crawler breast_cancer/lung_c ancer dic_gender_female kwd_bc kwd_lc rdm_sm rdm_lg Raw Database Cleansed Database RDBMS Metadata Reference Html Documents Content Extraction Module Context Extraction Module Context Enriching Module Rule-based Context Inference Module Exact Dictionary-based Chunking Module Rule-processing Module newspaper_reference Obituary Content Extracted Context Enriched Context Inferred Metadata Context Integration Module Dictionary dic_age dic_year dic_gender_male Rule age_at_death year_of_death gender Database Module ID Assignment Module Data-cleansing Module Integrated Context Statistical Analysis Module Statistics Report Web Sequential CrawlerRandom Crawler breast_cancer/lung_c ancer dic_gender_female kwd_bc kwd_lc rdm_sm rdm_lg Raw Database Cleansed Database RDBMS Metadata Reference Html Documen ts Content Extraction Module Context Extraction Module Context Enriching Module Rule-based Context Inference Module Exact Dictionary-based Chunking Module Rule-processing Module newspaper_referen ce Obituary Content Extracted Context Enriched Context Inferred Metadata Context Integration Module Dictionary dic_age dic_year dic_gender_male Rule age_at_death year_of_death gender Database Module ID Assignment Module Data-cleansing Module Integrated Context Analysis Module Statistics Report Web Crawling Pre-processing Natural Language Processing Database Processing Data Analyzing Web Sequential Crawler Random Crawler breast_cancer/lun g_cancer dic_gender_femal e kwd_bc kwd_lc rdm_s m rdm_lg Raw Databas e Cleanse d Databas e

Data Collection Obituary Crawler o Based on an online obituary search engine, ObitFinder o Serviced by Legacy.com, one of the largest online obituary providers for the US newspaper industry o 1,100 newspapers, 2005-2009 (200+ GB) o Covering 46 US states (AR, ND, WV, HI, WY excluded) Data o Random selection o 3,572,122 online obituary articles

Data Analysis Anticipated Biases o The number of cancer-related obituaries could be biased due to different prevalence of obituaries for different age groups or states o The proportion of obituaries including cause of death could bias the number of cancer-related obituaries Correction Factors o Referencing the statistics from the CDC Deaths Final Report (2005-2009) o Incorporating cultural openness factor of a particular age group or a state

Correction Factors Adjustment Ratio 1 (Age-based Obituary Distribution across States) Age-based obituary distribution over states may be different from age-based death distribution over states E.g., In the case of Tennessee, [#Obituary(TN)/#Obituary(US)] is 0.86%, but [#Death(TN)/#Death(US)] is 2.43% Adjustment Ratio 1: for TN is 2.43/0.86 = 2.84 We can compute adjustment ratio 1 for each state Adjustment Ratio 2 (Obituary Content Richness) Proportion of Obituaries which include cause of deaths may be different depending on states http://en.wikipedia.org/wiki/List_of_causes_of_death_by_rate E.g., In the case of California, 20.7 % of obituaries include cause-of-death related terms; however, only 5.2 % of Alabama obituaries include cause-of-death related terms. Adjustment Ratio 2: for CA is 5.2/20.7 = 0.13 We can compute adjustment ratio 2 for each state

Case Study 1: Breast Cancer 6,935 female subjects

Reliability?

Case Study 1: Lung Cancer 5,312 subjects

Reliability?

Conclusions Cancer mortality trends can be captured reliably in a time- efficient, cost-effective, and fully automated way by mining content that is openly available on the Internet. Using breast and lung cancer as case studies, we observed that the trends discovered via web mining were very similar to those reported by NCI. Proposed correction factors are useful to account for anticipated biases of statistics from obituary datasets.

Summary Web mining is a cost-effective way for epidemiological knowledge discovery Well suited as a hypotheses generator Monitoring trends in a dynamic way by continuously parsing and analyzing new online content

Conclusion

43 Georgia Tourassi, PhD ([email protected]) Songhua Xu, PhD ([email protected]) Thank you