The Importance of Data Science in Cancer and Disease …€¦ · • Topics include exploratory...

35
Dr Michael Cauchi AMRSC The Importance of Data Science in Cancer and Disease Diagnostics: Bladder Cancer Department of Mathematics & Statistics, University of Limerick Ireland 15 – 17 May 2017 1 CASI 2017

Transcript of The Importance of Data Science in Cancer and Disease …€¦ · • Topics include exploratory...

Page 1: The Importance of Data Science in Cancer and Disease …€¦ · • Topics include exploratory data analysis, pattern recognition, signal processing, experimental design, artificial

Dr Michael Cauchi AMRSC

The Importance of Data Science in

Cancer and Disease Diagnostics:

Bladder Cancer

Department of Mathematics & Statistics,

University of Limerick

Ireland

15 – 17 May 2017 1

CASI 2017

Page 2: The Importance of Data Science in Cancer and Disease …€¦ · • Topics include exploratory data analysis, pattern recognition, signal processing, experimental design, artificial

• MChem Chemistry with Studies in France• MSc Analytical Chemistry

• LabVIEW Programmer

• Analyst & Lab Manager• PhD in Data Science/Chemometrics

• Research Fellow• Senior Data Scientist

• Lecturer

About Me!

2

Page 3: The Importance of Data Science in Cancer and Disease …€¦ · • Topics include exploratory data analysis, pattern recognition, signal processing, experimental design, artificial

Data Science

3

• A growing field which encompasses the areas of mathematics, statistics,

operations research, information and computer science

• Topics include exploratory data analysis, pattern recognition, signal processing,

experimental design, artificial intelligence, expert systems, machine learning,

data processing, databases, data visualisation, predictions…

• Applications in marketing, finance, publishing, social and natural sciences

• Natural sciences employs for the biological and chemical sciences:

���� Bioinformatics and Chemometrics

• Chemometrics:

• “The application of mathematical and statistical techniques to chemical

data”

� M. Otto

• “The discipline that uses mathematical, statistical, and other methods

employing formal logic (a) to design and select optimal measurement

procedures and experiments, and (b) to provide maximum relevant chemical

information by analysing chemical data”

� D. L. Massart

Page 4: The Importance of Data Science in Cancer and Disease …€¦ · • Topics include exploratory data analysis, pattern recognition, signal processing, experimental design, artificial

82%

sample analytical device data pattern

recognitionsystem

diagnosis and

confidencescore

Pattern Recognition

4

Successful Applications:

• Diagnosis of disease/cancer state from volatile fingerprints using GC-MS.

• Identification of food freshness based on HPLC data.

• Cancer staging based on data from FTIR.

• Identification of oils from fluorescence data.

• Medical diagnosis using data from novel sensor arrays.

Page 5: The Importance of Data Science in Cancer and Disease …€¦ · • Topics include exploratory data analysis, pattern recognition, signal processing, experimental design, artificial

Is a Model Really Fit for Purpose?

5

• With such powerful technologies, anyone can produce a

classification model that “works”.

• Before clinical application we need to thoroughly check that:

• Exciting results are not the product of random chance.

• Models are fit for purpose – will they work with future samples?

• Are our results genuinely significant?

Pattern Recognition via Machine Learning

5

Page 6: The Importance of Data Science in Cancer and Disease …€¦ · • Topics include exploratory data analysis, pattern recognition, signal processing, experimental design, artificial

Bladder Cancer

• Most common form of bladder cancer is transitional cell carcinoma (TCC)

• Fatal if untreated!

• Cystoscopy with biopsy is the “gold standard” for bladder cancer detection

• Expensive

• Inconvenient

• Invasive

• Urine cytology is a “welcome” alternative

• High specificity: 90 - 98%

• Low sensitivity: 20 - 50%

• Protein biomarkers (NMP22 and BTAstat)

• Sensitivity: 50 – 85% and 50 – 70% respectively

• Specificity: 60 – 70%

• Volatiles in urine headspace

• GC-MS• eNose

6

Background

Page 7: The Importance of Data Science in Cancer and Disease …€¦ · • Topics include exploratory data analysis, pattern recognition, signal processing, experimental design, artificial

Electronic Noses (e-Noses)

7

• Small, portable device for monitoring volatile odour patterns

• Generates a pattern from the response of several component sensors to a (usually

complex) odour

• Not an analysis method – gives no real information about VOCs in a system

• Useful for comparing different systems or samples and categorising (typically in QC

applications)

Bladder Cancer

Page 8: The Importance of Data Science in Cancer and Disease …€¦ · • Topics include exploratory data analysis, pattern recognition, signal processing, experimental design, artificial

Alpha M.O.S. “Fox” (Toulouse, France)

6-24 MOS & QCM sensors 14 Conducting polymer sensors

Electronic Noses (e-Noses)

8

Scensive Technologies, Leeds, UK

Bladder Cancer

Page 9: The Importance of Data Science in Cancer and Disease …€¦ · • Topics include exploratory data analysis, pattern recognition, signal processing, experimental design, artificial

9

Gas Chromatography Mass Spectrometry (GC-MS)

Mass Spectrometry (MS)

• Structural information can be generated

• Particularly using tandem mass spectrometers

• Fragment sample & analyse products

• Useful for peptide & oligonucleotide sequencing

• Plus identification of individual compounds in complex mixtures

Gas Chromatography (GC)

• GC is used to separate and analyse compounds that can be vaporised but which do

not decompose.

• Typical uses of GC include testing the purity of a particular substance, or separating

the different components of a mixture

���� e.g. A range of metabolites in a urine sample

• In GC, the mobile phase is an inert gas like helium or unreactive gas like nitrogen.

• The stationary phase is a very thin layer of liquid or polymer coated on an inert solid

support like metal or glass (akin to a fractionating column).

Bladder Cancer

Page 10: The Importance of Data Science in Cancer and Disease …€¦ · • Topics include exploratory data analysis, pattern recognition, signal processing, experimental design, artificial

Carrier gasMobile phase

autosampler

detectors inc MS

Gas Chromatograph (GC)Stationary phase

ComputerData

inlet

Column1

traps

regulator

gascylinder

chromatogram

GC oven

Gas Chromatography Mass Spectrometry (GC-MS)

10

Bladder Cancer

Page 11: The Importance of Data Science in Cancer and Disease …€¦ · • Topics include exploratory data analysis, pattern recognition, signal processing, experimental design, artificial

11

Gas Chromatography Mass Spectrometry (GC-MS)

Ato

mic M

ass Num

ber

Scan Number

∑ all values for TIC value

0 200 400 6 00 800 1 000 1200 140 0 1600 1800 20 000

1

2

3

4

5

6x 10

6

Scan N umber

Abun

dance

73 a.m.u.

0 200 400 600 800 1000 1200 1400 1600 1800 20000

2

4

6

8

10

12

14x 10

7

Scan Number

Abun

dan

ce

Total Ion Chromat ogram

Selected Ion Chromatogram (SIC)

Total Ion Chromatogram (TIC)

Mass Spectrum

TOF

Inte

nsity

Bladder Cancer

Page 12: The Importance of Data Science in Cancer and Disease …€¦ · • Topics include exploratory data analysis, pattern recognition, signal processing, experimental design, artificial

Electronic nose (e-Nose)

12

Gas Chromatography Mass Spectrometry (GC-MS)

E-Nose S1 S2 S3 ALL

Overall (%) 70.0 67.0 62.2 64.6

Specificity

(%)

70.0 60.0 52.6 66.9

Sensitivity

(%)

70.0 71.7 68.3 60.0

S1 = C1 vs TCC;

S2 = C2 vs TCC;

S3 = C3 vs TCC;

ALL = C1 & C2 & C3 vs TCC

Preliminary Classification Results via PLS-DA

GC-MS S1 S2 S3 ALL

Overall (%) 74.8 78.6 77.9 73.1

Specificity

(%)

76.1 61.9 67.0 79.0

Sensitivity

(%)

73.9 89.8 85.2 61.4

Bladder Cancer

Page 13: The Importance of Data Science in Cancer and Disease …€¦ · • Topics include exploratory data analysis, pattern recognition, signal processing, experimental design, artificial

• 3 control groups and 3 transitional cell carcinoma (TCC) groups• C1: Healthy 18-32 year olds (70)• C2: Healthy 18-32 year olds but with occasional menstrual

bleeding or transient proteinuria (protein in urine) (71)• C3: Individuals (age 24 to 89) with urological diseases (64)

• Men over 50 tested negative for prostate cancer• TCC1: Early stage (16)• TCC2: Intermediate (27)

• TCC3: Late stage (27)

���� Working with a much larger GC-MS dataset…

13

Bladder Cancer

• All samples analysed via “Pipeline” performing the following:• Standardisation against internal standard (phenol at m/z 99)

when data loaded

• Exploratory using PCA to remove any outlying samples• Alignment of chromatographic peaks via COW• Classification via a thorough validation process involving

bootstrapping���� Cross-model validation (“leave-5-out”)

Page 14: The Importance of Data Science in Cancer and Disease …€¦ · • Topics include exploratory data analysis, pattern recognition, signal processing, experimental design, artificial

14

Sample collection and storage

GC-MS data acquisition

Import GC-MS chromatograms

Pre-processing(normalize, align, scale)

Permutation Testing

Measure of statistical significance

Interrogation of Optimized Model

Putative biomarkers

Repeat 300 times

Random assignment of classes

A

Repeat 150 times

Randomly split data into training (70%) &

testing (30%) sets

Optimise PLS-DA model using training data

Test optimised PLS-DA model using testing data

Calculate combined performance metrics

Performance metrics

A

A

IN

OUT

Bladder Cancer

Workflow

Page 15: The Importance of Data Science in Cancer and Disease …€¦ · • Topics include exploratory data analysis, pattern recognition, signal processing, experimental design, artificial

COW

15

• Most commonly employed (i.e. popular) choice of algorithm

Slack: How much

should peaks be

shifted by

• Alignment works by simultaneously compressing and stretching regions of the

sample chromatogram to specified regions of the reference chromatogram

Segments

Reference

Sample

Segments can either be:

• Set at every interval

(e.g. every 30 index

values)

• Specific ranges

defined

• Very good but can be time consuming!

• The segment length and slack

values can be determined

automatically

Bladder Cancer

Page 16: The Importance of Data Science in Cancer and Disease …€¦ · • Topics include exploratory data analysis, pattern recognition, signal processing, experimental design, artificial

• To avoid a one off chance result, the process is repeated with different

random data splits until stable average performance metrics are obtained.

tra in test

82% CC

tra in test

53% CC

tra in test

65% CC

tra in test

66% CC

tra in test

74% CC

tra in test

65% CC

tra in test

79% CC

tra in test

51% CC

tra in test

91% CC

tra in test

27% CC

tra in test

57% CC

tra in test

68% CC

0 50 100 15055

60

65

70

75

mea

n %

corr

ec

tly

cla

ss

ifie

d

number of bootstrap models

16

Bladder Cancer

Bootstrapping

Model Optimisation

Page 17: The Importance of Data Science in Cancer and Disease …€¦ · • Topics include exploratory data analysis, pattern recognition, signal processing, experimental design, artificial

PLS-DA SVM-LIN

Overall Spec Sens LV AUROC

76.85 74.41 78.95 15 0.8399

Overall Spec Sens AUROC

78.78 74.51 82.46 0.8332

40 50 60 70 80 900

10

20

30

40

50

60

% CC

Fre

qu

en

cy

BLUE bars = NULL; RED bars = ANALYSIS

ZTEST for large samples..

The z-value (two-tailed) is: -79.41

The critical z-value (two-tailed) is: 1.96

Null Hyp is REJECTED

There IS significant difference.

50 60 70 80 900

10

20

30

40

50

% CC

Fre

qu

en

cy

BLUE bars = NULL; RED bars = ANALYSIS

ZTEST for large samples..

The z-value (two-tailed) is: -83.48

The critical z-value (two-tailed) is: 1.96

Null Hyp is REJECTED

There IS significant difference.

17

Bladder Cancer

Overall Spec Sens Tree AUROC

73.57 66.15 80.52 350 0.7740

40 50 60 70 80 900

10

20

30

40

50

60

70

% CC

Fre

qu

en

cy

BLUE bars = NULL; RED bars = ANALYSIS

ZTEST for large samples..

The z-value (two-tailed) is: -74.25

The critical z-value (two-tailed) is: 1.96

Null Hyp is REJECTED

There IS significant difference.

C3 v TCC

Random

Forest

Page 18: The Importance of Data Science in Cancer and Disease …€¦ · • Topics include exploratory data analysis, pattern recognition, signal processing, experimental design, artificial

���� Working with a much larger GC-MS dataset…

Dataset Model Comparison %Overall %Spec %Sens LV or Tree AUROC

C1vTCC

PLS-DA 87.53 87.23 87.82 16 0.9055

SVM Lin 88.99 88.84 89.13 -- 0.9350

Random Forest 80.91 80.28 81.75 450 0.8923

C2vTCC

PLS-DA 88.35 88.21 88.48 12 0.9276

SVM Lin 89.18 88.00 90.33 -- 0.9220

Random Forest 82.70 82.93 82.72 450 0.8654

C3vTCC

PLS-DA 76.85 74.41 78.95 15 0.8399

SVM Lin 78.78 74.51 82.46 -- 0.8332

Random Forest 73.57 66.15 80.52 350 0.7740

C3vTCC1

PLS-DA 75.78 80.91 57.50 20 0.6193

SVM Lin 80.47 97.94 18.25 -- 0.6079

Random Forest 79.25 97.73 12.25 350 0.5476

C3vTCC2

PLS-DA 78.37 79.81 75.21 16 0.8134

SVM Lin 78.60 93.16 46.69 -- 0.7256

Random Forest 75.29 95.32 30.55 350 0.7079

C3vTCC3

PLS-DA 80.75 84.94 71.20 7 0.8592

SVM Lin 82.67 94.28 56.19 -- 0.8122

Random Forest 75.54 95.52 29.68 350 0.6009

• C3 v TCC1 is most difficult (see sensitivities)!18

C3 v TCC

40 50 60 70 80 900

10

20

30

40

50

60

% CC

Fre

qu

en

cy

BLUE bars = NULL; RED bars = ANALYSIS

55 60 65 70 75 80 85 900

5

10

15

20

25

30

35

% CC

Fre

qu

en

cy

BLUE bars = NULL; RED bars = ANALYSIS

C3 v TCC1

PLS-DA: Permutations

Bladder Cancer

Page 19: The Importance of Data Science in Cancer and Disease …€¦ · • Topics include exploratory data analysis, pattern recognition, signal processing, experimental design, artificial

PLS-DA Loadings

• Indication of significant/key variables

���� Biomarker discovery!

19

Bladder Cancer

Inte

nsity

Page 20: The Importance of Data Science in Cancer and Disease …€¦ · • Topics include exploratory data analysis, pattern recognition, signal processing, experimental design, artificial

20

BIOMARKER DISCOVERY

Bladder Cancer

.. Selected mas spectra sequentially uploaded to offline and online databases…

Page 21: The Importance of Data Science in Cancer and Disease …€¦ · • Topics include exploratory data analysis, pattern recognition, signal processing, experimental design, artificial

21

BIOMARKER DISCOVERY

Bladder Cancer

… searching the literature…

Page 22: The Importance of Data Science in Cancer and Disease …€¦ · • Topics include exploratory data analysis, pattern recognition, signal processing, experimental design, artificial

22

BIOMARKER DISCOVERY

C2 v TCC: PLSDAOverall Spec Sens LV AUROC

88.35 88.21 88.48 12 0.9276

Bladder Cancer

Page 23: The Importance of Data Science in Cancer and Disease …€¦ · • Topics include exploratory data analysis, pattern recognition, signal processing, experimental design, artificial

23

BIOMARKER DISCOVERY

C2 v TCC: PLSDAOverall Spec Sens LV AUROC

88.35 88.21 88.48 12 0.9276

6.43 min

Bladder Cancer

Page 24: The Importance of Data Science in Cancer and Disease …€¦ · • Topics include exploratory data analysis, pattern recognition, signal processing, experimental design, artificial

24

BIOMARKER DISCOVERY

7.078 minC2 v TCC: PLSDAOverall Spec Sens LV AUROC

88.35 88.21 88.48 12 0.9276

Bladder Cancer

Page 25: The Importance of Data Science in Cancer and Disease …€¦ · • Topics include exploratory data analysis, pattern recognition, signal processing, experimental design, artificial

25

BIOMARKER DISCOVERY

Compound Database

2-pentanone NIST & MassBank

2,3-butanedione MassBank

4-heptanone MassBank

Dimethyl disulphide NIST

Hexanal NIST

Benzaldehyde MassBank

Butyrophenone MassBank

3-hydroxyanthranilic acid MassBank

Benzoic acid MassBank

Trans-3-hexanoic acid MassBank

Cis-3-hexanoic acid MassBank

2-butanone NIST

2-propanol NIST

Acetic acid NIST

Piperitone MassBank

Thujone MassBank

Bladder Cancer

Median value of abundance from C3 to TCC (Increase or decrease)

Page 26: The Importance of Data Science in Cancer and Disease …€¦ · • Topics include exploratory data analysis, pattern recognition, signal processing, experimental design, artificial

26

BIOMARKER DISCOVERY

Compound

2-pentanone

2,3-butanedione

4-heptanone

Dimethyl disulphide

Hexanal

Benzaldehyde

Butyrophenone

3-hydroxyanthranilic acid

Benzoic acid

Trans-3-hexanoic acid

Cis-3-hexanoic acid

2-butanone

2-propanol

Acetic acid

Piperitone

Thujone

Bladder Cancer

• Ketone series found in urine.

• Also present in food (fruits, dairy, fish…)

• Elevated levels found in some gut

infections (c. jejuni; c. difficile) and some

gut conditions e.g. coeliac disease,

ulcerative colitis and non-alcoholic fatty

liver disease.

• Metabolised from acetaldehyde

• Ubiquitous in urine

• Known to be elevated in a number of

diseases including kidney disease, diabetes

and some infections

• Normal in breath and biofluids

• Elevated in some cancers and gut

conditions

Page 27: The Importance of Data Science in Cancer and Disease …€¦ · • Topics include exploratory data analysis, pattern recognition, signal processing, experimental design, artificial

27

BIOMARKER DISCOVERY

Bladder Cancer

Compound

2-pentanone

2,3-butanedione

4-heptanone

Dimethyl disulphide

Hexanal

Benzaldehyde

Butyrophenone

3-hydroxyanthranilic acid

Benzoic acid

Trans-3-hexanoic acid

Cis-3-hexanoic acid

2-butanone

2-propanol

Acetic acid

Piperitone

Thujone

• Ubiquitous in urine

• Elevated in gut disease and cancers

• Sometimes found in urine probably from

food additives

• Drug component so unlikely to be naturally

occurring in body

• Oxidation product of tryptophan metabolism

• Food additive and drug component

• It is noted that acids will show present in

higher concentration if pH drops in urine,

so this may be a result of lower pH down

to tumour growth

Page 28: The Importance of Data Science in Cancer and Disease …€¦ · • Topics include exploratory data analysis, pattern recognition, signal processing, experimental design, artificial

28

BIOMARKER DISCOVERY

Bladder Cancer

Compound

2-pentanone

2,3-butanedione

4-heptanone

Dimethyl disulphide

Hexanal

Benzaldehyde

Butyrophenone

3-hydroxyanthranilic acid

Benzoic acid

Trans-3-hexanoic acid

Cis-3-hexanoic acid

2-butanone

2-propanol

Acetic acid

Piperitone

Thujone

• Ubiquitous in urine

• Ubiquitous in urine

• Elevated in many conditions

including gut disease, diabetes and

others

• May be higher in cancer due to likely

lower pH

• Flavouring ingredients

Page 29: The Importance of Data Science in Cancer and Disease …€¦ · • Topics include exploratory data analysis, pattern recognition, signal processing, experimental design, artificial

Conclusions

29

Bladder Cancer

• PLS-DA-derived models gave a mean accuracy for patients presenting with other

non-cancerous urological disease of 88.4%, with 88.5% sensitivity and 88.2%

specificity for C2 versus TCC (TCC1, TCC2 and TCC3 combined).

• SVM-derived models had given a mean accuracy of 89.2%, with a sensitivity of

90.3% and specificity of 88.0%.

• Although the specificities achieved were marginally less than that of conventional

urine cytology (typically >90% specificity), sensitivity was very close to typical

range of 80-90% for high-grade tumours and thus better than the typical range of

20-50% for low-grade tumours, case in point, the sensitivity attained for C3 v

TCC1 was 57.5% which is marginally better than the “gold-standard” of 20-50%.

• Currently uncertain why exogenous compounds are present in different

concentrations. For TCC it may be down to changed host response due to

presence of tumour.

• Applications of data science techniques can very likely be employed to assist in

the diagnosis of cancers and diseases including identifying probable markers.

Page 30: The Importance of Data Science in Cancer and Disease …€¦ · • Topics include exploratory data analysis, pattern recognition, signal processing, experimental design, artificial

30

Where Next??

Bladder Cancer

• Ideally to be able to determine which grade of TCC

• Have seen a successful approach for bladder cancer

� Suggests that could be a replacement for the current invasive

procedures

���� Needs careful validation…

• Ideally validation across different populations/regions

• Potential for further collaborations?

References

• M. Cauchi, C. M. Weber, B. J. Bolt, P. B. Spratt, C. Bessant, D. C. Turner, C. M. Willis, L. E. Britton, C.

Turner, G. Morgan, Evaluation of gas chromatography mass spectrometry and pattern recognition for

the identification of bladder cancer from urine headspace, Analytical Methods, 2016, 8, 4037-4046

• Weber CM, Cauchi M, Patel M, Bessant C, Turner C, Britton LE, Willis CM,. “Evaluation of a gas sensor array

and pattern recognition for the identification of bladder cancer from urine headspace”, The Analyst, 2011,

vol 136 (2), p359-364.

Page 31: The Importance of Data Science in Cancer and Disease …€¦ · • Topics include exploratory data analysis, pattern recognition, signal processing, experimental design, artificial

Bladder Cancer

Papers

31

Page 32: The Importance of Data Science in Cancer and Disease …€¦ · • Topics include exploratory data analysis, pattern recognition, signal processing, experimental design, artificial

Pattern Recognition Pipeline

CLASSIFICATION_V1

IMPORTDATA

EXPLORATORY (GUI)

ALIGNMENT

VSELECT (GUI)

CLASSIFY

EXPORT

NCDFLOAD

CVSLOAD

MATLOAD

TABLOAD

XLSLOAD

SPCLOAD

DBLOAD

BUILDCLASS

ExtractSIC

CVCLASSIFY

BSCLASSIFY

DoSCALING DoMODELLING

DoSTATS

xPVcalc

pretreat2plsdaMC

plsMC

SVM

ANN

O-PLSDA

KNN

LDA

Lin

RBFSig

Poly

SPLIT

BuildBOOT

PROGRESS

Report Generator

NullCLASSIFY

GetVectorFromStructure

ReplaceWithZeros2

ToROCclass

BCCLASSIFYrocit

Obsolete since using rocit

histfit3

ReplaceWithBaseline

32

Page 33: The Importance of Data Science in Cancer and Disease …€¦ · • Topics include exploratory data analysis, pattern recognition, signal processing, experimental design, artificial

Pattern Recognition Pipeline

Exploratory Data Analysis

33

Feature Selection

Report

Generation

Page 34: The Importance of Data Science in Cancer and Disease …€¦ · • Topics include exploratory data analysis, pattern recognition, signal processing, experimental design, artificial

Collaborations (Former and Ongoing)

34

Page 35: The Importance of Data Science in Cancer and Disease …€¦ · • Topics include exploratory data analysis, pattern recognition, signal processing, experimental design, artificial

Thank you!

35

University of Limerick