The Importance of Data Science in Cancer and Disease …€¦ · • Topics include exploratory...
Transcript of The Importance of Data Science in Cancer and Disease …€¦ · • Topics include exploratory...
Dr Michael Cauchi AMRSC
The Importance of Data Science in
Cancer and Disease Diagnostics:
Bladder Cancer
Department of Mathematics & Statistics,
University of Limerick
Ireland
15 – 17 May 2017 1
CASI 2017
• MChem Chemistry with Studies in France• MSc Analytical Chemistry
• LabVIEW Programmer
• Analyst & Lab Manager• PhD in Data Science/Chemometrics
• Research Fellow• Senior Data Scientist
• Lecturer
About Me!
2
Data Science
3
• A growing field which encompasses the areas of mathematics, statistics,
operations research, information and computer science
• Topics include exploratory data analysis, pattern recognition, signal processing,
experimental design, artificial intelligence, expert systems, machine learning,
data processing, databases, data visualisation, predictions…
• Applications in marketing, finance, publishing, social and natural sciences
• Natural sciences employs for the biological and chemical sciences:
���� Bioinformatics and Chemometrics
• Chemometrics:
• “The application of mathematical and statistical techniques to chemical
data”
� M. Otto
• “The discipline that uses mathematical, statistical, and other methods
employing formal logic (a) to design and select optimal measurement
procedures and experiments, and (b) to provide maximum relevant chemical
information by analysing chemical data”
� D. L. Massart
82%
sample analytical device data pattern
recognitionsystem
diagnosis and
confidencescore
Pattern Recognition
4
Successful Applications:
• Diagnosis of disease/cancer state from volatile fingerprints using GC-MS.
• Identification of food freshness based on HPLC data.
• Cancer staging based on data from FTIR.
• Identification of oils from fluorescence data.
• Medical diagnosis using data from novel sensor arrays.
Is a Model Really Fit for Purpose?
5
• With such powerful technologies, anyone can produce a
classification model that “works”.
• Before clinical application we need to thoroughly check that:
• Exciting results are not the product of random chance.
• Models are fit for purpose – will they work with future samples?
• Are our results genuinely significant?
Pattern Recognition via Machine Learning
5
Bladder Cancer
• Most common form of bladder cancer is transitional cell carcinoma (TCC)
• Fatal if untreated!
• Cystoscopy with biopsy is the “gold standard” for bladder cancer detection
• Expensive
• Inconvenient
• Invasive
• Urine cytology is a “welcome” alternative
• High specificity: 90 - 98%
• Low sensitivity: 20 - 50%
• Protein biomarkers (NMP22 and BTAstat)
• Sensitivity: 50 – 85% and 50 – 70% respectively
• Specificity: 60 – 70%
• Volatiles in urine headspace
• GC-MS• eNose
6
Background
Electronic Noses (e-Noses)
7
• Small, portable device for monitoring volatile odour patterns
• Generates a pattern from the response of several component sensors to a (usually
complex) odour
• Not an analysis method – gives no real information about VOCs in a system
• Useful for comparing different systems or samples and categorising (typically in QC
applications)
Bladder Cancer
Alpha M.O.S. “Fox” (Toulouse, France)
6-24 MOS & QCM sensors 14 Conducting polymer sensors
Electronic Noses (e-Noses)
8
Scensive Technologies, Leeds, UK
Bladder Cancer
9
Gas Chromatography Mass Spectrometry (GC-MS)
Mass Spectrometry (MS)
• Structural information can be generated
• Particularly using tandem mass spectrometers
• Fragment sample & analyse products
• Useful for peptide & oligonucleotide sequencing
• Plus identification of individual compounds in complex mixtures
Gas Chromatography (GC)
• GC is used to separate and analyse compounds that can be vaporised but which do
not decompose.
• Typical uses of GC include testing the purity of a particular substance, or separating
the different components of a mixture
���� e.g. A range of metabolites in a urine sample
• In GC, the mobile phase is an inert gas like helium or unreactive gas like nitrogen.
• The stationary phase is a very thin layer of liquid or polymer coated on an inert solid
support like metal or glass (akin to a fractionating column).
Bladder Cancer
Carrier gasMobile phase
autosampler
detectors inc MS
Gas Chromatograph (GC)Stationary phase
ComputerData
inlet
Column1
traps
regulator
gascylinder
chromatogram
GC oven
Gas Chromatography Mass Spectrometry (GC-MS)
10
Bladder Cancer
11
Gas Chromatography Mass Spectrometry (GC-MS)
Ato
mic M
ass Num
ber
Scan Number
∑ all values for TIC value
0 200 400 6 00 800 1 000 1200 140 0 1600 1800 20 000
1
2
3
4
5
6x 10
6
Scan N umber
Abun
dance
73 a.m.u.
0 200 400 600 800 1000 1200 1400 1600 1800 20000
2
4
6
8
10
12
14x 10
7
Scan Number
Abun
dan
ce
Total Ion Chromat ogram
Selected Ion Chromatogram (SIC)
Total Ion Chromatogram (TIC)
Mass Spectrum
TOF
Inte
nsity
Bladder Cancer
Electronic nose (e-Nose)
12
Gas Chromatography Mass Spectrometry (GC-MS)
E-Nose S1 S2 S3 ALL
Overall (%) 70.0 67.0 62.2 64.6
Specificity
(%)
70.0 60.0 52.6 66.9
Sensitivity
(%)
70.0 71.7 68.3 60.0
S1 = C1 vs TCC;
S2 = C2 vs TCC;
S3 = C3 vs TCC;
ALL = C1 & C2 & C3 vs TCC
Preliminary Classification Results via PLS-DA
GC-MS S1 S2 S3 ALL
Overall (%) 74.8 78.6 77.9 73.1
Specificity
(%)
76.1 61.9 67.0 79.0
Sensitivity
(%)
73.9 89.8 85.2 61.4
Bladder Cancer
• 3 control groups and 3 transitional cell carcinoma (TCC) groups• C1: Healthy 18-32 year olds (70)• C2: Healthy 18-32 year olds but with occasional menstrual
bleeding or transient proteinuria (protein in urine) (71)• C3: Individuals (age 24 to 89) with urological diseases (64)
• Men over 50 tested negative for prostate cancer• TCC1: Early stage (16)• TCC2: Intermediate (27)
• TCC3: Late stage (27)
���� Working with a much larger GC-MS dataset…
13
Bladder Cancer
• All samples analysed via “Pipeline” performing the following:• Standardisation against internal standard (phenol at m/z 99)
when data loaded
• Exploratory using PCA to remove any outlying samples• Alignment of chromatographic peaks via COW• Classification via a thorough validation process involving
bootstrapping���� Cross-model validation (“leave-5-out”)
14
Sample collection and storage
GC-MS data acquisition
Import GC-MS chromatograms
Pre-processing(normalize, align, scale)
Permutation Testing
Measure of statistical significance
Interrogation of Optimized Model
Putative biomarkers
Repeat 300 times
Random assignment of classes
A
Repeat 150 times
Randomly split data into training (70%) &
testing (30%) sets
Optimise PLS-DA model using training data
Test optimised PLS-DA model using testing data
Calculate combined performance metrics
Performance metrics
A
A
IN
OUT
Bladder Cancer
Workflow
COW
15
• Most commonly employed (i.e. popular) choice of algorithm
Slack: How much
should peaks be
shifted by
• Alignment works by simultaneously compressing and stretching regions of the
sample chromatogram to specified regions of the reference chromatogram
Segments
Reference
Sample
Segments can either be:
• Set at every interval
(e.g. every 30 index
values)
• Specific ranges
defined
• Very good but can be time consuming!
• The segment length and slack
values can be determined
automatically
Bladder Cancer
• To avoid a one off chance result, the process is repeated with different
random data splits until stable average performance metrics are obtained.
tra in test
82% CC
tra in test
53% CC
tra in test
65% CC
tra in test
66% CC
tra in test
74% CC
tra in test
65% CC
tra in test
79% CC
tra in test
51% CC
tra in test
91% CC
tra in test
27% CC
tra in test
57% CC
tra in test
68% CC
0 50 100 15055
60
65
70
75
mea
n %
corr
ec
tly
cla
ss
ifie
d
number of bootstrap models
16
Bladder Cancer
Bootstrapping
Model Optimisation
PLS-DA SVM-LIN
Overall Spec Sens LV AUROC
76.85 74.41 78.95 15 0.8399
Overall Spec Sens AUROC
78.78 74.51 82.46 0.8332
40 50 60 70 80 900
10
20
30
40
50
60
% CC
Fre
qu
en
cy
BLUE bars = NULL; RED bars = ANALYSIS
ZTEST for large samples..
The z-value (two-tailed) is: -79.41
The critical z-value (two-tailed) is: 1.96
Null Hyp is REJECTED
There IS significant difference.
50 60 70 80 900
10
20
30
40
50
% CC
Fre
qu
en
cy
BLUE bars = NULL; RED bars = ANALYSIS
ZTEST for large samples..
The z-value (two-tailed) is: -83.48
The critical z-value (two-tailed) is: 1.96
Null Hyp is REJECTED
There IS significant difference.
17
Bladder Cancer
Overall Spec Sens Tree AUROC
73.57 66.15 80.52 350 0.7740
40 50 60 70 80 900
10
20
30
40
50
60
70
% CC
Fre
qu
en
cy
BLUE bars = NULL; RED bars = ANALYSIS
ZTEST for large samples..
The z-value (two-tailed) is: -74.25
The critical z-value (two-tailed) is: 1.96
Null Hyp is REJECTED
There IS significant difference.
C3 v TCC
Random
Forest
���� Working with a much larger GC-MS dataset…
Dataset Model Comparison %Overall %Spec %Sens LV or Tree AUROC
C1vTCC
PLS-DA 87.53 87.23 87.82 16 0.9055
SVM Lin 88.99 88.84 89.13 -- 0.9350
Random Forest 80.91 80.28 81.75 450 0.8923
C2vTCC
PLS-DA 88.35 88.21 88.48 12 0.9276
SVM Lin 89.18 88.00 90.33 -- 0.9220
Random Forest 82.70 82.93 82.72 450 0.8654
C3vTCC
PLS-DA 76.85 74.41 78.95 15 0.8399
SVM Lin 78.78 74.51 82.46 -- 0.8332
Random Forest 73.57 66.15 80.52 350 0.7740
C3vTCC1
PLS-DA 75.78 80.91 57.50 20 0.6193
SVM Lin 80.47 97.94 18.25 -- 0.6079
Random Forest 79.25 97.73 12.25 350 0.5476
C3vTCC2
PLS-DA 78.37 79.81 75.21 16 0.8134
SVM Lin 78.60 93.16 46.69 -- 0.7256
Random Forest 75.29 95.32 30.55 350 0.7079
C3vTCC3
PLS-DA 80.75 84.94 71.20 7 0.8592
SVM Lin 82.67 94.28 56.19 -- 0.8122
Random Forest 75.54 95.52 29.68 350 0.6009
• C3 v TCC1 is most difficult (see sensitivities)!18
C3 v TCC
40 50 60 70 80 900
10
20
30
40
50
60
% CC
Fre
qu
en
cy
BLUE bars = NULL; RED bars = ANALYSIS
55 60 65 70 75 80 85 900
5
10
15
20
25
30
35
% CC
Fre
qu
en
cy
BLUE bars = NULL; RED bars = ANALYSIS
C3 v TCC1
PLS-DA: Permutations
Bladder Cancer
PLS-DA Loadings
• Indication of significant/key variables
���� Biomarker discovery!
19
Bladder Cancer
Inte
nsity
20
BIOMARKER DISCOVERY
Bladder Cancer
.. Selected mas spectra sequentially uploaded to offline and online databases…
21
BIOMARKER DISCOVERY
Bladder Cancer
… searching the literature…
22
BIOMARKER DISCOVERY
C2 v TCC: PLSDAOverall Spec Sens LV AUROC
88.35 88.21 88.48 12 0.9276
Bladder Cancer
23
BIOMARKER DISCOVERY
C2 v TCC: PLSDAOverall Spec Sens LV AUROC
88.35 88.21 88.48 12 0.9276
6.43 min
Bladder Cancer
24
BIOMARKER DISCOVERY
7.078 minC2 v TCC: PLSDAOverall Spec Sens LV AUROC
88.35 88.21 88.48 12 0.9276
Bladder Cancer
25
BIOMARKER DISCOVERY
Compound Database
2-pentanone NIST & MassBank
2,3-butanedione MassBank
4-heptanone MassBank
Dimethyl disulphide NIST
Hexanal NIST
Benzaldehyde MassBank
Butyrophenone MassBank
3-hydroxyanthranilic acid MassBank
Benzoic acid MassBank
Trans-3-hexanoic acid MassBank
Cis-3-hexanoic acid MassBank
2-butanone NIST
2-propanol NIST
Acetic acid NIST
Piperitone MassBank
Thujone MassBank
Bladder Cancer
Median value of abundance from C3 to TCC (Increase or decrease)
26
BIOMARKER DISCOVERY
Compound
2-pentanone
2,3-butanedione
4-heptanone
Dimethyl disulphide
Hexanal
Benzaldehyde
Butyrophenone
3-hydroxyanthranilic acid
Benzoic acid
Trans-3-hexanoic acid
Cis-3-hexanoic acid
2-butanone
2-propanol
Acetic acid
Piperitone
Thujone
Bladder Cancer
• Ketone series found in urine.
• Also present in food (fruits, dairy, fish…)
• Elevated levels found in some gut
infections (c. jejuni; c. difficile) and some
gut conditions e.g. coeliac disease,
ulcerative colitis and non-alcoholic fatty
liver disease.
• Metabolised from acetaldehyde
• Ubiquitous in urine
• Known to be elevated in a number of
diseases including kidney disease, diabetes
and some infections
• Normal in breath and biofluids
• Elevated in some cancers and gut
conditions
27
BIOMARKER DISCOVERY
Bladder Cancer
Compound
2-pentanone
2,3-butanedione
4-heptanone
Dimethyl disulphide
Hexanal
Benzaldehyde
Butyrophenone
3-hydroxyanthranilic acid
Benzoic acid
Trans-3-hexanoic acid
Cis-3-hexanoic acid
2-butanone
2-propanol
Acetic acid
Piperitone
Thujone
• Ubiquitous in urine
• Elevated in gut disease and cancers
• Sometimes found in urine probably from
food additives
• Drug component so unlikely to be naturally
occurring in body
• Oxidation product of tryptophan metabolism
• Food additive and drug component
• It is noted that acids will show present in
higher concentration if pH drops in urine,
so this may be a result of lower pH down
to tumour growth
28
BIOMARKER DISCOVERY
Bladder Cancer
Compound
2-pentanone
2,3-butanedione
4-heptanone
Dimethyl disulphide
Hexanal
Benzaldehyde
Butyrophenone
3-hydroxyanthranilic acid
Benzoic acid
Trans-3-hexanoic acid
Cis-3-hexanoic acid
2-butanone
2-propanol
Acetic acid
Piperitone
Thujone
• Ubiquitous in urine
• Ubiquitous in urine
• Elevated in many conditions
including gut disease, diabetes and
others
• May be higher in cancer due to likely
lower pH
• Flavouring ingredients
Conclusions
29
Bladder Cancer
• PLS-DA-derived models gave a mean accuracy for patients presenting with other
non-cancerous urological disease of 88.4%, with 88.5% sensitivity and 88.2%
specificity for C2 versus TCC (TCC1, TCC2 and TCC3 combined).
• SVM-derived models had given a mean accuracy of 89.2%, with a sensitivity of
90.3% and specificity of 88.0%.
• Although the specificities achieved were marginally less than that of conventional
urine cytology (typically >90% specificity), sensitivity was very close to typical
range of 80-90% for high-grade tumours and thus better than the typical range of
20-50% for low-grade tumours, case in point, the sensitivity attained for C3 v
TCC1 was 57.5% which is marginally better than the “gold-standard” of 20-50%.
• Currently uncertain why exogenous compounds are present in different
concentrations. For TCC it may be down to changed host response due to
presence of tumour.
• Applications of data science techniques can very likely be employed to assist in
the diagnosis of cancers and diseases including identifying probable markers.
30
Where Next??
Bladder Cancer
• Ideally to be able to determine which grade of TCC
• Have seen a successful approach for bladder cancer
� Suggests that could be a replacement for the current invasive
procedures
���� Needs careful validation…
• Ideally validation across different populations/regions
• Potential for further collaborations?
References
• M. Cauchi, C. M. Weber, B. J. Bolt, P. B. Spratt, C. Bessant, D. C. Turner, C. M. Willis, L. E. Britton, C.
Turner, G. Morgan, Evaluation of gas chromatography mass spectrometry and pattern recognition for
the identification of bladder cancer from urine headspace, Analytical Methods, 2016, 8, 4037-4046
• Weber CM, Cauchi M, Patel M, Bessant C, Turner C, Britton LE, Willis CM,. “Evaluation of a gas sensor array
and pattern recognition for the identification of bladder cancer from urine headspace”, The Analyst, 2011,
vol 136 (2), p359-364.
Bladder Cancer
Papers
31
Pattern Recognition Pipeline
CLASSIFICATION_V1
IMPORTDATA
EXPLORATORY (GUI)
ALIGNMENT
VSELECT (GUI)
CLASSIFY
EXPORT
NCDFLOAD
CVSLOAD
MATLOAD
TABLOAD
XLSLOAD
SPCLOAD
DBLOAD
BUILDCLASS
ExtractSIC
CVCLASSIFY
BSCLASSIFY
DoSCALING DoMODELLING
DoSTATS
xPVcalc
pretreat2plsdaMC
plsMC
SVM
ANN
O-PLSDA
KNN
LDA
Lin
RBFSig
Poly
SPLIT
BuildBOOT
PROGRESS
Report Generator
NullCLASSIFY
GetVectorFromStructure
ReplaceWithZeros2
ToROCclass
BCCLASSIFYrocit
Obsolete since using rocit
histfit3
ReplaceWithBaseline
32
Pattern Recognition Pipeline
Exploratory Data Analysis
33
Feature Selection
Report
Generation
Collaborations (Former and Ongoing)
34
Thank you!
35
University of Limerick