Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and...
-
Upload
maximillian-hodges -
Category
Documents
-
view
217 -
download
0
Transcript of Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and...
![Page 1: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/1.jpg)
Copyright 2003 limsoon wong
Data Mining of Gene Expression Profiles for the Diagnosis and
Understanding of Diseases
Limsoon Wong
Institute for Infocomm Research
![Page 2: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/2.jpg)
Copyright 2003 limsoon wong
Plan
• Some accomplishments and challenges in knowledge discovery from biological and clinical data
• Data mining in microarray analysis– diagnosis of disease state and subtype– derivation of treatment plan– understanding of gene interaction network
![Page 3: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/3.jpg)
Copyright 2003 limsoon wong
Knowledge Discovery from Biological and Clinical Data: MOTIVATION
![Page 4: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/4.jpg)
Copyright 2003 limsoon wong
• Complete genomes are now available
• Knowing the genes is not enough to understand how biology functions
• Proteins, not genes, are responsible for many cellular activities
• Proteins function by interacting with other proteins and biomolecules
GENOME PROTEOME
INTERACTOME
Driving Forces: Genes, Proteins, Interactions, Diagnosis, & Cures
![Page 5: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/5.jpg)
Copyright 2003 limsoon wong
If we figure out how these work, we get these Benefits
To the patient:Better drug, better treatment
To the pharma:Save time, save cost, make more $
To the scientist:Better science
![Page 6: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/6.jpg)
Copyright 2003 limsoon wong
To figure these out,we bet on...
“solution” = Data Mgmt + Knowledge Discovery
Data Mgmt =Integration + Transformation + Cleansing
Knowledge Discovery = Statistics + Algorithms + Databases
![Page 7: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/7.jpg)
Copyright 2003 limsoon wong
Knowledge Discovery from Biological and Clinical Data: ACCOMPLISHMENT
![Page 8: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/8.jpg)
Copyright 2003 limsoon wong
IntegrationTechnology(Kleisli)
Cleansing & Warehousing (FIMM)
MHC-PeptideBinding(PREDICT)
Protein InteractionsExtraction (PIES)
Gene Expression & Medical RecordDatamining (PCL)
Gene FeatureRecognition (Dragon)
VenomInformatics
1994 19981996 2000 2002ISS KRDL LIT/I2R
GeneticXchange
MolecularConnections
Biobase
8 years of bioinformatics R&D in Singapore
![Page 9: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/9.jpg)
Copyright 2003 limsoon wong
Predict Epitopes,Find Vaccine Targets
• Vaccines are often the only solution for viral diseases
• Finding & developing effective vaccine targets is slow and expensive process
• Develop systems to recognize protein peptides that bind MHC molecules• Develop systems to recognize hot spots in viral antigens
![Page 10: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/10.jpg)
Copyright 2003 limsoon wong
Recognize Functional Sites,Help Scientists
• Effective recognition of initiation, control, and termination of biological processes is crucial to speeding up and focusing scientific experiments
• Data mining of bio seqs to find rules for recognizing & understanding functional sites
Dragon’s 10x reduction of TSS recognitionfalse positives
![Page 11: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/11.jpg)
Copyright 2003 limsoon wong
Diagnose Leukaemia, Benefit Children
• Childhood leukaemia is a heterogeneous disease
• Treatment is based on subtype• 3 different tests and 4 different
experts are needed for accurate diagnosis
Curable in USA, fatal in Indonesia
• A single platform diagnosis based on gene expression• Data mining to discover rules that are easy for doctors to understand
![Page 12: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/12.jpg)
Copyright 2003 limsoon wong
Understand Proteins,Fight Diseases
• Understanding function and role of protein needs organised info on interaction pathways
• Such info are often reported in scientific paper but are seldom found in structured databases
• Knowledge extraction system to process free text • extract protein names• extract interactions
![Page 13: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/13.jpg)
Copyright 2003 limsoon wong
Data Mining in Microarray Analysis:MICROARRAY BACKGROUND
![Page 14: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/14.jpg)
Copyright 2003 limsoon wong
What’s a Microarray?
• Contain large number of DNA molecules spotted on glass slides, nylon membranes, or silicon wafers
• Measure expression of thousands of genes simultaneously
![Page 15: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/15.jpg)
Copyright 2003 limsoon wong
Affymetrix GeneChip Array
![Page 16: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/16.jpg)
Copyright 2003 limsoon wong
Making Affymetrix GeneChip
quartz is washed to ensure uniform hydroxylation across its surface and to attach linker molecules
exposed linkers become deprotected and are available for nucleotide coupling
![Page 17: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/17.jpg)
Copyright 2003 limsoon wong
Gene Expression Measurement by GeneChip
![Page 18: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/18.jpg)
Copyright 2003 limsoon wong
A Sample Affymetrix GeneChip File (U95A)
![Page 19: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/19.jpg)
Copyright 2003 limsoon wong
Data Mining in Microarray Analysis: DISEASE SUBSTYPE DIAGNOSIS
![Page 20: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/20.jpg)
Copyright 2003 limsoon wong
Pediatric Acute Lymphoblastic Leukemia
• A heterogeneous disease with more than 12 subtypes, e.g., T-ALL, E2A-PBX1, TEL-AML1, BCR-ABL, MLL, and Hyperdip>50.
• Treatment response is subtype dependent
• 80% continuous remission if subtype is correctly diagnosed and the corresponding treatment plan is applied
![Page 21: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/21.jpg)
Copyright 2003 limsoon wong
Subtype Diagnosis
• Require different tests:– immunophenotyping– cytogenetics– molecular diagnostics
• Require different experts:– hematologist– oncologist– pathologist– cytogeneticist
![Page 22: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/22.jpg)
Copyright 2003 limsoon wong
Difficulties and Implications
• The different tests and experts are not commonly available within a single hospital, especially in less advanced countries
An 80%-curable disease in USA can be a fatal disease in Indonesia!
Is there a single diagnostic platform that does not need multiple human specialists?
![Page 23: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/23.jpg)
Copyright 2003 limsoon wong
A Potential Solution by MicroarraysYeoh et al., Cancer Cell 1:133--143, 2002
Diagnostic ALL BM samples (n=327)
3-3 -2 -1 0 1 2 = std deviation from mean
Ge
ne
s f
or
cla
ss
d
isti
nc
tio
n (
n=
27
1)
TEL-AML1BCR-ABL
Hyperdiploid >50E2A-PBX1
MLL T-ALL Novel
TEL-AML1E2A-PBX1
T-ALLHyperdiploid >50BCR-ABL
MLLNovel
![Page 24: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/24.jpg)
Copyright 2003 limsoon wong
Some Caveats
• Study was performed on Americans
• May not be applicable to Singaporeans, Malaysians, Indonesians, etc.
• Large-scale study on local populations currently in the works
![Page 25: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/25.jpg)
Copyright 2003 limsoon wong
Typical Procedure in Analysing Gene Expression for Diagnosis
• Gene expression data collection
• Gene selection
• Classifier training
• Classifier tuning (optional for some machine learning methods)
• Apply classifier for diagnosis of future cases
![Page 26: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/26.jpg)
Copyright 2003 limsoon wong
Feature Selection Methods
A refresher of feature selection methods
![Page 27: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/27.jpg)
Copyright 2003 limsoon wong
Signal Selection (Basic Idea)
• Choose a signal w/ low intra-class distance
• Choose a signal w/ high inter-class distance
![Page 28: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/28.jpg)
Copyright 2003 limsoon wong
Signal Selection (eg., t-statistics)
![Page 29: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/29.jpg)
Copyright 2003 limsoon wong
Signal Selection (eg., 2)
![Page 30: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/30.jpg)
Copyright 2003 limsoon wong
Signal Selection (eg., CFS)
• Instead of scoring individual signals, how about scoring a group of signals as a whole?
• CFS– Correlation-based Feature Selection– A good group contains signals that are highly
correlated with the class, and yet uncorrelated with each other
![Page 31: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/31.jpg)
Copyright 2003 limsoon wong
Gene Expression Profile Classification
An introduction to gene expression profile classificationby the example on ALL subtype diagnosis
![Page 32: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/32.jpg)
Copyright 2003 limsoon wong
Subtype Classification of ALL
A tree-structureddiagnostic workflow was recommended bythe doctors, as perYeoh et al., Cancer Cell 1:133--143, 2002
![Page 33: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/33.jpg)
Copyright 2003 limsoon wong
Training and Testing Sets
![Page 34: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/34.jpg)
Copyright 2003 limsoon wong
Our procedure for ALL subtype diagnosis
• Gene expression data collection
• Gene selection by entropy
• Classifier training by emerging pattern
• Classifier tuning (optional for some machine learning methods)
• Apply classifier for diagnosis of future cases by PCL
![Page 35: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/35.jpg)
Copyright 2003 limsoon wong
Signal Selection (eg., entropy)
![Page 36: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/36.jpg)
Copyright 2003 limsoon wong
Emerging Patterns (EPs)
• An EP is a set of conditions– usually involving several features– that most members of a class satisfy – but none or few of the other class satisfy
• A jumping EP is an EP that – some members of a class satisfy– but no members of the other class satisfy
• We use only most general jumping EPs
![Page 37: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/37.jpg)
Copyright 2003 limsoon wong
PCL: Prediction by Collective Likelihood
![Page 38: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/38.jpg)
Copyright 2003 limsoon wong
Accuracy (using 20 genes of lowest entropy)
PCL
1:1
0:25
0:21:1
0:04
0:1
1:1
1:614
0:11:1
5
0:12:2
7
![Page 39: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/39.jpg)
Copyright 2003 limsoon wong
Comprehensibility
![Page 40: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/40.jpg)
Copyright 2003 limsoon wong
Gene Expression Profile ClassificationHow about other feature selection and
classification methods?
![Page 41: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/41.jpg)
Copyright 2003 limsoon wong
Some gene selection heuristics
• all-CFS: all features from CFS
• top20-2: 20 features w/ highest 2 stats
• top20-t: 20 features w/ highest t-stats
• top20-mit: 20 features w/ highest MIT stats
• entropy: 20 features w/ lowest entropy
• all-2: all features meeting 5% significance level of 2 stats
![Page 42: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/42.jpg)
Copyright 2003 limsoon wong
Some other classification methods
• k-NN (k=1)– majority votes of the k nearest neighbours
determined by Euclidean distance
• C4.5– widely used decision tree method.
• Naïve Bayes (NB)– probabilistic prediction using Bayes’ rule
• SVM– (linear) discriminant function that maximizes
separation of boundary samples
![Page 43: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/43.jpg)
Copyright 2003 limsoon wong
Accuracy
• Feature selection improves performance• Entropy+PCL has consistent high performance
![Page 44: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/44.jpg)
Copyright 2003 limsoon wong
When 20 genes are selected randomly
Average over 100 experiments
Cf. 7-15 mistakes total with good feature selection
![Page 45: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/45.jpg)
Copyright 2003 limsoon wong
Data Mining in Microarray Analysis: TREATMENT PLAN DERIVATION
A pure speculation!
![Page 46: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/46.jpg)
Copyright 2003 limsoon wong
Can we do more with EPs?
• Detect gene groups that are significantly related to a disease
• Derive coordinated gene expression patterns from these groups
• Derive “treatment plan” based on these patterns
![Page 47: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/47.jpg)
Copyright 2003 limsoon wong
Colon Tumour DatasetAlon et al., PNAS 96:6745--6750, 1999
• We use the colon tumour dataset above to illustrate our ideas– 22 normal samples– 40 colon tumour samples
![Page 48: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/48.jpg)
Copyright 2003 limsoon wong
Detect Gene Groups
• Feature Selection– Use entropy method– 35 genes have cut points
• Generate EPs– 19501 EPs in normals– 2165 EPs in tumours
• EPs with largest support are gene groups significantly co-related to disease
![Page 49: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/49.jpg)
Copyright 2003 limsoon wong
Top 20 EPs
![Page 50: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/50.jpg)
Copyright 2003 limsoon wong
Observation 1
• Some EPs contain large number of genes and still have high freq
• E.g., {2, 3, 6, 7, 13, 17, 33} has freq 90.91% in normal and 0% in cancer samples
Nearly all normal sample’s gene expr. values satisfy all conds. implied by these 7 items
![Page 51: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/51.jpg)
Copyright 2003 limsoon wong
Observation 2
• Freq of singleton EP is not necessarily larger than EP having multiple genes
• E.g., {5} is EP in cancer samples and has freq 32.5%
• E.g., {16, 58, 62} is EP in cancer samples and has freq 75.5%
Groups of genes and their correlation's could be more impt than single genes
![Page 52: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/52.jpg)
Copyright 2003 limsoon wong
Observation 3
• M33680 has lowest entropy of the 35 genes if cutpoint is set at 352
• 18/40 of cancer samples shift expr level of M33680 from its normal range to its abnormal range
![Page 53: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/53.jpg)
Copyright 2003 limsoon wong
Treatment Plan Idea
• Increase/decrease expression level of particular genes in a cancer cell so that– it has the common EPs of normal cells– it has no common EPs of cancer cells
![Page 54: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/54.jpg)
Copyright 2003 limsoon wong
Treatment Plan Example
• From the EP {2,3,6,7,13,17,33}– 91% of normal cells express the 7 genes
(T51560, T49941, M62994, R34701, L02426, U20428, R10707) in the corr. Intervals
– a cancer cell never express all 7 genes in the same way
– if expression level of improperly expressed genes can be adjusted, the cancer cell can have one common EP of normal cells
– a cancer cell can then be iteratively converted into a normal one
![Page 55: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/55.jpg)
Copyright 2003 limsoon wong
Choosing Genes to Adjust
![Page 56: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/56.jpg)
Copyright 2003 limsoon wong
Doing more adjustments...
• Down regulating T49941 leads to 2 more top 10 EPs of normal cells to show up in the adjusted T1
• Down regulating X62153 to below 396 and T72403 to below 296 leads to T1 having 9 top 10 EPs of normal cells
• Ave. no. of EPs in normal cells is 9
• So the adjusted T1 now has impt features of normal cells
![Page 57: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/57.jpg)
Copyright 2003 limsoon wong
Next, eliminate common EPs of cancer cells in T1
• 6 more genes (K03001, T49732, U29171, R76254, D31767,
L40992) are adjusted
• All top 10 EPs of cancer cells now disappear from T1
• Ave. no. of top 10 EPs contained in cancer cells is 6
• The adjusted T1 now holds enough common features of normal cells and no features of cancer cells
T1 is converted to normal cells
![Page 58: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/58.jpg)
Copyright 2003 limsoon wong
“Treatment Plan” Validation
• “Adjustments” were made to the 40 colon tumour samples based on EPs as described
• Classifiers trained on original samples were applied to the adjusted samples
It works!
![Page 59: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/59.jpg)
Copyright 2003 limsoon wong
A Big But...
• Effective means for identifying mechanisms and pathways through which to modulate gene expression of selected genes need to be developed
![Page 60: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/60.jpg)
Copyright 2003 limsoon wong
Data Mining in Microarray Analysis:GENE INTERACTION PREDICTION
![Page 61: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/61.jpg)
Copyright 2003 limsoon wong
Beyond Classification of Gene Expression Profiles
• After identifying the candidate genes by feature selection, do we know which ones are causal genes and which ones are surrogates?
Diagnostic ALL BM samples (n=327)
3-3 -2 -1 0 1 2 = std deviation from mean
Ge
ne
s f
or
cla
ss
d
isti
nc
tio
n (
n=
27
1)
TEL-AML1BCR-ABL
Hyperdiploid >50E2A-PBX1
MLL T-ALL Novel
![Page 62: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/62.jpg)
Copyright 2003 limsoon wong
Gene Regulatory Circuits
• Genes are “connected” in “circuit” or network
• Expression of a gene in a network depends on expression of some other genes in the network
• Can we reconstruct the gene network from gene expression data?
![Page 63: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/63.jpg)
Copyright 2003 limsoon wong
Key Questions
For each gene in the network:
• which genes affect it?
• How they affect it?– Positively?– Negatively?– More complicated ways?
![Page 64: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/64.jpg)
Copyright 2003 limsoon wong
Some Techniques
• Bayesian Networks– Friedman et al., JCB 7:601--620, 2000
• Boolean Networks– Akutsu et al., PSB 2000, pages 293--304
• Differential equations– Chen et al., PSB 1999, pages 29--40
• Classification-based method– Soinov et al., “Towards reconstruction of gene
network from expression data by supervised learning”, Genome Biology 4:R6.1--9, 2003
![Page 65: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/65.jpg)
Copyright 2003 limsoon wong
A Classification-based TechniqueSoinov et al., Genome Biology 4:R6.1-9, Jan 2003
• Given a gene expression matrix X– each row is a gene– each column is a sample
– each element xij is expression of gene i in sample j
• Find the average value ai of each gene i
• Denote sij as state of gene i in sample j,
– sij = up if xij > ai
– sij = down if xij ai
![Page 66: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/66.jpg)
Copyright 2003 limsoon wong
• To see whether the state of gene g is determined by the state of other genes– we see whether sij | i
g can predict sgj
– if can predict with high accuracy, then “yes”
– Any classifier can be used, such as C4.5, PCL, SVM, etc.
• To see how the state of gene g is determined by the state of other genes– apply C4.5 (or PCL or
other “rule-based” classifiers) to predict sgj
from sij | i g – and extract the decision
tree or rules used
A Classification-based TechniqueSoinov et al., Genome Biology 4:R6.1-9, Jan 2003
![Page 67: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/67.jpg)
Copyright 2003 limsoon wong
Advantages of this method
• Can identify genes affecting a target gene
• Don’t need discretization thresholds
• Each data sample is treated as an example
• Explicit rules can be extracted from the classifier (assuming C4.5 or PCL)
• Generalizable to time series
![Page 68: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/68.jpg)
Copyright 2003 limsoon wong
Acknowledgements
Vladimir Bajic
See-Kiong NgVladimir Brusic
Huiqing Liu
Jinyan Li
![Page 69: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/69.jpg)
Copyright 2003 limsoon wong
Data Mining in Microarray Analysis: NOTES
![Page 70: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/70.jpg)
Copyright 2003 limsoon wong
References
• J.Li, L. Wong, “Geography of differences between two classes of data”, Proc. 6th European Conf. on Principles of Data Mining and Knowledge Discovery, pp. 325--337, 2002
• J.Li, L. Wong, “Identifying good diagnostic genes or gene groups from gene expression data by using the concept of emerging patterns”, Bioinformatics, 18:725--734, 2002
• J.Li et al., “A comparative study on feature selection and classification methods using a large set of gene expression profiles”, GIW, 13:51--60, 2002
![Page 71: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/71.jpg)
Copyright 2003 limsoon wong
References
• E.-J. Yeoh et al., “Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling”, Cancer Cell, 1:133--143, 2002
• U.Alon et al., “Broad patterns of gene expression revealed by clustering analysis of tumor colon tissues probed by oligonucleotide arrays”, PNAS 96:6745--6750, 1999
• L.A.Soinov et al., “Towards reconstruction of gene networks from expression data by supervised learning”, Genome Biology 4:R6.1--9, 2003.
![Page 72: Copyright 2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.](https://reader035.fdocuments.net/reader035/viewer/2022070415/56649e495503460f94b3cd14/html5/thumbnails/72.jpg)
Copyright 2003 limsoon wong
> Data Mining of Gene Expression Profiles for> the Diagnosis and Understanding of Diseases>> This talk is divided into two parts. In Part I, I will provide a> brief overview of some accomplishments and challenges> in Bioinformatics. In Part II, I will discuss the data mining> in the analysis of microarray gene expression profiles for> (a) diagnosis of disease state or subtype, (b) derivation of> disease treatment plan, and (c) understanding of gene> interaction networks.>