Ranjit Ganta, Raj Acharya, Shruthi Prabhakara Department of Computer Science and Engineering, Penn...
-
Upload
francis-chase -
Category
Documents
-
view
215 -
download
0
Transcript of Ranjit Ganta, Raj Acharya, Shruthi Prabhakara Department of Computer Science and Engineering, Penn...
Ranjit Ganta, Raj Acharya, Shruthi PrabhakaraDepartment of Computer Science and Engineering, Penn State University
DATA WAREHOUSE FOR BIO-GEO HEALTH CARE INFORMATICS
Health-Care records: ‘Bio-geo’ Informatics•Patient identification information•Geographical Information.•Clinical Information:
Organ/Cellular level: Tumor, pathology.
Molecular level: DNA sequence, Microarray.
Laboratory data: Blood tests, diagnosis, prognosis.
INTRODUCTION
Integration of Health-care records:•Privacy Violation•Distributed integration of health care records.Integration within Health-care records:•Information Fusion: Combine multiple disparate sources of information such that the whole is more than the sum of it’s parts.
For the patient demographic data set this helps answer questions such as:Which age/race profile(s) if any,
define a typical profile of a prostate cancer patient?
Are middle-aged Caucasian males more prone to prostate cancer than Caucasians of other age groups?
Is there a close association between age and race groups?
CORRESPONDENCE ANALYSIS
Sample Result: Example Data : Dhanasekharan et al. "Delineation of prognostic biomarkers in prostate cancer", Letters to Nature, Vol 412, August 2001, pages 822-826. Supplementary data (Fig 1C, pg 823,Commercial Pool)Gene expression (microarray data) in four clinical states of prostate-derived tissues
CLINICAL STATES
Benign states
BPH : Benign Prostatic Hyperlasia
NAP : Normal Adjacent Prostate
Malignant states
PCA : Localized prostate cancer
MET : Metastatic sample
Sample Result:
KL-CLUSTERING
Genes To Co-regulated genes
Down-regulated{g1}
Up-regulated{g2, g4}; {g3}; {g5}
No change{g6}
ClustersInput Profiles
g1
g2
g3
g4
g5
g6
The Kullback-Leibler (KL) divergence measures the relative dissimilarity of the shapes of two gene profiles.
1-D SOM algorithm + KL
Minimize D(Gene || SOM weight for each node) at each iteration step.
)(
)(log
q(x)
p(x)log p(x) )||(
xq
xp
qpD
p
x
[Bioinformatics, Vol. 19, No. 4, 2003, 449-458]
Common Motifs
Motif: short segments of DNA that act as a binding site for a specific transcription factor
Typically 6-25bp in lengthStatistically different in composite compared
to the backgroundOften repeated within a sequence
Motif 1 Motif 2 … Motif k
Gene 1 0 1 0
Gene 2 0 1 2
… …
Gene n 3 0 0
Frequency of occurrence
COMBINED CLUSTERING
Clustering using more than one data source aims at identifying clusters of genes withsimilar properties among all data.Goal of combined clustering is to answer the
following:1. If genes have similar expression profile
patterns, do they also share common motifs?
2. If genes have a set of motifs in common, do they also exhibit similar expression profile patterns?
3. Which genes share BOTH - that is, they have similar expression profile patterns AND share a set of common motifs?
Alpha Factor Experiments
Cluster on Motif vectors Cluster on Gene expression
Combined clustering
All genes in the cluster share the Transcription Factor MCBa
CONCLUSION
Figure: Information Fusion Based Attack
Prototype for Bio-geo Data Warehouse
Gene Expression
Clinical and Pathology
Public Data (Literature etc)
Patient Information
Global Statistics
Information Fusion based Clustering
Cancer Research Grid
Cancer Analysis Applications
Result Visualization
Geographical Information
We have demonstrated the significance of information fusion based tools for bio-geo health care informatics.• As a data warehouse for various data sets involved in bio-geo health care informatics studies.• To provide and demonstrate a set of information fusion tools for disease research.