Ranjit Ganta, Raj Acharya, Shruthi Prabhakara Department of Computer Science and Engineering, Penn...

1
Ranjit Ganta, Raj Acharya, Shruthi Prabhakara Department of Computer Science and Engineering, Penn State University DATA WAREHOUSE FOR BIO-GEO HEALTH CARE INFORMATICS Health-Care records: ‘Bio- geo’ Informatics •Patient identification information •Geographical Information. •Clinical Information: Organ/Cellular level: Tumor, pathology. Molecular level: DNA sequence, Microarray. Laboratory data: Blood tests, diagnosis, prognosis. INTRODUCTION Integration of Health- care records: •Privacy Violation •Distributed integration of health care records. Integration within Health-care records: •Information Fusion: Combine multiple disparate sources of information such that the whole is more than the sum of it’s parts. For the patient demographic data set this helps answer questions such as: Which age/race profile(s) if any, define a typical profile of a prostate cancer patient? Are middle-aged Caucasian males more prone to prostate cancer than Caucasians of other age groups? Is there a close association between age and race groups? CORRESPONDENCE ANALYSIS Sample Result: Example Data : Dhanasekharan et al. "Delineation of prognostic biomarkers in prostate cancer", Letters to Nature, Vol 412, August 2001, pages 822-826. Supplementary data (Fig 1C, pg 823,Commercial Pool) Gene expression (microarray data) in four clinical states of prostate- derived tissues CLINICAL STATES Benign states BPH : Benign Prostatic Hyperlasia NAP : Normal Adjacent Prostate Malignan t states PCA : Localized prostate cancer MET : Metastatic sample Sample Result: KL-CLUSTERING Genes To Co-regulated genes Down-regulated {g1} Up-regulated {g2, g4}; {g3}; {g5} No change {g6} Clusters Input Profiles g1 g2 g3 g4 g5 g6 The Kullback-Leibler (KL) divergence measures the relative dissimilarity of the shapes of two gene profiles. 1-D SOM algorithm + KL Minimize D(Gene || SOM weight for each node) at each iteration step. ) ( ) ( log q(x) p(x) log p(x) ) || ( x q x p q p D p x [Bioinformatics, Vol. 19, No. 4, 2003, 449-458] Common Motifs Motif: short segments of DNA that act as a binding site for a specific transcription factor Typically 6-25bp in length Statistically different in composite compared to the background Often repeated within a sequence Motif 1 Motif 2 Motif k Gene 1 0 1 0 Gene 2 0 1 2 Gene n 3 0 0 Frequency of occurrence COMBINED CLUSTERING Clustering using more than one data source aims at identifying clusters of genes with similar properties among all data. Goal of combined clustering is to answer the following: 1.If genes have similar expression profile patterns, do they also share common motifs? 2.If genes have a set of motifs in common, do they also exhibit similar expression profile patterns? 3.Which genes share BOTH - that is, they have similar expression profile patterns AND share a set of common motifs? Alpha Factor Experiments Cluster on Motif vectors Cluster on Gene expression Combined clustering All genes in the cluster share the Transcription Factor MCBa CONCLUSION Figure: Information Fusion Based Attack Prototype for Bio-geo Data Warehouse Gene Expression Clinical and Pathology Public Data (Literature etc) Patient Informatio n Global Statistics Information Fusion based Clustering Cancer Research Grid Cancer Analysis Applications Result Visualization Geographical Information We have demonstrated the significance of information fusion based tools for bio-geo health care informatics. As a data warehouse for various data sets involved in bio-geo health care informatics studies. To provide and demonstrate a set of information fusion tools for disease research.

Transcript of Ranjit Ganta, Raj Acharya, Shruthi Prabhakara Department of Computer Science and Engineering, Penn...

Page 1: Ranjit Ganta, Raj Acharya, Shruthi Prabhakara Department of Computer Science and Engineering, Penn State University DATA WAREHOUSE FOR BIO-GEO HEALTH CARE.

Ranjit Ganta, Raj Acharya, Shruthi PrabhakaraDepartment of Computer Science and Engineering, Penn State University

DATA WAREHOUSE FOR BIO-GEO HEALTH CARE INFORMATICS

Health-Care records: ‘Bio-geo’ Informatics•Patient identification information•Geographical Information.•Clinical Information:

Organ/Cellular level: Tumor, pathology.

Molecular level: DNA sequence, Microarray.

Laboratory data: Blood tests, diagnosis, prognosis.

INTRODUCTION

Integration of Health-care records:•Privacy Violation•Distributed integration of health care records.Integration within Health-care records:•Information Fusion: Combine multiple disparate sources of information such that the whole is more than the sum of it’s parts.

For the patient demographic data set this helps answer questions such as:Which age/race profile(s) if any,

define a typical profile of a prostate cancer patient?

Are middle-aged Caucasian males more prone to prostate cancer than Caucasians of other age groups?

Is there a close association between age and race groups?

CORRESPONDENCE ANALYSIS

Sample Result: Example Data : Dhanasekharan et al. "Delineation of prognostic biomarkers in prostate cancer", Letters to Nature, Vol 412, August 2001, pages 822-826. Supplementary data (Fig 1C, pg 823,Commercial Pool)Gene expression (microarray data) in four clinical states of prostate-derived tissues

CLINICAL STATES

Benign states

BPH : Benign Prostatic Hyperlasia

NAP : Normal Adjacent Prostate

Malignant states

PCA : Localized prostate cancer

MET : Metastatic sample

Sample Result:

KL-CLUSTERING

Genes To Co-regulated genes

Down-regulated{g1}

Up-regulated{g2, g4}; {g3}; {g5}

No change{g6}

ClustersInput Profiles

g1

g2

g3

g4

g5

g6

The Kullback-Leibler (KL) divergence measures the relative dissimilarity of the shapes of two gene profiles.

1-D SOM algorithm + KL

Minimize D(Gene || SOM weight for each node) at each iteration step.

)(

)(log

q(x)

p(x)log p(x) )||(

xq

xp

qpD

p

x

[Bioinformatics, Vol. 19, No. 4, 2003, 449-458]

Common Motifs

Motif: short segments of DNA that act as a binding site for a specific transcription factor

Typically 6-25bp in lengthStatistically different in composite compared

to the backgroundOften repeated within a sequence

  Motif 1 Motif 2 … Motif k

Gene 1 0 1   0

Gene 2 0 1   2

…     …  

Gene n 3 0   0

Frequency of occurrence

COMBINED CLUSTERING

Clustering using more than one data source aims at identifying clusters of genes withsimilar properties among all data.Goal of combined clustering is to answer the

following:1. If genes have similar expression profile

patterns, do they also share common motifs?

2. If genes have a set of motifs in common, do they also exhibit similar expression profile patterns?

3. Which genes share BOTH - that is, they have similar expression profile patterns AND share a set of common motifs?

Alpha Factor Experiments

Cluster on Motif vectors Cluster on Gene expression

Combined clustering

All genes in the cluster share the Transcription Factor MCBa

CONCLUSION

Figure: Information Fusion Based Attack

Prototype for Bio-geo Data Warehouse

Gene Expression

Clinical and Pathology

Public Data (Literature etc)

Patient Information

Global Statistics

Information Fusion based Clustering

Cancer Research Grid

Cancer Analysis Applications

Result Visualization

Geographical Information

We have demonstrated the significance of information fusion based tools for bio-geo health care informatics.• As a data warehouse for various data sets involved in bio-geo health care informatics studies.• To provide and demonstrate a set of information fusion tools for disease research.