Ranjit Ganta, Raj Acharya, Shruthi Prabhakara Department of Computer Science and Engineering, Penn...

Ranjit Ganta, Raj Acharya, Shruthi PrabhakaraDepartment of Computer Science and Engineering, Penn State University

DATA WAREHOUSE FOR BIO-GEO HEALTH CARE INFORMATICS

Health-Care records: ‘Bio-geo’ Informatics•Patient identification information•Geographical Information.•Clinical Information:

Organ/Cellular level: Tumor, pathology.

Molecular level: DNA sequence, Microarray.

Laboratory data: Blood tests, diagnosis, prognosis.

INTRODUCTION

Integration of Health-care records:•Privacy Violation•Distributed integration of health care records.Integration within Health-care records:•Information Fusion: Combine multiple disparate sources of information such that the whole is more than the sum of it’s parts.

For the patient demographic data set this helps answer questions such as:Which age/race profile(s) if any,

define a typical profile of a prostate cancer patient?

Are middle-aged Caucasian males more prone to prostate cancer than Caucasians of other age groups?

Is there a close association between age and race groups?

CORRESPONDENCE ANALYSIS

Sample Result: Example Data : Dhanasekharan et al. "Delineation of prognostic biomarkers in prostate cancer", Letters to Nature, Vol 412, August 2001, pages 822-826. Supplementary data (Fig 1C, pg 823,Commercial Pool)Gene expression (microarray data) in four clinical states of prostate-derived tissues

CLINICAL STATES

Benign states

BPH : Benign Prostatic Hyperlasia

NAP : Normal Adjacent Prostate

Malignant states

PCA : Localized prostate cancer

MET : Metastatic sample

Sample Result:

KL-CLUSTERING

Genes To Co-regulated genes

Down-regulated{g1}

Up-regulated{g2, g4}; {g3}; {g5}

No change{g6}

ClustersInput Profiles

g1

g2

g3

g4

g5

g6

The Kullback-Leibler (KL) divergence measures the relative dissimilarity of the shapes of two gene profiles.

1-D SOM algorithm + KL

Minimize D(Gene || SOM weight for each node) at each iteration step.

)(

)(log

q(x)

p(x)log p(x) )||(

xq

xp

qpD

p

x

[Bioinformatics, Vol. 19, No. 4, 2003, 449-458]

Common Motifs

Motif: short segments of DNA that act as a binding site for a specific transcription factor

Typically 6-25bp in lengthStatistically different in composite compared

to the backgroundOften repeated within a sequence

Motif 1 Motif 2 … Motif k

Gene 1 0 1 0

Gene 2 0 1 2

… …

Gene n 3 0 0

Frequency of occurrence

COMBINED CLUSTERING

Clustering using more than one data source aims at identifying clusters of genes withsimilar properties among all data.Goal of combined clustering is to answer the

following:1. If genes have similar expression profile

patterns, do they also share common motifs?

2. If genes have a set of motifs in common, do they also exhibit similar expression profile patterns?

3. Which genes share BOTH - that is, they have similar expression profile patterns AND share a set of common motifs?

Alpha Factor Experiments

Cluster on Motif vectors Cluster on Gene expression

Combined clustering

All genes in the cluster share the Transcription Factor MCBa

CONCLUSION

Figure: Information Fusion Based Attack

Prototype for Bio-geo Data Warehouse

Gene Expression

Clinical and Pathology

Public Data (Literature etc)

Patient Information

Global Statistics

Information Fusion based Clustering

Cancer Research Grid

Cancer Analysis Applications

Result Visualization

Geographical Information

We have demonstrated the significance of information fusion based tools for bio-geo health care informatics.• As a data warehouse for various data sets involved in bio-geo health care informatics studies.• To provide and demonstrate a set of information fusion tools for disease research.

Ranjit Ganta, Raj Acharya, Shruthi Prabhakara Department of Computer Science and Engineering, Penn...

Documents

Transcript of Ranjit Ganta, Raj Acharya, Shruthi Prabhakara Department of Computer Science and Engineering, Penn...