Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data...

69
Data mining in genetics Junior Barrera BIOINFO-USP

Transcript of Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data...

Page 1: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Data mining in genetics

Junior Barrera

BIOINFO-USP

Page 2: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Layout

• Introduction

• Data mining

• Mapping of rare genes

• Expression analysis: measure, genes differentially

expressed; clustering expression signals; identification of

gene regulation networks

Page 3: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Introduction

Page 4: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Knowledge evolution in genetics

• Heredity - Mendel (1866)

• The phenotypes of an individual depends on

genes of his parents.

Page 5: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Knowledge evolution in genetics

• Chromosome Theory - Morgan (1910)

• Genes were situated in chromosomes

Page 6: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Knowledge evolution in genetics

• The molecular structure of chromosomes

(Watson and Crick - 1953)

• DNA structure: the double helix

• Four basis: adenine(A), guanine(G),

thymine(T), cytosine(C)

• genes are sequences of nucleotides

Page 7: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Knowledge evolution in genetics

Page 8: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Knowledge evolution in genetics

• DNA manipulation

• cut, replication and decoding

Page 9: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Knowledge evolution in genetics

• Genetic engineering

• species modification, drug production

Page 10: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Knowledge evolution in genetics

• Genes control the metabolism

• Metabolism occurs by sequences of

enzyme-catalyzed reactions.

• Enzymes are specified by one or more

genes

Page 11: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Knowledge evolution in genetics

Page 12: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Knowledge evolution in genetics

• Gene expression

Page 13: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Data acquisition

Page 14: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Data acquisition

Page 15: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Data acquisition

Quantization - {-1,0,1}

Page 16: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Data mining

Page 17: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

P1

P2

Pn

Pi : analytical and mining procedures (kernel parallel)

Objected oriented database

Page 18: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

images Express

Table

Signal clusters

Dynamical

System

Page 19: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Sequence module

microarray module

Genbank

database

P1

P2

P3

Pn

P1

P2

P3

Pnquery

query

operational

operational

analitical

analitical

Access control module

Integrated Environment

query

Another

databases

Clinical data

Page 20: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Data Warehouse

M_database

Analysis tools writer SpreadSheet Dataminig

MOLAP/ROLAP server

IMS OthersRDMS

Wrapper Wrapper Wrapper

Integrator

Metadata Data

Mart

Users

1

2

3

4

Page 21: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

System Architecture

DB/2DB/2

SybaseSybase

OracleOracle

InformixInformix

SliceSlice NNSliceSlice 11

Appl. n

Appl. 3

Appl. 2

Appl. 1

JavaJava

PBPB

DelphiDelphi

MATLAMATLA

N-2 slices

Consulte rules and libraries

Page 22: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Proteome

Transcriptome

Genome

Pathways

Σ

Wet LabWhat genes regulate the

pathway A->B->C->D ?

Page 23: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Knowledge

Medline

History of

discovering

Pi,Ni

Page 24: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

P1

P2

Pi : analytical and mining procedures (kernel parallel)

Objected oriented database

Pn

N1

N2

Nn

Ni : knowledge discovering procedures (kernel parallel)

Page 25: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

molecule

cell

tissue

organism

Protein structure

and dynamics

DNA, Protein,

Gene Expression,

Gene Networks

Population

Page 26: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

E4000

E3500 V880

GRID Computer - DCC-IME-USP

Internet

and

Intranet

...

E4000:Databases and web application services

Processing services

(by SEFAZ)

10 processors

16 G-bytes main memory

1 T-bytes disk

(by CAGE-FAPESP)

4 processors

8 gigabytes

(by Malaria-FAPESP)

16 processors

32 G-bytes

V880?

(by eucalipto-CNPq ???)

16 processors

32 G-bytes

Page 27: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Mapping of rare genes

Page 28: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

ACGAATCTAGAGAATTAATTAACCGAGTTAAGA

ACGAATCTAGAGAATTAATTAACCGAGTTAAGA

Exon

Training from known genes

Page 29: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Expression Analysis

Page 30: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Analysis Phases

MeasureClustering

G. Dif. Exp

Clustering

ClusteringG. R. Net.

Page 31: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Image Analysis

Page 32: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Selection of clones

1. Clusters of the same gene

5’ 3’Gene chosen

2. Choice of a representation for the gene

5’ 3’

Cuidado com sítios de Poliadenilação!!

Page 33: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Captura das Imagens Cy5 Cy3

Hibridization

cDNAs Misturas de cDNAsmarcados CY3 e CY5

Page 34: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Hirata R, Barrera J, Hashimoto R, Dantas D, Esteves G. In press, 2002.

Expression Calculus

Ratio

Page 35: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

signal + noise

Forground

Background

Page 36: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Dispersion of cy3 and cy5 give an idea of the spot quality

Spot bom

Spot ruim

o Pixels do backgroundo Pixels do foreground

Page 37: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

ScanAlyzeScanAlyze

Totalmente baseado em ROIs.

Amostragens de pixels simples

Page 38: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

QuantArray®

Page 39: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

SpotSpotSoftware todo automatizado

Jain, et al. Genome Research, 12: 325-332, 2002

Mas nem sempre funciona…

Então, correção manual!!!

Os pixels são escolhidos combase no histograma.

Page 40: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

BIOINFO BIOINFO -- USPUSP

Page 41: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Example - Segmentation

Page 42: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Exemplo - Segmentação

Page 43: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

cDNAs used

� Q gene - 637pb, 52,3% de C e G

� trpC - 338pb, 45,3% de C e G

� lysA - 303pb, 47,2% de C e G

� ST0280 (ORESTES) - 659pb, 34,6% de C e G

� IL6 - 948pb, 37,7% de C e G

� IRF1 - 2069pb, 52,5% de C e G

Page 44: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Dilution of fixed material

Each cDNA will be fixed in the dilutions 1/1, 1/2, 1/4, 1/8, 1/16

Page 45: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Legenda

B1

Diluição 1

Diluição 2

Diluição 3

Diluição 4

Diluição 5

B2

Legenda

Diluição 1

Diluição 2

Diluição 3

Diluição 4

Diluição 5

B3

Legenda

Diluição 1

Diluição 2

Diluição 3

Diluição 4

Diluição 5

B4

Legenda

Diluição 1

Diluição 2

Diluição 3

Diluição 4

Diluição 5

Page 46: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

B1 B2

B3 B4

B1 B2

B3 B4

B1 B2

B3 B4

B1 B2

B3 B4

B1 B2

B3 B4

B1 B2

B3 B4

B1 B2

B3 B4

B1 B2

B3 B4

Page 47: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

B4B3

B2B1

Page 48: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected
Page 49: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected
Page 50: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

For a good signal:

• the linear regression is a good estimator for

ratio (background estimation is avoided)

• Swap permits to normalize cy3 and cy5

• Confidence intervals increase inversely with

the signal intensity

Page 51: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Genes differentially expressed

Page 52: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Experiment

• Choose a population of men and measure

their physical characteristics

• Ask men to move a 150 kg object

• Separate men that succeed and the ones that

do not

• Find common characteristics between men

that succeed and between men that do not

Page 53: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Genes differentially expressed

• Choose a set of genes

• Measure the expression of these genes on

two different Biological states

• Choose subsets of genes that are enough to

characterize each Biological state

Page 54: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

−3.5 −3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1−1.5

−1

−0.5

0

0.5

1

1.5

2

phosphofructokinase, platelet

v−

erb

−b2 a

via

n e

ryth

robla

stic leukem

ia v

iral oncogene h

om

olo

g 2

LINEAR CLASSIFIER (DISPERSED−GAUSSIAN) w/ σ = 0.600

1−th (0.389593)

7−th (0.416607)

BRCA1 BRCA2/sporadic

the tumor sample from Patient 20

Page 55: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Clustering

Page 56: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

1 2 3 4 5 6 7-6

-4

-2

0

2

4

6

timelo

g2(r

atio)

time course data

C1

C2

C3

-40 -20 0 20 40 60 80 100 120 140-20

-10

0

10

20

30

40

50

60KL plot

C1

C2

C3

Dendrogram

Time course data

KL plotmultidimensional space

different views of different views of different views of

Microarray dataMicroarray dataMicroarray datamis-classified

Original clusters

Clustered by dendrogram

Page 57: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Example

No error!

Tighter clusters due

to small variance

Results from Fuzzy c-means

Page 58: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Gene Regulation Networks

Page 59: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Cellpeptide

other signals

mRNAGENES

NETWORK

Translat.Transcr.Proteins

Pool Pathways

Metabolic

Page 60: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Division Steps

Cell Cycle Modeling

u1

I

u2

u5

v

vfp

w1

w2

w1fp

w2fp

w1f

w2f

x1

x6

y1

y2

z

x1f

x3f

x4f

x6f

x6fp

x1fp

y1fp

y2fp

.

.

.

.

.

.

z

Forward SignalFeedback to pFeedback to previous layer

y1f

y1f

y2f

y2f

p

Page 61: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Modeling Dynamical Systems

Page 62: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Simulator Architecture

Page 63: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected
Page 64: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected
Page 65: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected
Page 66: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Knockout

u1

I

u2

u5

v

vfp

w1

w2

w1fp

w2fp

w1f

w2f

x1

x6

y1

y2

z

x1f

x3f

x4f

x6f

x6fp

x1fp

y1fp

y2fp

.

.

.

.

.

.

z

Forward SignalFeedback to pFeedback to previous layer

y1f

y1f

y2f

y2f

p

Division Steps

Page 67: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected
Page 68: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected
Page 69: Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Challanges

• Architecture identification

• Dynamics transition function identification