Post on 05-Oct-2020
Visualizing microarray data
Laurent Gautier
August 18th, 2009
Loading and installing more packages
> l ibrary ( ”<package>”)> # if no such package
> source ( ”http : //www.bioconductor.org/b iocL i t e .R ”)> b i o cL i t e ( ”<package>”)> l ibrary ( ”<package>”)
1
1 Elements of data visualization
1.1 Shapes and screen resolution
Screen resolution
� only one dot displayed per pixel
� there a relatively small number of different geomtric shapes a viewer candistinguish on a plot
�
Overplotting Plain plot
~300, 000 data points are represented on a 500× 500 image.
Overplotting Alpha blending
2
Sampling
30, 000 points are sampled out of the 300, 000.
Binning
3
m[, 1]
m[,
2]
10
12
14
10 12 14
Counts
1
299
596
894
1192
1490
1788
2085
2383
2681
2978
3276
3574
3872
4170
4467
4765
Smooth scatter plot
9 10 11 12 13 14 15
910
1112
1314
15
m[, 1]
m[,
2]
1.2 Colors
� If only two colors, avoid the two color-blind people can distinguish (redand green are one notorious example)
� Most of the people can only keep track of around a dozen different colors(or tone differences) on one plot
4
ColorBrewer’s palettes
BrBGPiYG
PRGnPuOrRdBuRdGyRdYlBu
RdYlGnSpectral
AccentDark2Paired
Pastel1Pastel2
Set1Set2Set3
BluesBuGnBuPuGnBu
GreensGreys
OrangesOrRdPuBu
PuBuGnPuRd
PurplesRdPuRedsYlGn
YlGnBuYlOrBrYlOrRd
� People generally compare better lengths than areas
Blueberry
Cherry
Apple
Boston Cream
Other
Vanilla Cream
5
pie chart vs barplot
Blueberry
Cherry
Apple
Boston Cream
Other
Vanilla Cream
Blueberry Cherry Apple Boston Cream Other Vanilla Cream
0.00
0.05
0.10
0.15
0.20
0.25
0.30
1.3 basic R objects
R objects (some of the)
1.4 basic R plots
plot
> x ← rnorm(50)> plot ( x )
6
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
0 10 20 30 40 50
−2
−1
01
2
Index
x
Histogram
> hist ( x )
Histogram of x
x
Fre
quen
cy
−2 −1 0 1 2
02
46
810
12
Density estimate
> plot (density ( x ) )
−3 −2 −1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
density.default(x = x)
N = 50 Bandwidth = 0.367
Den
sity
> y ← 2*x + rnorm(50 , sd=0.3 )> plot (x , y )
7
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
−2 −1 0 1 2
−4
−2
02
x
y
> mycolors ← rep ( ”black ” , length ( x ) )> mycolors [ x < 0 | y < 0 ] ← ”red ”> plot (x , y , col = mycolors )
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
−2 −1 0 1 2
−4
−2
02
x
y
1.5 Lattice plots
FormulaeFormulae are used to describe a model, generally for the purpose of fitting
or plotting.
y ~ xExample:weight ~ age
y ~ x + zExample:weight ~ age + gender
y ~ x | zExample:weight ~ age | gender
8
Storing data in data.frame
> data ( chickwts )> head( chickwts )
weight f e ed1 179 horsebean2 160 horsebean3 136 horsebean4 227 horsebean5 217 horsebean6 168 horsebean
> hist ( chickwts$weight )
Histogram of chickwts$weight
chickwts$weight
Fre
quen
cy
100 150 200 250 300 350 400 450
05
1015
Using lattice
> l ibrary ( l a t t i c e )> p ← histogram ( ∼ weight ,+ data = chickwts )> print (p)
weight
Per
cent
of T
otal
0
5
10
15
20
100 200 300 400
> p ← histogram ( ∼ weight | feed ,+ data = chickwts )> print (p)
9
weight
Per
cent
of T
otal
0
10
20
30
40
50
100 200 300 400
casein horsebean
100 200 300 400
linseed
meatmeal
100 200 300 400
soybean
0
10
20
30
40
50
sunflower
weight
Per
cent
of T
otal
0
10
20
30
40
50
100 200 300 400
casein horsebean
100 200 300 400
linseed
meatmeal
100 200 300 400
soybean
0
10
20
30
40
50
sunflower
> p ← den s i t yp l o t ( ∼ weight , groups = feed ,+ data = chickwts ,+ auto .key = TRUE)> print (p)
weight
Den
sity
0.000
0.005
0.010
0.015
100 200 300 400 500
●●●● ●
●● ●● ●
● ●●●●
●●●● ● ●● ●
●●● ●●● ● ● ●●● ●●
●● ●● ●●●
●●●●● ●●●● ● ●●
●●● ● ●
● ●●●● ●● ●●●●
caseinhorsebeanlinseedmeatmealsoybeansunflower
10
1.6 ggplot2
Using ggplot2
> l ibrary ( ggp lot2 )> p ← ggp lot ( chickwts ) ++ aes (x = weight , col=feed ) ++ geom density ( )> print (p)
weight
dens
ity
0.000
0.002
0.004
0.006
0.008
0.010
0.012
0.014
150 200 250 300 350 400
feed
casein
horsebean
linseed
meatmeal
soybean
sunflower
Density estimates
> p ← ggp lot ( chickwts ) ++ aes (x = weight ) ++ geom density ( ) ++ facet wrap ( ∼ f e ed )> print (p)
weight
dens
ity
0.000
0.002
0.004
0.006
0.008
0.010
0.012
0.014
0.000
0.002
0.004
0.006
0.008
0.010
0.012
0.014
casein
meatmeal
150 200 250 300 350 400
horsebean
soybean
150 200 250 300 350 400
linseed
sunflower
150 200 250 300 350 400
Boxplot
> data ( sleep )> p ← ggp lot ( sleep ) ++ aes (x = factor ( group ) , y = extra ) ++ geom boxplot ( )> print (p)
11
factor(group)
extr
a
−1
0
1
2
3
4
5
1 2
ExpressionSet objects
eset[i, j] subset the matrix with its associated data.frames
> l ibrary ( go lubEsets )> data ( Golub Merge )> e s e t ← Golub Merge> p ← ggp lot ( pData ( e s e t ) ) ++ aes (x = PS) ++ geom histogram ( ) ++ facet wrap ( ∼ Gender , ncol = 3)> print (p)
PS
coun
t
0
1
2
3
4
F
0.2 0.4 0.6 0.8 1.0
M
0.2 0.4 0.6 0.8 1.0
NA
0.2 0.4 0.6 0.8 1.0
12
> exprs ( e s e t ) [ exprs ( e s e t ) ≤ 0 ] ←+ min( exprs ( e s e t ) [ exprs ( e s e t ) > 0 ] )> l ibrary ( limma )> model ← model.matrix ( ∼ pData ( e s e t )$ALL.AML )> f i t ← lmFit ( log2 ( exprs ( e s e t ) ) , model)> t f i t ← t r e a t ( f i t , l f c =1)> t t ← topTreat ( t f i t , coef=2, number=50)> head( t t )
ID logFC AveExpr t P.Valuead j .P .Va l2288 M84526 at 9 .819500 4 .245026 11 .641353 2 .030486e−18 1 .447533e−141882 M27891 at 7 .449327 7 .545176 7 .692992 3 .117432e−11 1 .111209e−073252 U46499 at 4 .664273 7 .019749 7 .224403 2 .283971e−10 5 .427476e−07760 D88422 at 3 .998790 7 .925139 6 .578039 3 .467697e−09 5 .671084e−061834 M23197 at 2 .258431 8 .058312 6 .545160 3 .977475e−09 5 .671084e−066378 M83667 rna1 s at 6 .876969 4 .061439 6 .463448 5 .590043e−09 6 .641903e−06
> plot ( exprs ( e s e t ) [ 1 , ] )
● ●
●
● ● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
● ● ● ● ● ● ● ● ● ●
●
● ● ● ● ● ● ● ●
●
● ● ● ●
●
● ● ● ●
0 10 20 30 40 50 60 70
020
4060
80
Index
expr
s(es
et)[
1, ]
Having data in data.frame
data.frame as a workhorse
� Have it one variable per column
� Split tell-it-all names into separate variables
� Really, no variable value such as mutant drugA 231209 grumpy (insteadhave 4 columns strain, treatment, experiment date, experimentalist mood
13
Converting a wide data structure into a long one
> l ibrary ( reshape )> myids ← t t$ID [ 1 : 1 0 ]> data f ← melt ( exprs ( e s e t ) [ myids , ] ,+ varnames = c ( ”symbol ” , ”sample ” ) )> data f ← merge( dataf , pData ( Golub Merge ) ,+ by.x = ”sample ” , by.y = 0)> str ( data f )
' data . f rame ' : 720 obs . o f 14 v a r i a b l e s :$ sample : i n t 1 1 1 1 1 1 1 1 1 1 . . .$ symbol : Factor w/ 10 levels ”D88422 at ” , ”M11722 at ” , . . : 8 5 10 1 4 7 9 3 2 6 . . .$ value : num 1 303 44 161 261 . . .$ Samples : i n t 1 1 1 1 1 1 1 1 1 1 . . .$ ALL.AML : Factor w/ 2 levels ”ALL” , ”AML” : 1 1 1 1 1 1 1 1 1 1 . . .$ BM.PB : Factor w/ 2 levels ”BM” , ”PB” : 1 1 1 1 1 1 1 1 1 1 . . .
14
$ T .B . c e l l : Factor w/ 2 levels ”B−cell ” , ”T−cell ” : 1 1 1 1 1 1 1 1 1 1 . . .$ FAB : Factor w/ 4 levels ”M1” , ”M2” , ”M4” , . . : NA NA NA NA NA NA NA NA NA NA . . .$ Date : Factor w/ 27 levels ”” , ”1/24/1984 ” , . . : 26 26 26 26 26 26 26 26 26 26 . . .$ Gender : Factor w/ 2 levels ”F” , ”M” : 2 2 2 2 2 2 2 2 2 2 . . .$ pc tB la s t s : i n t NA NA NA NA NA NA NA NA NA NA . . .$ Treatment : Factor w/ 2 levels ”Fa i l u r e ” , ”Success ” : NA NA NA NA NA NA NA NA NA NA . . .$ PS : num 1 1 1 1 1 1 1 1 1 1 . . .$ Source : Factor w/ 4 levels ”CALGB” , ”CCG” , . . : 3 3 3 3 3 3 3 3 3 3 . . .
ggplot2
> l ibrary ( ggp lot2 )
New school plots
> p ← ggp lot ( data f ) ++ aes (x=ALL.AML, y=log2 ( va lue ) ) ++ geom point ( ) ++ facet wrap (∼symbol)> print (p)
ALL.AML
log2
(val
ue)
02468
101214
02468
101214
02468
101214
D88422_at
●●●
●
●
●
●●●●
●●●
●●
●
●
●●●●
●
●●
●
●●
●●
●●●●●●
●
●
●
●●●
●
●
●●●
●●●
●
●
●●
●
●●
●
●
●●●
●
●
●●●
●
●
●●
●
●
M27891_at
●
●
●●
●
●●
●
●
●●●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●●●●
●●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●●●
●●
●
●
●●
●
M89957_at
●
●
●
●●
●
●
●
●
●●●
●
●
●●
●
●
●●●●●
●●●●
●
●●●●
●
●
●
●
●●
●●●●●●
●
●●
●●●●●●●●●●●●●
●
●●●●
●
●●●
●
●●
ALL AML
M11722_at
●
●
●
●●●●
●●
●●●
●
●●●●
●●●●
●
●
●●●
●
●●
●
●
●
●●●
●●●
●
●
●●
●
●
●●●
●●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
M29474_at
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●●●
●
●
●
●
●●●●●●
●
●●●●●●●●
●
●●●●●
●
●●
●
U46499_at
●
●
●●●●●
●
●●●
●
●
●
●
●
●●●●
●
●●●●●
●
●●
●
●●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●●●
●
●
●●
●●
●
●●●●
●●
●●●
●
●●●
●●
ALL AML
M19507_at
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●●●●●
●●
●
●
●
●●●
M83667_rna1_s_at
●●●●
●
●
●●●●
●
●
●
●●●●●●●●●●●●●●●
●
●●●
●
●
●
●●
●
●
●●
●●●●●●
●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●●
●
●
ALL AML
M23197_at
●●●
●
●●●●●●
●●●●●
●●●●●●●
●●●●●●
●
●●●●●●●
●●●●●●●●●●
●
●●●
●●●●
●●
●
●
●●●●●
●●●●●●●●●
M84526_at
●●●●●●●
●
●●●●●●●●●●
●
●●●●●
●
●●●●●●
●
●●●●●●
●
●●●●●
●
●●
●●●●
●●
●
●
●
●
●
●
●●●
●
●●
●●●
●
●●
●
ALL AML
> p ← ggp lot ( data f ) ++ aes (x=ALL.AML, y=log2 ( va lue ) , col=Source ) ++ geom point ( ) ++ facet wrap (∼symbol)> print (p)
15
ALL.AML
log2
(val
ue)
02468
101214
02468
101214
02468
101214
D88422_at
●●
●
●
●●
●
●●
●
●●
●
●
●
●
●●
●●
●●●
●
●
●
●●●●
●●●
●●
●
●
●●●●
●
●●
●
●●
●●
●●
●●
●
●
●
●●●
●
●
●●●●●
●
●●●
●
●
M27891_at
●
●●
●●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●●
●
●
●●
●
●
●●●
●
●
●
●
●
●●●●
●●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
M89957_at
●●●●●●●●●●●●●●●●●
●
●●
●
●
●
●●
●
●
●
●
●●●
●
●
●●
●
●
●●●●●
●●●●
●
●●●●
●
●
●
●
●●●●●
●
●●
●
●
●
●
●
●●
●
ALL AML
M11722_at
●●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●●
●●
●●●
●
●●●●
●●●●
●
●
●●●●
●
●
●
●●●
●●
●
●
●●
●
●
●●●
●
●
●
●
●●
●●
M29474_at
●●●
●
●●●●●●●
●
●●
●
●●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●●
●●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●●●
U46499_at
●●
●●
●●●
●●
●●
●
●●●●
●
●●●
●
●
●●●●●
●
●●●
●
●
●
●
●
●●●●
●
●●●●●
●
●●●
●
●
●
●
●
●
●●●
●
●
●
●
●●
●●
●●●
●●
ALL AML
M19507_at
●
●●
●
●
●
●
●
●
●
●●●●●●
●
●●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●●
●●●
M83667_rna1_s_at
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●●●
●
●
●●●●
●
●
●
●●●●●●●●●●●●●●●
●
●●●
●
●●●
●●●
●●●●●
●
●●
●●●●●
ALL AML
M23197_at
●●●
●●●●
●●
●
●
●●
●●
●●
●●●
●●●
●
●●●●●●
●●●●●
●●●●●●●
●●●●●●
●
●●●●●
●●●●●●●●
●
●
●●●
●●●●●
M84526_at
●●
●
●
●
●
●●
●
●
●
●●●
●
●●●●
●●●●●●●●
●
●●●●●●●●●
●
●●●●
●
●●●●●●
●
●●●●●
●
●●●●●
●
●●●●●
●
●●●●
ALL AML
Source
● CALGB
● CCG
● DFCI
● St−Jude
16
Anatomy of an heatmap
� central matrix of array values
� hierarchical clustering on row values
� hierarchical clustering on column values
Heatmap Iconic graphics for expression data
� A heatmap is not mandatory for all and every analysis of microarray data
� Every single pattern in a heatmap is not gold.
� Staring at heatmap diagrams can cause serious apophenia.
� Assessing the validity of clusters otherwise than visually is a good idea.
� Yet heatmaps can be useful.
?heatmap
About distance measures
� Dissimilarities a generalization of distances
� Measure how dissimilar entities are
� Hierarchical clustering depends on dissimilarity measure
Two dissimilarity measures, two outcomes
x
y
2
4
6
8
10
1.0 1.5 2.0 2.5 3.0 3.5 4.0
●
●
●
●
a
1.0 1.5 2.0 2.5 3.0 3.5 4.0
●
●
●
●
b
1.0 1.5 2.0 2.5 3.0 3.5 4.0
●
●
●
●
c
c a b
1
2
3
4
c a b
1
2
3
4
> l ibrary ( b ioDi s t )
17
� heatmap (stats)
� heatmap.2 (gplots)
� heatmap_2, heatmap_plus (Heatplus)
� circularmap, heatmapl (NeatMap)
Express ionSet ( storageMode : lockedEnvironment )assayData : 7129 f ea tu r e s , 72 samples
element names : exprsphenoData
sampleNames : 39 , 40 , . . . , 33 (72 t o t a l )varLabe l s and varMetadata description :
Samples : Sample indexALL.AML: Factor , i n d i c a t i n g ALL or AML. . . : . . .Source : Source o f sample(11 t o t a l )
featureDatafeatureNames : AFFX−BioB−5 at , AFFX−BioB−M at, . . . , Z78285 f at
(7129 t o t a l )f va rLabe l s and fvarMetadata description : none
experimentData : use ' experimentData ( ob j e c t ) '
pubMedIds : 10521349Annotation : hu6800
' data . f rame ' : 72 obs . o f 11 v a r i a b l e s :$ Samples : i n t 39 40 42 47 48 49 41 43 44 45 . . .$ ALL.AML : Factor w/ 2 levels ”ALL” , ”AML” : 1 1 1 1 1 1 1 1 1 1 . . .$ BM.PB : Factor w/ 2 levels ”BM” , ”PB” : 1 1 1 1 1 1 1 1 1 1 . . .$ T .B . c e l l : Factor w/ 2 levels ”B−cell ” , ”T−cell ” : 1 1 1 1 1 1 1 1 1 1 . . .$ FAB : Factor w/ 4 levels ”M1” , ”M2” , ”M4” , . . : NA NA NA NA NA NA NA NA NA NA . . .$ Date : Factor w/ 27 levels ”” , ”1/24/1984 ” , . . : 1 13 NA 27 9 NA NA NA 4 4 . . .$ Gender : Factor w/ 2 levels ”F” , ”M” : 1 1 1 2 1 2 1 1 1 2 . . .$ pc tB la s t s : i n t NA NA NA NA NA NA NA NA NA NA . . .$ Treatment : Factor w/ 2 levels ”Fa i l u r e ” , ”Success ” : NA NA NA NA NA NA NA NA NA NA . . .$ PS : num 0 .78 0 .68 0 .42 0 .81 0 .94 0 .84 0 .99 0 .66 0 .97 0 .88 . . .$ Source : Factor w/ 4 levels ”CALGB” , ”CCG” , . . : 3 3 3 3 3 3 3 3 3 3 . . .
18
Histogram of exprs(Golub_Merge)
exprs(Golub_Merge)
Fre
quen
cy
−20000 0 20000 40000 60000
010
0000
2500
00
package stats
> m ← exprs ( Golub Merge )> dim(m)
[ 1 ] 7129 72
> s p l i ← order (apply (m, 1 , var ) ) [ 1 : 3 0 0 ]> m ← m[ s p l i , ]> heatmap(m, labRow=””)
54 58 64 8 17 27 44 45 47 56 39 49 24 3 53 33 1 43 62 65 59 23 57 21 34 36 60 35 40 7 4 6 51 55 52 12 29 19 67 22 66 41 70 46 38 50 30 37 61 9 63 28 2 68 14 16 20 15 69 71 13 5 32 31 42 25 26 18 11 72 10 48
package stats
> exprs brk ←+ c ( quantile (m[m < 0 ] , seq (0 , 1 , length=5)) ,+ 0 ,+ quantile (m[m > 0 ] , seq (0 , 1 , length=5)))> a l l am l ← as.integer ( pData ( Golub Merge )$ALL.AML)> mycol ← brewer .pa l (2 , ”Set1 ” ) [ a l l am l ]
19
> heatmap(m, labRow = ”” ,+ col = brewer .pa l (10 , ”RdBu”) ,+ breaks = exprs brk ,+ ColS ideColors = mycol )
54 58 64 8 17 27 44 45 47 56 39 49 24 3 53 33 1 43 62 65 59 23 57 21 34 36 60 35 40 7 4 6 51 55 52 12 29 19 67 22 66 41 70 46 38 50 30 37 61 9 63 28 2 68 14 16 20 15 69 71 13 5 32 31 42 25 26 18 11 72 10 48
54 58 64 8 17 27 44 45 47 56 39 49 24 3 53 33 1 43 62 65 59 23 57 21 34 36 60 35 40 7 4 6 51 55 52 12 29 19 67 22 66 41 70 46 38 50 30 37 61 9 63 28 2 68 14 16 20 15 69 71 13 5 32 31 42 25 26 18 11 72 10 48
package gplots
> l ibrary ( gp l o t s )> heatmap.2 (m, labRow=”” ,+ symbreaks = TRUE,+ col = brewer .pa l (10 , ”RdBu”) ,+ trace = ”none ” ,+ ColS ideColors = mycol )
20
54 58 64 8 17 27 44 45 47 56 39 49 24 3 53 33 1 43 62 65 59 23 57 21 34 36 60 35 40 7 4 6 51 55 52 12 29 19 67 22 66 41 70 46 38 50 30 37 61 9 63 28 2 68 14 16 20 15 69 71 13 5 32 31 42 25 26 18 11 72 10 48
−200 0 200Value
0Color Key
and HistogramC
ount
package Heatplus
> l ibrary ( Heatplus )> addvar ← pData ( Golub Merge ) [ c ( ”Gender ” , ”ALL.AML” ) ]> heatmap plus (m,+ breaks = exprs brk ,+ col=brewer .pa l (10 , ”RdBu”) ,+ addvar = addvar )
54 58 64 8 17 27 44 45 47 56 39 49 24 3 53 33 1 43 62 65 59 23 57 21 34 36 60 35 40 7 4 6 51 55 52 12 29 19 67 22 66 41 70 46 38 50 30 37 61 9 63 28 2 68 14 16 20 15 69 71 13 5 32 31 42 25 26 18 11 72 10 48
M58569_s_atX13589_atL17325_atX95425_s_atU18991_atS62907_s_atU18297_s_atAB005535_s_atM10950_cds2_atX86564_atU92459_atX82835_atY08564_atU12535_atX94629_atHG2743−HT3926_s_atL27624_s_atU70981_atL14542_atU31973_s_atU44105_atS62027_s_atD63882_s_atU20860_atX83929_s_atL09234_atZ78285_f_atM64554_rna1_atU24577_atHG4245−HT4515_atD64108_atK02402_atL05597_atU39231_atU62015_atL32140_atX95654_atU68031_atD13264_atU13913_s_atX83107_atU79304_atU49516_atY00317_atY10202_atU73330_atU37529_atZ83800_atU62434_atHG2564−HT2660_s_atU95626_rna3_atX82629_atX58987_atZ70723_atAC000062_atS76853_s_atL36642_atX54925_atU49379_atD87458_atX66360_atU88871_atL41349_atU28015_atD38462_atJ00209_f_atV00551_f_atL14430_atX14894_atU07856_atU56102_atU38810_atX15943_atL76627_atJ00219_s_atX56088_s_atZ78290_atJ04156_atX78578_atU17033_atL27559_s_atD64053_atU65002_atX97675_rna1_atU92458_atX64877_s_atZ48510_atS75313_atU87309_atX76342_atU46116_atHG3431−HT3616_s_atU35407_atU87460_atZ83806_atX95239_atM14648_atM63896_atL34081_atD38503_atX71661_atU89012_atY10510_atM38180_rna1_atU22322_s_atX75958_atU17034_atU14407_atM55419_atX55330_atZ46629_atD14822_atD14497_atM16474_s_atL41913_atM14123_xpt1_atM73239_s_atU47007_atL40396_atY10517_atX89426_atM61916_atU68727_atU62432_atX98330_atM19154_atX51405_atU63332_atX65663_atAC000066_atD10922_s_atM14306_atU65437_rna1_atX52001_atL11573_atU13896_atD37965_atJ04970_atU68133_atU39196_atX82018_atD42072_atU35376_atU17032_atU30245_atHG2160−HT2230_atHG2007−HT2056_s_atU91521_atD26561_cds3_atM26167_rna1_atX51730_atD86980_atU79300_atU43328_atX51823_s_atL49218_f_atU58033_atU22815_atU79249_atX81895_atX95237_atM18533_atL75847_atU19495_s_atX76534_atU11821_s_atD26561_cds1_atX00949_atU52155_atL26953_atX76383_atS79862_s_atU79246_atX16901_atY07596_atL22650_atM30773_atX65233_atD13168_atU42359_atM82919_atS78693_f_atJ04513_atU57093_atU69108_atS73205_atX62429_s_atD00408_atM16801_atU66561_atL07077_atS81661_s_atM37981_atM91556_s_atD82347_atM81882_atL08485_atM33478_atX54150_atY08136_atU16129_atS57887_atU68135_s_atU00001_s_atAFFX−LysX−5_atM31423_s_atL32163_atL25286_s_atHG3513−HT3707_atM86808_atD13644_atX97671_atX06290_atX71125_atX06661_atU93091_atL19778_atL07949_atX84003_atHG2841−HT2970_atU26712_atU18985_atU21128_atL07615_atX64643_atU12778_atY08319_atL47726_atL24470_atHG429−HT429_atU12897_atX63597_atX15422_atX77922_s_atV00503_atZ95624_atU54617_atU33267_atX78926_atM60828_atX82153_atX02956_atX58723_atL12468_atM16714_atM60503_atZ28339_atM84605_atX66087_atM55418_atZ75330_atM63623_atX07820_atZ24725_atU03886_atD31784_atX78686_atU19906_atX98266_cds2_atU59914_atM93119_atM31241_s_atS52028_s_atY09615_atD38122_atY10571_atU10886_atU66726_s_atL21934_atJ03810_atU66497_atS58544_atX59710_atJ05096_rna1_atU33632_atM25393_atM19301_atZ83802_atX00540_atU22233_atL40157_atV00571_rna1_atX98337_s_atL25441_atM65290_atL46353_atD76435_atM54927_atZ48570_atX64810_atL29306_s_atM62424_atX84195_atS82592_atU96136_atY07512_atU01828_atS66896_atX05608_atU09279_at
Gender
ALL.AML
21
2 Visualizing data on the genome
� contextual information
� measured entities have respective positions on the genome
� the reference genome is an abstraction
2.1 Idiogram
> l ibrary ( idiogram )
> l ibrary ( go lubEsets )> data ( Golub Train )> human chr ← buildChromLocation ( ”hu6800 ”)> expr vec to r ←+ assayData ( Golub Train ) [ [ ”exprs ” ] ] [ , 1 ]> id iogram ( expr vector , human chr , chr=”1 ”)
●●● ●●●● ● ●●●● ●● ●●●●●● ● ●● ●●●●● ●●●●●●●●● ●●●● ●●●●●●● ●●●●●● ●●●● ●●●● ●●●●●● ● ●●●●● ● ●●●● ●●●●● ●●● ●●●●● ●● ●● ●●● ● ●●●●●●●● ●●●● ●●●● ●●●● ●● ●● ●●●● ●● ● ●●●●●●●●●●● ●●● ●●●●●●●●● ●● ●●●●● ●●●●● ●●● ●● ● ●●●●●●● ●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●● ●● ●●●●●●●●● ●●●●●●● ●● ●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●● ●● ●●●●●●●●●● ●●●● ●●● ●●●●●●●●●●●●●●●●●
●●●●●● ●●●●● ●●●●● ●● ●●●●●● ●● ●●●● ●●●●●●●● ●● ●●●●● ● ●●●●● ●●●●● ●●●● ●●● ●●●● ●●●●●●● ●● ●● ● ●●●●● ●●●● ●●●●●● ●●●●●●●● ●●●●● ●●●●●●● ●●●● ●●●●●● ●●● ●● ●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●● ● ●●●● ●●●●●●● ●●●●●● ●●●●● ●● ●●● ● ●● ●●● ●● ●●● ●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●● ● ●●● ●●● ●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●
1
0 5000 10000 15000 20000
q44q43q42q41q32q31q25q24q23q22q21q12q11p11p12p13p21p22p31p32p33p34p35p36
chromLocation objects
> buildChromLocation ( ”<package name>”)
> prob2chro ← new.env ( )> assign ( ”probe a1 ” , ”1 ” , prob2chro )> assign ( ”probe a2 ” , ”1 ” , prob2chro )> assign ( ”probe b ” , ”X” , prob2chro )> assign ( ”probe c ” , ”X” , prob2chro )> prob2symbol ← new.env ( )> assign ( ”probe a1 ” , ”a ” , prob2symbol )> assign ( ”probe a2 ” , ”a ” , prob2chro )> assign ( ”probe b ” , ”b” , prob2symbol )> assign ( ”probe c ” , ”c ” , prob2symbol )
> f o oba r ch r l o c ← new( ”chromLocation ” ,+ organism = ”foobar ” , # name
+ dataSource = ”Dr. Moreau ' s lab ” ,+ chromLocs = l i s t ( gene a=c ( ) , gene b=c ( ) ) ,+ probesToChrom = c (1000 , 1200 , 200 , 200) ,+ chromInfo = c ( ”1 ”=10000 , ”X”=500 , ”Y”=600) ,+ geneSymbols = prob2symbol )
22
> l ibrary ( RColorBrewer )> c o l i nd ex ← cut ( expr vector , 9)> my col ← brewer .pa l (9 , ”BuGn” ) [ c o l i nd ex ]> id iogram ( expr vector , human chr , chr=”1 ” ,+ col = my col )
●●● ●●●● ● ●●●● ●● ●●●●●● ● ●● ●●●●● ●●●●●●●●● ●●●● ●●●●●●● ●●●●●● ●●●● ●●●● ●●●●●● ● ●●●●● ● ●●●● ●●●●● ●●● ●●●●● ●● ●● ●●● ● ●●●●●●●● ●●●● ●●●● ●●●● ●● ●● ●●●● ●● ● ●●●●●●●●●●● ●●● ●●●●●●●●● ●● ●●●●● ●●●●● ●●● ●● ● ●●●●●●● ●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●● ●● ●●●●●●●●● ●●●●●●● ●● ●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●● ●● ●●●●●●●●●● ●●●● ●●● ●●●●●●●●●●●●●●●●●
●●●●●● ●●●●● ●●●●● ●● ●●●●●● ●● ●●●● ●●●●●●●● ●● ●●●●● ● ●●●●● ●●●●● ●●●● ●●● ●●●● ●●●●●●● ●● ●● ● ●●●●● ●●●● ●●●●●● ●●●●●●●● ●●●●● ●●●●●●● ●●●● ●●●●●● ●●● ●● ●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●● ● ●●●● ●●●●●●● ●●●●●● ●●●●● ●● ●●● ● ●● ●●● ●● ●●● ●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●● ● ●●● ●●● ●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●
1
0 5000 10000 15000 20000
q44q43q42q41q32q31q25q24q23q22q21q12q11p11p12p13p21p22p31p32p33p34p35p36
> c o l i nd ex ← cut (rank ( expr vec to r ) , 9)> my col ← brewer .pa l (9 , ”BuGn” ) [ c o l i nd ex ]> id iogram ( expr vector , human chr , chr=”1 ” ,+ col = my col , pch = 16)
●●● ●●●● ● ●●●● ●● ●●●●●● ● ●● ●●●●● ●●●●●●●●● ●●●● ●●●●●●● ●●●●●● ●●●● ●●●● ●●●●●● ● ●●●●● ● ●●●● ●●●●● ●●● ●●●●● ●● ●● ●●● ● ●●●●●●●● ●●●● ●●●● ●●●● ●● ●● ●●●● ●● ● ●●●●●●●●●●● ●●● ●●●●●●●●● ●● ●●●●● ●●●●● ●●● ●● ● ●●●●●●● ●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●● ●● ●●●●●●●●● ●●●●●●● ●● ●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●● ●● ●●●●●●●●●● ●●●● ●●● ●●●●●●●●●●●●●●●●●
●●●●●● ●●●●● ●●●●● ●● ●●●●●● ●● ●●●● ●●●●●●●● ●● ●●●●● ● ●●●●● ●●●●● ●●●● ●●● ●●●● ●●●●●●● ●● ●● ● ●●●●● ●●●● ●●●●●● ●●●●●●●● ●●●●● ●●●●●●● ●●●● ●●●●●● ●●● ●● ●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●● ● ●●●● ●●●●●●● ●●●●●● ●●●●● ●● ●●● ● ●● ●●● ●● ●●● ●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●● ● ●●● ●●● ●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●
1
0 5000 10000 15000 20000
q44q43q42q41q32q31q25q24q23q22q21q12q11p11p12p13p21p22p31p32p33p34p35p36
2.2 Genome Atlases
Genome Atlas
23
> l ibrary ( e c o l i t k )
http://tinyurl.com/6zqsc5
Using more of the available space
24
KEMISK MÅNEDSBLAD / NUMMER 4 APRIL 2009 / 90. ÅRGANG ISSN 0011-6335KEMISK FORENING KEMIINGENIØRGRUPPEN
Tema: Plast fra træ Holdbar mælk Humleudtræk
– kemi på højniveau
2.3 Hilbervis
Hilbert’s curve
� fractal curve
� covers as much of the plane as possible
Iteration level = 1
●
● ●
●
4 data points can be represented on a 2x2 grid.
25
Iteration level = 2
● ●
●●
●
● ●
● ●
● ●
●
●●
● ●
16 data points can be represented on a 4x4 grid.
Iteration level = 3
●
● ●
● ● ●
●●
● ●
●●●
●●
●
● ●
●●
●
● ●
● ●
● ●
●
●●
● ● ● ●
●●
●
● ●
● ●
● ●
●
●●
● ●
●
●●
●●●
● ●
●●
● ● ●
● ●
●
64 data points can be represented on a 8x8 grid.
Iteration level = 4
26
● ●
●●
●
● ●
● ●
● ●
●
●●
● ● ●
● ●
● ● ●
●●
● ●
●●●
●●
●
●
● ●
● ● ●
●●
● ●
●●●
●●
●●●
● ●
●
●●
●●
●●
●
● ●
●●
●
● ●
● ● ●
●●
● ●
●●●
●●
●
● ●
●●
●
● ●
● ●
● ●
●
●●
● ● ● ●
●●
●
● ●
● ●
● ●
●
●●
● ●
●
●●
●●●
● ●
●●
● ● ●
● ●
● ●
● ●
● ● ●
●●
● ●
●●●
●●
●
● ●
●●
●
● ●
● ●
● ●
●
●●
● ● ● ●
●●
●
● ●
● ●
● ●
●
●●
● ●
●
●●
●●●
● ●
●●
● ● ●
● ●
●
●●
● ●
●
●●
●●
●●
●
● ●
●●●
●●
●●●
● ●
●●
● ● ●
● ●
●
●
●●
●●●
● ●
●●
● ● ●
● ●
● ● ●
●●
●
● ●
● ●
● ●
●
●●
● ●
256 data points can be represented on a 16x16 grid.
Iteration level = 5
●● ●
● ● ●●●
● ●●●●
●●●● ●
●●●● ●
● ●● ●
●●●
● ● ● ●●●
●● ●
● ●● ●
●●●
● ●●●●
●●●● ●
●●● ● ●
● ●● ● ●
●●●● ●
● ●● ●
●●●
● ● ●● ●
● ● ●●●
● ●●●●
●●●●● ●
● ● ●●●
● ●●●●
●●●●●
● ●●●●
●●●●
●● ●
●●● ●
●●●● ●
● ●● ●
●●●
● ● ●● ●
● ● ●●●
● ●●●●
●●●●● ●
● ● ●●●
● ●●●●
●●●●●
● ●●●●
●●●●
●● ●
●●●●●
●●●● ●
●●● ● ●
● ●●●●
● ●●●●
●●●●
●● ●
●●●●● ●
●●●
●●●●
●● ●
●●●● ●
● ● ●●●
● ●●●●
●●●● ●
●●●● ●
● ●● ●
●●●
● ● ●● ●
● ● ●●●
● ●●●●
●●●●● ●
● ● ●●●
● ●●●●
●●●●●
● ●●●●
●●●●
●● ●
●●●● ●
● ● ●●●
● ●●●●
●●●● ●
●●●● ●
● ●● ●
●●●
● ● ● ●●●
●● ●
● ●● ●
●●●
● ●●●●
●●●● ●
●●● ● ●
● ●● ●
● ●● ● ●
●●● ●
●●●●●
●● ●
●●●● ●
● ●● ●
●●●
● ● ● ●●●
●● ●
● ●● ●
●●●
● ●●●●
●●●● ●
●●● ● ●
● ●●●●
● ●●●●
●●●●
●● ●
●●●●●
●●●● ●
●●● ● ●
● ●●●●●
●●●● ●
●●● ● ●
● ●● ● ●
●●●● ●
● ●● ●
●●●
● ● ● ●●●
●● ●
● ●● ●
●●●
● ● ●● ●
● ● ●●●
● ●●●●
●●●●● ●
● ● ●●●
● ●●●●
●●●●●
● ●●●●
●●●●
●● ●
●●●● ●
● ● ●●●
● ●●●●
●●●● ●
●●●● ●
● ●● ●
●●●
● ● ● ●●●
●● ●
● ●● ●
●●●
● ●●●●
●●●● ●
●●● ● ●
● ●● ●
● ●● ● ●
●●● ●
●●●●●
●● ●
●●●● ●
● ●● ●
●●●
● ● ● ●●●
●● ●
● ●● ●
●●●
● ●●●●
●●●● ●
●●● ● ●
● ●●●●
● ●●●●
●●●●
●● ●
●●●●●
●●●● ●
●●● ● ●
● ●●●●●
●●●● ●
●●● ● ●
● ●● ● ●
●●●● ●
● ●● ●
●●●
● ●●●●
●●●● ●
●●● ● ●
● ●●●●
● ●●●●
●●●●
●● ●
●●●●● ●
●●●
●●●●
●● ●
●●●● ●
● ● ●●●
● ●●●●
●●●●●
● ●●●●
●●●●
●● ●
●●●●●
●●●● ●
●●● ● ●
● ●●●●●
●●●● ●
●●● ● ●
● ●● ● ●
●●●● ●
● ●● ●
●●●
● ●●●
● ●●●●
●●●●
●● ●
●●●●●
●●●● ●
●●● ● ●
● ●●●●●
●●●● ●
●●● ● ●
● ●● ● ●
●●●● ●
● ●● ●
●●●
● ● ●● ●
● ● ●●●
● ●●●●
●●●● ●
●●●● ●
● ●● ●
●●●
● ● ● ●●●
●● ●
● ●● ●
●●●
● ●●●●
●●●● ●
●●● ● ●
● ●●
1024 data points can be represented on a 32x32 grid.
Visualizing large vectors on screens
27
● ● ● ● ● ● ● ● ● ● ● ● ●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
0 5 10 15 20 25 30
020
060
010
00
image size
vect
or s
ize
A square of 1000x1000 pixels can be used to represent 1,000,000 data points.
Comparing cartesian plots with Hilbert curve
seq
x
0
2
4
0 100 200 300 400
seq
x
0
2
4
0 100 200 300 400
●●●
●●
●
●
●●
●
●●
●
●
●●●●
●
●●●
●●
●●●●●●●
●
●
●
●
●
●
●●●
●●
●
●●●●●●●●●●
●●●
●●●●●●
●
●
●●●●●
●
●●●●●●
●●●●●●
●●●
●●
●
●
●●
●
●●●
●●
●
●
●
●●
●●●
●●
●
●
●
●
●●
●●
●●
●
●
●●●●●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●●●●
●●
●●
●●●●●
●●
●●●●●●
●●●
●●
●●●
●
●●●●●●●
●
●
●
●
●●●●●●●●
●
●●●
●
●●●●●
●
●●●●●●
●●●
●●
●
●●●
●●●●
●●●●
●●
●
●
●
●●●
●
●●
●
●●
●●
●
●
●●●●●
●●
●
●
●
●●
●
●●●●●
●
●●●●●●●●●●●
●
●●●
●
●
●
●●
●
●●
●
●●
●●●●
●
●●●●●
●
●
●
●
●
●●
●●●●
●
●
●
●●●
●●
●
●
●
●
●
●●●●
●●●
●●●
●●●●●
●
●
●●
●
●
●
●
●●●●●●
●
●●●●
●
●●
●
●●
●
●●●
●●
●
●●
●
●
●
●●
●
●
●●
●
●●●●●
●
●●
●●
●
●●●●
●
●
●●●●
●
−4
−2
0
2
4
2.4 GenomeGraphs
GenomeGraphs
� Genomic information is traditionnally represented in tracks
� Tracks can be:
– ORF, promoters, genes on the genome
– Sequence characteristics (GC-content, functional domains, complex-ity)
– Experimental signal associated with those sequences (RNA levels,methylation signals, copy-numbers)
> l ibrary (GenomeGraphs )
> ehs mart ← useMart ( ”ensembl ” ,+ ”hsap iens gene ensembl ”)> minbase ← 180300000#180292097> maxbase ← 180500000#180491933> genesp lus ← makeGeneRegion ( start = minbase ,+ end = maxbase ,+ strand = ”+” ,+ chromosome = ”3 ” ,
28
+ biomart = ehs mart )> genesmin ← makeGeneRegion ( start = minbase ,+ end = maxbase ,+ strand = ”−” ,+ chromosome = ”3 ” ,+ biomart = ehs mart )> a x i s t r k← makeGenomeAxis ( add53 = TRUE,+ add35=TRUE,+ l i t t l e T i c k s = TRUE)> p ← gdPlot ( l i s t ( genesplus , ax i s t r k , genesmin ) ,+ minBase = minbase , maxBase = maxbase )> print (p)
180300000
180350000
180400000
180450000
1805000005' 3'
3' 5'
> i d i o g ← makeIdeogram ( ”3 ”)> p ← gdPlot ( l i s t ( id iog , genesplus ,+ ax i s t r k , genesmin ) ,+ minBase = minbase , maxBase = maxbase )> print (p)
180300000
180350000
180400000
180450000
1805000005' 3'
3' 5'
> probepos beg in ← sort ( runif (200 , min=minbase ,+ max=maxbase ) )> probepos end ← probepos beg in + 200> probepo s s i gna l ←
29
+ sin ( seq (0 , 6 , length=200)) + rnorm(200 , sd=0.05 )> probepo s s i gna l ←+ matrix ( probepos s i gna l ,+ ncol=1)> e xp r e s s i o n t r k ←+ makeGenericArray (+ i n t e n s i t y = probepos s i gna l ,+ probeStart = probepos begin ,+ probeEnd = probepos end ,+ dp = DisplayPars ( c o l o r=”darkblue ” ,+ type=”point ” ) )> gdPlot ( l i s t ( ”+” = genesplus ,+ ax i s t r k , ”−” = genesmin ,+ ”log− rat ion expr e s s i on ” = exp r e s s i o n t r k ) ,+ minBase = minbase ,+ maxBase = maxbase )
+−
log−
ratio
n ex
pres
sion
180300000180350000
180400000180450000
1805000005' 3'3' 5'
−1
−0.5
0
0.5
1
30