Canadian Bioinformatics Workshops · The data matrix therefore is of size 7680 x 8.! One of the...

20
2016-06-03 1 Canadian Bioinformatics Workshops www.bioinformatics.ca

Transcript of Canadian Bioinformatics Workshops · The data matrix therefore is of size 7680 x 8.! One of the...

Page 1: Canadian Bioinformatics Workshops · The data matrix therefore is of size 7680 x 8.! One of the array chips.! Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca! EDA: Transformations!

2016-06-03

1

Canadian Bioinformatics Workshops!

www.bioinformatics.ca!

2 Module #: Title of Module

Page 2: Canadian Bioinformatics Workshops · The data matrix therefore is of size 7680 x 8.! One of the array chips.! Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca! EDA: Transformations!

2016-06-03

2

Module 1 Exploratory Data Analysis (EDA)!

DEPARTMENT OF BIOCHEMISTRY DEPARTMENT OF MOLECULAR GENETICS

Odysseus listening to the song of the sirens. Late Archaic (500–480 BCE)!

This workshop includes bits of material originally developed by Raphael Gottardo, FHCRC and by Sohrab Shah, UBC

Boris Steipe!!

Exploratory Data Analysis of Biological Data using R!Toronto, June 6. and 7. 2016!

Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca!

Learning Objectives!

•  Be able to load, transform and visualize data;!•  Begin generating ideas about how relationships in

data can be explored;!•  Understand some principles of effective

visualization;!•  Know where to go for help.!

Page 3: Canadian Bioinformatics Workshops · The data matrix therefore is of size 7680 x 8.! One of the array chips.! Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca! EDA: Transformations!

2016-06-03

3

Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca!

Exploratory Data Analysis (EDA)!

Numerical analysis and graphical presentation of data without a particular model or underlying hypothesis in order to ...!! !uncover underlying structure;!! !define important variables;!! !detect outliers and anomalies;!! !detect trends;! !develop statistical models;!! !test underlying assumptions.!The goal is hypothesis generation, not hypothesis testing.!Therefore EDA is often the first step of data analysis. !

Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca!

Exploratory Data Analysis (EDA)!

The objectives of EDA include:!•  uncovering underlying structure and identifying

trends and patterns;!•  extracting important variables;!•  detecting outliers and anomalies;!

•  testing underlying assumptions;!•  developing statistical models.!

Page 4: Canadian Bioinformatics Workshops · The data matrix therefore is of size 7680 x 8.! One of the array chips.! Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca! EDA: Transformations!

2016-06-03

4

Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca!

Exploratory Data Analysis (EDA)!

The practice of EDA emphasizes looking at data in different ways through:!•  computing and tabulating basic descriptors of data properties such as

ranges, means, quantiles, and variances;!•  generating graphics, such as boxplots, histograms, scatter plots;!•  applying transformations, such as log or rank;!•  comparing observations to statistical models, such as the QQ-plot, or

linear and non-linear regression;!•  simplifying data through dimension reduction;!•  identifying underlying structure through clustering ...!... all with the final goal of defining which statistical model might be

appropriate to use for hypothesis testing and prediction.!

Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca!

Why R for EDA?!

Full-featured programming language!"Statistical workbench"!

Data manipulation is easy!Easy access to graphics!

Sophisticated packages and libraries!

Large community!

Page 5: Canadian Bioinformatics Workshops · The data matrix therefore is of size 7680 x 8.! One of the array chips.! Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca! EDA: Transformations!

2016-06-03

5

Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca!

Graphics!

Good graphics are immensely valuable. Poor graphics are worse than none.!

If you want to learn more about good graphics and information design, find a copy of Edward Tufte's The Visual Display of Quantitative Information. You can also visit his Web site to get a sense of the

field (www.edwardtufte.com).!Fundamentally, there is one simple rule.!

The rule has many corollaries.!

Use less ink.!

Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca!

Use less ink!

Make sure that all elements on your graphics are necessary.!Make sure that all elements on your graphics are informative.!Make sure that all information in your data is displayed.!

Not all of R's defaults use as little ink as possible...!

Page 6: Canadian Bioinformatics Workshops · The data matrix therefore is of size 7680 x 8.! One of the array chips.! Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca! EDA: Transformations!

2016-06-03

6

Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca!

refined graphics! ... for a popular alternative to R's defaults, check out the ggplot2 package (http://www.ggplot2.org).!

> install.packages("ggplot2")...> library(ggplot2)...

Example: Bubble chart.!http://ygc.name/2010/12/01/bubble-chart-by-using-ggplot2/

crime <- read.csv("crimeRatesByState2008.csv", header=TRUE, sep="\t")p <- ggplot(crime, aes(murder,burglary,size=population, label=state))p <- p+geom_point(colour="red") +scale_area(to=c(1,20))+geom_text(size=3)p + xlab("Murders per 100,000 population") + ylab("Burglaries per 100,000")

Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca!

Graphics examples!"... What if you could gussy up a report or pretty up a chart without much additional work? What if, using just one extra line of code, you could create a Microsoft Excel column chart that included a cool gradient fill like this one:"!

(From Microsoft TechNet)!

Page 7: Canadian Bioinformatics Workshops · The data matrix therefore is of size 7680 x 8.! One of the array chips.! Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca! EDA: Transformations!

2016-06-03

7

Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca!

Graphics examples: chartjunk!"... What if you could gussy up a report or pretty up a chart without much additional work? What if, using just one extra line of code, you could create a Microsoft Excel column chart that included a cool gradient fill like this one:"!

(From Microsoft TechNet)!

Only five numbers actually!

Meaningless colors, no connection to actual scale. Note that the range differs for each stack!!

No units!

Values can't be easily retrieved from graph due to parallax.!

What does the third dimension even mean?!

Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca!

Graphics examples!

http://visualizing.org/

Page 8: Canadian Bioinformatics Workshops · The data matrix therefore is of size 7680 x 8.! One of the array chips.! Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca! EDA: Transformations!

2016-06-03

8

Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca!

Graphics examples!

http://datavisualization.ch/

Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca!

Graphics inspiration and advice!

https://www.reddit.com/r/visualization

Page 9: Canadian Bioinformatics Workshops · The data matrix therefore is of size 7680 x 8.! One of the array chips.! Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca! EDA: Transformations!

2016-06-03

9

Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca!

Graphics inspiration and advice!

https://www.reddit.com/r/dataisbeautiful!

Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca!

Statistics can be vaguely divided into descriptive and inferential domains. Both are supported by graphics. Let us discuss basic summary statistics (mean, median, variance, standard deviation) and how to visualize them...

GraphicsforDescrip.veSta.s.cs

Page 10: Canadian Bioinformatics Workshops · The data matrix therefore is of size 7680 x 8.! One of the array chips.! Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca! EDA: Transformations!

2016-06-03

10

Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca!

Roughly sepaking, exploration of data means finding something in or about the data that surprises us. Something like:

•  the values of the data have changed

•  our data seems to follow some rules

•  our data seems to come from different populations

•  some attribute of our data appears surprisingly often

•  ...

Descrip.veSta.s.csForEDA

Obviously we need to be able to do two things: •  Describe what’s surprising, and •  quantify how surprised we are.

Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca!

Topics

•  boxplots, normalization

•  transformation

•  handling missing values

•  smoothing, and outliers

•  ...

PreparingDataForEDA

Page 11: Canadian Bioinformatics Workshops · The data matrix therefore is of size 7680 x 8.! One of the array chips.! Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca! EDA: Transformations!

2016-06-03

11

Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca!

Box-plot!Descriptive statistics can be intuitively summarized in a Box-plot. !

> boxplot(x)

IQR!

1.5 x IQR!

1.5 x IQR!

Everything above and below 1.5 x IQR is considered an "outlier".!

75% quantile!!Median!!25% quantile!

IQR = Inter Quantile Range = 75% quantile – 25% quantile!

Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca!

Violinplot!Internal structure of a data-vector can be made visible in a violin plot. The principle is the same as for a boxplot, but a width is calculated from a smoothed histogram.!

p <- ggplot(X, aes(1,x))p + geom_violin()

Page 12: Canadian Bioinformatics Workshops · The data matrix therefore is of size 7680 x 8.! One of the array chips.! Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca! EDA: Transformations!

2016-06-03

12

Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca!

QQ–plot!

One of the first things we may ask about data is whether it deviates from an expectation e.g. to be normally distributed.!

The quantile-quantile plot provides a way to visually verify this.!The QQ-plot shows the theoretical quantiles versus the empirical quantiles. If the distribution assumed (theoretical one) is indeed the correct one, we should observe a straight line.!

R provides qqnorm() and qqplot().!

Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca!

exploring data!•  When teaching (or learning new procedures) I prefer to work with

synthetic data. •  Synthetic data has the advantage that I know what the outcome of the

analysis should be. •  Typically one would create values acording to a function and then add

noise.

•  R has several functions to create sequences of values – or you can write your own ...

0:10 seq(0, pi, 5*pi/180) rep(1:3, each=3, times=2)for (i in 1:10) { print(i*i) }

Page 13: Canadian Bioinformatics Workshops · The data matrix therefore is of size 7680 x 8.! One of the array chips.! Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca! EDA: Transformations!

2016-06-03

13

Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca!

synthetic data!Function ...

Explore functions and noise.

Noise ...

Noisy Function ...

Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca!

probability distributions!•  Can be either discrete or

continuous (uniform, Bernoulli, normal, etc)

•  Defined by a density function, p(x) or f(x)

•  Bernoulli distribution Be(p): flip a coin (T=0, H=1). Assume probability of H is 0.1 ...

x <- 0:1 f <- dbinom(x, size=1, prob=0.1) plot(x, f, xlab="x", ylab="density", type="h", lwd=5)

Page 14: Canadian Bioinformatics Workshops · The data matrix therefore is of size 7680 x 8.! One of the array chips.! Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca! EDA: Transformations!

2016-06-03

14

Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca!

probability distributions!

Random sampling: Generate 100 observations from a Be(0.1)!

set.seed(100)x <- rbinom(100, size=1, prob=0.1)hist(x)

Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca!

probability distributions!Normal distribution N(µ,σ2)!µ is the mean and σ2 is the variance.!!Extremely important because of the Central Limit Theorem: if a random variable is the sum of a large number of small random variables, it will be normally distributed. !

x <- seq(-4, 4, 0.1)f <- dnorm(x, mean=0, sd=1)plot(x, f, xlab="x", ylab="density", lwd=5, type="l")

The area under the curve is the probability of observing a value between 0 and 2.!

Page 15: Canadian Bioinformatics Workshops · The data matrix therefore is of size 7680 x 8.! One of the array chips.! Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca! EDA: Transformations!

2016-06-03

15

Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca!

A! C! G! T!

0! 1!

arbitrary probabilities!

Map the probabilities onto the unit interval, then use runif() to choose. !

Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca!

arbitrary probability distributions!Calculate the function value at a point in the domain of interest. Then choose a uniform value between the minimum and maximum in the domain. If the random value is smaller than the function value, accept, otherwise reject.!

Think of this as randomly placing dots on a plot of the function, then accepting those that lie under the curve.!!Try this.!

.

. .

. .

. .

.

.

. .

.

. . .

.

.

. . .

.

. .

.

.

.

.

Page 16: Canadian Bioinformatics Workshops · The data matrix therefore is of size 7680 x 8.! One of the array chips.! Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca! EDA: Transformations!

2016-06-03

16

Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca!

Multivariate data - scatter plots!

Biological data sets often contain several variables, they are multivariate.!Scatter plots allow us to look at two variables at a time.!

This can be used to assess independence and identify subgroups.!

# GvHD flow cytometry datagvhd <- read.table("GvHD.txt", header=TRUE)# Only extract the CD3 positive cellsgvhdCD3p <- as.data.frame(gvhd[gvhd[, 5]>280, 3:6])plot(gvhdCD3p[, 1:2])

Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca!

trellis graphics!Trellis Graphics is a family of techniques for viewing complex, multi-variable data sets.!

plot(gvhdCD3p, pch=".")

Tip:!Many more possibilities are available in the "lattice" package. See ?Lattice!

Page 17: Canadian Bioinformatics Workshops · The data matrix therefore is of size 7680 x 8.! One of the array chips.! Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca! EDA: Transformations!

2016-06-03

17

Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca!

Boxplots!

The boxplot function can be used to display several variables at a time.!

boxplot(gvhdCD3p)

Exercise: Interpret this plot.!

Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca!

EDA: HIV microarray data!

HIV Data:!The expression levels of 7680 genes were measured in CD4-T-cell lines at time t = 24 hours after infection with HIV type 1 virus.!

• 12 positive controls (HIV genes).!

• 4 replicates (2 with a dye swap)!

Page 18: Canadian Bioinformatics Workshops · The data matrix therefore is of size 7680 x 8.! One of the array chips.! Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca! EDA: Transformations!

2016-06-03

18

Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca!

EDA: HIV microarray data!

We get the data after the image analysis is done.!For each spot we have an estimate of the intensity in both channels.!The data matrix therefore is of size 7680 x 8.!

One of the array chips.!

Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca!

EDA: Transformations!

Observations:!The data are highly skewed. The standard deviation is not constant, as it increases with the mean.!!Solution:!Look for a transformation that will make the data more symmetric and the variance more constant.!With positive data the log transformation is often appropriate.!

Page 19: Canadian Bioinformatics Workshops · The data matrix therefore is of size 7680 x 8.! One of the array chips.! Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca! EDA: Transformations!

2016-06-03

19

Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca!

EDA: Transformations!

The standard deviation has become almost independent of the mean.

# 'apply' will apply the function to all rows of the data matrixmean <- apply(data[, 1:4], 1, "mean")sd <- apply(data[, 1:4], 1, "sd")plot(mean, sd)trend <- lowess(mean, sd)lines(trend, col=2, lwd=5)

Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca!

EDA for microarray data: always log!

•  Makes the data more symmetric, large observations are not as influential!

•  The variance is (more) constant!•  Turns multiplication into addition

(log(ab)=log(a)+log(b))!•  In practice use log base 2, log2(x)=log(x)/

log(2)!

Page 20: Canadian Bioinformatics Workshops · The data matrix therefore is of size 7680 x 8.! One of the array chips.! Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca! EDA: Transformations!

2016-06-03

20

Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca!

Summary!

EDA should be the first step in any statistical analysis!!

Good modeling starts and ends with EDA.!

R provides a great framework for EDA.!

Module 1: Exploratory Data Analysis (EDA)! bioinformatics.ca!

[email protected]!