Introduction*to*Large/Scale*...
Transcript of Introduction*to*Large/Scale*...
Introduction to Large-‐Scale Biomedical Data Analysis
A grand overview of the course
• Introductory course created for the BIG (Bioinformatics, Imaging and Genetics) concentration.
• Purpose of the course: introduce modern high-‐dimensional biomedical data from:– Bioinformatics and computational biology.– Biomedical imaging.– Statistical genetic.
• Focus on:– Scientific background: questions and motivations.– Technologies.– Data and their characteristics.– Brief overview of some statistical methods, opportunities and challenges for statisticians.
There will be a lot of materials and new terminologies! • Not covered in this course: detailed statistical theories and
methods for data analyses. • This is a knowledge-‐centric class, big picture and concepts are
more important.
Contents of the course
Format of the course
• Co-‐taught by multiple instructors. • Students are evaluated by 3 reading assignments: you need to write reading reports!
Lecture 1:Introduction to next
generation sequencing
Outline
• Biological Backgrounds: DNA and DNA sequencing.• Next generation sequencing (NGS) technologies.• Data from NGS.• Typical NGS data analysis workflow. • An application of NGS: RNA-‐seq.
Background: DNA and sequencing
DNA (DeoxyriboNucleic Acid)
• A molecule contains the genetic instruction of all known living organisms and some viruses.
• Resides in the cell nucleus, where DNA is organized into long structures called chromosomes.
• Most DNA molecule consists of two long polymers (strands), where two strands entwine in the shape of a double helix.
• Each strand is a chain of simple units (bases) called nucleotides: A, C, G, T.
• The bases from two strands are complementary by base pairing: A-‐T, C-‐G.
DNA sequence• The order of occurrence of the bases in a DNA molecule is
called the sequence of the DNA. The DNA sequence is usually store in a big text file:
ACAGGTTTGCTGGTGACCAGTTCTTCATGAGGGACCATCTATCACAACAGAGAAAGCACTTGGATCCACCAGGGGCTGCCAGGGGAAGCAGCATGGGAGCCTGAACCATGAAGCAGGAAGCACCTGTCTGTAGGGGGAAGTGATGGAAGGACATGGGCACAGAAGGGTGTAGGTTTTGTGTCTGGAGGACACTGGGAGTGGCTCCTGGCATTGAAACAGGTGTGTAGAAGGATGTGGTGGGACCTACAGACAGACTGGAATCTAAGGGACACTTGAATCCCAGTGTGACCATGGTCTTTAAGGACAGGTTGGggccaggcacagtggctcatgcctgtaatcccagcact
• Some interesting facts:– Total length of the human DNA is 3 billion bases. – Difference in DNA sequence between two individuals is less than 1%.– Human and chimpanzee have 96% of the sequences identical. Human
and mouse: 70%.
Genome size (total length of DNA)
Organism Genome size (bp) # genes
E. coli 4.6M 4,300
S. cerevisiae (yeast) 12.5M 5,800
C. elegans (worm) 100M 20,000
A. thaliana (plant) 115M 28,000
D. melanogaster (fly) 123M 13,000
M. musculus (mouse) 3G 23,800
H. sapiens (human) 3.3G 25,000
DNA sequencing
• Technologies to determine the nucleotide bases of a DNA molecule.
• Motivation: decipher the genetic codes hidden in DNA sequences for different biological processes.
• Genome projects: determine DNA sequences for different species, e.g., human genome project.
• Genomic research (in a nutshell): study the functions of DNA sequences and related components.
Sequencing technologies
• Traditional technology: Sanger sequencing.– Slow (low throughput) and expensive: it took Human Genome Project (HGP) 13 years and $3 billion to sequence the entire human genome.
– Relatively accurate.
• New technology: different types of high-‐throughput sequencing.
Next generation sequencing (NGS)
• Aka: high-‐throughput sequencing, second-‐generation sequencing.
• Able to sequence large amount of short sequence segments in a short period:– High-‐throughput: hundreds of millions sequences in a run.– Cheap: sequence entire human genome costs a few thousand dollars now.
– Higher error rate compared to Sanger sequencing.
NGS technologies
• Complicated and involves a lot of biochemical reactions.– Sequencing by synthesis.– Sequencing by ligation.– Pyrosequencing.
• In a nutshell: – Cut the long DNA into smaller segments (several hundreds to several
thousand bases). – Sequence each segment: start from one end and sequence along the
chain, base by base. – The process stops after a while because the noise level is too high. – Results from sequencing are many sequence pieces. The lengths vary,
usually a few thousands from Sanger, and several hundreds from NGS. – The sequence pieces are called “reads” for NGS data.
Single-‐end vs. paired-‐end sequencing
• Sequence one end or both ends of the DNA segments. • Paired-‐end sequencing: reads are “paired”, separated by
certain length (the length of the DNA segments). • Paired-‐end data can be used as single-‐end, but contain extra
information for reads pairing.• Useful in some cases, for example, detecting structural
variations in the genome.
Available NGS platforms
• Major player:– Illumina: HiSeq, Genome Analyzer (GA) – LifeTech: SOLiD, IonTorrent– Roche 454
• Others:– Complete Genomics– Pacific Bioscience– Helicose
Introduction to NGS Applications
Applications of NGS
• NGS has a wide range of applications. – DNA-‐seq: sequence genomic DNA.– RNA-‐seq: sequence RNA products.– ChIP-‐seq: detect protein-‐DNA interaction sites.– Bisulfite sequencing (BS-‐seq): measure DNA methylation strengths.
– A lot of others.
DNA-‐seq
• Sequence the untreated genomic DNA. – Obtain DNA from cells, cut into small pieces then sequence the segments.
• Goals: – Genome re-‐sequencing: compare to the reference genome and look for genetic variants: • Single nucleotide polymorphisms (SNPs)• Insertions/deletions (indels), • Copy number variations (CNVs) • Other structural variations (gene fusion, etc.).
– De novo assembly of a new unknown genome.
Variations of DNA-‐seq
• Targeted sequencing, e.g., exome sequencing.– Sequence the genomic DNA at targeted genomic regions. – Cheaper than whole genome DNA-‐seq, so that money can be spent to
get bigger sample size (more individuals). – The targeted genomic regions need to be “captured” first using
technologies like microarrays.
• Metagenomic sequencing.– Sequence the DNA of a mixture of species, mostly microbes, in order
to understand the microbial environments. – The goal is to determine number of species, their genome and
proportions in the population.– De novo assembly is required. But the number and proportions of
species are unknown, so it poses challenge to assembly.
ChIP-‐seq
• Chromatin-‐Immunoprecipitation (ChIP) followed by sequencing (seq): sequencing version of ChIP-‐chip.
• A type of “captured” sequencing. ChIP step is to capture genomic regions of interest.
• Used to detect locations of certain “events” on the genome: genome-‐wide location analysis (GWLA). Example of events:– Transcription factor binding. – DNA methylations and histone modifications.
RNA-‐seq
• Sequence the “transcriptome”: the set of RNA molecules.
• Goals: – Catalog RNA products. – Determine transcriptional structures: alternative splicing, gene fusion, etc.
– Quantify gene expression: the sequencing version of gene expression microarray.
NGS data and analyses workflow
Workflow of NGS data analysisSec-gen sequencing data analysis workflow
Raw images
Fluorescent intensities
sequence reads
aligned reads contigs
imaging analysis
base calling
alignment de novo assembly
variant calling(DNA seq)
DE/splicing (RNA seq)
peak/DMR detection(ChIP/MeDIP- seq)
...
Base calling
• The “raw” data from sequencer are many images. • To obtain the sequence reads, need a “base calling” step: – First extract data (fluorescent intensities) from images – Next convert intensities (continuous) to a base calls (categorical).
– Both steps involve many statistical methods to extract signals from noisy data.
• Sequencing data analyses start from the sequence reads (results of base calling) for most people.
Raw sequence reads data from NGS after base calling
• Large text file (millions of lines) with simple format. – Most frequently used: fasta/fa format for storing the sequences, or fastq
format storing both the sequence and corresponding quality scores.
• fa format:
>5_143_428_832GATATTGTAGCATAACGCAACTTGGGAGGTGAGCTT>5_143_984_487GTTTTCATGCCTCCAAATCTTGGAGGCTTTTTTATG>5_143_963_690GGTATATGCACAAAATGAGATGCTTGCTTATCAACA>5_143_957_461GGAGGGTGTCAATCCTGACGGTTATTTCCTAGACAA>5_143_808_403GATAACCGCATCAAGCTCTTGGAAGAGATTCTGTCT
read nameread sequence
fastq formatSecgen sequencing data: fastq format
@HWI-EAS165:1:1:50:908:1CTGCGGTCTCTAAAGTGCCATCTCATTGTGCTTTGTATCAGTCAGTGCTGGA+BCCBCB8ABBBBBBB:BC=8@BBA:@BB@BBBCBB<9BBAC;A<C?BAAB<#@HWI-EAS165:1:1:50:0:1NCAACCCCCACAGTAATATGTAAAACAAAAACTAAAACCAGGAGCTGAAGGG+#BABABBBBBB@08<@?A@7:A@CCBCCCCBBBCCBB=?BBBB@7@B=A>:2@HWI-EAS165:1:1:50:708:1GGTCAGCATGTCTTCTGTTAAGTGCTTGCACAAGCTAGCCTCTGCCTATGGG+BB@A;B>@A@@=BB=BB?A>@@>B?ABBA=A?@@>@@A:=?>?A@=B8@@AB@HWI-EAS165:1:1:50:1494:1CTGGTGTCACACAAGCAGGTCTCCTGTGTTGACTTCACCAGACACTGTCATT+BCBB@AB@1ABBBBBBAAB?BBBBAB<A?AA>BB@?1ABBA@BBBA@;B>>:
read nameread sequenceseparatorquality scores
Sequence alignment and assembly
• Sequence a known genome -‐-‐-‐ Alignment– Use the known genome (called “reference genome”) as a blue print.– Determine where each read is located in the reference genome.
• Sequence a whole new genome -‐-‐-‐ Assembly– New genome: a species with unknown genome, or the genome is
believed to be very different from reference (e.g., cancer). – Basically the short reads are “stitched” together to form long
sequences called “contigs”. – Overlaps among sequence reads are required, so it needs a lot of
reads (deep coverage).– More computationally intensive.
Alignment• For most people, alignment is the first step for sequencing
data analysis, since the machine output raw sequence read file (fastq).
• Need: sequence reads file and a reference genome.• It is basically a string search problem: where is the short (50-‐
letter) string located within the reference string of 3 billion letters.
• Brute-‐force searching is okay for a single read, but computationally infeasible to alignment millions of reads.
• Clever algorithms are needed to preprocess the reference genome (indexing), which is beyond the scope of this class.
Popular alignment software
• Bowtie: fast, but less accurate. • BWA (Burrows-‐Wheeler Aligner): same algorithm as bowtie, but
allow gaps in alignments. – about 5-‐10 times slower than bowtie, but provide better results especially for paired end data.
• Maq (Mapping and Assembly with Qualities): with SNP calling capabilities.
• ELAND: Illumina’s commercial software. • A lot of others. See
http://en.wikipedia.org/wiki/List_of_sequence_alignment_software for more details.
• Our suggestion: use bowtie for single-‐end, and bwa for paired-‐end.
Once you have the reads aligned
• Downstream analyses depend on purpose.• Often one wants to manipulating and visualizing the alignment results. There are several useful tools:– file manipulating (format conversion, counting, etc.): samtools/Rsamtools, BEDTools, bamtools, IGV tools.
– Visualizing: samtools (text version), IGV (Java GUI).
RNA-‐seq data Analysis
A little biological background
Central dogma of molecular biology:“The central dogma of molecular biology deals with the detailed residue-‐by-‐residue transfer of sequential information. It states that information cannot be transferred back from protein to either protein or nucleic acid.”
Gene Expression
Gene expression levels are measured through their mRNA abundance.
gene
The amount of these matters! But they are difficult to measure.
The amount of these is easy to measure. And it is positively correlated with the protein amount!
DNA(2 copies)
mRNA(multiple copies)
Protein(multiple copies)
High-‐throughput technologies to measure mRNA abundance
• Traditionally: gene expression microarray. • Now: RNA-‐seq. – Data have greater dynamic range and higher signal-‐to-‐noise ratios.
– Contain more information: structural change of genes.
• Results show good correlation between microarray and RNA-‐seq data.
RNA-‐seq
• Sequence the set of RNA molecules. • Scientific goals: – Catalogue RNA products. – Determine transcriptional structures: alternative splicing, gene fusion, etc.
– Quantify gene expression: the sequencing version of gene expression microarray.
Experimental procedures
• Procedures:– RNA -‐> cDNA.– Shearing and amplification of cDNA.– Sequence the cDNA segments.
• Compare to expression microarray:– Clearer signal without hybridization.– Better dynamic ranges for
expressions.– Provide richer information.
RNA-‐seq data analyses
• Gene expression analysis: compare expressions of genes under different biological conditions.
• Structural analysis of transcriptome: discover and compare the structural variations of mRNA:– Alternative splicing.– Gene fusion.
• Small RNA analyses.
RNA-‐seq gene expression analysis
• Goal: compare gene expressions between different samples.• Steps:
1. Summarization: get a number for each gene to represent its expression level.
2. Normalization: remove technical artifacts so that data from different samples are comparable.
3. Differential expression detection: gene by gene statistical test, usually based on negative binomial model.
• Popular software:– R/Bioconductor packages: DESeq, edgeR, DSS. – Stand alone: cufflink and cuffdiff.
Differential expression
• Biological motivation is the same as in gene expression microarray: find differentially expressed (DE) genes.
• Microarray methods are not directly applicable: continuous vs. count data, but ideas can be borrowed.
• Usually needs multiple replicates per sample, so that means and variances can be evaluated.
• Test is carried gene by gene. • Two major tasks of developing a test:
– Within group variance estimation.– Test procedure and inference.
RNA-‐seq data for DE analysis
• A matrix of integers, rows are genes, columns are samples:
Cancer Cancer Cancer Cancer Normal Normal Normal NormalGene 1 20 25 16 7 4 1 21 11Gene 2 493 628 133 450 28 97 462 225Gene 3 128 146 121 25 30 26 106 80Gene 4 1 0 0 0 1 0 0 0Gene 5 58 105 27 40 19 28 88 54Gene 6 0 0 1 0 0 0 0 0...
Data model for replicated samples
• For a sample with M replicates, the counts for gene i replicate j, denoted by Yij, is often modeled by following hierarchical model:
• Marginally, the Gamma-‐Poisson compound distribution is Negative binomial. So the counts for a gene from multiple replicates is modeled as Negative binomial:
Yij | λi ~ Poisson(λi ), λi ~Gamma(α,β)
Yij ~ NB(α,β)
A little more about the NB distribution
• NB is over-‐dispersed Poisson: - Poisson:- NB:
• Dispersion parameter Φ approximates the squared coefficient of variation:
• Dispersion Φ represents the biological variance.• NB distribution can be parameterized by mean and dispersion.
4 H. WU AND OTHERS
2010; Robinson and Smyth, 2007; Hardcastle and Kelly, 2010). The variance from a NB
distribution depends on the mean µ in the relationship:
var = µ+ µ2⇥ (1.1)
where the first term represents variance due to Poisson sampling error and the second
term represents variance due to variation between biological replicates. The parameter
⇥ is referred to as the dispersion parameter. Notice that ⇥ is the reciprocal of the shape
parameter in the Gamma distribution, thus is the squared coefficient of variation (CV).
Therefore, ⇥ represents the variation of a gene’s expression relative to its mean.
Most statisticians agree that the over-dispersion problem needs to be addressed. The
difference is how the dispersion is modeled, estimated and used in inference. Robinson
and Smyth (2008) assumes a common dispersion for all genes and uses information from
all genes to estimate a global ⇥. This stabilizes the estimation for ⇥ but a common disper-
sion means identical CV for all genes, while it is known that some genes are more tightly
controlled (for example, housekeeping genes) and other genes vary much more relative
to their means (for example, immune-modulated and stress-induced genes (Pritchard
and others, 2001)). The gene specific biological variation is reproducible across tech-
nologies (Hansen and others, 2011) and is not reflected by a common dispersion. As a
result, genes that are naturally more variable are more likely to be reported as DE due
to an underestimate of their dispersion. Anders and Huber (2010) introduce DESeq, an-
other method based on the NB model. Instead of assuming the second term in Equation
1.1 to be proportional to µ2 with a dispersion ⇥, they let the variance be a smooth func-
tion of the mean (with proper offset accounting for sequencing depth). However, their
model still assumes that conditioning on the mean expression, the variance is constant.
4 H. WU AND OTHERS
“over-dispersion” problem since even non-differentially expressed genes show variation
greater than expected by Poisson.
A common model to address the over-dispersion problem is the Negative Binomial
(NB) model. The NB model is a Gamma-poisson mixture and can be interpreted as the
following: the Gamma distribution models the unobserved true expression levels in each
biological sample, and conditioning on the expression level, the measurement from the
sequencing machine follows a Poisson distribution (Anders and Huber, 2010; Robinson
and Smyth, 2007; Hardcastle and Kelly, 2010). The variance from a NB distribution
depends on the mean in the relationship:
var = µ + µ2⇥ (1.1)
where the first term reprents variance due to poisson sampling error and the second
term represents variance due to variation between biological replicates. The parameter
⇥ is referred to as the dispersion parameter. Notice that ⇥ is the inverse of the shape
parameter in the Gamma distribution, thus is the squared coefficient of variation. This
is crucial in the context of gene expression, because although expression is a stochastic
phenomenon and under constant regulation, the biological system is also robust and
thus the CV is usually small between replicates. Based on RNA-seq data from human
populations (Cheung and others, 2010), we observed that 70% of the genes having CV
less than 0.5. CV in inbred animal models or established cell lines are likely even lower.
HAO: do we have the log(phi.hat) from high counts and can we show that it looks
somewhat normal? This can be our motivation for using a lognormal prior.
Most statisticians agree that the over-dispersion problem needs to be addressed. The
difference is how the dispersion is modeled, estimated and used in inference. Robinson
BIOS 560R: Advanced Statistical Computing
Fall 2012 Homework 3
Due 10/18/2012 at 4pm before the class
Hidden Markov model (HMM) is useful for modeling financial time series data such as
the stock prices. In this homework we will practice using HMM to model the daily price of
QQQ, and use the modeling results for prediction. PowerShares QQQ is an ETF tracking
the Nasdaq 1000 Index. Its price is highly indicative of the overall stock market.
Its adjusted daily closing prices from Mar 10th, 1999 to Oct 1st, 2012 can be obtained
from the class website.
� = var�µµ2 ⇡ var
µ2
Note the
However the daily prices For example, the stock prices
1
Simple ideas for RNA-‐seq DE analysis
• Transform data into continuous scale (e.g., by taking log), then use microarray methods:– Troublesome for genes with low counts.
• For each gene, perform two group Poisson or NB test for equal means. But:– Number of replicates are usually small, asymptotic theories don’t apply so the results are not reliable.
– Like in microarray, information from all genes can be combined to improve inferences (e.g., variance shrinkage).
DEseq (Anders et al. 2010, GB)
• Counts are assumed to following NB, parameterized by mean and variance:
• The variance is the sum of shot noise and raw variance:
• The raw variance is a smooth function of the mean, so it is assumed that genes with same means will have the same variances.
• Hypothesis testing using exact test:
data, and show that it provides better fits (SectionModel). As a result, more balanced selection of differen-tially expressed genes throughout the dynamic range ofthe data can be obtained (Section Testing for differentialexpression). We demonstrate the method by applyingit to four data sets (Section Applications) and discusshow it compares to alternative approaches (SectionConclusions).
Results and DiscussionModelDescriptionWe assume that the number of reads in sample j thatare assigned to gene i can be modeled by a negativebinomial (NB) distribution,
Kij ij ij~ ( , ),NB � � 2 (1)
which has two parameters, the mean μij and thevariance � ij
2 . The read counts Kij are non-negativeintegers. The probabilities of the distribution are givenin Supplementary Note A. (All Supplementary Notes arein Additional file 1.) The NB distribution is commonlyused to model count data when overdispersion ispresent [12].In practice, we do not know the parameters μij and
� ij2 , and we need to estimate them from the data.
Typically, the number of replicates is small, and furthermodelling assumptions need to be made in order toobtain useful estimates. In this paper, we develop amethod that is based on the following three assumptions.First, the mean parameter μij, that is, the expectation
value of the observed counts for gene i in sample j, isthe product of a condition-dependent per-gene value qi,r(j) (where r(j) is the experimental condition of samplej) and a size factor sj,
� �ij i j jq S=
, ( ). (2)
qi,r(j) is proportional to the expectation value of thetrue (but unknown) concentration of fragments fromgene i under condition r(j). The size factor sj representsthe coverage, or sampling depth, of library j, and we willuse the term common scale for quantities, such as qi, r(j),that are adjusted for coverage by dividing by sj.
Second, the variance � ij2 is the sum of a shot noise
term and a raw variance term,
� � �ij ij j i js v2 2= +shot noise raw variance
� � �� ��, ( ) .
(3)
Third, we assume that the per-gene raw varianceparameter vi, r is a smooth function of qi, r,
v v qi j i j, ( ) , ( )( ).� � �= (4)
This assumption is needed because the number ofreplicates is typically too low to get a precise estimate ofthe variance for gene i from just the data available forthis gene. This assumption allows us to pool the datafrom genes with similar expression strength for the pur-pose of variance estimation.The decomposition of the variance in Equation (3) is
motivated by the following hierarchical model: Weassume that the actual concentration of fragments fromgene i in sample j is proportional to a random variableRij, such that the rate that fragments from gene i aresequenced is sjrij. For each gene i and all samples j ofcondition r, the Rij are i.i.d. with mean qir and variancevir. Thus, the count value Kij, conditioned on Rij = rij, isPoisson distributed with rate sjrij. The marginal distribu-tion of Kij - when allowing for variation in Rij - has themean μij and (according to the law of total variance) thevariance given in Equation (3). Furthermore, if thehigher moments of Rij are modeled according to agamma distribution, the marginal distribution of Kij isNB (see, for example, [12], Section 4.2.2).FittingWe now describe how the model can be fitted to data. Thedata are an n × m table of counts, kij, where i = 1,..., nindexes the genes, and j = 1,..., m indexes the samples. Themodel has three sets of parameters:(i) m size factors sj; the expectation values of all
counts from sample j are proportional to sj.(ii) for each experimental condition r, n expression
strength parameters qir; they reflect the expected abun-dance of fragments from gene i under condition r, thatis, expectation values of counts for gene i are propor-tional to qir.(iii) The smooth functions vr : R+ ® R+; for each con-
dition r, vr models the dependence of the raw variancevir on the expected mean qir.The purpose of the size factors sj is to render
counts from different samples, which may have beensequenced to different depths, comparable. Hence, theratios ( � Kij)/( � Kij’) of expected counts for the samegene i in different samples j and j’ should be equal tothe size ratio sj/sj ’ if gene i is not differentiallyexpressed or samples j and j’ are replicates. The totalnumber of reads, Σi kij, may seem to be a good measureof sequencing depth and hence a reasonable choice forsj. Experience with real data, however, shows this notalways to be the case, because a few highly and differ-entially expressed genes may have strong influence onthe total read count, causing the ratio of total readcounts not to be a good estimate for the ratio ofexpected counts.
Anders and Huber Genome Biology 2010, 11:R106http://genomebiology.com/2010/11/10/R106
Page 2 of 12
data, and show that it provides better fits (SectionModel). As a result, more balanced selection of differen-tially expressed genes throughout the dynamic range ofthe data can be obtained (Section Testing for differentialexpression). We demonstrate the method by applyingit to four data sets (Section Applications) and discusshow it compares to alternative approaches (SectionConclusions).
Results and DiscussionModelDescriptionWe assume that the number of reads in sample j thatare assigned to gene i can be modeled by a negativebinomial (NB) distribution,
Kij ij ij~ ( , ),NB � � 2 (1)
which has two parameters, the mean μij and thevariance � ij
2 . The read counts Kij are non-negativeintegers. The probabilities of the distribution are givenin Supplementary Note A. (All Supplementary Notes arein Additional file 1.) The NB distribution is commonlyused to model count data when overdispersion ispresent [12].In practice, we do not know the parameters μij and
� ij2 , and we need to estimate them from the data.
Typically, the number of replicates is small, and furthermodelling assumptions need to be made in order toobtain useful estimates. In this paper, we develop amethod that is based on the following three assumptions.First, the mean parameter μij, that is, the expectation
value of the observed counts for gene i in sample j, isthe product of a condition-dependent per-gene value qi,r(j) (where r(j) is the experimental condition of samplej) and a size factor sj,
� �ij i j jq S=
, ( ). (2)
qi,r(j) is proportional to the expectation value of thetrue (but unknown) concentration of fragments fromgene i under condition r(j). The size factor sj representsthe coverage, or sampling depth, of library j, and we willuse the term common scale for quantities, such as qi, r(j),that are adjusted for coverage by dividing by sj.
Second, the variance � ij2 is the sum of a shot noise
term and a raw variance term,
� � �ij ij j i js v2 2= +shot noise raw variance
� � �� ��, ( ) .
(3)
Third, we assume that the per-gene raw varianceparameter vi, r is a smooth function of qi, r,
v v qi j i j, ( ) , ( )( ).� � �= (4)
This assumption is needed because the number ofreplicates is typically too low to get a precise estimate ofthe variance for gene i from just the data available forthis gene. This assumption allows us to pool the datafrom genes with similar expression strength for the pur-pose of variance estimation.The decomposition of the variance in Equation (3) is
motivated by the following hierarchical model: Weassume that the actual concentration of fragments fromgene i in sample j is proportional to a random variableRij, such that the rate that fragments from gene i aresequenced is sjrij. For each gene i and all samples j ofcondition r, the Rij are i.i.d. with mean qir and variancevir. Thus, the count value Kij, conditioned on Rij = rij, isPoisson distributed with rate sjrij. The marginal distribu-tion of Kij - when allowing for variation in Rij - has themean μij and (according to the law of total variance) thevariance given in Equation (3). Furthermore, if thehigher moments of Rij are modeled according to agamma distribution, the marginal distribution of Kij isNB (see, for example, [12], Section 4.2.2).FittingWe now describe how the model can be fitted to data. Thedata are an n × m table of counts, kij, where i = 1,..., nindexes the genes, and j = 1,..., m indexes the samples. Themodel has three sets of parameters:(i) m size factors sj; the expectation values of all
counts from sample j are proportional to sj.(ii) for each experimental condition r, n expression
strength parameters qir; they reflect the expected abun-dance of fragments from gene i under condition r, thatis, expectation values of counts for gene i are propor-tional to qir.(iii) The smooth functions vr : R+ ® R+; for each con-
dition r, vr models the dependence of the raw variancevir on the expected mean qir.The purpose of the size factors sj is to render
counts from different samples, which may have beensequenced to different depths, comparable. Hence, theratios ( � Kij)/( � Kij’) of expected counts for the samegene i in different samples j and j’ should be equal tothe size ratio sj/sj ’ if gene i is not differentiallyexpressed or samples j and j’ are replicates. The totalnumber of reads, Σi kij, may seem to be a good measureof sequencing depth and hence a reasonable choice forsj. Experience with real data, however, shows this notalways to be the case, because a few highly and differ-entially expressed genes may have strong influence onthe total read count, causing the ratio of total readcounts not to be a good estimate for the ratio ofexpected counts.
Anders and Huber Genome Biology 2010, 11:R106http://genomebiology.com/2010/11/10/R106
Page 2 of 12
Hence, to estimate the size factors, we take the median ofthe ratios of observed counts. Generalizing the procedurejust outlined to the case of more than two samples, we use:
sk
k
ji
ij
ivv
m m^
/.=
⎛
⎝⎜
⎞
⎠⎟
=∏median
1
1 (5)
The denominator of this expression can be interpretedas a pseudo-reference sample obtained by taking thegeometric mean across samples. Thus, each size factorestimate s j^ is computed as the median of the ratios ofthe j-th sample’s counts to those of the pseudo-reference.(Note: While this manuscript was under review, Robinsonand Oshlack [13] suggested a similar method.)To estimate qir, we use the average of the counts from
the samples j corresponding to condition r, transformedto the common scale:
qm
k
si
ij
jj j
^
^: ( )
,�� � �
==
∑1(6)
where mr is the number of replicates of condition r andthe sum runs over these replicates. the functions vr, wefirst calculate sample variances on the common scale
wm
k
sqi
ij
ji
j j
��
�� �
=−
−⎛
⎝
⎜⎜⎜
⎞
⎠
⎟⎟⎟=
∑1
1
2
^^
: ( )
(7)
and define
zq
m si
i
jj j
��
� � �
==
∑^
^: ( )
.1 (8)
In Supplementary Note B in Additional file 1 we showthat wir - zir is an unbiased estimator for the raw varianceparameter vir of Equation (3).However, for small numbers of replicates, mr, as is
typically the case in applications, the values wir are highlyvariable, and wir - zir would not be a useful varianceestimator for statistical inference. Instead, we use localregression [14] on the graph ( , )q̂ wi i� �
to obtain asmooth function wr(q), with
v q w q zi i i^ ^ ^( ) ( )� � � � �= − (9)
as our estimate for the raw variance.Some attention is needed to avoid estimation biases in
the local regression. wir is a sum of squared randomvariables, and the residuals w w qi i� �− ( )^ are skewed.Following References [15], Chapter 8 and [14], Section
9.1.2, we use a generalized linear model of the gammafamily for the local regression, using the implementationin the locfit package [16].
Testing for differential expressionSuppose that we have mA replicate samples for biologi-cal condition A and mB samples for condition B. Foreach gene i, we would like to weigh the evidence in thedata for differential expression of that gene betweenthe two conditions. In particular, we would like to testthe null hypothesis qiA = qiB, where qiA is the expressionstrength parameter for the samples of condition A, andqiB for condition B. To this end, we define, as test statis-tic, the total counts in each condition,
K K K Ki ij
j j
i ij
j j
A
A
B
B
= == =
∑ ∑: ( ) : ( )
, ,
� �(10)
and their overall sum KiS = KiA + KiB. From the errormodel described in the previous Section, we show belowthat - under the null hypothesis - we can compute theprobabilities of the events KiA = a and KiB = b for anypair of numbers a and b. We denote this probability byp(a, b). The P value of a pair of observed count sums(kiA, kiB) is then the sum of all probabilities less or equalto p(kiA, kiB), given that the overall sum is kiS:
p
p a b
p a bi
a b kp a b p k k
a b k
i
i i
i
=+ =≤
+ =
∑
∑
( , )
( , ).
( , ) ( ),S
A B
S
(11)
The variables a and b in the above sums take thevalues 0,..., kiS. The approach presented so far followsthat of Robinson and Smyth [11] and is analogous tothat taken by other conditioned tests, such as Fisher’sexact test. (See Reference [17], Chapter 3 for a discus-sion of the merits of conditioning in tests.)Computation of p(a, b). First, assume that, under the
null hypothesis, counts from different samples are inde-pendent. Then, p(a, b) = Pr(KiA = a) Pr(KiB = b). Theproblem thus is computing the probability of the eventKiA = a, and, analogously, of KiB = b. The random vari-able KiA is the sum of mA
NB-distributed random variables. We approximate itsdistribution by a NB distribution whose parameters weobtain from those of the Kij. To this end, we first com-pute the pooled mean estimate from the counts of bothconditions,
q k si ij
j j A B
j^
: ( ) { , }
/ ,0 =∈
∑�
(12)
Anders and Huber Genome Biology 2010, 11:R106http://genomebiology.com/2010/11/10/R106
Page 3 of 12
Bioconductor package DEseq
• Inputs are: – integer matrix for gene counts, rows for genes and columns for samples.
– experimental design: samples for the columns.library(DESeq)
conds=c(0,0,0,1,1,1)cds=newCountDataSet(data, conds )cds=estimateSizeFactors( cds )cds=estimateVarianceFunctions( cds )fit=nbinomTest( cds, 0, 1)pval.DEseq=fit.DEseq$pval
edgeR• From a series of papers by Robinson et al.(the same group
developed limma): 2007 Bioinformatics, 2008 Biostatistics, 2010 Bioinformatics.
• Empirical Bayes ideas to “shrink” gene-‐specific estimations and get better estimates for variances.
• The parameter to shrink is over-‐dispersion (Φ) in NB, which controls the within group variances.
• There is no conjugate prior so a shrinkage is not straightforward.
• Used a conditional weighted likelihood approach to establish an approximate EB estimator for Φ.
Bioconductor package edgeR
• Inputs are the same as DEseq: an integer matrix for counts and column labels for design. library(edgeR)
d = DGEList(counts=data, group=c(0,0,0,1,1,1), lib.size=colSums(data))
d = calcNormFactors(d)d = estimateCommonDisp(d)d = estimateTagwiseDisp(d, trend=TRUE)fit.edgeR = exactTest(d)pval.edgeR = fit.edgeR$table$p.value
DSS (Wu et al. 2012, Biostatistics)
• Found that the shrinkage from DESeq and edgeR are too strong.
(a) edgeR
True dispersion
Estim
ated
dis
pers
ion
0.005 0.050 0.500 5.000
0.00
50.
050
0.50
05.
000
(b) DESeq
True dispersion
Estim
ated
dis
pers
ion
0.005 0.050 0.500 5.000
0.00
50.
050
0.50
05.
000
A hierarchical model for the counts
• Ygi: observed counts for gene g, sample i• Θgi: unobserved true expression for gene g, sample i• Φg: dispersion (related biological variance) for gene g.• si: library size for sample i.
8 H. WU AND OTHERS
the estimates do not correlate well with the truth. Although both edgeR and DESeq
estimate the central magnitude of the dispersions well, they fail to capture the varia-
tions in dispersions among genes. The near horizontal clusters of points show that the
dispersion of many genes are estimated to be approximately the same, which indicates
over-shrinkage of the dispersions.
Since the dispersion parameter represents biological variation, having a good esti-
mate of ⌅ is crucial for finding DE that is beyond natural biological variation. In order to
account for gene specific dispersion, we take advantage of real RNA-seq data sets with
replicates and notice that the distribution of the logarithm of sample dispersion, log(⌅),
is approximately Gaussian as shown in Figure 2. In the next section we describe a novel
shrinkage estimator for ⌅ using a log-normal prior and a NB likelihood. In Section 5 we
show that the propose method greatly improves the estimation of ⌅ and leads to better
detection of DE.
4. METHODS
Denote the observed counts from gene g in sample i by Ygi, and the unobserved expres-
sion rate by �gi. We assume the following hierarchical model:
Ygi|�gi � Poisson(�gisi)
�gi|⌅g � Gamma(µg,k(i),⌅g)
⌅g � log-normal(m0, ⇤2)
Here the Gamma distribution is parametrized with mean and dispersion (⌅g is the recip-
rocal of the shape parameter in the common parameterization). k(i) denotes the treat-
The posterior
• Negative binomial is parameterized by mean and dispersion, then the posterior for dispersion is:
• Here, is the expected value for Ygi.• It’s a penalized likelihood to penalize (1) dispersions far away from prior mean; and (2) large dispersions.
10 H. WU AND OTHERS
Supplementary Materials):
log[p(⌅g|Ygi, ⇥gi, i = 1, . . . , n)]⇤�
i
⇧(⌅�1g + Ygi)� n⇧(⌅�1
g )� ⌅�1g
�
i
log(1 + ⇥gi⌅g)
+�
i
Ygi[log(⇥gi⌅g)� log(1 + ⇥gi⌅g)]
� [log(⌅g)�m0]2
2⇤ 2� log(⌅g)� log(⇤), (4.1)
where ⇧(.) is the log gamma function and n is the number of samples. To obtain a
point estimate of ⌅g, we could compute the posterior mean by using numerical methods
such as importance sampling. It is however too computationally intensive to be prac-
tical for RNA-seq data. Instead we compute the posterior mode by maximizing an ap-
proximate of Equation 4.1 and denote the estimate ⌅̃g. Specifically, we substitute ⇥gi
by ⇥̂gi ⇥ µ̂g,k(i)si where µ̂g,k(i) =P
j:k(j)=k(i) Ygj/sj
nk(i), and nk(i) is the number of samples
in the same treatment group as i. The estimate µ̂g,k(i) is the same as the sample esti-
mate of true concentration in Anders and Huber (2010). We plug in the hyperparameters
m0, ⇤ 2 estimated from the data, and maximize the approximated Equation 4.1 using the
Newton-Rhaphson method. In practice we use “optimize” function in R, which provides
excellent computational performance.
The hyperparameters m0, ⇤ 2 are estimated from the data by pooling information from
all genes (details in the next subsection). The estimate ⌅̃g is therefore an empirical Bayes
estimate shrunken towards the common prior. It can also be viewed as an estimate that
maximizes a penalized pseudo likelihood, as we are maximizing an approximate of the
penalized likelihood (Equation 4.1), with penalty �[log(⌅g) � m0]2/(2⇤ 2) � log(⌅g),
by replacing the nuisance parameter ⇥gi by its estimate ⇥̂gi. The first term in the penalty,
[log(⌅g) � m0]2/(2⇤ 2), penalizes values that deviate far from the common prior m0
DE analysis of RNA-seq 9
ment group of sample i. si represents the normalizing factor, such as the relative library
size, for sample i. This normalization factor could also be a gene specific factor, taking
into account of varying sequencing efficiency due to GC content and/or length (Hansen
and others, 2012; Risso and others, 2012), but would nonetheless be treated as a known
constant. Based on the hierarchical model, the marginal distribution of Ygi given µg,k(i)
and ⇤g is NB with mean ⇥gi = µg,k(i)si and dispersion ⇤g. In the simplest setting where
there are only two groups, the goal of DE detection is to test, for each gene, whether
the mean expressions are identical in both groups, e.g., µg,1 = µg,2, as noted by other
researchers (Robinson and Smyth, 2007, 2008; Robinson and others, 2010; Anders and
Huber, 2010).
Estimating ⇤g is a crucial step in DE detection and shrinkage estimators have been
shown to be useful in typical RNA-seq experiments with small number of replicates
(Robinson and others, 2010; Anders and Huber, 2010; Hardcastle and Kelly, 2010). As
discussed in previous works (Robinson and Smyth, 2007), there is no conjugate prior
for ⇤g to ease the computation of a posterior distribution. Thus we decide to use a prior
that approximates the ⇤g in real RNA-seq data. We chose the log-normal distribution,
motivated by Figure 2.
For gene g, the conditional posterior distribution of ⇤g given all observed counts
and means, p(⇤g|Ygi, ⇥gi, i = 1, . . . , n) satisfies (detailed derivation can be found at
Testing and inference in two-‐group comparison
• Wald test:
– With dispersions, variances can be computed according to NB distribution:
– The variance for is:
• Inferences: use normal P-‐values and local FDR.
4 H. WU AND OTHERS
“over-dispersion” problem since even non-differentially expressed genes show variation
greater than expected by Poisson.
A common model to address the over-dispersion problem is the Negative Binomial
(NB) model. The NB model is a Gamma-poisson mixture and can be interpreted as the
following: the Gamma distribution models the unobserved true expression levels in each
biological sample, and conditioning on the expression level, the measurement from the
sequencing machine follows a Poisson distribution (Anders and Huber, 2010; Robinson
and Smyth, 2007; Hardcastle and Kelly, 2010). The variance from a NB distribution
depends on the mean in the relationship:
var = µ + µ2⇥ (1.1)
where the first term reprents variance due to poisson sampling error and the second
term represents variance due to variation between biological replicates. The parameter
⇥ is referred to as the dispersion parameter. Notice that ⇥ is the inverse of the shape
parameter in the Gamma distribution, thus is the squared coefficient of variation. This
is crucial in the context of gene expression, because although expression is a stochastic
phenomenon and under constant regulation, the biological system is also robust and
thus the CV is usually small between replicates. Based on RNA-seq data from human
populations (Cheung and others, 2010), we observed that 70% of the genes having CV
less than 0.5. CV in inbred animal models or established cell lines are likely even lower.
HAO: do we have the log(phi.hat) from high counts and can we show that it looks
somewhat normal? This can be our motivation for using a lognormal prior.
Most statisticians agree that the over-dispersion problem needs to be addressed. The
difference is how the dispersion is modeled, estimated and used in inference. Robinson
12 H. WU AND OTHERS
pseudo datasets we obtain the mean S2 as the baseline. Lastly, we estimate ⇤ 2 as ⇤̂ 2 =
max([IQR{log(⌅̂g)}/1.349]2 � S2, c0). Here c0 is the lower bound for ⇤̂ 2. In practice
we use c0 = 0.01.
Notice that ⌅̂g is only used in estimating the hyper-parameters and not directly used
to obtain the shrinkage estimate ⌅̃g. Although each ⌅̂g is a rather crude estimate of the
dispersion, especially for small number of samples, the large number of genes in RNA-
seq data allow us to obtain stable estimates of the hyper parameters. Once the prior is
established in the empirical Bayesian method, our shrinkage estimate ⌅̃g directly comes
from maximizing Equation 4.1.
Test statistic and False Discovery Rate With all parameters estimated, a hypothesis
testing for the comparison of any two groups can be carried using a Wald test or ex-
act test. Previous reports (Robinson and Smyth, 2007) and our simulation show that
both procedures provide similar performances. We choose to use Wald test for its better
computational performance and simplicity. In addition, it can be generalized to more
complex experimental design.
The Wald test statistic for two-group comparison is constructed as:
tg =µ̂g,1 � µ̂g,2⌃⇥̂2g,1 + ⇥̂2
g,2
where ⇥̂2g,1 ⇥ 1
n21
⌅µ̂g,1
�⇤j:k(j)=1
1sj
⇥+ n1µ̂2
g,1⌅̃g
⇧, is the estimated variance for µ̂g,1
(detailed derivation in Supplementary Materials), and ⇥̂2g,2 similarly defined. When there
is no need for normalization (sj = 1 ⇤j), ⇥̂2g,1 reduces to (µ̂g,1+ µ̂2
g,1⌅̃g)/n1, an estimate
of the variance in Equation 1.1.
It is not trivial to derive the null distribution for the test statistic and asymptotic
12 H. WU AND OTHERS
pseudo datasets we obtain the mean S2 as the baseline. Lastly, we estimate ⇤ 2 as ⇤̂ 2 =
max([IQR{log(⌅̂g)}/1.349]2 � S2, c0). Here c0 is the lower bound for ⇤̂ 2. In practice
we use c0 = 0.01.
Notice that ⌅̂g is only used in estimating the hyper-parameters and not directly used
to obtain the shrinkage estimate ⌅̃g. Although each ⌅̂g is a rather crude estimate of the
dispersion, especially for small number of samples, the large number of genes in RNA-
seq data allow us to obtain stable estimates of the hyper parameters. Once the prior is
established in the empirical Bayesian method, our shrinkage estimate ⌅̃g directly comes
from maximizing Equation 4.1.
Test statistic and False Discovery Rate With all parameters estimated, a hypothesis
testing for the comparison of any two groups can be carried using a Wald test or ex-
act test. Previous reports (Robinson and Smyth, 2007) and our simulation show that
both procedures provide similar performances. We choose to use Wald test for its better
computational performance and simplicity. In addition, it can be generalized to more
complex experimental design.
The Wald test statistic for two-group comparison is constructed as:
tg =µ̂g,1 � µ̂g,2⌃⇥̂2g,1 + ⇥̂2
g,2
where ⇥̂2g,1 ⇥ 1
n21
⌅µ̂g,1
�⇤j:k(j)=1
1sj
⇥+ n1µ̂2
g,1⌅̃g
⇧, is the estimated variance for µ̂g,1
(detailed derivation in Supplementary Materials), and ⇥̂2g,2 similarly defined. When there
is no need for normalization (sj = 1 ⇤j), ⇥̂2g,1 reduces to (µ̂g,1+ µ̂2
g,1⌅̃g)/n1, an estimate
of the variance in Equation 1.1.
It is not trivial to derive the null distribution for the test statistic and asymptotic
12 H. WU AND OTHERS
pseudo datasets we obtain the mean S2 as the baseline. Lastly, we estimate ⇤ 2 as ⇤̂ 2 =
max([IQR{log(⌅̂g)}/1.349]2 � S2, c0). Here c0 is the lower bound for ⇤̂ 2. In practice
we use c0 = 0.01.
Notice that ⌅̂g is only used in estimating the hyper-parameters and not directly used
to obtain the shrinkage estimate ⌅̃g. Although each ⌅̂g is a rather crude estimate of the
dispersion, especially for small number of samples, the large number of genes in RNA-
seq data allow us to obtain stable estimates of the hyper parameters. Once the prior is
established in the empirical Bayesian method, our shrinkage estimate ⌅̃g directly comes
from maximizing Equation 4.1.
Test statistic and False Discovery Rate With all parameters estimated, a hypothesis
testing for the comparison of any two groups can be carried using a Wald test or ex-
act test. Previous reports (Robinson and Smyth, 2007) and our simulation show that
both procedures provide similar performances. We choose to use Wald test for its better
computational performance and simplicity. In addition, it can be generalized to more
complex experimental design.
The Wald test statistic for two-group comparison is constructed as:
tg =µ̂g,1 � µ̂g,2⌃⇥̂2g,1 + ⇥̂2
g,2
where ⇥̂2g,1 ⇥ 1
n21
⌅µ̂g,1
�⇤j:k(j)=1
1sj
⇥+ n1µ̂2
g,1⌅̃g
⇧, is the estimated variance for µ̂g,1
(detailed derivation in Supplementary Materials), and ⇥̂2g,2 similarly defined. When there
is no need for normalization (sj = 1 ⇤j), ⇥̂2g,1 reduces to (µ̂g,1+ µ̂2
g,1⌅̃g)/n1, an estimate
of the variance in Equation 1.1.
It is not trivial to derive the null distribution for the test statistic and asymptotic
DSS Bioconductor package
• Inputs are the same as DEseq and edgeR: an integer matrix for counts and column labels for design.
conds=c(0,0,0,1,1,1)seqData=newSeqCountSet(X, conds)seqData=estNormFactors(seqData)seqData=estDispersion(seqData)result=waldTest(seqData, 0, 1)
Summary for RNA-‐seq DE test
• Based on my experiences and simulation results:– All methods provide very similar results when the variation of dispersions is small.
– When variation of dispersions is large, DSS outperforms.
• Testing procedure for multiple-‐factor design is still not well developed. Mostly based on glm.
• Some rooms for statistical researches.
Review of this class
• I have covered some aspects of next generation sequencing : – Biological backgrounds.– Sequencing technologies.– Brief introduction to NGS applications: DNA-‐seq, RNA-‐seq and ChIP-‐seq.
– RNA-‐seq analyses: motivation, statistical methods for DE analysis (very briefly), useful software.