Genome Annotation Haixu Tang School of Informatics.

40
Genome Annotation Haixu Tang School of Informatics

Transcript of Genome Annotation Haixu Tang School of Informatics.

Page 1: Genome Annotation Haixu Tang School of Informatics.

Genome Annotation

Haixu TangSchool of Informatics

Page 2: Genome Annotation Haixu Tang School of Informatics.

Genome and genes

• Genome: an organism’s genetic material (Car encyclopedia)

• Gene: a discrete units of hereditary information located on the chromosomes and consisting of DNA. (Chapters to make components of a car, or to use and drive a car).

Page 3: Genome Annotation Haixu Tang School of Informatics.

Gene Prediction: Computational Challenge

aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcgg

Page 4: Genome Annotation Haixu Tang School of Informatics.

Gene Prediction: Computational Challenge

aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcgg

Page 5: Genome Annotation Haixu Tang School of Informatics.

Gene Prediction: Computational Challenge

aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcgg

Gene!

Page 6: Genome Annotation Haixu Tang School of Informatics.

• Gene: A sequence of nucleotides coding for protein

• Gene Prediction Problem: Determine the beginning and end positions of genes in a genome

Gene Prediction: Computational Challenge

Page 7: Genome Annotation Haixu Tang School of Informatics.

Protein

RNA

DNA

transcription

translation

CCTGAGCCAACTATTGATGAA

PEPTIDE

CCUGAGCCAACUAUUGAUGAA

Central Dogma: DNA -> RNA -> Protein

Page 8: Genome Annotation Haixu Tang School of Informatics.

• Codon: 3 consecutive nucleotides

• 4 3 = 64 possible codons

• Genetic code is degenerative and redundant

– Includes start and stop codons

– An amino acid may be coded by more than one codon (codon degeneracy)

Translating Nucleotides into Amino Acids

Page 9: Genome Annotation Haixu Tang School of Informatics.

• In 1961 Sydney Brenner and Francis Crick discovered frameshift mutations

• Systematically deleted nucleotides from DNA– Single and double deletions dramatically

altered protein product– Effects of triple deletions were minor– Conclusion: every triplet of nucleotides, each codon, codes for exactly one amino acid in a protein

Codons

Page 10: Genome Annotation Haixu Tang School of Informatics.

UAA, UAG and UGA correspond to 3 Stop codons that (together with Start codon ATG) delineate Open Reading Frames

Genetic Code and Stop Codons

Page 11: Genome Annotation Haixu Tang School of Informatics.

Six Frames in a DNA Sequence

• stop codons – TAA, TAG, TGA

• start codons - ATG

GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG

CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC

GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTGGACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTGGACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG

CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACACCTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACACCTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC

Page 12: Genome Annotation Haixu Tang School of Informatics.

• Detect potential coding regions by looking at ORFs– A genome of length n is comprised of (n/3) codons– Stop codons break genome into segments

between consecutive Stop codons– The subsegments of these that start from the Start

codon (ATG) are ORFs• ORFs in different frames may overlap

3

n

3

n

3

n

Genomic Sequence

Open reading frame

ATG TGA

Open Reading Frames (ORFs)

Page 13: Genome Annotation Haixu Tang School of Informatics.

• Long open reading frames may be a gene– At random, we should expect one stop codon

every (64/3) ~= 21 codons– However, genes are usually much longer

than this• A basic approach is to scan for ORFs whose

length exceeds certain threshold– This is naïve because some genes (e.g. some

neural and immune system genes) are relatively short

Long vs.Short ORFs

Page 14: Genome Annotation Haixu Tang School of Informatics.

Testing ORFs: Codon Usage

• Create a 64-element hash table and count the frequencies of codons in an ORF

• Amino acids typically have more than one codon, but in nature certain codons are more in use

• Uneven use of the codons may characterize a real gene

• This compensate for pitfalls of the ORF length test

Page 15: Genome Annotation Haixu Tang School of Informatics.

Codon Usage in Human Genome

Page 16: Genome Annotation Haixu Tang School of Informatics.

AA codon /1000 frac Ser TCG 4.31 0.05Ser TCA 11.44 0.14Ser TCT 15.70 0.19Ser TCC 17.92 0.22Ser AGT 12.25 0.15Ser AGC 19.54 0.24

Pro CCG 6.33 0.11Pro CCA 17.10 0.28Pro CCT 18.31 0.30Pro CCC 18.42 0.31

AA codon /1000 frac Leu CTG 39.95 0.40Leu CTA 7.89 0.08Leu CTT 12.97 0.13Leu CTC 20.04 0.20

Ala GCG 6.72 0.10Ala GCA 15.80 0.23Ala GCT 20.12 0.29Ala GCC 26.51 0.38

Gln CAG 34.18 0.75Gln CAA 11.51 0.25

Codon Usage in Mouse Genome

Page 17: Genome Annotation Haixu Tang School of Informatics.

Transcription in prokaryotes

Coding region

Promoter

Transcription start side

Untranslated regions

Transcribed region

start codon stop codon

5’ 3’

upstream downstream

Transcription stop side

Page 18: Genome Annotation Haixu Tang School of Informatics.

Microbial gene finding

• Microbial genome tends to be gene rich (80%-90% of the sequence is coding sequence)

• Major problem – finding genes without known homologue.

Page 19: Genome Annotation Haixu Tang School of Informatics.

Open Reading Frame

Open Reading Frame (ORF) is a sequence of codons which starts with start codon, ends with a stop codon and has no stop codons in-between.

Searching for ORFs – consider all 6 possible reading frames: 3 forward and 3 reverse

Is the ORF a coding sequence?1. Must be long enough (roughly 300 bp or more)2. Should have average amino-acid composition specific for a

given organism.3. Should have codon usage specific for the given organism.

Page 20: Genome Annotation Haixu Tang School of Informatics.

Gene finding using codon frequency

frequency in coding region frequency in non-coding region

Input sequence

Compare

Coding region or non-coding region

Page 21: Genome Annotation Haixu Tang School of Informatics.

Example Codon position

A C T G

1 28% 33% 18% 21%

2 32% 16% 21% 32%

3 33% 15% 14% 38%

frequency in

genome

31% 18% 19% 31%

Assume: bases making codon are independent

P(x|in coding)P(x|random)

=

P(Ai at ith position)P(Ai in the sequence)i

Score of AAAGAT:

.28*.32*.33*.21*.26*.14

.31*.31*.31*.31*.31*.19

Page 22: Genome Annotation Haixu Tang School of Informatics.

Using codon frequency to find correct reading frame

Consider sequence x1 x2 x3 x4 x5 x6 x7 x8 x9….

where xi is a nucleotide

let p1 = p x1 x2 x3 p x3 x4 x5…. p2 = p x2 x3 x4 p x5 x6 x7….

p3 = p x3 x4 x5 p x6 x7 x8….

then probability that ith reading frame is the coding frame is:

pi

p1 + p2 + p3

Algorithm:• slide a window along the sequence and compute Pi

•Plot the results

Pi =

Page 23: Genome Annotation Haixu Tang School of Informatics.

Eukaryotic gene finding

• On average, vertebrate gene is about 30KB long

• Coding region takes about 1KB• Exon sizes vary from double digit numbers to

kilobases• An average 5’ UTR is about 750 bp• An average 3’UTR is about 450 bp but both

can be much longer.

Page 24: Genome Annotation Haixu Tang School of Informatics.

Exons and Introns

• In eukaryotes, the gene is a combination of coding segments (exons) that are interrupted by non-coding segments (introns)

• This makes computational gene prediction in eukaryotes even more difficult

• Prokaryotes don’t have introns - Genes in prokaryotes are continuous

Page 25: Genome Annotation Haixu Tang School of Informatics.

Gene Structure

Page 26: Genome Annotation Haixu Tang School of Informatics.

Gene structure in eukaryotes

Promoter

Transcription start side

Untranslated regions

Transcribed region

start codon stop codon

5’ 3’

Transcription stop side

exons

Initial exon

Final exon

donor and acceptor sides

GT AG

Page 27: Genome Annotation Haixu Tang School of Informatics.

Central Dogma and Splicingexon1 exon2 exon3

intron1 intron2

transcription

translation

splicing

exon = codingintron = non-coding

Page 28: Genome Annotation Haixu Tang School of Informatics.

Splicing Signals

Exons are interspersed with introns and typically flanked by GT and AG

Page 29: Genome Annotation Haixu Tang School of Informatics.

Splice site detection

5’ 3’Donor site

Position

% -8 … -2 -1 0 1 2 … 17

A 26 … 60 9 0 1 54 … 21C 26 … 15 5 0 1 2 … 27G 25 … 12 78 99 0 41 … 27T 23 … 13 8 1 98 3 … 25

Page 30: Genome Annotation Haixu Tang School of Informatics.

Consensus splice sites

Donor: 7.9 bitsAcceptor: 9.4 bits

Page 31: Genome Annotation Haixu Tang School of Informatics.

Promoters• Promoters are DNA segments upstream

of transcripts that initiate transcription

• Promoter attracts RNA Polymerase to the transcription start site

5’Promoter 3’

Page 32: Genome Annotation Haixu Tang School of Informatics.

• Statistical: coding segments (exons) have typical sequences on either end and use different subwords than non-coding segments (introns).

• Similarity-based: many human genes are similar to genes in mice, chicken, or even bacteria. Therefore, already known mouse, chicken, and bacterial genes may help to find human genes.

Two Approaches to Eukaryotic Gene Prediction

Page 33: Genome Annotation Haixu Tang School of Informatics.

Ribosomal Binding Site

Page 34: Genome Annotation Haixu Tang School of Informatics.

(http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)

Donor: 7.9 bitsAcceptor: 9.4 bits(Stephens & Schneider, 1996)

Donor and Acceptor Sites: Motif Logos

Page 35: Genome Annotation Haixu Tang School of Informatics.

Similarity-based gene finding

• Alignment of

– Genomic sequence and (assembled) EST sequences

– Genomic sequence and known (similar) protein sequences

– Two or more similar genomic sequences

Page 36: Genome Annotation Haixu Tang School of Informatics.

Cell or tissue

Isolate mRNA andReverse transcribe intocDNA

Clone cDNA into a vector toMake a cDNA library

5’

3’

EST

Pick a cloneAnd sequence the 5’ and 3’Ends of cDNA insert

dbEST

SubmitTo dbEST

Vectors

Expressed Sequence Tags

Page 37: Genome Annotation Haixu Tang School of Informatics.

Central Dogma and Splicingexon1 exon2 exon3

intron1 intron2

transcription

translation

splicing

exon = codingintron = non-coding

Page 38: Genome Annotation Haixu Tang School of Informatics.

Splicing Sequence Alignment

Potential splicing sites

Page 39: Genome Annotation Haixu Tang School of Informatics.

Using Similarities to Find the Exon Structure• Human EST (mRNA) sequence is aligned to different

locations in the human genome• Find the “best” path to reveal the exon structure of human

gene

ES

T sequence

Human Genome

Page 40: Genome Annotation Haixu Tang School of Informatics.

An annotated gene in human genome