CSE182-L12
description
Transcript of CSE182-L12
![Page 1: CSE182-L12](https://reader035.fdocuments.net/reader035/viewer/2022070418/568157ed550346895dc56336/html5/thumbnails/1.jpg)
CSE182-L12
Gene Finding
![Page 2: CSE182-L12](https://reader035.fdocuments.net/reader035/viewer/2022070418/568157ed550346895dc56336/html5/thumbnails/2.jpg)
Silly Quiz
• Who are these people, and what is the occasion?
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
![Page 3: CSE182-L12](https://reader035.fdocuments.net/reader035/viewer/2022070418/568157ed550346895dc56336/html5/thumbnails/3.jpg)
Gene Features
ATG
5’ UTR
intron
exon3’ UTR
AcceptorDonor splice siteTranscription start
Translation start
![Page 4: CSE182-L12](https://reader035.fdocuments.net/reader035/viewer/2022070418/568157ed550346895dc56336/html5/thumbnails/4.jpg)
DNA Signals
• Coding versus non-coding• Splice Signals• Translation start
ATG
5’ UTR
intron
exon3’ UTR
AcceptorDonor splice siteTranscription start
Translation start
![Page 5: CSE182-L12](https://reader035.fdocuments.net/reader035/viewer/2022070418/568157ed550346895dc56336/html5/thumbnails/5.jpg)
PWMs
• Fixed length for the splice signal.• Each position is generated independently
according to a distribution• Figure shows data from > 1200 donor
sites
321123456321123456AAGAAGGTGTGAGTGAGTCCGCCGGTGTAAGTAAGTGAGGAGGTGTGAGGGAGGTAGTAGGTGTAAGGAAGG
![Page 6: CSE182-L12](https://reader035.fdocuments.net/reader035/viewer/2022070418/568157ed550346895dc56336/html5/thumbnails/6.jpg)
MDD
• PWMs do not capture correlations between positions• Many position pairs in the Donor signal are correlated
![Page 7: CSE182-L12](https://reader035.fdocuments.net/reader035/viewer/2022070418/568157ed550346895dc56336/html5/thumbnails/7.jpg)
MDD method
• Choose the position i which has the highest correlation score.
• Split sequences into two: those which have the consensus at position i, and the remaining.
• Recurse until <Terminating conditions>
![Page 8: CSE182-L12](https://reader035.fdocuments.net/reader035/viewer/2022070418/568157ed550346895dc56336/html5/thumbnails/8.jpg)
MDD for Donor sites
![Page 9: CSE182-L12](https://reader035.fdocuments.net/reader035/viewer/2022070418/568157ed550346895dc56336/html5/thumbnails/9.jpg)
Gene prediction: Summary
• Various signals distinguish coding regions from non-coding
• HMMs are a reasonable model for Gene structures, and provide a uniform method for combining various signals.
• Further improvement may come from improved signal detection
![Page 10: CSE182-L12](https://reader035.fdocuments.net/reader035/viewer/2022070418/568157ed550346895dc56336/html5/thumbnails/10.jpg)
How many genes do we have?
Nature
Science
![Page 11: CSE182-L12](https://reader035.fdocuments.net/reader035/viewer/2022070418/568157ed550346895dc56336/html5/thumbnails/11.jpg)
Alternative splicing
![Page 12: CSE182-L12](https://reader035.fdocuments.net/reader035/viewer/2022070418/568157ed550346895dc56336/html5/thumbnails/12.jpg)
Comparative methods
• Gene prediction is harder with alternative splicing.• One approach might be to use comparative
methods to detect genes• Given a similar mRNA/protein (from another
species, perhaps?), can you find the best parse of a genomic sequence that matches that target sequence• Yes, with a variant on alignment algorithms that penalize
separately for introns, versus other gaps.
![Page 13: CSE182-L12](https://reader035.fdocuments.net/reader035/viewer/2022070418/568157ed550346895dc56336/html5/thumbnails/13.jpg)
Comparative gene finding tools
• Genscan/Genie• Procrustes/Sim4: mRNA vs. genomic• Genewise: proteins versus genomic• CEM: genomic versus genomic• Twinscan: Combines comparative and
de novo approach.
![Page 14: CSE182-L12](https://reader035.fdocuments.net/reader035/viewer/2022070418/568157ed550346895dc56336/html5/thumbnails/14.jpg)
Databases
• RefSeq and other databases maintain sequences of full-length transcripts.
• We can query using sequence.
![Page 15: CSE182-L12](https://reader035.fdocuments.net/reader035/viewer/2022070418/568157ed550346895dc56336/html5/thumbnails/15.jpg)
De novo Gene prediction: Summary
• Various signals distinguish coding regions from non-coding
• HMMs are a reasonable model for Gene structures, and provide a uniform method for combining various signals.
• Further improvement may come from improved signal detection
![Page 16: CSE182-L12](https://reader035.fdocuments.net/reader035/viewer/2022070418/568157ed550346895dc56336/html5/thumbnails/16.jpg)
How many genes do we have?
Nature
Science
![Page 17: CSE182-L12](https://reader035.fdocuments.net/reader035/viewer/2022070418/568157ed550346895dc56336/html5/thumbnails/17.jpg)
Alternative splicing
![Page 18: CSE182-L12](https://reader035.fdocuments.net/reader035/viewer/2022070418/568157ed550346895dc56336/html5/thumbnails/18.jpg)
Comparative methods
• Gene prediction is harder with alternative splicing.• One approach might be to use comparative
methods to detect genes• Given a similar mRNA/protein (from another
species, perhaps?), can you find the best parse of a genomic sequence that matches that target sequence• Yes, with a variant on alignment algorithms that penalize
separately for introns, versus other gaps.
![Page 19: CSE182-L12](https://reader035.fdocuments.net/reader035/viewer/2022070418/568157ed550346895dc56336/html5/thumbnails/19.jpg)
Comparative gene finding tools
• Procrustes/Sim4: mRNA vs. genomic• Genewise: proteins versus genomic• CEM: genomic versus genomic• Twinscan: Combines comparative and
de novo approach.
![Page 20: CSE182-L12](https://reader035.fdocuments.net/reader035/viewer/2022070418/568157ed550346895dc56336/html5/thumbnails/20.jpg)
Course
• Sequence Comparison (BLAST & other tools)• Protein Motifs:
– Profiles/Regular Expression/HMMs
• Protein Sequence Identification via Mass Spec.• Discovering protein coding genes
– Gene finding HMMs– DNA signals (splice signals)
![Page 21: CSE182-L12](https://reader035.fdocuments.net/reader035/viewer/2022070418/568157ed550346895dc56336/html5/thumbnails/21.jpg)
Genome Assembly
![Page 22: CSE182-L12](https://reader035.fdocuments.net/reader035/viewer/2022070418/568157ed550346895dc56336/html5/thumbnails/22.jpg)
DNA Sequencing
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
• DNA is double-stranded
• The strands are separated, and a polymerase is used to copy the second strand.
• Special bases terminate this process early.
![Page 23: CSE182-L12](https://reader035.fdocuments.net/reader035/viewer/2022070418/568157ed550346895dc56336/html5/thumbnails/23.jpg)
• A break at T is shown here.
• Measuring the lengths using electrophoresis allows us to get the position of each T
• The same can be done with every nucleotide. Color coding can help separate different nucleotides
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
![Page 24: CSE182-L12](https://reader035.fdocuments.net/reader035/viewer/2022070418/568157ed550346895dc56336/html5/thumbnails/24.jpg)
• Automated detectors ‘read’ the terminating bases.
• The signal decays after 1000 bases.
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
![Page 25: CSE182-L12](https://reader035.fdocuments.net/reader035/viewer/2022070418/568157ed550346895dc56336/html5/thumbnails/25.jpg)
Sequencing Genomes: Clone by Clone
• Clones are constructed to span the entire length of the genome.
• These clones are ordered and oriented correctly (Mapping)
• Each clone is sequenced individually
![Page 26: CSE182-L12](https://reader035.fdocuments.net/reader035/viewer/2022070418/568157ed550346895dc56336/html5/thumbnails/26.jpg)
Shotgun Sequencing
• Shotgun sequencing of clones was considered viable
• However, researchers in 1999 proposed shotgunning the entire genome.
![Page 27: CSE182-L12](https://reader035.fdocuments.net/reader035/viewer/2022070418/568157ed550346895dc56336/html5/thumbnails/27.jpg)
Library
• Create vectors of the sequence and introduce them into bacteria. As bacteria multiply you will have many copies of the same clone.
![Page 28: CSE182-L12](https://reader035.fdocuments.net/reader035/viewer/2022070418/568157ed550346895dc56336/html5/thumbnails/28.jpg)
Sequencing
![Page 29: CSE182-L12](https://reader035.fdocuments.net/reader035/viewer/2022070418/568157ed550346895dc56336/html5/thumbnails/29.jpg)
Questions
• Algorithmic: How do you put the genome back together from the pieces? Will be discussed in the next lecture.
• Statistical? How many pieces do you need to sequence, etc.?– The answer to the statistical questions had
already been given in the context of mapping, by Lander and Waterman.
![Page 30: CSE182-L12](https://reader035.fdocuments.net/reader035/viewer/2022070418/568157ed550346895dc56336/html5/thumbnails/30.jpg)
Lander Waterman Statistics
G
L€
G = Genome LengthL = Clone LengthN = Number of ClonesT = Required Overlapc = Coverage = LN/Gα = N/Gθ = T/Lσ = 1-θ
Island
![Page 31: CSE182-L12](https://reader035.fdocuments.net/reader035/viewer/2022070418/568157ed550346895dc56336/html5/thumbnails/31.jpg)
LW statistics: questions
• As the coverage c increases, more and more areas of the genome are likely to be covered. Ideally, you want to see 1 island.• Q1: What is the expected number of islands?
• Ans: N exp(-c)• The number
increases at first, and gradually decreases.
![Page 32: CSE182-L12](https://reader035.fdocuments.net/reader035/viewer/2022070418/568157ed550346895dc56336/html5/thumbnails/32.jpg)
Analysis: Expected Number Islands
• Computing Expected # islands.• Let Xi=1 if an island ends at position i,
Xi=0 otherwise.• Number of islands = ∑i Xi
• Expected # islands = E(∑i Xi) = ∑i E(Xi)
![Page 33: CSE182-L12](https://reader035.fdocuments.net/reader035/viewer/2022070418/568157ed550346895dc56336/html5/thumbnails/33.jpg)
Prob. of an island ending at i
• E(Xi) = Prob (Island ends at pos. i)
• =Prob(clone began at position i-L+1
AND no clone began in the next L-T positions)
iL
T
€
E(X i) =α 1−α( )L−T
=αe−cσ
€
Expected # islands = E(X i) =i
∑ Gαe−cσ = Ne−cσ
![Page 34: CSE182-L12](https://reader035.fdocuments.net/reader035/viewer/2022070418/568157ed550346895dc56336/html5/thumbnails/34.jpg)
LW statistics
• Pr[Island contains exactly j clones]?• Consider an island that has already begun. With
probability e-c, it will never be continued. Therefore• Pr[Island contains exactly j clones]=
€
(1− e−cσ ) j−1e−cσ
• Expected # j-clone islands
€
=Ne−cσ (1− e−cσ ) j−1e−cσ
![Page 35: CSE182-L12](https://reader035.fdocuments.net/reader035/viewer/2022070418/568157ed550346895dc56336/html5/thumbnails/35.jpg)
Expected # of clones in an island
€
ecσ
Why?
![Page 36: CSE182-L12](https://reader035.fdocuments.net/reader035/viewer/2022070418/568157ed550346895dc56336/html5/thumbnails/36.jpg)
Expected length of an island
€
Lecσ −1
c
⎛
⎝ ⎜
⎞
⎠ ⎟+ (1−σ )
⎡
⎣ ⎢
⎤
⎦ ⎥