Lecture #3: Finding genes 9/9/09. Looking at my lecture slides This has the main points But there...
-
date post
21-Dec-2015 -
Category
Documents
-
view
212 -
download
0
Transcript of Lecture #3: Finding genes 9/9/09. Looking at my lecture slides This has the main points But there...
Looking at my lecture slidesLooking at my lecture slides
This has the main points
But there is more information sometimes in the info panel below. Use the “Normal view” to see that panel.
HGP Venter Watson Quake, Stanford
Method Clone by clone
Whole genome shotgun
454 Life Sciences
Heliscope single molecule seq
Coverage 8-10x 7.5x 7.4x 28x
Read length
600 bp 700 bp 230 bp 20-70 bp
Time 13 yrs 4 yrs 4.5 month
?
Cost $2.7B $100M $1.5M $50k
Authors 2800 31 27 3
Last lecture and homeworkLast lecture and homework
Questions on genomes??
Did you learn anything interesting about genomes??
Big pictureBig picture
QuestionsEvolution of sensory
systemsHow complex are they?
Gather infoDiversity of receptorsSignal transduction
pathways
HypothesisHypothesis
If we can find the genes for receptors and signal transduction, we will understand how the senses work
Questions for todayQuestions for today
1. What is a gene?2. How many genes are in a
genome?3. How do you find a gene in
Genbank?4. How do you find a gene in a
genome?
Q1. What is a gene?Q1. What is a gene?
DNA sequence necessary to make an RNA message “A locatable region of genomic sequence, which is
associated with regulatory regions, transcribed regions, or other functional sequence regions” -Sequence ontology consortium
Gene = exons + intronsGene = exons + introns
Not all exons code for protein
Exon1 Exon1 Exon2 Exon2 Exon3Exon3
Exon1 Exon1 Exon2 Exon2 Exon3Exon35’UTR 3’UTR
Intron intronIntron intron
Gene = exons + introns + Gene = exons + introns + regulatory elementsregulatory elements
Promoter and enhancer elements upstream, downstream or in introns
Exon1 Exon1 Exon2 Exon2 Exon3Exon35’UTR 3’UTR
Central dogma*Central dogma*
DNA mRNA protein
Information flows from DNA to RNA to protein
*Francis Crick, 1958
Gene is transcribed to mRNA Gene is transcribed to mRNA and translated to proteinand translated to protein
Exon1 Exon1 Exon2 Exon2 Exon3Exon3
Exon1 Exon1 Exon2 Exon2 Exon3Exon3
DNA
mRNA
protein
AAAAAAAA
Transcription: DNA Transcription: DNA mRNAmRNA
Exon1 Exon1 Exon2 Exon2 Exon3Exon3
AAAAAAExon1 Exon1 Exon2 Exon2 Exon3Exon3
Introns are spliced out and polyA tail is added to make mRNA
DNA
GT AG GT AG
Translation: mRNATranslation: mRNAprotein protein
Protein goes from start to stop
AAAAAAExon1 Exon1 Exon2 Exon2 Exon3Exon3
Protein
ATGATG TAATAA
Q2. How many genes in a genome?Q2. How many genes in a genome?
Species Genome (Mb)
#genes
Escherichia coli 4.6 4377
Saccharomyces cerevisiae
12 5770
Caenorhabditis elegans 100 20099
Drosophila melanogaster 132 14651
Homo sapiens 3000 ????
# human genes# human genes
GeneSweep contestLeading scientists bet on the # of human
genes before sequence was availableEstimates ranged from 26,000 to
150,000Humans are more complex so should have more genes
And the winner is ….
# human genes# human genes
Lee Rowen, U Wash had the lowest number (25,947)
Predicted # in 2003:
24,500 genes
New estimates put this number at 23,299
# genes is proportional to # genes is proportional to genome size to a pointgenome size to a point
HumanMouse
EcoliYeast
Fly
Worm
Puffer
Stickleback
Zfish
0
5000
10000
15000
20000
25000
30000
35000
0 500 1000 1500 2000 2500 3000 3500
Genome size (Mb)
# of genes
Problem - How to predict Problem - How to predict genes?genes?
Assumptions:Exons are GC richExons can’t contain stop codonsIntrons bounded by GT … AGExons and introns fall in certain size
range
Problems with gene predictionsProblems with gene predictions
Genes are predicted which have never been detected as transcriptsAre they real?
C elegans ORFeome (Brent et al)Go looking for transcripts that are
predicted but not yet observedFind some of predictions but not
othersUse this to improve predictions
Some genes are unusual: Titan Some genes are unusual: Titan has >200 exons and is 101,519 has >200 exons and is 101,519
bp!bp!
From proteinsFrom proteins
First genes found from proteinsIsolate and purifyProtein sequencing to determine AA order
Can design primers to AA sequence to isolate full length DNA
From mRNAFrom mRNA
Isolate mRNA Reverse
transcribe it to make cDNA
Clone into vector Pick clones to
make a library Sequence
Transcriptome = set of all Transcriptome = set of all mRNA’smRNA’s
Specific to OrganismTissueDevelopmental stage
Try to catalog all genes expressed
cDNA library
Tissue specific genes
From targeted search - PCRFrom targeted search - PCR
PCR amplify using primers specific to a gene of interest
FASTA is: >identifierFASTA is: >identifier sequence sequence
>identifier>identifier
sequencesequence
Q#4 : How do we find a gene in a genome?Q#4 : How do we find a gene in a genome? This is harder than targeted sequencing. A genome is just a string of letters
TCAAGCAAACTAGACAACAGAAGATGGCTTGGGAAGGAGGAATTGAGCCCAATGGCACTGAAGGCAAGAACTTCTACATTCCCATGTCCAACAGGACTGGGATTGTTAGAAGTCCTTTTGAGTACACTCAGTATTACCTGGCAGACCCGATCTTTTTCAAGCTCCTGGCTTTCTACATGTTCTTCCTGATCTGCACTGGGACTCCCATCAACAGCCTGACATTGTTTGTAACTGCTCAGAACAAGAAGCTCCGGCAACCTCTCAACTATATCCTGGTCAACCTGGCTGTGGCTGGACTCATCATGTGCTGCTTTGGATTCACCATCACCATCACATCAGCTTTTAATGGCTACTTCATTCTTGGATCCACCTTCTGTGCAATTGAGGGATTCATGGCCACACTAGGAGGTAAGCAAGAAGTCAGATCCTTTTCAGGATCCTTTCTATTTCATTGGCAGATATACAATATCAATGAATAACTCACCTTTTCTGTCTACAGGTGAAGTTGCTCTCTGGTCACTTGTTGTCCTGGCTATTGAGAGATACATTGTGGTCTGCAAACCCATGGGAAGCTTCAAGTTCTCTGGAGCTCATGCTGGTGCTGGAGTACTCTTCACATGGATCATGGCAATGGCTTGTGCTGCACCTCCACTCTTTGGATGGTCCAGGTACTCAAATATTTCTTAATATTTTATTTAGTTAAACAGCCTTTTGTACTTTTAAGCCACTAACTTGAAAATTAGATATTTTCACTCACAAAACTAGATGTTAAATGTAAAGAGCTATTTTTACTGAATGAGGACATGACTTTCTTTCTCACAATCTCACACAGGTACATTCCTGAGGGAATGCAGTGTTCCTGTGGTCCTGACTACTACACACTGGCTCCAGGTTTCAACAATGAGTCATATGTCATCTACATGTTTGTTGTTCACTTCTTCGTTCCTGTCTTCATCATTTTCTTCACTTATGGAAGCCTTGTGATGACAGTCAAAGCTGTAAGTGAAGCTAAAGTTCTTAAATTATTTATAATAGATCAGTTAATCTAACCAAGCTAGCCATTGCTACTGTGGAATTTATTGGTGTTACATTAACCTAACTGACCATAAAACTTAACAGGCAGCAGCACAGCAGCAGGACTCAGCTTCTACCCAGAAAGCTGAGAAGGAAGTGACCCGTATGTGTGTCTTGATGGTCATGGGCTTCCTAATAGCTTGGACACCGTATGCTAGCTTTGCTGGTTGGATTTTCATGAACAAGGGAGCTTCTTTCAGTGCCCTCACTGCAGCCATCCCTGCTTTCTTTGCAAAAAGCTCAGCCTTGTACAACCCTGTTATCTACGTGCTAATGAACAAACAGGTTGGTGTTACGTATTCTCATAGTTTTCTTCTCTGTTTTTCAGTCTTTTGCTGTTTACTGAGTTTCTGAATTGGCTGTCTTTCAGTTCCGTAACTGCATGCTATCCACCATTGGAATGGGCGGCATGGTGGAGGATGAGACCTCAGTTTCAACAAGCAAGACAGAGGTGTCCTCTGTGTCTTAATCTTGATGGCATCTTCAGATATAAGGACACTGATGATCGCTCGCAAATTTTCAAAATTCCCATTAGA
How do we know if genes are How do we know if genes are there?there?
Gene predictionsTrain algorithm to find genes
AnnotationComparisons with messenger RNAComparisons with known genes in
other organisms
Comparisons with mRNAComparisons with mRNA TCAAGCAAACTAGACAACAGAAGATGGCTTGGGAAGGAGGAATTGAGCCCAATGGCACTGAAGGCAA
GAACTTCTACATTCCCATGTCCAACAGGACTGGGATTGTTAGAAGTCCTTTTGAGTACACTCAGTATTACCTGGCAGACCCGATCTTTTTCAAGCTCCTGGCTTTCTACATGTTCTTCCTGATCTGCACTGGGACTCCCATCAACAGCCTGACATTGTTTGTAACTGCTCAGAACAAGAAGCTCCGGCAACCTCTCAACTATATCCTGGTCAACCTGGCTGTGGCTGGACTCATCATGTGCTGCTTTGGATTCACCATCACCATCACATCAGCTTTTAATGGCTACTTCATTCTTGGATCCACCTTCTGTGCAATTGAGGGATTCATGGCCACACTAGGAGGTAAGCAAGAAGTCAGATCCTTTTCAGGATCCTTTCTATTTCATTGGCAGATATACAATATCAATGAATAACTCACCTTTTCTGTCTACAGGTGAAGTTGCTCTCTGGTCACTTGTTGTCCTGGCTATTGAGAGATACATTGTGGTCTGCAAACCCATGGGAAGCTTCAAGTTCTCTGGAGCTCATGCTGGTGCTGGAGTACTCTTCACATGGATCATGGCAATGGCTTGTGCTGCACCTCCACTCTTTGGATGGTCCAGGTACTCAAATATTTCTTAATATTTTATTTAGTTAAACAGCCTTTTGTACTTTTAAGCCACTAACTTGAAAATTAGATATTTTCACTCACAAAACTAGATGTTAAATGTAAAGAGCTATTTTTACTGAATGAGGACATGACTTTCTTTCTCACAATCTCACACAGGTACATTCCTGAGGGAATGCAGTGTTCCTGTGGTCCTGACTACTACACACTGGCTCCAGGTTTCAACAATGAGTCATATGTCATCTACATGTTTGTTGTTCACTTCTTCGTTCCTGTCTTCATCATTTTCTTCACTTATGGAAGCCTTGTGATGACAGTCAAAGCTGTAAGTGAAGCTAAAGTTCTTAAATTATTTATAATAGATCAGTTAATCTAACCAAGCTAGCCATTGCTACTGTGGAATTTATTGGTGTTACATTAACCTAACTGACCATAAAACTTAACAGGCAGCAGCACAGCAGCAGGACTCAGCTTCTACCCAGAAAGCTGAGAAGGAAGTGACCCGTATGTGTGTCTTGATGGTCATGGGCTTCCTAATAGCTTGGACACCGTATGCTAGCTTTGCTGGTTGGATTTTCATGAACAAGGGAGCTTCTTTCAGTGCCCTCACTGCAGCCATCCCTGCTTTCTTTGCAAAAAGCTCAGCCTTGTACAACCCTGTTATCTACGTGCTAATGAACAAACAGGTTGGTGTTACGTATTCTCATAGTTTTCTTCTCTGTTTTTCAGTCTTTTGCTGTTTACTGAGTTTCTGAATTGGCTGTCTTTCAGTTCCGTAACTGCATGCTATCCACCATTGGAATGGGCGGCATGGTGGAGGATGAGACCTCAGTTTCAACAAGCAAGACAGAGGTGTCCTCTGTGTCTTAATCTTGATGGCATCTTCAGATATAAGGACACTGATGATCGCTCGCAAATTTTCAAAATTCCCATTAGA
Comparisons with mRNAComparisons with mRNA TCAAGCAAACTAGACAACAGAAGATGGCTTGGGAAGGAGGAATTGAGCCCAATGGCACTGAAGGCAA
GAACTTCTACATTCCCATGTCCAACAGGACTGGGATTGTTAGAAGTCCTTTTGAGTACACTCAGTATTACCTGGCAGACCCGATCTTTTTCAAGCTCCTGGCTTTCTACATGTTCTTCCTGATCTGCACTGGGACTCCCATCAACAGCCTGACATTGTTTGTAACTGCTCAGAACAAGAAGCTCCGGCAACCTCTCAACTATATCCTGGTCAACCTGGCTGTGGCTGGACTCATCATGTGCTGCTTTGGATTCACCATCACCATCACATCAGCTTTTAATGGCTACTTCATTCTTGGATCCACCTTCTGTGCAATTGAGGGATTCATGGCCACACTAGGAGGTAAGCAAGAAGTCAGATCCTTTTCAGGATCCTTTCTATTTCATTGGCAGATATACAATATCAATGAATAACTCACCTTTTCTGTCTACAGGTGAAGTTGCTCTCTGGTCACTTGTTGTCCTGGCTATTGAGAGATACATTGTGGTCTGCAAACCCATGGGAAGCTTCAAGTTCTCTGGAGCTCATGCTGGTGCTGGAGTACTCTTCACATGGATCATGGCAATGGCTTGTGCTGCACCTCCACTCTTTGGATGGTCCAGGTACTCAAATATTTCTTAATATTTTATTTAGTTAAACAGCCTTTTGTACTTTTAAGCCACTAACTTGAAAATTAGATATTTTCACTCACAAAACTAGATGTTAAATGTAAAGAGCTATTTTTACTGAATGAGGACATGACTTTCTTTCTCACAATCTCACACAGGTACATTCCTGAGGGAATGCAGTGTTCCTGTGGTCCTGACTACTACACACTGGCTCCAGGTTTCAACAATGAGTCATATGTCATCTACATGTTTGTTGTTCACTTCTTCGTTCCTGTCTTCATCATTTTCTTCACTTATGGAAGCCTTGTGATGACAGTCAAAGCTGTAAGTGAAGCTAAAGTTCTTAAATTATTTATAATAGATCAGTTAATCTAACCAAGCTAGCCATTGCTACTGTGGAATTTATTGGTGTTACATTAACCTAACTGACCATAAAACTTAACAGGCAGCAGCACAGCAGCAGGACTCAGCTTCTACCCAGAAAGCTGAGAAGGAAGTGACCCGTATGTGTGTCTTGATGGTCATGGGCTTCCTAATAGCTTGGACACCGTATGCTAGCTTTGCTGGTTGGATTTTCATGAACAAGGGAGCTTCTTTCAGTGCCCTCACTGCAGCCATCCCTGCTTTCTTTGCAAAAAGCTCAGCCTTGTACAACCCTGTTATCTACGTGCTAATGAACAAACAGGTTGGTGTTACGTATTCTCATAGTTTTCTTCTCTGTTTTTCAGTCTTTTGCTGTTTACTGAGTTTCTGAATTGGCTGTCTTTCAGTTCCGTAACTGCATGCTATCCACCATTGGAATGGGCGGCATGGTGGAGGATGAGACCTCAGTTTCAACAAGCAAGACAGAGGTGTCCTCTGTGTCTTAATCTTGATGGCATCTTCAGATATAAGGACACTGATGATCGCTCGCAAATTTTCAAAATTCCCATTAGA
Exons Introns
Comparisons with mRNAComparisons with mRNA TCAAGCAAACTAGACAACAGAAGATGGCTTGGGAAGGAGGAATTGAGCCCAATGGCACTGAAGGCAA
GAACTTCTACATTCCCATGTCCAACAGGACTGGGATTGTTAGAAGTCCTTTTGAGTACACTCAGTATTACCTGGCAGACCCGATCTTTTTCAAGCTCCTGGCTTTCTACATGTTCTTCCTGATCTGCACTGGGACTCCCATCAACAGCCTGACATTGTTTGTAACTGCTCAGAACAAGAAGCTCCGGCAACCTCTCAACTATATCCTGGTCAACCTGGCTGTGGCTGGACTCATCATGTGCTGCTTTGGATTCACCATCACCATCACATCAGCTTTTAATGGCTACTTCATTCTTGGATCCACCTTCTGTGCAATTGAGGGATTCATGGCCACACTAGGAGGTAAGCAAGAAGTCAGATCCTTTTCAGGATCCTTTCTATTTCATTGGCAGATATACAATATCAATGAATAACTCACCTTTTCTGTCTACAGGTGAAGTTGCTCTCTGGTCACTTGTTGTCCTGGCTATTGAGAGATACATTGTGGTCTGCAAACCCATGGGAAGCTTCAAGTTCTCTGGAGCTCATGCTGGTGCTGGAGTACTCTTCACATGGATCATGGCAATGGCTTGTGCTGCACCTCCACTCTTTGGATGGTCCAGGTACTCAAATATTTCTTAATATTTTATTTAGTTAAACAGCCTTTTGTACTTTTAAGCCACTAACTTGAAAATTAGATATTTTCACTCACAAAACTAGATGTTAAATGTAAAGAGCTATTTTTACTGAATGAGGACATGACTTTCTTTCTCACAATCTCACACAGGTACATTCCTGAGGGAATGCAGTGTTCCTGTGGTCCTGACTACTACACACTGGCTCCAGGTTTCAACAATGAGTCATATGTCATCTACATGTTTGTTGTTCACTTCTTCGTTCCTGTCTTCATCATTTTCTTCACTTATGGAAGCCTTGTGATGACAGTCAAAGCTGTAAGTGAAGCTAAAGTTCTTAAATTATTTATAATAGATCAGTTAATCTAACCAAGCTAGCCATTGCTACTGTGGAATTTATTGGTGTTACATTAACCTAACTGACCATAAAACTTAACAGGCAGCAGCACAGCAGCAGGACTCAGCTTCTACCCAGAAAGCTGAGAAGGAAGTGACCCGTATGTGTGTCTTGATGGTCATGGGCTTCCTAATAGCTTGGACACCGTATGCTAGCTTTGCTGGTTGGATTTTCATGAACAAGGGAGCTTCTTTCAGTGCCCTCACTGCAGCCATCCCTGCTTTCTTTGCAAAAAGCTCAGCCTTGTACAACCCTGTTATCTACGTGCTAATGAACAAACAGGTTGGTGTTACGTATTCTCATAGTTTTCTTCTCTGTTTTTCAGTCTTTTGCTGTTTACTGAGTTTCTGAATTGGCTGTCTTTCAGTTCCGTAACTGCATGCTATCCACCATTGGAATGGGCGGCATGGTGGAGGATGAGACCTCAGTTTCAACAAGCAAGACAGAGGTGTCCTCTGTGTCTTAATCTTGATGGCATCTTCAGATATAAGGACACTGATGATCGCTCGCAAATTTTCAAAATTCCCATTAGA
Comparing genes with other Comparing genes with other speciesspecies
HypothesisGenes perform same job in different
organismsSequence needed to perform function
will be conserved Conservation
Highest in exonsLowest in introns, UTRs
BLASTBLAST
Use Basic Local Alignment Search Tool to find sequences in database which are similar to one you have in hand
Your query sequence is compared to all sequences in database and given a score based on the alignment
Ways to use BLASTWays to use BLAST
Find sequence in a database Find sequence in a genome Can use pairwise BLAST to
compare gene in one organism with another
Compare known mRNA vs genomic region
How BLAST worksHow BLAST works
1. Divides query sequence and database sequences into “words”
Compares all words and finds ones that are similarCalculate a similarity score based on
scoring method
Compare two nucleotide Compare two nucleotide sequencessequences
Score them for similarityMatch = +2Mismatch = -3
Query ACTGCGT compared to Dbase ACTGCCT 2 2 2 2 2-3 2 = score of 9
BLAST compareBLAST compare
Then extend word on either side to see how long a match can be made
Score matchQ GTCAACACTGCGTACGTCC
D CGTAACACTGCCTACTCAG 2 2 2 2 2 2 2 2-32 2 2
score of 19Do this between query and every entry in
the database
BLAST compareBLAST compare
Calculate probability that match would occur by chance, EIf E is small, match is significant
E < 1 x 10-100
If E is big, match occurred randomlyE = 1
Type Query Database
Blastn Nucleotide Nucleotide
Blastp Protein Protein
Blastx Translated nucleotide
Protein
Tblastn Protein Translated nucleotide
Tblastx Translated nucleotide
Translated nucleotide
Types of blast queriesTypes of blast queries
Protein sequences will be more Protein sequences will be more similar than nucleotide similar than nucleotide
sequencessequences Redundancy of amino acid code
CCACCC = P, prolineCCG CCT
So two nucleotide sequences could have 1/3 bases be different but protein sequences would be identical
If compare coding sequencesIf compare coding sequences
Amino acids should be similar in important parts of molecule
More likely to keep same chemical character so change within AA groupNon-polar A, G, I, L, P, VPolar M, N, Q, S, TAromatic F, W, YPolar + charge H, K, RPolar - charge D, ECysteine C
Need a way to quantify or score Need a way to quantify or score AA similarityAA similarity
Compare set of real protein sequences
Determine probability that AA switches from one to another oneGive each change a score+ score = more likely to occur- score = less likely to occur
Similarity matrix gives scoresSimilarity matrix gives scores
A = alanine C = cysteine D = aspartic acid E = glutamic acid F = phenylalanine G = glycine H = histidine
NP C -P -P A NP +PNP C -P -P A NP +P
Different matrices are used Different matrices are used depending on how different depending on how different
proteins / organisms areproteins / organisms are Generate different matrices for
proteins which less or more differentBLOSUM 80 proteins 80% similarBLOSUM 62 62%BLOSUM 45 45%
Gather different proteins sets to make each of these matrices
Similarity matrixSimilarity matrix
A = alanine C = cysteine D = aspartic acid E = glutamic acid F = phenylalanine G = glycine H = histidine
NP C -P -P A NP +PNP C -P -P A NP +P
6 6 -2-2
BLAST compareBLAST compare
Score and select best word match Query AGDECDA
Data AGDDCDG 4 6 6 2 9 6 0 = 33
Query AGDECDA Data AAHECDA 4 0 1 5 9 6 4 = 29
Blast choicesBlast choices
What database do you want to search?
What do you want to compare?
What program do you want to do the comparing?