Lecture #3: Finding genes 9/9/09. Looking at my lecture slides This has the main points But there...

63
Lecture #3: Finding genes Lecture #3: Finding genes 9/9/09
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    212
  • download

    0

Transcript of Lecture #3: Finding genes 9/9/09. Looking at my lecture slides This has the main points But there...

Lecture #3: Finding genesLecture #3: Finding genes

9/9/09

Looking at my lecture slidesLooking at my lecture slides

This has the main points

But there is more information sometimes in the info panel below. Use the “Normal view” to see that panel.

HGP Venter Watson Quake, Stanford

Method Clone by clone

Whole genome shotgun

454 Life Sciences

Heliscope single molecule seq

Coverage 8-10x 7.5x 7.4x 28x

Read length

600 bp 700 bp 230 bp 20-70 bp

Time 13 yrs 4 yrs 4.5 month

?

Cost $2.7B $100M $1.5M $50k

Authors 2800 31 27 3

Last lecture and homeworkLast lecture and homework

Questions on genomes??

Did you learn anything interesting about genomes??

Big pictureBig picture

QuestionsEvolution of sensory

systemsHow complex are they?

Gather infoDiversity of receptorsSignal transduction

pathways

HypothesisHypothesis

If we can find the genes for receptors and signal transduction, we will understand how the senses work

Questions for todayQuestions for today

1. What is a gene?2. How many genes are in a

genome?3. How do you find a gene in

Genbank?4. How do you find a gene in a

genome?

Q1. What is a gene?Q1. What is a gene?

Q1. What is a gene?Q1. What is a gene?

DNA sequence necessary to make an RNA message “A locatable region of genomic sequence, which is

associated with regulatory regions, transcribed regions, or other functional sequence regions” -Sequence ontology consortium

Gene = exons + intronsGene = exons + introns

Not all exons code for protein

Exon1 Exon1 Exon2 Exon2 Exon3Exon3

Exon1 Exon1 Exon2 Exon2 Exon3Exon35’UTR 3’UTR

Intron intronIntron intron

Gene = exons + introns + Gene = exons + introns + regulatory elementsregulatory elements

Promoter and enhancer elements upstream, downstream or in introns

Exon1 Exon1 Exon2 Exon2 Exon3Exon35’UTR 3’UTR

Central dogma*Central dogma*

DNA mRNA protein

Information flows from DNA to RNA to protein

*Francis Crick, 1958

Gene is transcribed to mRNA Gene is transcribed to mRNA and translated to proteinand translated to protein

Exon1 Exon1 Exon2 Exon2 Exon3Exon3

Exon1 Exon1 Exon2 Exon2 Exon3Exon3

DNA

mRNA

protein

AAAAAAAA

Transcription: DNA Transcription: DNA mRNAmRNA

Exon1 Exon1 Exon2 Exon2 Exon3Exon3

AAAAAAExon1 Exon1 Exon2 Exon2 Exon3Exon3

Introns are spliced out and polyA tail is added to make mRNA

DNA

GT AG GT AG

Translation: mRNATranslation: mRNAprotein protein

Protein goes from start to stop

AAAAAAExon1 Exon1 Exon2 Exon2 Exon3Exon3

Protein

ATGATG TAATAA

Q2. How many genes in a genome?Q2. How many genes in a genome?

Species Genome (Mb)

#genes

Escherichia coli 4.6 4377

Saccharomyces cerevisiae

12 5770

Caenorhabditis elegans 100 20099

Drosophila melanogaster 132 14651

Homo sapiens 3000 ????

# human genes# human genes

GeneSweep contestLeading scientists bet on the # of human

genes before sequence was availableEstimates ranged from 26,000 to

150,000Humans are more complex so should have more genes

And the winner is ….

# human genes# human genes

Lee Rowen, U Wash had the lowest number (25,947)

Predicted # in 2003:

24,500 genes

New estimates put this number at 23,299

# genes is proportional to # genes is proportional to genome size to a pointgenome size to a point

HumanMouse

EcoliYeast

Fly

Worm

Puffer

Stickleback

Zfish

0

5000

10000

15000

20000

25000

30000

35000

0 500 1000 1500 2000 2500 3000 3500

Genome size (Mb)

# of genes

Problem - How to predict Problem - How to predict genes?genes?

Assumptions:Exons are GC richExons can’t contain stop codonsIntrons bounded by GT … AGExons and introns fall in certain size

range

Gene predicting…Gene predicting…

Pennisi, Science 2003

Gene predicting is hardGene predicting is hard

Pennisi, Science 2003

Problems with gene predictionsProblems with gene predictions

Genes are predicted which have never been detected as transcriptsAre they real?

C elegans ORFeome (Brent et al)Go looking for transcripts that are

predicted but not yet observedFind some of predictions but not

othersUse this to improve predictions

Some genes are unusual: Titan Some genes are unusual: Titan has >200 exons and is 101,519 has >200 exons and is 101,519

bp!bp!

Q#3 How do we find Q#3 How do we find genes genes de novo?de novo?

From proteinsFrom proteins

First genes found from proteinsIsolate and purifyProtein sequencing to determine AA order

Can design primers to AA sequence to isolate full length DNA

From mRNAFrom mRNA

Isolate mRNA Reverse

transcribe it to make cDNA

Clone into vector Pick clones to

make a library Sequence

Transcriptome = set of all Transcriptome = set of all mRNA’smRNA’s

Specific to OrganismTissueDevelopmental stage

Try to catalog all genes expressed

cDNA library

Tissue specific genes

From targeted search - PCRFrom targeted search - PCR

PCR amplify using primers specific to a gene of interest

Sequences deposited in Sequences deposited in GenbankGenbank

Accession number

Sequence type

To get fasta sequenceTo get fasta sequence

FASTA is: >identifierFASTA is: >identifier sequence sequence

>identifier>identifier

sequencesequence

Protein Protein sequencsequenc

ee

From protein display, get fasta protein From protein display, get fasta protein sequencesequence

FASTAFASTA

Q#4 : How do we find a gene in a genome?Q#4 : How do we find a gene in a genome? This is harder than targeted sequencing. A genome is just a string of letters

TCAAGCAAACTAGACAACAGAAGATGGCTTGGGAAGGAGGAATTGAGCCCAATGGCACTGAAGGCAAGAACTTCTACATTCCCATGTCCAACAGGACTGGGATTGTTAGAAGTCCTTTTGAGTACACTCAGTATTACCTGGCAGACCCGATCTTTTTCAAGCTCCTGGCTTTCTACATGTTCTTCCTGATCTGCACTGGGACTCCCATCAACAGCCTGACATTGTTTGTAACTGCTCAGAACAAGAAGCTCCGGCAACCTCTCAACTATATCCTGGTCAACCTGGCTGTGGCTGGACTCATCATGTGCTGCTTTGGATTCACCATCACCATCACATCAGCTTTTAATGGCTACTTCATTCTTGGATCCACCTTCTGTGCAATTGAGGGATTCATGGCCACACTAGGAGGTAAGCAAGAAGTCAGATCCTTTTCAGGATCCTTTCTATTTCATTGGCAGATATACAATATCAATGAATAACTCACCTTTTCTGTCTACAGGTGAAGTTGCTCTCTGGTCACTTGTTGTCCTGGCTATTGAGAGATACATTGTGGTCTGCAAACCCATGGGAAGCTTCAAGTTCTCTGGAGCTCATGCTGGTGCTGGAGTACTCTTCACATGGATCATGGCAATGGCTTGTGCTGCACCTCCACTCTTTGGATGGTCCAGGTACTCAAATATTTCTTAATATTTTATTTAGTTAAACAGCCTTTTGTACTTTTAAGCCACTAACTTGAAAATTAGATATTTTCACTCACAAAACTAGATGTTAAATGTAAAGAGCTATTTTTACTGAATGAGGACATGACTTTCTTTCTCACAATCTCACACAGGTACATTCCTGAGGGAATGCAGTGTTCCTGTGGTCCTGACTACTACACACTGGCTCCAGGTTTCAACAATGAGTCATATGTCATCTACATGTTTGTTGTTCACTTCTTCGTTCCTGTCTTCATCATTTTCTTCACTTATGGAAGCCTTGTGATGACAGTCAAAGCTGTAAGTGAAGCTAAAGTTCTTAAATTATTTATAATAGATCAGTTAATCTAACCAAGCTAGCCATTGCTACTGTGGAATTTATTGGTGTTACATTAACCTAACTGACCATAAAACTTAACAGGCAGCAGCACAGCAGCAGGACTCAGCTTCTACCCAGAAAGCTGAGAAGGAAGTGACCCGTATGTGTGTCTTGATGGTCATGGGCTTCCTAATAGCTTGGACACCGTATGCTAGCTTTGCTGGTTGGATTTTCATGAACAAGGGAGCTTCTTTCAGTGCCCTCACTGCAGCCATCCCTGCTTTCTTTGCAAAAAGCTCAGCCTTGTACAACCCTGTTATCTACGTGCTAATGAACAAACAGGTTGGTGTTACGTATTCTCATAGTTTTCTTCTCTGTTTTTCAGTCTTTTGCTGTTTACTGAGTTTCTGAATTGGCTGTCTTTCAGTTCCGTAACTGCATGCTATCCACCATTGGAATGGGCGGCATGGTGGAGGATGAGACCTCAGTTTCAACAAGCAAGACAGAGGTGTCCTCTGTGTCTTAATCTTGATGGCATCTTCAGATATAAGGACACTGATGATCGCTCGCAAATTTTCAAAATTCCCATTAGA

How do we know if genes are How do we know if genes are there?there?

Gene predictionsTrain algorithm to find genes

AnnotationComparisons with messenger RNAComparisons with known genes in

other organisms

Comparisons with mRNAComparisons with mRNA TCAAGCAAACTAGACAACAGAAGATGGCTTGGGAAGGAGGAATTGAGCCCAATGGCACTGAAGGCAA

GAACTTCTACATTCCCATGTCCAACAGGACTGGGATTGTTAGAAGTCCTTTTGAGTACACTCAGTATTACCTGGCAGACCCGATCTTTTTCAAGCTCCTGGCTTTCTACATGTTCTTCCTGATCTGCACTGGGACTCCCATCAACAGCCTGACATTGTTTGTAACTGCTCAGAACAAGAAGCTCCGGCAACCTCTCAACTATATCCTGGTCAACCTGGCTGTGGCTGGACTCATCATGTGCTGCTTTGGATTCACCATCACCATCACATCAGCTTTTAATGGCTACTTCATTCTTGGATCCACCTTCTGTGCAATTGAGGGATTCATGGCCACACTAGGAGGTAAGCAAGAAGTCAGATCCTTTTCAGGATCCTTTCTATTTCATTGGCAGATATACAATATCAATGAATAACTCACCTTTTCTGTCTACAGGTGAAGTTGCTCTCTGGTCACTTGTTGTCCTGGCTATTGAGAGATACATTGTGGTCTGCAAACCCATGGGAAGCTTCAAGTTCTCTGGAGCTCATGCTGGTGCTGGAGTACTCTTCACATGGATCATGGCAATGGCTTGTGCTGCACCTCCACTCTTTGGATGGTCCAGGTACTCAAATATTTCTTAATATTTTATTTAGTTAAACAGCCTTTTGTACTTTTAAGCCACTAACTTGAAAATTAGATATTTTCACTCACAAAACTAGATGTTAAATGTAAAGAGCTATTTTTACTGAATGAGGACATGACTTTCTTTCTCACAATCTCACACAGGTACATTCCTGAGGGAATGCAGTGTTCCTGTGGTCCTGACTACTACACACTGGCTCCAGGTTTCAACAATGAGTCATATGTCATCTACATGTTTGTTGTTCACTTCTTCGTTCCTGTCTTCATCATTTTCTTCACTTATGGAAGCCTTGTGATGACAGTCAAAGCTGTAAGTGAAGCTAAAGTTCTTAAATTATTTATAATAGATCAGTTAATCTAACCAAGCTAGCCATTGCTACTGTGGAATTTATTGGTGTTACATTAACCTAACTGACCATAAAACTTAACAGGCAGCAGCACAGCAGCAGGACTCAGCTTCTACCCAGAAAGCTGAGAAGGAAGTGACCCGTATGTGTGTCTTGATGGTCATGGGCTTCCTAATAGCTTGGACACCGTATGCTAGCTTTGCTGGTTGGATTTTCATGAACAAGGGAGCTTCTTTCAGTGCCCTCACTGCAGCCATCCCTGCTTTCTTTGCAAAAAGCTCAGCCTTGTACAACCCTGTTATCTACGTGCTAATGAACAAACAGGTTGGTGTTACGTATTCTCATAGTTTTCTTCTCTGTTTTTCAGTCTTTTGCTGTTTACTGAGTTTCTGAATTGGCTGTCTTTCAGTTCCGTAACTGCATGCTATCCACCATTGGAATGGGCGGCATGGTGGAGGATGAGACCTCAGTTTCAACAAGCAAGACAGAGGTGTCCTCTGTGTCTTAATCTTGATGGCATCTTCAGATATAAGGACACTGATGATCGCTCGCAAATTTTCAAAATTCCCATTAGA

Comparisons with mRNAComparisons with mRNA TCAAGCAAACTAGACAACAGAAGATGGCTTGGGAAGGAGGAATTGAGCCCAATGGCACTGAAGGCAA

GAACTTCTACATTCCCATGTCCAACAGGACTGGGATTGTTAGAAGTCCTTTTGAGTACACTCAGTATTACCTGGCAGACCCGATCTTTTTCAAGCTCCTGGCTTTCTACATGTTCTTCCTGATCTGCACTGGGACTCCCATCAACAGCCTGACATTGTTTGTAACTGCTCAGAACAAGAAGCTCCGGCAACCTCTCAACTATATCCTGGTCAACCTGGCTGTGGCTGGACTCATCATGTGCTGCTTTGGATTCACCATCACCATCACATCAGCTTTTAATGGCTACTTCATTCTTGGATCCACCTTCTGTGCAATTGAGGGATTCATGGCCACACTAGGAGGTAAGCAAGAAGTCAGATCCTTTTCAGGATCCTTTCTATTTCATTGGCAGATATACAATATCAATGAATAACTCACCTTTTCTGTCTACAGGTGAAGTTGCTCTCTGGTCACTTGTTGTCCTGGCTATTGAGAGATACATTGTGGTCTGCAAACCCATGGGAAGCTTCAAGTTCTCTGGAGCTCATGCTGGTGCTGGAGTACTCTTCACATGGATCATGGCAATGGCTTGTGCTGCACCTCCACTCTTTGGATGGTCCAGGTACTCAAATATTTCTTAATATTTTATTTAGTTAAACAGCCTTTTGTACTTTTAAGCCACTAACTTGAAAATTAGATATTTTCACTCACAAAACTAGATGTTAAATGTAAAGAGCTATTTTTACTGAATGAGGACATGACTTTCTTTCTCACAATCTCACACAGGTACATTCCTGAGGGAATGCAGTGTTCCTGTGGTCCTGACTACTACACACTGGCTCCAGGTTTCAACAATGAGTCATATGTCATCTACATGTTTGTTGTTCACTTCTTCGTTCCTGTCTTCATCATTTTCTTCACTTATGGAAGCCTTGTGATGACAGTCAAAGCTGTAAGTGAAGCTAAAGTTCTTAAATTATTTATAATAGATCAGTTAATCTAACCAAGCTAGCCATTGCTACTGTGGAATTTATTGGTGTTACATTAACCTAACTGACCATAAAACTTAACAGGCAGCAGCACAGCAGCAGGACTCAGCTTCTACCCAGAAAGCTGAGAAGGAAGTGACCCGTATGTGTGTCTTGATGGTCATGGGCTTCCTAATAGCTTGGACACCGTATGCTAGCTTTGCTGGTTGGATTTTCATGAACAAGGGAGCTTCTTTCAGTGCCCTCACTGCAGCCATCCCTGCTTTCTTTGCAAAAAGCTCAGCCTTGTACAACCCTGTTATCTACGTGCTAATGAACAAACAGGTTGGTGTTACGTATTCTCATAGTTTTCTTCTCTGTTTTTCAGTCTTTTGCTGTTTACTGAGTTTCTGAATTGGCTGTCTTTCAGTTCCGTAACTGCATGCTATCCACCATTGGAATGGGCGGCATGGTGGAGGATGAGACCTCAGTTTCAACAAGCAAGACAGAGGTGTCCTCTGTGTCTTAATCTTGATGGCATCTTCAGATATAAGGACACTGATGATCGCTCGCAAATTTTCAAAATTCCCATTAGA

Exons Introns

Comparisons with mRNAComparisons with mRNA TCAAGCAAACTAGACAACAGAAGATGGCTTGGGAAGGAGGAATTGAGCCCAATGGCACTGAAGGCAA

GAACTTCTACATTCCCATGTCCAACAGGACTGGGATTGTTAGAAGTCCTTTTGAGTACACTCAGTATTACCTGGCAGACCCGATCTTTTTCAAGCTCCTGGCTTTCTACATGTTCTTCCTGATCTGCACTGGGACTCCCATCAACAGCCTGACATTGTTTGTAACTGCTCAGAACAAGAAGCTCCGGCAACCTCTCAACTATATCCTGGTCAACCTGGCTGTGGCTGGACTCATCATGTGCTGCTTTGGATTCACCATCACCATCACATCAGCTTTTAATGGCTACTTCATTCTTGGATCCACCTTCTGTGCAATTGAGGGATTCATGGCCACACTAGGAGGTAAGCAAGAAGTCAGATCCTTTTCAGGATCCTTTCTATTTCATTGGCAGATATACAATATCAATGAATAACTCACCTTTTCTGTCTACAGGTGAAGTTGCTCTCTGGTCACTTGTTGTCCTGGCTATTGAGAGATACATTGTGGTCTGCAAACCCATGGGAAGCTTCAAGTTCTCTGGAGCTCATGCTGGTGCTGGAGTACTCTTCACATGGATCATGGCAATGGCTTGTGCTGCACCTCCACTCTTTGGATGGTCCAGGTACTCAAATATTTCTTAATATTTTATTTAGTTAAACAGCCTTTTGTACTTTTAAGCCACTAACTTGAAAATTAGATATTTTCACTCACAAAACTAGATGTTAAATGTAAAGAGCTATTTTTACTGAATGAGGACATGACTTTCTTTCTCACAATCTCACACAGGTACATTCCTGAGGGAATGCAGTGTTCCTGTGGTCCTGACTACTACACACTGGCTCCAGGTTTCAACAATGAGTCATATGTCATCTACATGTTTGTTGTTCACTTCTTCGTTCCTGTCTTCATCATTTTCTTCACTTATGGAAGCCTTGTGATGACAGTCAAAGCTGTAAGTGAAGCTAAAGTTCTTAAATTATTTATAATAGATCAGTTAATCTAACCAAGCTAGCCATTGCTACTGTGGAATTTATTGGTGTTACATTAACCTAACTGACCATAAAACTTAACAGGCAGCAGCACAGCAGCAGGACTCAGCTTCTACCCAGAAAGCTGAGAAGGAAGTGACCCGTATGTGTGTCTTGATGGTCATGGGCTTCCTAATAGCTTGGACACCGTATGCTAGCTTTGCTGGTTGGATTTTCATGAACAAGGGAGCTTCTTTCAGTGCCCTCACTGCAGCCATCCCTGCTTTCTTTGCAAAAAGCTCAGCCTTGTACAACCCTGTTATCTACGTGCTAATGAACAAACAGGTTGGTGTTACGTATTCTCATAGTTTTCTTCTCTGTTTTTCAGTCTTTTGCTGTTTACTGAGTTTCTGAATTGGCTGTCTTTCAGTTCCGTAACTGCATGCTATCCACCATTGGAATGGGCGGCATGGTGGAGGATGAGACCTCAGTTTCAACAAGCAAGACAGAGGTGTCCTCTGTGTCTTAATCTTGATGGCATCTTCAGATATAAGGACACTGATGATCGCTCGCAAATTTTCAAAATTCCCATTAGA

Comparing genes with other Comparing genes with other speciesspecies

HypothesisGenes perform same job in different

organismsSequence needed to perform function

will be conserved Conservation

Highest in exonsLowest in introns, UTRs

BLASTBLAST

Use Basic Local Alignment Search Tool to find sequences in database which are similar to one you have in hand

Your query sequence is compared to all sequences in database and given a score based on the alignment

The beauty of BLASTThe beauty of BLAST

You can find the needle in the haystack without suffering

Ways to use BLASTWays to use BLAST

Find sequence in a database Find sequence in a genome Can use pairwise BLAST to

compare gene in one organism with another

Compare known mRNA vs genomic region

How BLAST worksHow BLAST works

1. Divides query sequence and database sequences into “words”

Compares all words and finds ones that are similarCalculate a similarity score based on

scoring method

Compare two nucleotide Compare two nucleotide sequencessequences

Score them for similarityMatch = +2Mismatch = -3

Query ACTGCGT compared to Dbase ACTGCCT 2 2 2 2 2-3 2 = score of 9

BLAST compareBLAST compare

Then extend word on either side to see how long a match can be made

Score matchQ GTCAACACTGCGTACGTCC

D CGTAACACTGCCTACTCAG 2 2 2 2 2 2 2 2-32 2 2

score of 19Do this between query and every entry in

the database

BLAST compareBLAST compare

Calculate probability that match would occur by chance, EIf E is small, match is significant

E < 1 x 10-100

If E is big, match occurred randomlyE = 1

Type Query Database

Blastn Nucleotide Nucleotide

Blastp Protein Protein

Blastx Translated nucleotide

Protein

Tblastn Protein Translated nucleotide

Tblastx Translated nucleotide

Translated nucleotide

Types of blast queriesTypes of blast queries

Protein sequences will be more Protein sequences will be more similar than nucleotide similar than nucleotide

sequencessequences Redundancy of amino acid code

CCACCC = P, prolineCCG CCT

So two nucleotide sequences could have 1/3 bases be different but protein sequences would be identical

If compare coding sequencesIf compare coding sequences

Amino acids should be similar in important parts of molecule

More likely to keep same chemical character so change within AA groupNon-polar A, G, I, L, P, VPolar M, N, Q, S, TAromatic F, W, YPolar + charge H, K, RPolar - charge D, ECysteine C

Need a way to quantify or score Need a way to quantify or score AA similarityAA similarity

Compare set of real protein sequences

Determine probability that AA switches from one to another oneGive each change a score+ score = more likely to occur- score = less likely to occur

Similarity matrix gives scoresSimilarity matrix gives scores

A = alanine C = cysteine D = aspartic acid E = glutamic acid F = phenylalanine G = glycine H = histidine

NP C -P -P A NP +PNP C -P -P A NP +P

Different matrices are used Different matrices are used depending on how different depending on how different

proteins / organisms areproteins / organisms are Generate different matrices for

proteins which less or more differentBLOSUM 80 proteins 80% similarBLOSUM 62 62%BLOSUM 45 45%

Gather different proteins sets to make each of these matrices

BLAST compareBLAST compare

Compare protein sequencesQuery AGDECDA

Database AGDDCDG AAHECDA

Similarity matrixSimilarity matrix

A = alanine C = cysteine D = aspartic acid E = glutamic acid F = phenylalanine G = glycine H = histidine

NP C -P -P A NP +PNP C -P -P A NP +P

6 6 -2-2

BLAST compareBLAST compare

Score and select best word match Query AGDECDA

Data AGDDCDG 4 6 6 2 9 6 0 = 33

Query AGDECDA Data AAHECDA 4 0 1 5 9 6 4 = 29

Blast choicesBlast choices

What database do you want to search?

What do you want to compare?

What program do you want to do the comparing?

Nucleotide BLAST

Choices for programsChoices for programs

Megablast Highly similar sequences >95%

Word length 28 Discontiguous megablast

Pretty similar seqs Word length 11

Blastn Dissimilar seqsWord length 11