Last lecture summary
description
Transcript of Last lecture summary
![Page 1: Last lecture summary](https://reader035.fdocuments.net/reader035/viewer/2022062323/568163bb550346895dd4d3cd/html5/thumbnails/1.jpg)
Last lecture summary
![Page 2: Last lecture summary](https://reader035.fdocuments.net/reader035/viewer/2022062323/568163bb550346895dd4d3cd/html5/thumbnails/2.jpg)
Sequencing strategies• Hierarchical genome shotgun HGS – Human Genome
Project• “map first, sequence second”• clone-by-clone … cloning is performed twice (BAC, plasmid)
![Page 3: Last lecture summary](https://reader035.fdocuments.net/reader035/viewer/2022062323/568163bb550346895dd4d3cd/html5/thumbnails/3.jpg)
Sequencing strategies• Whole genome shotgun WGS – Celera
• shotgun, no mapping
• Coverage - the average number of reads representing a given nucleotide in the reconstructed sequence. HGS: 8, WGS: 20
![Page 4: Last lecture summary](https://reader035.fdocuments.net/reader035/viewer/2022062323/568163bb550346895dd4d3cd/html5/thumbnails/4.jpg)
Genome assembly• reads, contigs, scaffolds
• base calling, sequence assembly, PHRED/PHRAP
![Page 5: Last lecture summary](https://reader035.fdocuments.net/reader035/viewer/2022062323/568163bb550346895dd4d3cd/html5/thumbnails/5.jpg)
Human genome• 3 billions bps, ~20 000 – 25 000 genes• Only 1.1 – 1.4 % of the genome sequence codes for proteins.• State of completion:
• best estimate – 92.3% is complete• problematic unfinished regions: centromeres, telomeres (both contain
highly repetitive sequences), some unclosed gaps• It is likely that the centromeres and telomeres will remain unsequenced
until new technology is developed• Genome is stored in databases
• Primary database – Genebank (http://www.ncbi.nlm.nih.gov/sites/entrez?db=nucleotide)
• Additional data and annotation, tools for visualizing and searching• UCSCS (http://genome.ucsc.edu)• Ensembl (http://www.ensembl.org)
![Page 6: Last lecture summary](https://reader035.fdocuments.net/reader035/viewer/2022062323/568163bb550346895dd4d3cd/html5/thumbnails/6.jpg)
New stuff
![Page 7: Last lecture summary](https://reader035.fdocuments.net/reader035/viewer/2022062323/568163bb550346895dd4d3cd/html5/thumbnails/7.jpg)
New generation sequencing (NGS)• The completion of human genome was just a start of
modern DNA sequencing era – “high-throughput next generation sequencing” (NGS).
• New approaches, reduce time and cost.• Holly Grail of sequencing – complete human genome
below $ 1000.• Archon X Prize
• http://genomics.xprize.org/• $10 million prize is to be awarded to the private company that is
able to sequence 100 human genomes within 10 days at cost of no more than $10 000 per genome
![Page 8: Last lecture summary](https://reader035.fdocuments.net/reader035/viewer/2022062323/568163bb550346895dd4d3cd/html5/thumbnails/8.jpg)
1st and 2nd generation of sequencers• 1st generation – ABI Prism 3700 (Sanger, fluorescence, 96
capillaries), used in HGP and in Celera• Sanger method overcomes NGS by the read length (600 bps)
• 2nd generation - birth of HT-NGS in 2005. 454 Life Sciences developed GS 20 sequencer. Combines PCR with pyrosequencing.
• Pyrosequencing – sequencing-by-synthesis• Relies on detection of pyrophosphate release on nucleotide
incorporation rather than chain termination with ddNTs.• The release of pyrophosphate is detected by flash of light
(chemiluminiscence).• Average read length: 400 bp• Roche GS-FLX 454 (successor of GS 20) used for J.
Watson’s genome sequencing.
![Page 9: Last lecture summary](https://reader035.fdocuments.net/reader035/viewer/2022062323/568163bb550346895dd4d3cd/html5/thumbnails/9.jpg)
3rd generation• 2nd generation still uses PCR amplification which may
introduce base sequence errors or favor certain sequences over others.
• To overcome this, emerging 3rd generation of seqeuencers performs the single molecule sequencing (i.e. sequence is determined directly from one DNA molecule, no amplification or cloning).
• Compared to 2nd generation these instruments offer higher throughput, longer read lengths (~1000 bps), higher accuracy, small amount of starting material, lower cost
![Page 10: Last lecture summary](https://reader035.fdocuments.net/reader035/viewer/2022062323/568163bb550346895dd4d3cd/html5/thumbnails/10.jpg)
NHGRI Costs
National Human Genome Research Institute (NHGRI) tracks the costs associated with sequencing.
source: http://www.genome.gov/27541954
$0.19
transition to 2nd generation
![Page 11: Last lecture summary](https://reader035.fdocuments.net/reader035/viewer/2022062323/568163bb550346895dd4d3cd/html5/thumbnails/11.jpg)
Which genomes were sequenced?• http://www.ncbi.nlm.nih.gov/sites/genome
• GOLD – Genomes online database (http://www.genomesonline.org/)• information regarding complete and ongoing genome projects
![Page 12: Last lecture summary](https://reader035.fdocuments.net/reader035/viewer/2022062323/568163bb550346895dd4d3cd/html5/thumbnails/12.jpg)
Important genomics projects• The analysis of personal genomes has demonstrated,
how difficult is to draw medically or biologically relevant conclusions from individual sequences.• More genomes need to be sequenced to learn how genotype
correlates with phenotype.• 1000 Genomes project (http://www.1000genomes.org/) started in 2009.
Sequence the genomes of at least a 1000 people from around the world to create the detailed and medically useful picture of human genetic variation.
• 2nd generation of sequencers is used in 1000 Genomes.• 10 000 Genomes will start soon.
![Page 13: Last lecture summary](https://reader035.fdocuments.net/reader035/viewer/2022062323/568163bb550346895dd4d3cd/html5/thumbnails/13.jpg)
Important genomics projects• ENCODE project (ENCyclopedia Of DNA Elements,
http://www.genome.gov/ENCODE/)• by NHGRI• identify all functional elements in the human genome sequence• Defined regions of the human genome corresponding to 30Mb (1%)
have been selected. • These regions serve as the foundation on which to test and
evaluate the effectiveness and efficiency of a diverse set of methods and technologies for finding various functional elements in human DNA.
![Page 14: Last lecture summary](https://reader035.fdocuments.net/reader035/viewer/2022062323/568163bb550346895dd4d3cd/html5/thumbnails/14.jpg)
Rapid Evolution of Next Generation Sequencing Technologies
• 2008: Major genome centers can sequence the same number of base pairs every 4 days• 1000 Genome project launched
• World-wide capacity dramatically increasing
2000: Human genome working drafts Data unit of approximately 10x coverage of human10 years and cost about $3 billion
• 2009: Every 4 hours ($25,000)
• 2010: Every 14 minutes ($5,000)• Illumina HiSeq2000 machine produces 200
gigabases per 8 day run
![Page 15: Last lecture summary](https://reader035.fdocuments.net/reader035/viewer/2022062323/568163bb550346895dd4d3cd/html5/thumbnails/15.jpg)
cDNA1. isolate mRNA from suitable cells2. convert it to complementary DNA (cDNA) using the
enzyme reverse transcriptase (+ DNA poymerase)
• cDNA contains only expressed genes, no intergenic regions, no introns (just exons).
• Because usually the desired gene sequences still represent only a tiny proportion of the total cDNA population, the cDNA fragments are amplified by cloning/PCR.
• cDNA library – a library is defined simply as a collection of different DNA sequences that have been incorporated into a vector.
![Page 16: Last lecture summary](https://reader035.fdocuments.net/reader035/viewer/2022062323/568163bb550346895dd4d3cd/html5/thumbnails/16.jpg)
ESTs• Expressed Sequence Tag• Their use was promoted by Craig Venter. At that time
(1991) it was a revolutionary way for gene identification.• EST is a short subsequence (200-800 bps) of cDNA
sequence. They are unedited, randomly selected single-pass sequence reads derived from cDNA libraries.
• They can be generated either from 5’ or from 3’ end.
mRNA
cDNA
3’ ESTs5’ ESTs
![Page 17: Last lecture summary](https://reader035.fdocuments.net/reader035/viewer/2022062323/568163bb550346895dd4d3cd/html5/thumbnails/17.jpg)
ESTs• ESTs and cDNA sequences provide direct evidence for all
sampled transcripts and they are currently the most important resources for transcriptome exploration.
• ESTs/cDNA sequences cover the genes expressed in the given tissue of the given organism under the given conditions.• housekeeping genes – gene products required by the cell under
all growth conditions (genes for DNA polymerase, RNA polymerase, rRNA, tRNA, …)
• tissue specific genes – different genes are expressed in the brain and in the liver, enzymes responding to a specific environmental condition such as DNA damage, …
![Page 18: Last lecture summary](https://reader035.fdocuments.net/reader035/viewer/2022062323/568163bb550346895dd4d3cd/html5/thumbnails/18.jpg)
ESTs vs. whole genome• Whole genome sequencing is still impractical and
expensive for organisms with large genome size.• Genome expansion, as a result of retrotransposon
repeats, makes whole genome sequencing less attractive for plants such as maize.• Transposons - sequences of DNA that can move (transpose)
themselves to new positions within the genome.• Retrotransposons – subclass of transposons, they can amplify
themselves. Ubiquitous in eukaryotic organisms (45%-48% in mammals, 42% in human). Particularly abundant in plants (maize – 49-78%, wheat – 68%)
• Genome expansion – increase in genome size, one of the elements of genome evolution
![Page 19: Last lecture summary](https://reader035.fdocuments.net/reader035/viewer/2022062323/568163bb550346895dd4d3cd/html5/thumbnails/19.jpg)
EST properties• Individual raw EST has negligible biological information, it
is just a very short copy of mRNA .• It is highly error prone, especially at the ends. The overall
sequence quality is usually significantly better in the middle.
Nagaraj SH, Gasser RB, Ranganathan S. A hitchhiker's guide to expressed sequence tag (EST) analysis. Brief Bioinform. 2007 8(1):6-21. PMID: 16772268.
![Page 20: Last lecture summary](https://reader035.fdocuments.net/reader035/viewer/2022062323/568163bb550346895dd4d3cd/html5/thumbnails/20.jpg)
Problems in ESTs• Redundancy• Under-representation and over-representation of selected
host transcripts (i.e. sequence bias)• Base calling errors (as high as 5%)• Contamination from vector sequences• Repeats may pose problems• Natural sequence variations (e.g. SNPs) – how to
distinguish them and sequencing artifacts?
![Page 21: Last lecture summary](https://reader035.fdocuments.net/reader035/viewer/2022062323/568163bb550346895dd4d3cd/html5/thumbnails/21.jpg)
ESTs on the web• Largest repository: dbEST (http://www.ncbi.nlm.nih.gov/dbEST/)
• 1.7. 2011 – 69 992 536 ESTs from more than 1 000 organisms• UniGene (http://www.ncbi.nlm.nih.gov/unigene) stores unique genes
and represents a nonredundant set of gene-oriented clusters generated from ESTs.
![Page 22: Last lecture summary](https://reader035.fdocuments.net/reader035/viewer/2022062323/568163bb550346895dd4d3cd/html5/thumbnails/22.jpg)
EST analysisgeneric steps involved in EST analysis
The aim of the analysis: augment weak signals, make consensus, when a multitude of ESTs are analysed reconstruct transcriptome of the organism.
Nagaraj SH, Gasser RB, Ranganathan S. A hitchhiker's guide to expressed sequence tag (EST) analysis. Brief Bioinform. 2007 8(1):6-21. PMID: 16772268.
![Page 23: Last lecture summary](https://reader035.fdocuments.net/reader035/viewer/2022062323/568163bb550346895dd4d3cd/html5/thumbnails/23.jpg)
EST preprocessing• Reduces the overall noise in EST data to improve the
efficacy of subsequent analyses.• Remove vector contaminating fragments.
• Compare ESTs with non-redundant vector databases (UniVec - http://www.ncbi.nlm.nih.gov/VecScreen/UniVec.html, EMVEC – http://
www.ebi.ac.uk/Tools/sss/ncbiblast/vectors.html)• Repeats must be detected and masked using
RepeatMasker (http://www.repeatmasker.org/). • Resources for EST pre-processing: page 12 in Nagaraj
SH, Gasser RB, Ranganathan S. A hitchhiker's guide to expressed sequence tag (EST) analysis. Brief Bioinform. 2007 8(1):6-21. PMID: 16772268.
![Page 24: Last lecture summary](https://reader035.fdocuments.net/reader035/viewer/2022062323/568163bb550346895dd4d3cd/html5/thumbnails/24.jpg)
EST clustering• Collect overlapping ESTs from the same transcript of a
single gene into a unique cluster to reduce redundancy.• Clustering is based on the sequence similarity.• Different steps for EST clustering are described in detail in Ptitsyn
A, Hide W. CLU: a new algorithm for EST clustering. BMC Bioinformatics. 2005; 6 Suppl 2:S3. PubMed PMID: 16026600
• The maximum informative consensus sequence is generated by ‘assembling’ these clusters, each of which could represent a putative gene. This step serves to elongate the sequence length by culling information from several short EST sequences simultaneously.
• Sequence clustering and assembly: CAP3
![Page 25: Last lecture summary](https://reader035.fdocuments.net/reader035/viewer/2022062323/568163bb550346895dd4d3cd/html5/thumbnails/25.jpg)
Functional annotations• Database similarity searches (BLAST) are subsequently
performed against relevant DNA databases and possible functionality is assigned for each query sequence if significant database matches are found.
• Additionally, a consensus sequence can be conceptually translated to a putative peptide and then compared with protein sequence databases. Protein centric functional annotation, including domain and motif analysis, can be carried out using protein analysis tools.
![Page 26: Last lecture summary](https://reader035.fdocuments.net/reader035/viewer/2022062323/568163bb550346895dd4d3cd/html5/thumbnails/26.jpg)
EST analysis pipelines• Large-scale sequencing projects (thousands of ESTs
generated daily) – store, organize and annotate EST data in an automatic pipeline.
• Database of raw chromatograms → clean, cluster, assemble, generate consensus, translate, assign putative function based on various DNA/protein similarity searches
• examples:• TGI Clustering tools (TGICL) http
://compbio.dfci.harvard.edu/tgi/software/• PartiGene http://nebc.nerc.ac.uk/tools/other-tools/est
![Page 27: Last lecture summary](https://reader035.fdocuments.net/reader035/viewer/2022062323/568163bb550346895dd4d3cd/html5/thumbnails/27.jpg)
Sequence Alignment
![Page 28: Last lecture summary](https://reader035.fdocuments.net/reader035/viewer/2022062323/568163bb550346895dd4d3cd/html5/thumbnails/28.jpg)
What is sequence alignment ?
CTTTTCAAGGCTTA GGCTTATTATTGC
CTTTTCAAGGCTTA GGCTATTATTGC
CTTTTCAAGGCTTA GGCT-ATTATTGC
Fragments overlaps
![Page 29: Last lecture summary](https://reader035.fdocuments.net/reader035/viewer/2022062323/568163bb550346895dd4d3cd/html5/thumbnails/29.jpg)
What is sequence alignment ?
CCCCATGGTGGCGGCAGGTGACAG CATGGGGGAGGATGGGGACAGTCCGG TTACCCCATGGTGGCGGCTTGGGAAACTT TGGCGGCTCGGGACAGTCGCGCATAAT CCATGGTGGTGGCTGGGGATAGTA TGAGGCAGTCGCGCATAATTCCG
TTACCCCATGGTGGCGGCTGGGGACAGTCGCGCATAATTCCG
“EST clustering”
CCCCATGGTGGCGGCAGGTGACAGCATGGGGGAGGATGGGGACAGTCCGG TTACCCCATGGTGGCGGCTTGGGAAACTTTGGCGGCTCGGGACAGTCGCGCATAATCCATGGTGGTGGCTGGGGATAGTATGAGGCAGTCGCGCATAATTCCG
consensus
![Page 30: Last lecture summary](https://reader035.fdocuments.net/reader035/viewer/2022062323/568163bb550346895dd4d3cd/html5/thumbnails/30.jpg)
Sequence alphabetside chain charge at physiological
pH 7.4 Name 3 letters 1 letter
Positively charged side chains
Arginine Arg RHistidine His H
Lysine Lys KNegatively charged
side chainsAspartic Acid Asp DGlutamic Acid Glu E
Polar uncharged side chains
Serine Ser SThreonine Thr TAsparagine Asn NGlutamine Gln Q
Special
Cysteine Cys CSelenocysteine Sec U
Glycine Gly GProline\ Pro P
Hydrophobic side chains
Alanine Ala ALeucine Leu L
Isoleucine Ile IMethionine Met M
Phenylalanine Phe FTryptophan Trp W
Tyrosine Tyr YValine Val V
Adenine A
Thymine T
Cytosine G
Guanine C
![Page 31: Last lecture summary](https://reader035.fdocuments.net/reader035/viewer/2022062323/568163bb550346895dd4d3cd/html5/thumbnails/31.jpg)
Sequence alignment• Procedure of comparing sequences• Point mutations – easy
• More difficult example
• However, gaps can be inserted to get something like this
ACGTCTGATACGCCGTATAGTCTATCTACGTCTGATTCGCCCTATCGTCTATCT
ACGTCTGATACGCCGTATAGTCTATCTCTGATTCGCATCGTCTATCT
ACGTCTGATACGCCGTATAGTCTATCT----CTGATTCGC---ATCGTCTATCT
gapless alignment
gapped alignmentinsertion × deletionindel
![Page 32: Last lecture summary](https://reader035.fdocuments.net/reader035/viewer/2022062323/568163bb550346895dd4d3cd/html5/thumbnails/32.jpg)
Why align sequences – continuation• The draft human genome is available• Automated gene finding is possible• Gene: AGTACGTATCGTATAGCGTAA
• What does it do?• One approach: Is there a similar gene in another
species?• Align sequences with known genes• Find the gene with the “best” match
![Page 33: Last lecture summary](https://reader035.fdocuments.net/reader035/viewer/2022062323/568163bb550346895dd4d3cd/html5/thumbnails/33.jpg)
Flavors of sequence alignmentpair-wise alignment × multiple sequence alignment
![Page 34: Last lecture summary](https://reader035.fdocuments.net/reader035/viewer/2022062323/568163bb550346895dd4d3cd/html5/thumbnails/34.jpg)
Flavors of sequence alignmentglobal alignment × local alignment
global
local
align entire sequence
stretches of sequence with the highest density of matches are aligned, generating islands of matches or subalignments in the aligned sequences