CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments.
Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TA: Eugene Fratkin...
-
date post
21-Dec-2015 -
Category
Documents
-
view
220 -
download
0
Transcript of Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TA: Eugene Fratkin...
Welcome to CS262: Computational
Genomics
Instructor:Serafim Batzoglou
TA:Eugene Fratkin
Tuesday&Thursday 2:45-4:00Skilling Auditorium
Goals of this course
• Introduction to Computational Biology & Genomics
Basic concepts and scientific questions
Why does it matter?
Basic biology for computer scientists
In-depth coverage of algorithmic techniques
Current active areas of research
• Useful algorithms
Dynamic programming
String algorithms
HMMs and other graphical models for sequence analysis
Topics in CS262
Part 1: Basic Algorithms
Sequence Alignment & Dynamic Programming
Hidden Markov models, Context Free Grammars, Conditional Random Fields
Part 2: Topics in computational genomics and areas of active research
DNA sequencing
Comparative genomics
Genes: finding genes, gene regulation
Proteins, families, and evolution
Networks of protein interactions
Course responsibilities
• Homeworks
4 challenging problem sets, 4-5 problems/pset• Due at beginning of class• Up to 3 late days (24-hr periods) for the quarter
Collaboration allowed – please give credit• Teams of 2 or 3 students• Individual writeups• If individual (no team) then drop score of worst problem per problem set
• (Optional) Scribing
Due one week after the lecture, except special permission
Scribing grade replaces 2 lowest problems from all problem sets• First-come first-serve, email staff list to sign up
Reading material
• Books
“Biological sequence analysis” by Durbin, Eddy, Krogh, Mitchison
• Chapters 1-4, 6, 7-8, 9-10
“Algorithms on strings, trees, and sequences” by Gusfield
• Chapters 5-7, 11-12, 13, 14, 17
• Papers
• Lecture notes
Birth of Molecular Biology
T
C
A
C
T
G
G
C
G
A
G
T
C
A
G
C
DNA
Phosphate Group
Sugar
NitrogenousBase
A, C, G, T
PhysicistOrnithologist
Chromosomes
H1DNA
H2A, H2B, H3, H4
~146bp
telomere
centromerenucleosome
chromatin
In humans:2x22 autosomesX, Y sex chromosomes
The Genetic Dogma
3’
5’
5’
3’TAGGATCGACTATATGGGATTACAAAGCATTTAGGGA...TCACCCTCTCTAGACTAGCATCTATATAAAACAGAA
ATCCTAGCTGATATACCCTAATGTTTCGTAAATCCCT...AGTGGGAGAGATCTGATCGTAGATATATTTTGTCTT
AUGGGAUUACAAAGCAUUUAGGGA...UCACCCUCUCUAGACUAGCAUCUAUAUAA
(transcription)
(translation)
Single-stranded RNA
protein
Double-stranded DNA
DNA to RNA to Protein to Cell
DNA, ~3x109 long in humansContains ~ 22,000 genes
G
A
G
U
C
A
G
C
messenger-RNA
transcription translation folding
Gene Transcription
3’
5’
5’
3’
The promoter lies upstream of a gene
Transcription factors recognize transcription factor binding sites and bind to them, forming a complex
RNA polymerase binds the complex
G A T T A C A . . .
C T A A T G T . . .
Gene Transcription
3’
5’
5’
3’
The two strands are separated
G A T T A C
A . . .
C T A A T G T . . .
Gene Transcription
3’
5’
5’
3’
An RNA copy of the 5’→3’ sequence is created from the 3’→5’ template
G A T T A C
A . . .
C T A A T G T . . .
G A U U A C A
Gene Transcription
3’
5’
5’
3’
G A U U A C A . . .
G A T T A C A . . .
C T A A T G T . . .
pre-mRNA 5’ 3’
How many?
• Genes: ~22,000 in the human genome
• Exons per gene:~ 8 on average (max: 148)
• Nucleotides per exon:170 on average (max: 12k)
• Nucleotides per intron:5,500 on average (max: 500k)
• Nucleotides per gene:45k on average (max: 2,2M)
• Composed of a chain of amino acids.
R
|
H2N--C--COOH
|
H
Proteins
20 possible groupsAlanineArginine
AsparagineAspartateCysteine
GlutamateGlutamine
GlycineHistidine
IsoleucineLeucineLysine
MethioninePhenylalanine
ProlineSerine
ThreonineTryptophan
TyrosineValine
R R | | H2N--C--COOH H2N--C--COOH | | H H
Proteins
AlanineArginine
AsparagineAspartateCysteine
GlutamateGlutamine
GlycineHistidine
IsoleucineLeucineLysine
MethioninePhenylalanine
ProlineSerine
ThreonineTryptophan
TyrosineValine
Dipeptide
R O R | II | H2N--C--C--NH--C--COOH | | H H
This is a peptide bond
AlanineArginine
AsparagineAspartateCysteine
GlutamateGlutamine
GlycineHistidine
IsoleucineLeucineLysine
MethioninePhenylalanine
ProlineSerine
ThreonineTryptophan
TyrosineValine
Protein structure
• Linear sequence of amino acids folds to form a complex 3-D structure
• The structure of a protein is intimately connected to its function
Translation
The ribosome synthesizes a protein by reading the mRNA in triplets (codons). Each codon is translated to an amino acid.
mRNA
P site A site
The Genetic Code
U C A G
U
UUU Phenylalanine (Phe) UCU Serine (Ser) UAU Tyrosine (Tyr) UGU Cysteine (Cys) U
UUC Phe UCC Ser UAC Tyr UGC Cys C
UUA Leucine (Leu) UCA Ser UAA STOP UGA STOP A
UUG Leu UCG Ser UAG STOP UGG Tryptophan (Trp) G
C
CUU Leucine (Leu) CCU Proline (Pro) CAU Histidine (His) CGU Arginine (Arg) U
CUC Leu CCC Pro CAC His CGC Arg C
CUA Leu CCA Pro CAA Glutamine (Gln) CGA Arg A
CUG Leu CCG Pro CAG Gln CGG Arg G
A
AUU Isoleucine (Ile) ACU Threonine (Thr) AAU Asparagine (Asn) AGU Serine (Ser) U
AUC Ile ACC Thr AAC Asn AGC Ser C
AUA Ile ACA Thr AAA Lysine (Lys) AGA Arginine (Arg) A
AUG Methionine (Met) or START ACG Thr AAG Lys AGG Arg G
G
GUU Valine (Val) GCU Alanine (Ala) GAU Aspartic acid (Asp) GGU Glycine (Gly) U
GUC Val GCC Ala GAC Asp GGC Gly C
GUA Val GCA Ala GAA Glutamic acid (Glu) GGA Gly A
GUG Val GCG Ala GAG Glu GGG Gly G
Errors?
• What if the transcription / translation machinery makes mistakes?
• What is the effect of mutations in coding regions?
Reading Frames
G C U U G U U U A C G A A U U A G
G C U U G U U U A C G A A U U A G
G C U U G U U U A C G A A U U A G
G C U U G U U U A C G A A U U A G
Synonymous Mutation
G C U U G U U U A C G A A U U A G
Ala Cys Leu Arg Ile
G C U U G U U U A C G A A U U A G
G
G C U U G U U U G C G A A U U A G
Ala Cys Leu Arg Ile
Missense Mutation
G C U U G U U U A C G A A U U A G
Ala Cys Leu Arg Ile
G C U U G U U U A C G A A U U A G
G
G C U U G G U U A C G A A U U A G
Ala Trp Leu Arg Ile
Nonsense Mutation
G C U U G U U U A C G A A U U A G
Ala Cys Leu Arg Ile
G C U U G U U U A C G A A U U A G
A
G C U U G A U U A C G A A U U A G
Ala STOP
Frameshift
G C U U G U U U A C G A A U U A G
Ala Cys Leu Arg Ile
G C U U G U U U A C G A A U U A G
G C U U G U U A C G A A U U A G
Ala Cys Tyr Glu Leu
21st Century
AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCTCTCTCTAGTCTACGTGCTGTATGCGTTAGTGTCGTCGTCTAGTAGTCGCGATGCTCTGATGTTAGAGGATGCACGATGCTGCTGCTACTAGCGTGCTGCTGCGATGTAGCTGTCGTACGTGTAGTGTGCTGTAAGTCGAGTGTAGCTGGCGATGTATCGTGGT
AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCTCTCTCTAGTCTACGTGCTGTATGCGTTAGTGTCGTCGTCTAGTAGTCGCGATGCTCTGATGTTAGAGGATGCACGATGCTGCTGCTACTAGCGTGCTGCTGCGATGTAGCTGTCGTACGTGTAGTGTGCTGTAAGTCGAGTGTAGCTGGCGATGTATCGTGGT
AGTAGGACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCTCTCTCTAGTCTACGTGCTGTATGCGTTAGTGTCGTCGTCTAGTAGTCGCGATGCTCTGATGTTAGAGGATGCACGATGCTGCTGCTACTAGCGTGCTGCTGCGATGTAGCTGTCGTACGTGTAGTGTGCTGTAAGTCGAGTGTAGCTGGCGATGTATCGTGGT
Computational Biology
• Organize & analyze massive amounts of biological data
Enable biologists to use data
Form testable hypotheses
Discover new biology
AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCTCTCTCTAGTCTACGTGCTGTATGCGTTAGTGTCGTCGTCTAGTAGTCGCGATGCTCTGATGTTAGAGGATGCACGATGCTGCTGCTACTAGCGTGCTGCTGCGATGTAGCTGTCGTACGTGTAGTGTGCTGTAAGTCGAGTGTAGCTGGCGATGTATCGTGGT
DNA to RNA to Protein to Cell
DNA, ~3x109 long in humansContains ~ 22,000 genes
G
A
G
U
C
A
G
C
messenger-RNA
transcription translation folding
Some Topics in CS2621. Sequencing
AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT
3x109 nucleotides
~500 nucleotides
Some Topics in CS2621. Sequencing
AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT
3x109 nucleotides
Computational Fragment AssemblyIntroduced ~19801995: assemble up to 1,000,000 long DNA pieces2000: assemble whole human genome
A big puzzle~60 million pieces
4. Sequence ComparisonSequence conservation implies function
Sequence comparison is key to• Finding genes• Determining function• Uncovering the evolutionary processes
Sequence Comparison—Alignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- | | | | | | | | | | | | | x | | | | | | | | | | |
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Sequence AlignmentIntroduced ~1970BLAST: 1990, most cited paper in historyStill very active area of research
query
DB
BLAST
5. RNA Structure
Predict:
Given: A G C A G A G U G G …
an unfolded RNA sequence
A G C A C A G U G A …
+ aligned homologs
A C U A G A C A G G
…C G C C G A G U C G
…
A G C A G U G U G G …
bulge
loop
helix (stem)
hairpin loop
internal loop multi-
branch loop
which nucleotides base pair?
6. Protein networks
Fresh research area• Construct networks from
multiple data sources
• Navigate networks
• Compare networks across organisms
Computer scientists vs Biologists
• Nothing is ever true or false in Biology
• Everything is true or false in computer science
Computer scientists vs Biologists
• Biologists strive to understand the complicated, messy natural world
• Computer scientists seek to build their own clean and organized virtual worlds
• Biologists are obsessed with being the first to discover something
• Computer scientists are obsessed with being the first to invent or prove something
Computer scientists vs Biologists
• Biologists are comfortable with the idea that all data have errors
• Computer scientists are not
Computer scientists vs Biologists
• Computer scientists get high-paid jobs after graduation
• Biologists typically have to complete one or more 5-year post-docs...
Computer scientists vs Biologists