Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TA: Eugene Fratkin...

58
Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TA: Eugene Fratkin Tuesday&Thursday 2:45-4:00 Skilling Auditorium
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    220
  • download

    0

Transcript of Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TA: Eugene Fratkin...

Welcome to CS262: Computational

Genomics

Instructor:Serafim Batzoglou

TA:Eugene Fratkin

Tuesday&Thursday 2:45-4:00Skilling Auditorium

Goals of this course

• Introduction to Computational Biology & Genomics

Basic concepts and scientific questions

Why does it matter?

Basic biology for computer scientists

In-depth coverage of algorithmic techniques

Current active areas of research

• Useful algorithms

Dynamic programming

String algorithms

HMMs and other graphical models for sequence analysis

Topics in CS262

Part 1: Basic Algorithms

Sequence Alignment & Dynamic Programming

Hidden Markov models, Context Free Grammars, Conditional Random Fields

Part 2: Topics in computational genomics and areas of active research

DNA sequencing

Comparative genomics

Genes: finding genes, gene regulation

Proteins, families, and evolution

Networks of protein interactions

Course responsibilities

• Homeworks

4 challenging problem sets, 4-5 problems/pset• Due at beginning of class• Up to 3 late days (24-hr periods) for the quarter

Collaboration allowed – please give credit• Teams of 2 or 3 students• Individual writeups• If individual (no team) then drop score of worst problem per problem set

• (Optional) Scribing

Due one week after the lecture, except special permission

Scribing grade replaces 2 lowest problems from all problem sets• First-come first-serve, email staff list to sign up

Reading material

• Books

“Biological sequence analysis” by Durbin, Eddy, Krogh, Mitchison

• Chapters 1-4, 6, 7-8, 9-10

“Algorithms on strings, trees, and sequences” by Gusfield

• Chapters 5-7, 11-12, 13, 14, 17

• Papers

• Lecture notes

Birth of Molecular Biology

T

C

A

C

T

G

G

C

G

A

G

T

C

A

G

C

DNA

Phosphate Group

Sugar

NitrogenousBase

A, C, G, T

PhysicistOrnithologist

T

C

A

C

T

G

G

C

G

A

G

T

C

A

G

C

G

A

G

U

C

A

G

C

DNA RNA

A - T

G - C

T U

DNA

• DNA is written 5’ to 3’ by convention

• AGACC = GGTCT

3’

5’

5’

3’

Chromosomes

H1DNA

H2A, H2B, H3, H4

~146bp

telomere

centromerenucleosome

chromatin

In humans:2x22 autosomesX, Y sex chromosomes

The Genetic Dogma

3’

5’

5’

3’TAGGATCGACTATATGGGATTACAAAGCATTTAGGGA...TCACCCTCTCTAGACTAGCATCTATATAAAACAGAA

ATCCTAGCTGATATACCCTAATGTTTCGTAAATCCCT...AGTGGGAGAGATCTGATCGTAGATATATTTTGTCTT

AUGGGAUUACAAAGCAUUUAGGGA...UCACCCUCUCUAGACUAGCAUCUAUAUAA

(transcription)

(translation)

Single-stranded RNA

protein

Double-stranded DNA

DNA to RNA to Protein to Cell

DNA, ~3x109 long in humansContains ~ 22,000 genes

G

A

G

U

C

A

G

C

messenger-RNA

transcription translation folding

Gene Transcription

3’

5’

5’

3’G A T T A C A . . .

C T A A T G T . . .

Gene Transcription

3’

5’

5’

3’

The promoter lies upstream of a gene

Transcription factors recognize transcription factor binding sites and bind to them, forming a complex

RNA polymerase binds the complex

G A T T A C A . . .

C T A A T G T . . .

Gene Transcription

3’

5’

5’

3’

The two strands are separated

G A T T A C

A . . .

C T A A T G T . . .

Gene Transcription

3’

5’

5’

3’

An RNA copy of the 5’→3’ sequence is created from the 3’→5’ template

G A T T A C

A . . .

C T A A T G T . . .

G A U U A C A

Gene Transcription

3’

5’

5’

3’

G A U U A C A . . .

G A T T A C A . . .

C T A A T G T . . .

pre-mRNA 5’ 3’

RNA Processing

5’ cap poly(A) tail

intron

exon

mRNA

5’ UTR 3’ UTR

pre-mRNA

Gene Structure

5’ 3’

promoter

5’ UTR exons 3’ UTR

introns

coding

non-coding

How many?

• Genes: ~22,000 in the human genome

• Exons per gene:~ 8 on average (max: 148)

• Nucleotides per exon:170 on average (max: 12k)

• Nucleotides per intron:5,500 on average (max: 500k)

• Nucleotides per gene:45k on average (max: 2,2M)

• Composed of a chain of amino acids.

R

|

H2N--C--COOH

|

H

Proteins

20 possible groupsAlanineArginine

AsparagineAspartateCysteine

GlutamateGlutamine

GlycineHistidine

IsoleucineLeucineLysine

MethioninePhenylalanine

ProlineSerine

ThreonineTryptophan

TyrosineValine

R R | | H2N--C--COOH H2N--C--COOH | | H H

Proteins

AlanineArginine

AsparagineAspartateCysteine

GlutamateGlutamine

GlycineHistidine

IsoleucineLeucineLysine

MethioninePhenylalanine

ProlineSerine

ThreonineTryptophan

TyrosineValine

Dipeptide

R O R | II | H2N--C--C--NH--C--COOH | | H H

This is a peptide bond

AlanineArginine

AsparagineAspartateCysteine

GlutamateGlutamine

GlycineHistidine

IsoleucineLeucineLysine

MethioninePhenylalanine

ProlineSerine

ThreonineTryptophan

TyrosineValine

Protein structure

• Linear sequence of amino acids folds to form a complex 3-D structure

• The structure of a protein is intimately connected to its function

Translation

The ribosome synthesizes a protein by reading the mRNA in triplets (codons). Each codon is translated to an amino acid.

mRNA

P site A site

The Genetic Code

U C A G

U

UUU Phenylalanine (Phe) UCU Serine (Ser) UAU Tyrosine (Tyr) UGU Cysteine (Cys) U

UUC Phe UCC Ser UAC Tyr UGC Cys C

UUA Leucine (Leu) UCA Ser UAA STOP UGA STOP A

UUG Leu UCG Ser UAG STOP UGG Tryptophan (Trp) G

C

CUU Leucine (Leu) CCU Proline (Pro) CAU Histidine (His) CGU Arginine (Arg) U

CUC Leu CCC Pro CAC His CGC Arg C

CUA Leu CCA Pro CAA Glutamine (Gln) CGA Arg A

CUG Leu CCG Pro CAG Gln CGG Arg G

A

AUU Isoleucine (Ile) ACU Threonine (Thr) AAU Asparagine (Asn) AGU Serine (Ser) U

AUC Ile ACC Thr AAC Asn AGC Ser C

AUA Ile ACA Thr AAA Lysine (Lys) AGA Arginine (Arg) A

AUG Methionine (Met) or START ACG Thr AAG Lys AGG Arg G

G

GUU Valine (Val) GCU Alanine (Ala) GAU Aspartic acid (Asp) GGU Glycine (Gly) U

GUC Val GCC Ala GAC Asp GGC Gly C

GUA Val GCA Ala GAA Glutamic acid (Glu) GGA Gly A

GUG Val GCG Ala GAG Glu GGG Gly G

Translation (tRNA)

C C A

Tryptophan anticodon

Translation

5’ . . . A U U A U G G C C U G G A C U U G A . . . 3’

UTR Met

Start Codon

Ala Trp Thr

Translation

5’ . . . A U U A U G G C C U G G A C U U G A . . . 3’

Translation

Met Ala

5’ . . . A U U A U G G C C U G G A C U U G A . . . 3’

Trp

Errors?

• What if the transcription / translation machinery makes mistakes?

• What is the effect of mutations in coding regions?

Reading Frames

G C U U G U U U A C G A A U U A G

G C U U G U U U A C G A A U U A G

G C U U G U U U A C G A A U U A G

G C U U G U U U A C G A A U U A G

Synonymous Mutation

G C U U G U U U A C G A A U U A G

Ala Cys Leu Arg Ile

G C U U G U U U A C G A A U U A G

G

G C U U G U U U G C G A A U U A G

Ala Cys Leu Arg Ile

Missense Mutation

G C U U G U U U A C G A A U U A G

Ala Cys Leu Arg Ile

G C U U G U U U A C G A A U U A G

G

G C U U G G U U A C G A A U U A G

Ala Trp Leu Arg Ile

Nonsense Mutation

G C U U G U U U A C G A A U U A G

Ala Cys Leu Arg Ile

G C U U G U U U A C G A A U U A G

A

G C U U G A U U A C G A A U U A G

Ala STOP

Frameshift

G C U U G U U U A C G A A U U A G

Ala Cys Leu Arg Ile

G C U U G U U U A C G A A U U A G

G C U U G U U A C G A A U U A G

Ala Cys Tyr Glu Leu

Noncoding RNA

3’

5’

5’

3’

G A U U A C A . . .

G A T T A C A . . .

C T A A T G T . . .

5’ 3’

Genetics in the 20th Century

21st Century

AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCTCTCTCTAGTCTACGTGCTGTATGCGTTAGTGTCGTCGTCTAGTAGTCGCGATGCTCTGATGTTAGAGGATGCACGATGCTGCTGCTACTAGCGTGCTGCTGCGATGTAGCTGTCGTACGTGTAGTGTGCTGTAAGTCGAGTGTAGCTGGCGATGTATCGTGGT

AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCTCTCTCTAGTCTACGTGCTGTATGCGTTAGTGTCGTCGTCTAGTAGTCGCGATGCTCTGATGTTAGAGGATGCACGATGCTGCTGCTACTAGCGTGCTGCTGCGATGTAGCTGTCGTACGTGTAGTGTGCTGTAAGTCGAGTGTAGCTGGCGATGTATCGTGGT

AGTAGGACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCTCTCTCTAGTCTACGTGCTGTATGCGTTAGTGTCGTCGTCTAGTAGTCGCGATGCTCTGATGTTAGAGGATGCACGATGCTGCTGCTACTAGCGTGCTGCTGCGATGTAGCTGTCGTACGTGTAGTGTGCTGTAAGTCGAGTGTAGCTGGCGATGTATCGTGGT

Computational Biology

• Organize & analyze massive amounts of biological data

Enable biologists to use data

Form testable hypotheses

Discover new biology

AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCTCTCTCTAGTCTACGTGCTGTATGCGTTAGTGTCGTCGTCTAGTAGTCGCGATGCTCTGATGTTAGAGGATGCACGATGCTGCTGCTACTAGCGTGCTGCTGCGATGTAGCTGTCGTACGTGTAGTGTGCTGTAAGTCGAGTGTAGCTGGCGATGTATCGTGGT

DNA to RNA to Protein to Cell

DNA, ~3x109 long in humansContains ~ 22,000 genes

G

A

G

U

C

A

G

C

messenger-RNA

transcription translation folding

Some Topics in CS2621. Sequencing

AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

3x109 nucleotides

~500 nucleotides

Some Topics in CS2621. Sequencing

AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

3x109 nucleotides

Computational Fragment AssemblyIntroduced ~19801995: assemble up to 1,000,000 long DNA pieces2000: assemble whole human genome

A big puzzle~60 million pieces

Complete genomes today

More than 300 complete genomes have been

sequenced

Where are the genes?Where are the genes?

2. Gene Finding

In humans:

~22,000 genes~1.5% of human DNA

atg

tga

ggtgag

ggtgag

ggtgag

caggtg

cagatg

cagttg

caggccggtgag

3. Molecular Evolution

Evolution at the DNA level

OK

OK

OK

X

X

Still OK?

next generation

4. Sequence ComparisonSequence conservation implies function

Sequence comparison is key to• Finding genes• Determining function• Uncovering the evolutionary processes

Sequence Comparison—Alignment

AGGCTATCACCTGACCTCCAGGCCGATGCCC

TAGCTATCACGACCGCGGTCGATTTGCCCGAC

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- | | | | | | | | | | | | | x | | | | | | | | | | |

TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

Sequence AlignmentIntroduced ~1970BLAST: 1990, most cited paper in historyStill very active area of research

query

DB

BLAST

5. RNA Structure

Predict:

Given: A G C A G A G U G G …

an unfolded RNA sequence

A G C A C A G U G A …

+ aligned homologs

A C U A G A C A G G

…C G C C G A G U C G

A G C A G U G U G G …

bulge

loop

helix (stem)

hairpin loop

internal loop multi-

branch loop

which nucleotides base pair?

6. Protein networks

Fresh research area• Construct networks from

multiple data sources

• Navigate networks

• Compare networks across organisms

Computer Scientists vs Biologists

Computer scientists vs Biologists

• Nothing is ever true or false in Biology

• Everything is true or false in computer science

Computer scientists vs Biologists

• Biologists strive to understand the complicated, messy natural world

• Computer scientists seek to build their own clean and organized virtual worlds

• Biologists are obsessed with being the first to discover something

• Computer scientists are obsessed with being the first to invent or prove something

Computer scientists vs Biologists

• Biologists are comfortable with the idea that all data have errors

• Computer scientists are not

Computer scientists vs Biologists

• Computer scientists get high-paid jobs after graduation

• Biologists typically have to complete one or more 5-year post-docs...

Computer scientists vs Biologists

Computer Science is to Biology what Mathematics is to Physics