Shinichi MorishitaMolecular Biology of the Cell - Fifth Edition Figure 5－75 Figure removed due to...

Computational Analysis of Genomes

Shinichi Morishita

The figures, photos and moving images with ‡marks attached belong to their copyright holders. Reusing or reproducing them is prohibited unless permission is obtained directly from such copyright holders.

Chromosomes, Chromatin Structure and Genomes

Ref: Annunziato,

A.

DNA packaging: Nucleosomes

and chromatin.

Nature Education

1(1),

(2008)

‡

Sequencing a Genome

• Do the genome size and number of chromosomes characterize an organism?

• What could the sequenced genome be used for?

• How to sequence genomes?

• How to identify the gene coding regions?• Recent revolutionary advances in genome

sequencing equipment

• How to infer chromatin structure?

Genome Size

Courtesy of Dr. T. Ryan Gregory http://www.genomesize.com/statistics.php

‡

1 pg (10-12 g) ≅

1 billion base pairs (more accurately, 978 million pairs)

Why Are Genomes So Different in Size?

Human genome著作権の都合により、ここに挿入され

ていた画像を削除しました。

Molecular Biology of the Cell - Fifth EditionFigure 5－75

Figure removed due to

copyright restrictions

Molecular Biology of the Cell - Fifth EditionGarland Science (2008)Figure 5－75

retrovirus-like element

simple repeat sequence

repetitive sequence unique sequence

Overlapping fragments

Intron

Protein coding region

GENE

Nonrepetive DNA sequence, neither be found in intron nor codon

Human Chromosomes

U.S. National Library of Medicinehttp://ghr.nlm.nih.gov/handbook/illustrations/normalkaryotype

‡

millions of years ago

Mesozoic CenozoicPaleozoic

500 400 300 200 100 0

Cambrian TriassicJurassic

CretaceousPaleogene

Neogene

Verterata

LampreyAgnathaChondrichthyes

Gnathostomata

Shark, ray

Osteichthyes

ActinopterygiiPolypteriformes

Holostei

Bichir

Sturgeon, gar, bowfin

Teleostei

Zebrafish

Killifish

Green spotted puffer

Tiger blowfish

Sarcopterygii

Coelacanth, lungfishDipnoi

Tetrapoda

Amniota

AmphibiaFrog

Reptilia

Aves

SquamataCrocodilia

Testudinata

Lizard, snake

Chicken

CrocodileTurtle

Mammalia

Dog

Mouse, rat

Human, chimpanzee

Vertebrate ChromosomeCounts

Nakatani

et al., 2007 , Genome Res., 17, 1254‐1265

CarboniferousPermianDevonian

SilurianOrdovician

http://ja.wikipedia.org/wiki/%E7%94%BB%E5%83%8F:Flussneunauge.jpg

http://upload.wikimedia.org/wikipedia/commons/8/8f/Polypterus_weeksii_1.jpg

http://ja.wikipedia.org/wiki/%E7%94%BB%E5%83%8F:CommonStrugeon.png

http://www.ensembl.org/Fugu_rubripes/

http://www.ensembl.org/Tetraodon_nigroviridis/

http://www.ensembl.org/Danio_rerio/

http://www.ensembl.org/Xenopus_tropicalis/

http://www.ensembl.org/Gallus_gallus/

http://www.ensembl.org/Homo_sapiens/

http://www.ensembl.org/Pan_troglodytes/

http://www.ensembl.org/Canis_familiaris/

http://www.ensembl.org/Mus_musculus/

http://www.ensembl.org/Rattus_norvegicus/

Chromosome Count DistributionVertical axis: No. species,

Horizontal axis: Chromosome count500 400 300 200 100 0

Verterata

Gnathostomata

Osteichthyes


Holostei

Teleostei

Sarcopterygii

Dipnoi

TetrapodaReptilia

Agnatha

Chondrichthyes

Amphibia

Aves

Squamata

Figure removed due to


Nakatani et al., 2007 , Genome Res., 17, 1254-1265Figure6

Nakatani

et al., 2007 , Genome Res., 17, 1254‐1265



CretaceousPaleogene

NeogeneCarboniferousPermianDevonian

SilurianOrdovician

Amniota

Mammalia

Testudinata

Crocodilia









（小鹿）Figure removed due to


Molecular Biology of the Cell - Fifth EditionGarland Science (2008)Figure 4-14

What Uses Has the Genome?

• It tells us whether a given gene is present.• Chicken genome sequenced Dec 2004

• Do chickens have a poor sense of smell?

• There are calculated to be 218 genes that could be olfactory receptors.

• What happened to the genes for flight?

The Sanger method allows reading of 500 to 800 bases . . .

Ref: Annunziato,

A.

DNA packaging: Nucleosomes

and chromatin.

Nature Education

1(1),

(2008)

How to Sequence the Genome?

‡

Applied Biosystems Japan Co., Ltd.‡

Applied Biosystems Japan Co., Ltd.‡

A C G T

A jigsaw puzzle, of millions or tens of millions of pieces,whose source image is unknown.

Copy the genome

Use a high-velocity water jet to fragment the genome into random pieces

Collect fragments of around 2,000 base pairs (or other suitable volume)

Read 500 to 800 bases at either end of the group

Assemble read sequences into contiguous fragments

Example where it takes more than one method to assemble read sequences

Assemble contiguous fragments

Assemble non-contiguous fragments

Gene Code Regions Within a Genome

Barry Shell, www.science.cahttp://www.science.ca/scientists/scientistprofile.php?pID=19&pg=1

‡

http://www.science.ca/

http://www.science.ca/scientists/scientistprofile.php?pID=19&pg=1

Can Gene Coding Regions Be Predicted from a Genome Alone?

CCATA TATA

GGTAAGG

CAGG

ATG(Start codon)

TAA,TAG,TGA(Stop codon)

Protein coding region

AATAAA

Coding potential

The frequency of codon usage is a bias specific to the organism.The periodicity of a coding region is the nucleotide triplet.The standard bias used is that of a six-nucleotide (two-codon) frequency.Hidden Markov model

millions of years ago


500 400 300 200 100 0


CretaceousPaleogene

Neogene

Verterata

LampreyAgnathaChondrichthyes

Gnathostomata

Shark, ray

Osteichthyes


Holostei

Bichir

Sturgeon, gar, bowfin

Teleostei

Zebrafish

Killifish

Green spotted puffer

Tiger blowfish

Sarcopterygii

Coelacanth, lungfishDipnoi

Tetrapoda

Amniota

AmphibiaFrog

Reptilia

Aves

SquamataCrocodilia

Testudinata

Lizard, snake

Chicken

CrocodileTurtle

Mammalia

Dog

Mouse, rat

Human, chimpanzee

Nakatani

et al., 2007 , Genome Res., 17, 1254‐1265

CarboniferousPermianDevonian

SilurianOrdovician

Sequenced VertebrateGenomes

Compare genomes, identify stored regions, predict genes


http://upload.wikimedia.org/wikipedia/commons/8/8f/Polypterus_weeksii_1.jpg

http://ja.wikipedia.org/wiki/%E7%94%BB%E5%83%8F:CommonStrugeon.png


http://www.ensembl.org/Tetraodon_nigroviridis/





http://www.ensembl.org/Pan_troglodytes/



http://www.ensembl.org/Rattus_norvegicus/

http://mlab.cb.k.u-tokyo.ac.jp/index_jp.html

http://mlab.cb.k.u-tokyo.ac.jp/index_eng.html








http://mlab.cb.k.u-tokyo.ac.jp/index_jp.html

Compare genomes, identify stored regions, predict genes

Dubchak and Frazer, 2003 , Genome Biology, 4,122http://genomebiology.com/2003/4/12/122

‡

Acquiring Gene Sequences

• Synthesize cDNA from mRNA

• Combine cDNA into vectors, propagate and store copies: cDNA library

• Areas where Japan has global prominence:Sumio Sugano (The University of Tokyo, Institute of Medical Science): Human, otherYoshihide Hayashizaki (Riken): Mouse

• Extreme difficulty of identifying all mRNA

Figure removed

due to



Accelerating Genome Sequencing

Illumina GAIIx ABI SOLiD 3 Roche 454FLX Titanium

Read length (nucleotides) 75 x 2 = 150 50 500Reads (hundred mn) / run 0.96～1.2 4 0.01Days / run 9.5 16 0.4 (10 hr)Nucleotides per unit of timeNucleotides (hundred mn)/day 15～19 12.5 12Sample volume (µg) 0.1～1 0.01～ 5 3～5

1.4x/year

2.9x/year

Partial collection of gene sequences

• Genome re-sequencing, transcription start sites, chromatin structures, DNA methylation, RNA sequencing ==> Illumina GA

• De novo sequencing of large genomes, full-length cDNA sequencing, selective splicing ==> Roche 454

• Single-molecule prediction ==> Observations of early development

• Re-sequencing of the Watson genome (454, c.250 bp)

• Re-sequencing of the Asian human genome (Illumina, c.35 bp): mutations, insertions/deletions, inversions, etc.

• DNA methylation (Roche 454, 100-250 bp; Illumina, 36 bp after target capture)

• RNA sequencing (Illumina: 25～35 bp)• Cover the start points of transcription

(Illumina/SOLiD:

25bp)

• Chromatin structures (Illumina/SOLiD, 25 bp)

• Partial sequencing of Neanderthal genome• Chromatin structures (c.100 bp read length)• De novo sequencing of human and other large

vertebrate genomes, full-length cDNA sequencing (500-800 bp read lengths)

1.4x/yr

2.9x/yr

• Moore's Law: CPU performance (the number of transistors on a chip) doubles every 1.5 years.

• The performance of next- generation sequencers outstripping Moore's Law will improve some tenfold over four years.

• Parallelize ten-times the number of CPUs to maintain processing speeds.

• Bottleneck simultaneous access to secondary storage devices

Parallelization of Computational Resources to Improve Performance of Next-Generation Sequencers

Moore's law:1.6x/year

10xdifference

N megabytes/sec KxN megabytes/sec

Access parallelization

HA8000 Cluster System at the U. Tokyo Information Technology Center

Nodes

Logical operational

performance

147.2 GFLOPS

Processors (cores) 4(16)

Main memory32 GB (936 nodes)128 GB (16 nodes)

Local disk capacity 250 GB (incl. RAID 1 OS space)

Processors

Processors (clock speed) AMD Opteron

8386 (2.3 GHz)

Cache memoryL2: 512 KB/core

L3: 2 MB/processor

Logical core operational

performance

9.2 GFLOPS

http://www.cc.u‐tokyo.ac.jp/ha8000/

Top-performing supercomputer in Japan

Global Top 50027th in Nov 200816th in Jun 2008

the U. Tokyo Information Technology Center‡

Type B computational node group (35 + 16 nodes)Login nodes (4 nodes)

Computational node

Login

node

Storage system

Computational node

Computational node

Computational node

Computational node

Computational node

External rou

ter Type A computational

node group (512 nodes)Type A computational node group (128 nodes)

Type B computational node group (256 nodes)

Jun Wang (1976 ‐

)

‡

Comprehensive Description of Chromatin Structure

‡

Jeremy M. Berg,2006,Biochemistry 6th edition,W.H. Freeman & Co.Figure removed

due to



Can nucleosome core positions be predicted from a genome sequence alone?

Reprinted by permission from Macmillan Publishers Ltd: Segal et al., Nature 442(7104):772-8 , copyright （2006)

‡

(out of the minor groove)

(inside of the minor groove)

dinucleotide

DNA of nucreosomeHistone core

• A nucleosome core is present every 160-200 base pairs

• The human genome has 15 to 20 million nucleosome cores

• c.2,000 sequences/day in 2002 (ABI 3730)

• 10 million sequences/day attainable since 2007 (Illumina GA)

Linker DNAHistone

core of

nucreosome

Linker DNA breakage by

micrococcus nuclease(digestive ferment)

Take a pieces of DNA only the

part of twin around

11nm

Read both ends out from a gene‐sequencing machine

In a population of cells, positions of nucleosome

cores are unlikely to be stable.

Figure removed

due to


Molecular Biology of the Cell - Fifth EditionGarland Science (2008)Figure 4-23 （part2of2）

Summary

• Genome size and chromosome count do not necessarily characterize an organism.

• A genome has many different uses.• Repeated sequences complicate genome sequencing.

Computational analysis is essential.• Prediction, genome comparison and cDNA

collection are

used in conjunction to infer gene code regions.• The capability of genome sequencers is making

tremendous strides in recent years.• It is now possible to characterize chromatin structures.

Shinichi MorishitaMolecular Biology of the Cell - Fifth Edition Figure 5－75 Figure removed due to...

Documents

Transcript of Shinichi MorishitaMolecular Biology of the Cell - Fifth Edition Figure 5－75 Figure removed due to...