[email protected] Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique...

[email protected]é Libre de Bruxelles, Belgique

Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)http://www.bigre.ulb.ac.be/

Genome analysis

Bioinformatics

Contents

Genome annotation Comparative genomics

Phylogenetic profiles Gene fusion analysis Phylogenetic footprinting



From sequences to genomes

Bioinformatics

From sequences to genomes

Before the 1990’s, DNA sequencing represented an important investment in terms of human work. A PhD student could spend a significant fraction of his thesis to sequence a single gene.

Genome projects stimulated the development of automatic sequencing methods, and led to important technological improvement.

There are currently (2008) several hundreds of publicly available fully sequenced genomes.

The NCBI genome distribution (ftp://ftp.ncbi.nih.gov/genomes/) contains• >650 prokaryotes (Bacteria and Archaea)

• Insects (Drosophila melanogaster, Apis mellifera)

• Plants (Arabidopsis thaliana, rice, maize)

• A worm (Caenorhabditis elegans)

• Some fungi (Saccharomyces cerevisiae, Schizosaccharomyces pombe, … )

• Some mammals (Homo sapiens, Mus musculus, Rattus norvegicus) Other genome centres give acces to other genomes.

• ENSEMBL (http://www.ensembl.org/) maintains many vertebrate genomes

• UCSC (http://genome.ucsc.edu/) maintains genomes of metazoan + insects

• Sanger Institute (http://www.sanger.ac.uk/genbiol/)

• Integr8 ~800 of genomes in 2008.

Many other genomes were sequenced by commercial companies, and are not available to the public.

ftp://ftp.ncbi.nih.gov/genomes/






http://www.ensembl.org/

http://genome.ucsc.edu/

http://www.sanger.ac.uk/genbiol/

Gene organization

Source: Mount (2000)

Gene function

After having localized genes on the sequence, we have to predict their function.

Some genes have already been characterized before the genome project, but these are generally a minority of those found in the genome.

For the majority of the genes, one tries to predict function on the basis of similarities between the sequence of the newly sequenced gene and some previously known genes (function assignation by sequence similarity).

Example: yeast genome (1996): there are still 2500 genes (39%) whose function is completely unknown. However

Yeast is among the best known model organisms (genetics, molecular biology).

The full genome is available since 1996. When the first traft of the Human genome

has been published, 60% of the predicted genes were of unknwown function.

>PHO4,SPBC428.03C : THIAMINE-REPRESSIBLE ACID PHOSPHATASE PRECURSOR: Q01682;Q9UU70; Length = 463 Score = 161 bits (408), Expect = 1e-40 Identities = 138/473 (29%), Positives = 223/473 (46%), Gaps = 47/473 (9%)Query: 9 ILAASLVNAGTIPLGKLSDIDKIGTQTEIFPFLGGSGPYYSFPGDYGISRDLPESCEMKQ 68 +LAAS+V+AG S + + LG Y+ P G + PESC +KQSbjct: 10 LLAASIVHAGK------SQFEAFENEFYFKDHLGTISVYHE-PYFNGPTTSFPESCAIKQ 62Query: 69 VQMVGRHGERYPT-------VSKAKSIMTTWYKLSNYTGQFSGALSFLNDDYEFFIRDTK 121 V ++ RHG R PT VS A+ I KL N G S+ + F T Sbjct: 63 VHLLQRHGSRNPTGDDTATDVSSAQYIDIFQNKLLN--GSIPVNFSYPENPLYFVKHWTP 120Query: 122 NLEMETTLANSVNVLNPYTGEMNAKRHARDFLAQYGYMVENQTSFAVFTSNSNRCHDTAQ 181 ++ E S + G + R +Y Y + + + + T+ R D+A+Sbjct: 121 VIKAENADQLSSS------GRIELFDLGRQVFERY-YELFDTDVYDINTAAQERVVDSAE 173Query: 182 YFIDGL-GDKFN--ISLQTISEAESAGANTLSAHHSCPAWDDDVNDDILKK-----YDTK 233 +F G+ GD + + E +SAGAN+L+ ++SCP ++D+ D+ + + Sbjct: 174 WFSYGMFGDDMQNKTNFIVLPEDDSAGANSLAMYYSCPVYEDNNIDENTTEAAHTSWRNV 233Query: 234 YLSGIAKRLNKE-NKGLNLTSSDANTFFAWCAYEINARGYSDICNIFTKDELVRFSYGQD 292 +L IA RLNK + G NLT SD + + C YEI R SD C++FT E + F Y DSbjct: 234 FLKPIANRLNKYFDSGYNLTVSDVRSLYYICVYEIALRDNSDFCSLFTPSEFLNFEYDSD 293Query: 293 LETYYQTGPGYDVVRSVGANLFNASVKLLKE--SEVQDQKVWLSFTHDTDILNYLTTIGI 350 L+ Y GP + ++G N L++ + D+KV+L+FTHD+ I+ +G Sbjct: 294 LDYAYWGGPASEWASTLGGAYVNNLANNLRKGVNNASDRKVFLAFTHDSQIIPVEAALGF 353Query: 351 IDDKNNLTAEH-VPFMENTF----HRSWYVPQGARVYTEKFQCS-NDTYVRYVINDAVVP 404 D +T EH +P +N F S +VP + TE F CS N YVR+++N V PSbjct: 354 FPD---ITPEHPLPTDKNIFTYSLKTSSFVPFAGNLITELFLCSDNKYYVRHLVNQQVYP 410Query: 405 IETCSTGPGFS----CEINDFYDYAEKRVAGTDFLKVCNVSSVSNSTELTFFW 453 + C GP + CE++ + + + + + ++ + N ++ST +T ++Sbjct: 411 LTDCGYGPSGASDGLCELSAYLNSSVRVNSTSNGIANFNSQCQAHSTNVTVYY 463

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Some milestones

Speices name Common name Publication yearGenome sizeGene numberMean intergenic distanceFraction of coding sequencesNon-coding fractionRepetitive elementsTranscribed fractionRemarksMb Kb % % % %

Bactérie

Mycoplasma genitalium Mycoplasma 1995 0.6 481 1.2 90 10 Small genome (intracellular)

Haemophilus influenzae 1995 1.8 1 717 1.0 86 14 First bacterial genome sequenced

Escherichia coli Enterobacteria 1997 4.6 4 289 1.1 87 13

Levures

Saccharomyces cerevisiae Baker's yeast 1996 12 6 286 1.9 72 28 First eukaryote genome

Animaux

Caenorhabditis elegans Nematod worm 1998 97 19 000 5 27 73 First metazoan genome

Drosophila melanogaster Fruit fly 2000 165 16 000 10 15 85

Ciona intestinalia 174 14 180 12

Danio rerio Zebrafish 1 527 18 957 81

Xenopus laevis Amphibian 1 511 18 023 84

Gallus gallus Chicken 2 961 16 736 177

Ortnithorynchus anatinus Ornithorhynchus 1 918 17 951 107

Mus musculus Mouse 2002 3 421 23 493 146

Pan troglodytes Chimp 2 929 20 829 141

Homo sapiens Human 2001 3 200 21 528 149 2 98 46 28 Draft version in 2001

1000 génomes humains > 2008 Project announced Jan 2008

Plantes

Arabidiopsis thaliana 2001 120 27 000 4 30 70 First plant genome sequenced

Oryza sativa Rice 390 37 544 10

Zea mais Maize 2 500 50 000 50 50 Nb of gene is an approximation

Triticum aestivum Wheat 16 000 Hexaploid genome

Lilium 120 000

Psilotum nudum 250 000

Genes and genome size

In prokaryotes, the number of genes increases linearly with genome size

In eukaryotes, this is not the case: the genome size increases faster than the number of genes

Genes and genome size

Beware: the axes are logarithmic.

This plot represents the same data as the previous one, but in logarithmic scale, in order to see Mammals as well.

Gene spacing

Gene spacing increases considerably with the complexity off the organisms.

Note: the X axis si logarithmic, not the Y axis -> the increase seems grossly exponential.

Proportion of intergenic regions

Beware: the X axis is logarithmic.

The proportion of intergenic regions increases with the complexity of an organism.

In addition (not shown here), introns represent an increasing fraction of the genome.

For example, the exonic fraction represents <5% of the human genome.

Protein size versus genome size

Protein sequences are shorter in prokaryotes than in eukaryotes.

Among eukaryotes, the increase in genome size is not correlated to an increase in protein size

higher eukaryotes have a much larger genome than fungi, without increase in protein size



Genome annotation

Bioinformatics

Gene prediction

Starting from a completely sequenced genome, predict the positions of genes Elements of prediction

Open Reading Frames• Start and stop codons, separated by a a continuous set of non-stop codons.

Region content• Hexanucleotide composition

• Codon adaptation index (CAI). Signals

• In prokaryotes: Shine-Delgarno boxes.

• In eukaryotes: intron/exon boundary elements (splicing signals). Similarity with known genes.

Gene prediction - limitations

Typical problems: Gene prediction programs are trained for a specific organism, and can give very bad

results with other organisms (e.g., the first rounds of annotations of A.thaliana were done with programs trained for mammals).

Any gene prediction program will unavoidably predict false genes, and miss some true genes.

The prediction of intron/exon boundaries is particularly difficult. For prokaryotes, the predicted start codons are sometimes imprecise.

Example: genome of the yeast Saccharomyces cerevisiae For the yeast genomes, the gene detection protocol used in 1996 was over-predictive. The program essentially relied on ORF, and predicted 6400 gene. Some researchers estimated that ~1,000 ORFs might be false predictions. Since 1996, the reality of the predicted genes has been tested by combining several

methods of functional genomics (expression studies, mutant phenotypes, comparative genomics between closely related species, …).

A few hundreds of the initially predicted genes have been removed from the annotations.

Non-coding genes

There are many types of non-coding genes tRNA transfer RNA rRNA ribosomial RNA snRNA small nuclear RNA (elements of spliceosome) snoRNA methylation guides ...

Detection of non-coding RNA generally transcribed by polymerase I and III and have different promoters

Annotation of gene function

Once a genomic region has been predicted to contain a gene, the next step is to predict the function of this gene.

The translated product is compared with all known proteins, and a putative function can be assigned on the basis of high similarity matches.

Problems Sequence similarity is not always sufficient to confer the same function Where to put the threshold ? Some proteins might have similar function with different sequences (convergent

evolution). Once a gene has been assigned some putative function, this will be used to assign the

same function to other genes expansion of errors.

We should thus be aware that gene annotations have to be taken with caution.

Genes with unknown function

When genomes of model organisms were sequenced, about 40% of the predicted genes could not be associated to any known function

These genes are annotated as "hypothetical proteins". Note

In the yeast genome, many of these hypothetical proteins have been removed from the annotations since 1996, because they were false predictions.



Comparative genomics

Bioinformatics

Phylogenetic footprinting

One of the main reasons for sequencing the mouse genome was to detect conserved regions between mouse and human, which will reveal exons and regulatory regions.

The fact that an unknown gene is found in different genomes gives more confidence in the existence of this gene.

Another important goal was to detect conserved regions in non-coding regions. On the basis of a few known cases, it has been shown that conserved non-coding regions contain a

high concentration in regulatory elements. The detection of conserved non-coding sequences gives thus indications about regions potentially

involved in regulation. Such conserved regions are called phylogenetic footprints.

Genome 1

Genome 2

conserved non-coding regionconserved exon

Phylogenetic profiles

For each gene of the query genome (e.g. E.coli), orthologs are searched in all the sequenced genomes

Each gene is characterized by a profile of presence/absence in all the sequenced genomes Groups of genes having similar phylogenetic profiles are likely to be functionally related

Gene A.aeolicusC.muridarumC.pneumoniae.AR39Nostoc.spSynechocystis.PCC6803B.haloduransB.subtilisC.acetobutylicumC.glutamicumC.perfringensL.innocuaL.lactisM.genitaliumM.lepraeM.pneumoniaeM.pulmonisS.aureus.MW2S.coelicolorS.pneumoniae.R6S.pyogenesT.tengcongensisU.urealyticumF.nucleatumA.tumefaciens.C58B.aphidicola.SgB.melitensisC.crescentusC.jejuniH.influenzaeH.pylori.26695M.lotiN.meningitidis.MC58P.aeruginosaR.conoriiR.solanacearumS.melilotiV.choleraeX.campestrisB.burgdorferiT.pallidumT.maritimaD.radioduransA.pernixP.aerophilumS.solfataricusS.tokodaiiA.fulgidusHalobacterium.spM.acetivoransM.jannaschiiM.kandleriM.thermoautotrophicumP.abyssiT.acidophilumT.volcaniumS.cerevisiaeS.pombeC.elegansD.melanogasterE.cuniculiA.thaliana16127995 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 016127996 1 0 0 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 0 1 1

16127997 1 0 0 1 1 1 1 1 1 0 1 1 0 1 0 0 1 1 1 0 1 0 0 0 1 0 0 1 1 1 0 0 0 0 0 016127998 0 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 1 1

16127999 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 016128000 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 1 0 0 1 0 0 0 1 0 1 0 0 1 1 1 1 0

16128001 0 1 1 0 0 1 1 0 1 1 0 0 0 0 0 0 1 1 1 1 0 0 1 0 0 1 0 1 1 1 0 1 1 0 0 116128002 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0

16128003 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0 1 016128004 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

16128005 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 016128006 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

16128007 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 016128008 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

16128009 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 116128010 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Pellegrini et al. (1999). Proc Natl Acad Sci U S A 96(8), 4285-8.

Gene fusion analysis

It is quite frequent to observe that two genes of a given organism are fused into a single gene in another organism.

Fusions between more than 2 genes are occasionally observed.

Fused genes are likely to be functionally related.

Query genome

E.coli 5 components

Yeast 1 composite

Reference genomes

A B C D E

C^D^A^B^E

Query genome

E.coli 2 components

B.subtilis 1 composite

Reference genomes

H.pylori 1 composite

A B

A^B

ReferencesMarcotte, et al. (1999). Science 285(5428), 751-3. Marcotte, et al. (1999). Nature 402(6757), 83-6. Enright, et al. (1999). Nature 402(6757), 86-90.



Conclusion

Bioinformatics

The genome challenge

Despite the availability of several hundreds of genomes, we are far from understanding the organization and function of a single genome.

In particular, a lot of work remains to be done to decipher genomes of higher organisms. Genome sequence by itself is far from sufficient for this. Since 1997, several high-throughput methods have been invented to give complementary

information about gene function (see courses on transcriptome, proteome and interactome).

Quelques jalons

Nom d'espèce Nom commun Année de publicationTaille du génomeNombre de gènesDistance moyenne entre gènesFraction couverte par des gènes codantsFraction non-codanteFraction répétitiveFraction transcriteRemarquesMb Kb % % % %

Bactérie

Mycoplasma genitalium Mycoplasma 1995 0.6 481 1.2 90 10 Petit génome (intracellulaire)

Haemophilus influenzae 1995 1.8 1 717 1.0 86 14 Premier génome bactérien séquencé

Escherichia coli Entérobactérie 1997 4.6 4 289 1.1 87 13

Levures

Saccharomyces cerevisiae Levure du boulanger 1996 12 6 286 1.9 72 28 Premier génome eucaryote

Animaux

Caenorhabditis elegans Ver nématode 1998 97 19 000 5 27 73 Premier génome de métazoaire

Drosophila melanogaster Mouche à vinaigre 2000 165 16 000 10 15 85

Ciona intestinalia 174 14 180 12

Danio rerio Poisson zèbre 1 527 18 957 81

Xenopus laevis Xénope (amphibien) 1 511 18 023 84

Gallus gallus Poule 2 961 16 736 177

Ortnithorynchus anatinus Ornithorynque 1 918 17 951 107

Mus musculus Souris 2002 3 421 23 493 146

Pan troglodytes Chimpanzé 2 929 20 829 141

Homo sapiens Humain 2001 3 200 21 528 149 2 98 46 28 Version "brouillon"

1000 génomes humains > 2008 Projet annoncé en janvier 2008

Plantes

Arabidiopsis thaliana Arabette 2001 120 27 000 4 30 70 Premier génome de plante

Oryza sativa Riz 390 37 544 10

Zea mais Maïs 2 500 50 000 50 50 Nb de gènes approximatif

Triticum aestivum Blé 16 000 Génome hexaploÏde

Lilium Lys 120 000

Psilotum nudum 250 000

[email protected] Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique...

Documents

Transcript of [email protected] Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique...