[email protected] Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique...
-
Upload
verity-morgan -
Category
Documents
-
view
218 -
download
1
Transcript of [email protected] Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique...
[email protected]é Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)http://www.bigre.ulb.ac.be/
Genome analysis
Bioinformatics
Contents
Genome annotation Comparative genomics
Phylogenetic profiles Gene fusion analysis Phylogenetic footprinting
[email protected]é Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)http://www.bigre.ulb.ac.be/
From sequences to genomes
Bioinformatics
From sequences to genomes
Before the 1990’s, DNA sequencing represented an important investment in terms of human work. A PhD student could spend a significant fraction of his thesis to sequence a single gene.
Genome projects stimulated the development of automatic sequencing methods, and led to important technological improvement.
There are currently (2008) several hundreds of publicly available fully sequenced genomes.
The NCBI genome distribution (ftp://ftp.ncbi.nih.gov/genomes/) contains• >650 prokaryotes (Bacteria and Archaea)
• Insects (Drosophila melanogaster, Apis mellifera)
• Plants (Arabidopsis thaliana, rice, maize)
• A worm (Caenorhabditis elegans)
• Some fungi (Saccharomyces cerevisiae, Schizosaccharomyces pombe, … )
• Some mammals (Homo sapiens, Mus musculus, Rattus norvegicus) Other genome centres give acces to other genomes.
• ENSEMBL (http://www.ensembl.org/) maintains many vertebrate genomes
• UCSC (http://genome.ucsc.edu/) maintains genomes of metazoan + insects
• Sanger Institute (http://www.sanger.ac.uk/genbiol/)
• Integr8 ~800 of genomes in 2008.
Many other genomes were sequenced by commercial companies, and are not available to the public.
Gene organization
Source: Mount (2000)
Gene function
After having localized genes on the sequence, we have to predict their function.
Some genes have already been characterized before the genome project, but these are generally a minority of those found in the genome.
For the majority of the genes, one tries to predict function on the basis of similarities between the sequence of the newly sequenced gene and some previously known genes (function assignation by sequence similarity).
Example: yeast genome (1996): there are still 2500 genes (39%) whose function is completely unknown. However
Yeast is among the best known model organisms (genetics, molecular biology).
The full genome is available since 1996. When the first traft of the Human genome
has been published, 60% of the predicted genes were of unknwown function.
>PHO4,SPBC428.03C : THIAMINE-REPRESSIBLE ACID PHOSPHATASE PRECURSOR: Q01682;Q9UU70; Length = 463 Score = 161 bits (408), Expect = 1e-40 Identities = 138/473 (29%), Positives = 223/473 (46%), Gaps = 47/473 (9%)Query: 9 ILAASLVNAGTIPLGKLSDIDKIGTQTEIFPFLGGSGPYYSFPGDYGISRDLPESCEMKQ 68 +LAAS+V+AG S + + LG Y+ P G + PESC +KQSbjct: 10 LLAASIVHAGK------SQFEAFENEFYFKDHLGTISVYHE-PYFNGPTTSFPESCAIKQ 62Query: 69 VQMVGRHGERYPT-------VSKAKSIMTTWYKLSNYTGQFSGALSFLNDDYEFFIRDTK 121 V ++ RHG R PT VS A+ I KL N G S+ + F T Sbjct: 63 VHLLQRHGSRNPTGDDTATDVSSAQYIDIFQNKLLN--GSIPVNFSYPENPLYFVKHWTP 120Query: 122 NLEMETTLANSVNVLNPYTGEMNAKRHARDFLAQYGYMVENQTSFAVFTSNSNRCHDTAQ 181 ++ E S + G + R +Y Y + + + + T+ R D+A+Sbjct: 121 VIKAENADQLSSS------GRIELFDLGRQVFERY-YELFDTDVYDINTAAQERVVDSAE 173Query: 182 YFIDGL-GDKFN--ISLQTISEAESAGANTLSAHHSCPAWDDDVNDDILKK-----YDTK 233 +F G+ GD + + E +SAGAN+L+ ++SCP ++D+ D+ + + Sbjct: 174 WFSYGMFGDDMQNKTNFIVLPEDDSAGANSLAMYYSCPVYEDNNIDENTTEAAHTSWRNV 233Query: 234 YLSGIAKRLNKE-NKGLNLTSSDANTFFAWCAYEINARGYSDICNIFTKDELVRFSYGQD 292 +L IA RLNK + G NLT SD + + C YEI R SD C++FT E + F Y DSbjct: 234 FLKPIANRLNKYFDSGYNLTVSDVRSLYYICVYEIALRDNSDFCSLFTPSEFLNFEYDSD 293Query: 293 LETYYQTGPGYDVVRSVGANLFNASVKLLKE--SEVQDQKVWLSFTHDTDILNYLTTIGI 350 L+ Y GP + ++G N L++ + D+KV+L+FTHD+ I+ +G Sbjct: 294 LDYAYWGGPASEWASTLGGAYVNNLANNLRKGVNNASDRKVFLAFTHDSQIIPVEAALGF 353Query: 351 IDDKNNLTAEH-VPFMENTF----HRSWYVPQGARVYTEKFQCS-NDTYVRYVINDAVVP 404 D +T EH +P +N F S +VP + TE F CS N YVR+++N V PSbjct: 354 FPD---ITPEHPLPTDKNIFTYSLKTSSFVPFAGNLITELFLCSDNKYYVRHLVNQQVYP 410Query: 405 IETCSTGPGFS----CEINDFYDYAEKRVAGTDFLKVCNVSSVSNSTELTFFW 453 + C GP + CE++ + + + + + ++ + N ++ST +T ++Sbjct: 411 LTDCGYGPSGASDGLCELSAYLNSSVRVNSTSNGIANFNSQCQAHSTNVTVYY 463
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Some milestones
Speices name Common name Publication yearGenome sizeGene numberMean intergenic distanceFraction of coding sequencesNon-coding fractionRepetitive elementsTranscribed fractionRemarksMb Kb % % % %
Bactérie
Mycoplasma genitalium Mycoplasma 1995 0.6 481 1.2 90 10 Small genome (intracellular)
Haemophilus influenzae 1995 1.8 1 717 1.0 86 14 First bacterial genome sequenced
Escherichia coli Enterobacteria 1997 4.6 4 289 1.1 87 13
Levures
Saccharomyces cerevisiae Baker's yeast 1996 12 6 286 1.9 72 28 First eukaryote genome
Animaux
Caenorhabditis elegans Nematod worm 1998 97 19 000 5 27 73 First metazoan genome
Drosophila melanogaster Fruit fly 2000 165 16 000 10 15 85
Ciona intestinalia 174 14 180 12
Danio rerio Zebrafish 1 527 18 957 81
Xenopus laevis Amphibian 1 511 18 023 84
Gallus gallus Chicken 2 961 16 736 177
Ortnithorynchus anatinus Ornithorhynchus 1 918 17 951 107
Mus musculus Mouse 2002 3 421 23 493 146
Pan troglodytes Chimp 2 929 20 829 141
Homo sapiens Human 2001 3 200 21 528 149 2 98 46 28 Draft version in 2001
1000 génomes humains > 2008 Project announced Jan 2008
Plantes
Arabidiopsis thaliana 2001 120 27 000 4 30 70 First plant genome sequenced
Oryza sativa Rice 390 37 544 10
Zea mais Maize 2 500 50 000 50 50 Nb of gene is an approximation
Triticum aestivum Wheat 16 000 Hexaploid genome
Lilium 120 000
Psilotum nudum 250 000
Genes and genome size
In prokaryotes, the number of genes increases linearly with genome size
In eukaryotes, this is not the case: the genome size increases faster than the number of genes
Genes and genome size
Beware: the axes are logarithmic.
This plot represents the same data as the previous one, but in logarithmic scale, in order to see Mammals as well.
Gene spacing
Gene spacing increases considerably with the complexity off the organisms.
Note: the X axis si logarithmic, not the Y axis -> the increase seems grossly exponential.
Proportion of intergenic regions
Beware: the X axis is logarithmic.
The proportion of intergenic regions increases with the complexity of an organism.
In addition (not shown here), introns represent an increasing fraction of the genome.
For example, the exonic fraction represents <5% of the human genome.
Protein size versus genome size
Protein sequences are shorter in prokaryotes than in eukaryotes.
Among eukaryotes, the increase in genome size is not correlated to an increase in protein size
higher eukaryotes have a much larger genome than fungi, without increase in protein size
[email protected]é Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)http://www.bigre.ulb.ac.be/
Genome annotation
Bioinformatics
Gene prediction
Starting from a completely sequenced genome, predict the positions of genes Elements of prediction
Open Reading Frames• Start and stop codons, separated by a a continuous set of non-stop codons.
Region content• Hexanucleotide composition
• Codon adaptation index (CAI). Signals
• In prokaryotes: Shine-Delgarno boxes.
• In eukaryotes: intron/exon boundary elements (splicing signals). Similarity with known genes.
Gene prediction - limitations
Typical problems: Gene prediction programs are trained for a specific organism, and can give very bad
results with other organisms (e.g., the first rounds of annotations of A.thaliana were done with programs trained for mammals).
Any gene prediction program will unavoidably predict false genes, and miss some true genes.
The prediction of intron/exon boundaries is particularly difficult. For prokaryotes, the predicted start codons are sometimes imprecise.
Example: genome of the yeast Saccharomyces cerevisiae For the yeast genomes, the gene detection protocol used in 1996 was over-predictive. The program essentially relied on ORF, and predicted 6400 gene. Some researchers estimated that ~1,000 ORFs might be false predictions. Since 1996, the reality of the predicted genes has been tested by combining several
methods of functional genomics (expression studies, mutant phenotypes, comparative genomics between closely related species, …).
A few hundreds of the initially predicted genes have been removed from the annotations.
Non-coding genes
There are many types of non-coding genes tRNA transfer RNA rRNA ribosomial RNA snRNA small nuclear RNA (elements of spliceosome) snoRNA methylation guides ...
Detection of non-coding RNA generally transcribed by polymerase I and III and have different promoters
Annotation of gene function
Once a genomic region has been predicted to contain a gene, the next step is to predict the function of this gene.
The translated product is compared with all known proteins, and a putative function can be assigned on the basis of high similarity matches.
Problems Sequence similarity is not always sufficient to confer the same function Where to put the threshold ? Some proteins might have similar function with different sequences (convergent
evolution). Once a gene has been assigned some putative function, this will be used to assign the
same function to other genes expansion of errors.
We should thus be aware that gene annotations have to be taken with caution.
Genes with unknown function
When genomes of model organisms were sequenced, about 40% of the predicted genes could not be associated to any known function
These genes are annotated as "hypothetical proteins". Note
In the yeast genome, many of these hypothetical proteins have been removed from the annotations since 1996, because they were false predictions.
[email protected]é Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)http://www.bigre.ulb.ac.be/
Comparative genomics
Bioinformatics
Phylogenetic footprinting
One of the main reasons for sequencing the mouse genome was to detect conserved regions between mouse and human, which will reveal exons and regulatory regions.
The fact that an unknown gene is found in different genomes gives more confidence in the existence of this gene.
Another important goal was to detect conserved regions in non-coding regions. On the basis of a few known cases, it has been shown that conserved non-coding regions contain a
high concentration in regulatory elements. The detection of conserved non-coding sequences gives thus indications about regions potentially
involved in regulation. Such conserved regions are called phylogenetic footprints.
Genome 1
Genome 2
conserved non-coding regionconserved exon
Phylogenetic profiles
For each gene of the query genome (e.g. E.coli), orthologs are searched in all the sequenced genomes
Each gene is characterized by a profile of presence/absence in all the sequenced genomes Groups of genes having similar phylogenetic profiles are likely to be functionally related
Gene A.aeolicusC.muridarumC.pneumoniae.AR39Nostoc.spSynechocystis.PCC6803B.haloduransB.subtilisC.acetobutylicumC.glutamicumC.perfringensL.innocuaL.lactisM.genitaliumM.lepraeM.pneumoniaeM.pulmonisS.aureus.MW2S.coelicolorS.pneumoniae.R6S.pyogenesT.tengcongensisU.urealyticumF.nucleatumA.tumefaciens.C58B.aphidicola.SgB.melitensisC.crescentusC.jejuniH.influenzaeH.pylori.26695M.lotiN.meningitidis.MC58P.aeruginosaR.conoriiR.solanacearumS.melilotiV.choleraeX.campestrisB.burgdorferiT.pallidumT.maritimaD.radioduransA.pernixP.aerophilumS.solfataricusS.tokodaiiA.fulgidusHalobacterium.spM.acetivoransM.jannaschiiM.kandleriM.thermoautotrophicumP.abyssiT.acidophilumT.volcaniumS.cerevisiaeS.pombeC.elegansD.melanogasterE.cuniculiA.thaliana16127995 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 016127996 1 0 0 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 0 1 1
16127997 1 0 0 1 1 1 1 1 1 0 1 1 0 1 0 0 1 1 1 0 1 0 0 0 1 0 0 1 1 1 0 0 0 0 0 016127998 0 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 1 1
16127999 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 016128000 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 1 0 0 1 0 0 0 1 0 1 0 0 1 1 1 1 0
16128001 0 1 1 0 0 1 1 0 1 1 0 0 0 0 0 0 1 1 1 1 0 0 1 0 0 1 0 1 1 1 0 1 1 0 0 116128002 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0
16128003 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0 1 016128004 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
16128005 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 016128006 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
16128007 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 016128008 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
16128009 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 116128010 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Pellegrini et al. (1999). Proc Natl Acad Sci U S A 96(8), 4285-8.
Gene fusion analysis
It is quite frequent to observe that two genes of a given organism are fused into a single gene in another organism.
Fusions between more than 2 genes are occasionally observed.
Fused genes are likely to be functionally related.
Query genome
E.coli 5 components
Yeast 1 composite
Reference genomes
A B C D E
C^D^A^B^E
Query genome
E.coli 2 components
B.subtilis 1 composite
Reference genomes
H.pylori 1 composite
A B
A^B
ReferencesMarcotte, et al. (1999). Science 285(5428), 751-3. Marcotte, et al. (1999). Nature 402(6757), 83-6. Enright, et al. (1999). Nature 402(6757), 86-90.
[email protected]é Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)http://www.bigre.ulb.ac.be/
Conclusion
Bioinformatics
The genome challenge
Despite the availability of several hundreds of genomes, we are far from understanding the organization and function of a single genome.
In particular, a lot of work remains to be done to decipher genomes of higher organisms. Genome sequence by itself is far from sufficient for this. Since 1997, several high-throughput methods have been invented to give complementary
information about gene function (see courses on transcriptome, proteome and interactome).
Quelques jalons
Nom d'espèce Nom commun Année de publicationTaille du génomeNombre de gènesDistance moyenne entre gènesFraction couverte par des gènes codantsFraction non-codanteFraction répétitiveFraction transcriteRemarquesMb Kb % % % %
Bactérie
Mycoplasma genitalium Mycoplasma 1995 0.6 481 1.2 90 10 Petit génome (intracellulaire)
Haemophilus influenzae 1995 1.8 1 717 1.0 86 14 Premier génome bactérien séquencé
Escherichia coli Entérobactérie 1997 4.6 4 289 1.1 87 13
Levures
Saccharomyces cerevisiae Levure du boulanger 1996 12 6 286 1.9 72 28 Premier génome eucaryote
Animaux
Caenorhabditis elegans Ver nématode 1998 97 19 000 5 27 73 Premier génome de métazoaire
Drosophila melanogaster Mouche à vinaigre 2000 165 16 000 10 15 85
Ciona intestinalia 174 14 180 12
Danio rerio Poisson zèbre 1 527 18 957 81
Xenopus laevis Xénope (amphibien) 1 511 18 023 84
Gallus gallus Poule 2 961 16 736 177
Ortnithorynchus anatinus Ornithorynque 1 918 17 951 107
Mus musculus Souris 2002 3 421 23 493 146
Pan troglodytes Chimpanzé 2 929 20 829 141
Homo sapiens Humain 2001 3 200 21 528 149 2 98 46 28 Version "brouillon"
1000 génomes humains > 2008 Projet annoncé en janvier 2008
Plantes
Arabidiopsis thaliana Arabette 2001 120 27 000 4 30 70 Premier génome de plante
Oryza sativa Riz 390 37 544 10
Zea mais Maïs 2 500 50 000 50 50 Nb de gènes approximatif
Triticum aestivum Blé 16 000 Génome hexaploÏde
Lilium Lys 120 000
Psilotum nudum 250 000