Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss...

73
Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG)

Transcript of Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss...

Page 1: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

Introduction to Bioinformatics

Patricia M. PalagiSwiss Institute of Bioinformatics (SIB)

PI Group (PIG)

Page 2: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

Bioinformatics: definition

The applications of computer sciences to molecular biology

In particular to the study of macromolecules such as proteins, nucleic acids and oligosaccharides

(sugar)

Page 3: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

Some synonyms for molecular bioinformatics

• Computational biology• Biocomputing• Genome computing• Sequence analysis (restrictive)

Page 4: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

Molecular bioinformatics is sometimes confused with...

• «Bio-inspired» computer sciences (artificial life, neural networks, genetic algorithms);

• Biomathematics or biostatistics;• Modelization of biological systems.

Page 5: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

• Databases– Nucleic acid sequence databases (EMBL /

GenBank / DDBJ) and protein sequence databases (SWISS-PROT / TrEMBL);

– Databases specialized for genomics (FlyBase, OMIM), mutations, 3D structures (PDB), 2D gels (SWISS-2DPAGE), references (Medline); etc.;

– More than 1’000 are currently available;– They can generally be accessed from the Web;– Size range from <10 Kb to >10 Gb;– Frequency of update: from daily (EMBL) to

annually.

2 components of bioinformatics

Page 6: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

• Tools– Programs to analyze raw experimental results

(from sequencing machines, mass spectrometers, etc.);

– Programs to analyze the intrinsic properties of DNA or protein sequences;

– Sequence comparison and similarity search tools;– Micro-array analysis software;– Tridimensional structures visualization and

modelization tools;– These software tools are either part of commercial

packages or are available to all on the WWW.

2 components of bioinformatics

Page 7: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

Some important facts on bioinformatics

• It is a discipline that complements but does not supplement experimental research;

• It can help plan experiments, not replace the experiments;

• It is not cheap;• Good bioinformatic studies take significant

amounts of time;• Like anywhere else: some garbage in, lots

of garbage out!

Page 8: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

Bioinformatics and the discovery process in biology

• Discoveries are made through studies of anomalies;

• Computer analysis tends to smooth out the ‘spikes’ of anomalies;

• We need to make sure that we do not throw away the baby with the water.

Page 9: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

A common fallacy• Genome projects are providing massive

amounts of data.• Yes, they are providing lots of sequence

data, but a lack of information on "proteins" and no characterization data;

• The amount of data is relatively small in absolute term. Compared to images, sequence data does not cause real problems in terms of storage or processing.

Page 10: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There
Page 11: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

Viroide 300Small phage (virus infecting a bacteria) 2,000HIV virus 10,000Herpes virus 150,000Mycoplasma genitalium (parasite bacteria) 600,000Bacteria 1 à 13 millionsBaker’s yeast 13 millionsDrosophila (fruit fly) 180 millionsFugu (fish) 360 millionsHuman 3.2 billionsPine 68 billionsSalamander 81 billionsAmoeba 670 billions

Genome sizes (in base pairs)

Page 12: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

CCCCTGACGACCGATTCAAAAACCACTTTCCTCTTTTACGGCGCCCTAGCGCTATGGCGGTGAAGACTGCTTGACATTAACATGCCTGTTGAGGCTAGAGAATCCATGCGAAGGCGGTTCGGAAACTGCTTCGAAGGCGTGGGGTGGTGCGGGTGGGATTTGAACCCACGCAGGCCTACGCCATCGGGTCCTAAGCCCGACCCCTTTGGCCAGGCTCGGGCACCCCCGCACCGTGTAGTCTTTAGGTTTAGCTTTCAGGGTTAAAACGGTTTAACACTCATGAGTATCACTGGGCTGGCTACTGGGCTCTGCATTCCCGAGGCCATGCTGCCCGTGAGGAATAACGGGTCTGAGGAGCCGTTGACAGGTTGCCATTTGGCCTTGCCCCCAAAAGTGATGCTGTGGATCACGACCTCCTCGGAGGAGGGGAGCCTCAGCATACACTTTATAAAAGGCTTTAAGGGTTTAGCCGGATAATGTTGTTGGGGCGTGCAGCGGCAAGTGCTGCAGCTCATGGGTATGGTATGCGGCTTTGCCTGGTGATGCGGTTTGGCCCCCGTTGTCTGCGACGTCTGCGGTGTTAGGAGGGCTGTGGTGCTGCAGCCACACGGGAAGGCGGCTCTGCAGGGAGTGCTTTAGGGAGGATATAGTGGGGAGGGTCAGGAGGGAGGTTGAGAGGTGGGGGATGATAGGCCCTGGGGAGACGGTCCTCCTAGGCCTGAGCGGCGGTAAGGACAGCTATGTCCTGCTGGACCCTCTCCGAGATAGTCGGGCCCTCGAGGCTGGTGGCGGTGTCTATAGTGGAGGGCATACCGGGGTACAACAGGGAGGGAGATATCGAGAAGATCAGGAGGGTGGCCGCGGCTAGGGGCGTCGACGTGATAGTGACGAGCATAAGGGAGTATGGGGGCCAGCCTCTATGAGATATACTCCAGGGCCCGAGGGAGGGGGGCGGGCCACGCCGCCTGCACCTACTGCGGCATAAGCAGGAGGAGGATACTTGCCCTCTACGCCCGCCTCTACGGCGCCCACAAGGTCGCTACGGCCCACAACCACGACGAGGCGCAGACAGCTATAGTGAACTTCCTCAGGGGGGACTGGGTTGGCATGCTGAAAACACACCCCCTCTACAGGAGCGGGGGCGAGGACCTGGTTCCAAGGATAAAGCCTCTTAGGAAAGTCTACGAGTGGGAGACGGCCAGCTATGGTACTCCACCGCTACCCCATCCAGGAGGCTGAATGCCCCTTCATAAACATGAACCCAACCCTCAGGGCGAGGGTGAGGACGGCCCTGAGGGTGCTAGAGGAGAGGAGCCCGGGCACCCTGCTCAGGATGATGGAGAGGCTCGACGAGGATGAGGCCGCTGGCCCAGGCCATGAAGCCCTCCTCCCTAGGCAGGTGCGAGAGATGCGGGGAGCCGACCAGCCCGAAGAGGAGGCTCTGCAAGCTCTGCGAGCTCCTGGAGGAGGCCGGGTTCCAGGAGCCCATCTACGCGATCGCAGGGAGCAAGAGATTAAGGCTTCAGAGCCCCACCGCTAGCCCTGGGTGAACGCGCTATGGCAAAGCCAAAGGTTAGCCTGCCGGAGGATGTGGAGCCCCCCAAGGCTATAGTCAAGAAGCCTAGGCTAGTGAAGCTAGGCCCCGTAGACCCGGGGAGGAGGGGAAGGGGGTTCAGCCTAGGCGAGCTCGCGGAGGCTGGGCTAGACGCTAAAAAGGCGAGGAAGCTTGGCCTGCACGTGGACACGAGGAGGAGGACGGTCCACCCGTGGAACGTGGAGGCCCTCAAGAAGTATATAGAGAGGCTTAGAGGCGGGCGTAGAGGTCTAGACCCCGGGGCTATATACTACCACTTCGCCCTCCCCATTATACTATCCACATCCACCCTGGCCCTCCCCACCTCCAGGACCTCAATATCCCCCTCAGCCCTGGTGTACACGCTCAAAGACGGCTCCCTGTAAGGCCCTGGTCACCACCCCCACGTGAATCACCCCTCCCGCGTGTACGGCGGCTATAAGCCCCCTCTCCCAGCCCTCCCGGAGGACGCGGAGCCCGGAGCCTACTCCGACCCTACCGCCCCTCCTCGCCACAACCACTATGTCCCCGTCAATCTCACCATAGAGGGCGGCTGGGTGTAGGGCCTTGAGGGCCTCGTGGGCCAGAGGCTCCCCCCGGAATATCGGCGCGCCAACTATCTCGGCCTCGCCGGGCCTGACCCTCCTCTCCCTCCCTCCCGAGGTCCTAAGGGCTATCAGCCTCTTATGAAGAGCCCTCTCCCCCCGGCTCTTGCCCGCCTCTCCAGCCAGCCTCTCCACAGACAGAGTGTCAAGCCCCCACACCCTCTCGAGCAGCCTGGCCCGTCGGCTGGCTATGCCCACCGCGACTACAAGCCTTGCTCTAGAGGCTATGGGGGCTGCCTTAGACTCGAGCCCCTCCCACAGTGATATCCAGCCATCTGTATCCACTACCACCTGGCTGGCCAGTGAGGCCAATCTAGATGCGCAGGCGAGGTAGCGGGACTCCGACCCCCGGGGGGTGAAGCCGCCGACGAAACACGGCTCACTCGAGAACGAGTCGTCTAGGCCCGGGACGGCCACGCCCTGTGGAGACGCCAGCGCCATAAACCCCGGGGCGAAGACCTCGTTCTGGCCTATATCCGCCGACAGCAGTCTATACCCACCACCGCCCCTGTTAACTATCCAAGCCGCTATGCTCTTACCGGAGTCGCTCGGCCCCACAATAGCCACCCTGCCCCGCTGAGAGGCCTCCCTGGCTATGGAGTCGAACCTGTTGTAAGCCTCCTCCACGCCCCCTGTGGAGACTACACCGGACACAATAGCCCTCCCCTCAACCCTGGCGACCGACCTGCCTGCAGGGACCACTAGAGTAGAGCCCTCCCCCAGCCTTCCACCCAAAACCTCTGCAGCACCCTCTACAACCTCTATCCTCCCCGGGCCGCGGACTAGCGCCGAGCCCCATGCAATCTCCACAGGCAAAGCTTTAAACCCCCAGGTAAGATATGTGAACCGGGCCGCGGTAGTATAGCCTGGACTAGTATGCGGGCCTGTCAAGGGCCCCGCCTCCGCCCCACCCTCATTCTACTACACGCTTATCAGGATAAACAGCCGGGCAAACGTTTTTAACCCCGCCGAAATTCATACTCCCGGGGCGGAGGCGGGCCTGCGGAGAGCCCGTGACCCGGGTTCAAATCCCGGCCGCGGCGCCAATAATCCTCGCGGCCCGCCTTCAAGACTCACTAAACCCCGGTTGAGCACCCGCAGCATCGATGCTAAGGCTCGAGCCATGCATAGCCGCGGGGGGTGGGGGGATTTGGCGAGGCCTGTTGAGGCGGTAAAGAGGCTGCTGGAGAGGTGGCTGGAGGGTAGGAGGAGGGGTTATGTCCTTACGCTTGTAGCTCTTAGAAGGCTTGAGGAGAGGGGGGAGGAGGCTACTGTAGAGAGTAGGGAGGAGGGCCTGAGGATTCTGGAGAGGACGGAGGGGAGGATAGACTGGGGTGTTACTAGGGATGAGTACACTGTCAACATGGTCTCCAGCGTTCTTCGCGAGCTGGCCGAGAGCGGCCTTGTCGAGATGGTGGACGGCGGGAGGAGCGTCAGGTACAGGATAGCGAGGGATGCTGAGGAGGAGTTCCTCTCCAGCTTCGGCCACCTCCTGCAGCTTGTGAGGATGCCGAAGTAGCGTTAAAGCCCTAGGTGCCAGAGGCCGCCGGAGGCTAAGAGGCCGATGAAGGCCTTGAGAGGTGCCGCCAAGCTATCCCTATCCCTGCTGCTCTTTTGGGCTAGCTACTCGATCTACTACACTATAACGAGGCGTGCTGTAGAGGAGGGCCTAGGAGAGGGATCCTACCTCCTGGGCGTCTTGATGTCGGGGGCTGAGGAGGCGCCGCTCGCGTCAATAGTCCTTGGCTACCTGGCGGACAGGCTAGGCTACCGCTTACCCCTGGCCCTGGGCCTGTTTGAGGCTGGGCTGGTCGCTGCAATGGCCTTCACCCCCCTAGAGACCTACCCCATACTGGCTGGGGCTGCGTCGCTAGTCTACGCCTCATACTCCGCCCTAATGGGCCTCGTCCTGGGTGAGAGCGGGGGGAGCGGCTTCAGGTACAGTGTTATAGCAGCCTTCGGCAGCCTTGGCTGGGCTCTCGGCGGGTTGGCGGGGGGAGCGGCTTACTCCCGCCTGGGGTCACTGGGGCTAGTGGCCGCAGCCCTCATGGCCGCCTCATACCTAGTCGCCCTCTCAGCCTCGCCCCCCCGCGGCGGCGCGGCGCCCAGTGTGGGGGAGACGATAACCGCTCTGAAGGGGGTTCTGCCCCTATTTGCAAGCCTCTCAACCAGCTGGGCGGCGGGCTTCTTCTTCGGGGCTGCCAGCATAAGGCTTAGCGAGGCGCTCGAGAGCCCTATCGCCTACGGGCTAGTGCTGACCACCGTCCCCGCACTCCTAGGCTTCCTGGCGAGGCCTGCGGCGGGCAGGCTGGTCGACAAGGCCGGGGCTGTAGTGCTTGCGTTGTCCAACGCGGCATACTCCCTTCTCGCCCTAGTTTTCGGCCTGCCCACCAGTCCGGCCCTGCTGGCCCTTGCATGGAGCCTGCCCCTATACCCCTTTAGGGATGCCGCCGCGGCCATCGCAGTTAGCAGCAGGCTTGAGAGGCTGCAGGCGACGGCCGCGGGGCTGCTCTCAGCGAGCGAGAGCGTCGGCGGCGCTGCAACCCTTGCCCTGGCACTGCTCCTGGATGGGGGGTTTAGGGAGATGATGACGGCTTCAATAGCCCTTATGCTCCTCTCCACCCTACTCCTCGCAGACCACTCTACGGCTCCACGCCGAGAGCCCTGTCCCCGGCGTCGCCAAGGCCCGGCACTATGAAGTAGTTCTCGTCCAGCTCGGGGTCTAGGGCTAGCGTGTATATGGGGGTGTCGCCGTAGAGGGATGATATGTACTCGACGCCCCTGGACGCTATTATAGAGCCTATAACGACCTTGCTGGCCCCCCTGTCTCTGGCCAGCCTCACGGCCTCCGCCACAGTCTTGCCCGTGGCCAGCATCGGGTCTAGAACGACGGCGGGGCCGTCGAACATGCGGGGTAGCCTGGAGTAGTAGATCTATCTTGAGCCTGCCCGGCTCCTCGACCCTCCTGGCTGCTACGAGGGCTATCCTCGCCTCCGGCATCATCGAGGCGAAACCCTCTACCATGGGGAGGCTAGCCCCGAGTATCCCTACGAGGTAGACGGGCCCCGCTGGCGCCAGCTCCGCCTTAGCCCCCAGGGGGGTCTCCACCTCCTCCTCCACCCACCCGAGCTCGCCCGCAATGTACACCGCCAGTATGGAGCCCGCTATCCTGACGTACCTCCTAAACTCCGGGAACCCGGTTGTCCGGTCCCTGAGAACCTTGAGGACGTAGGCTAGGGGTGTTTCGCCCCCAATAACCCTAACTGCCGCCACCATGGGAACCTCTAGGTAGTGGTTGAGGCTCCGGAGCTTAAGAGGGTTAAACTCCAGGATGGCCACCTGGGTGCCGCCGGGGATTGGACAGTAGGGTTCTAGAGTCCGCGAGAGCCCTATCCCGCTACCCCCTCTGCGACCGCTGCCTCGGCAGGCTCTTCGCTAGGCTTGGGAGAGGCTGGAGCAATAGGGAGCGGGGAGAGGCTGTCAAGAGGGTTCTGGTGATGGAGCTTCACAGGAGGGTCCTCGAGGGGGATGAGGCGTTGAAAACCCTGGTCTCTGCAGCTCCGAACATAGGGGAGGTGGCAAGGGATGTCGTGGAGCACCTCTCCCCAGGTTCCTACAGGGAGGGCGGCCCATGCGCTGTCTGCGGCGGGCGGCTGGAGAGTGTTATAGCCTCAGCGGTGGAGGGGTACAGGCTGCTAAGGGCTTACGATATCGAGAGGTTCGTAGTCGGGGTCCGGCTAGAGAGAGGTGTTGCCATGGCTGAGGAGGAGGTAAAGCTGGCCGCCGGCGCCGGGTACGGCGAGTCCATTAAGGCTGAGATCAGGAGGGAGGTCAAGCTCCTGGTGAGCCGGGGTGGAGTGACCGTGGACTTCGACAGCCCTGAAGCGACCCTAATGGTGGAGTTCCCCGGGGGCGGGGTTGACATACAGGTCAACAGCCTGCTCTACAAGGCTAGGTACTGGAAGCTTGCCAGGAACATAAGGGCATACTGGCCCACGCCAGAGGGGCCGAGGTACTTCAGCGTGGAGCAGGCTCTATGGCCGGTTCTAAAGCTCACTGGGGGGGAGAGGCTGGTTGTACACGCTGCTGGCAGGGAGGATGTAGACGCCAGGATGCTGGGCAGCGGGAGGCCGATAGTCGAGGTCAAGTCGCCTAGGCGCAGGAGGATCCCGCTTGAGGAGCTGGAGGCGGCCGCCAACGCCGGCGGGAAGGGGCTGGTTAGGTTCAGGTTCGAGACGGCTGCCAAGCGTGCCGAGGTCGCGCTTTACAAGGAGGAGACTGCGGTTAGGAAGGTGTACCGCGCCCTGGTAGCGGTGGAGGGTGGTGTTAGTGAGGTGGATGTTGAAGGGTTGAGGAGGGCTCTCGAGGGCGCGGTTATAATGCAGAGGACGCCCTCCAGGGTCCTCCATAGGAGGCCGGATATACTGAGGAGGAGGCTCTACAGCCTAGACTGCAGCCCCCTGGAGGGGGCGCCTCTGATGGAGTGCATATTGGAGGCGGAAGGGGGTCTCTACATCAAGGAGCTGGTCAGCGGTGATGGCGGGAGAACCAGGCCAAGCTTCGCTGAGGTCCTCGGCAGGGATGTGTGTATAGAGCTCGACGTGGTGTGGGTGGAGCATGAAGCTCCAGCCGCACCCGGCTAAAGCTAAATTAAGCTGGGCTGAGCAAAATACCGGGGGGAGCGTAGGTTGGTCAAGGCACCTAGAGGCTATAGGAACAGGACTAGGAGGCTGAGGAAGCCTGTGAGGGAGAAGGGCAGCATACCCAGGCTCAGCACCTACCTTAGGGAGTACAGGGTGGGCGATAAGGTGGCTATAATCATAAACCCCTCCTTCCCAGACTGGGGCATGCCCCACAGGAGGTTCCACGGGCTGACGGGAACGGTGGGGAAGAGGGGCGAGGCCTACGAGGTAGAGGTCTATCTGGGTAGGAAGAGGAAGACCCTCTTCGTCCCCCCCGTGCACCTCAAACCCCTCAGCACAGCCGCCGAGAGGCGGGGCAGCTAGAGCTGTCCCCACGGTTCCACGCTGGAGGGGGTGCTAGTGTTGGAGAGGAGGATCCTAGAGTATAAGGCGGTGCCCTACCAGGTAGCCAAGAAGTATATGTACGAGAGGGTTAGGGAGGGCGACATAATATCGATACAGGAGTCGACTTGGGAGTACTTCAGGAAGGTAGTGTTCTGCGACCCGGAGGCTGCCTCCGAGCTTGTTGAGGAGATTGTGAAGGAGGGTGTCAGCCGTGAGGCGCGGCGAACATCGCGAGCATATGCCCCAAGACCGAGGGCGAGCTCAGGAGCATTCTCGAGATGGACAGGAGCATAACCTCCGTACACATGGCTAGCAAACTGTACCCCATAGTTTCCAAATACTGCAAGGACTAGACCCCGCCCCCCTTCAGCCCGGGGATTAACAGTTTAATCTCCGCGTCCCAACCATATTTATGTTGATAGCGGCTGTACGGAGAGTGTTGAGAAGTGTCTAGACCCCGCCCCCGCGACAGGAAGCCCCCCCACCAGGGGAGGCCGCAGCCCCACATCGCCGCCCTTGAGGTGGAGGCTATAGTTCTGGACTACATACCCGAGGGCTACCCGAGAGACCCCCACAGGGAGCACCGCAGTAAGCCCGTCGTTCAGCTCGGGGTTAGGAGGCTGCACCTAGTCGACGGTGTCCCCCTCCATGAGGTCGATATACTGGAGCGGGTCACCCTGGCTAGGGAGGTTGTGTATAGCGTCCCCATAGTGGCCCGGCTCCCCGGGGGGGTCGAGAGGAGGGTGAAAAGTGTTAGTCGCGGTAACATGCCTCCCCGGCCAGGCGCGGGAGGGCGGGGTCAGGGAGATATACTGCTACCCCCTCTCCTACGCCGACCAGGCGACCCTGGAGGCGCTGCAGCAGCTCCTGGGTGAGGGGGACGAGAGGCACAGGTATATACTTGTGTCCCCCGACAAGCTCTCCGAGGTGGCCAGAGGTCACGGCCTCTCGGGGAAGATAGTGAGCACGCCCAGAGACCCTATATCCTACCAGGACCTCACCGACGTCGCCAGGGCTACGCTGCCGGACGCTGTGAGGAAGCTGGTCAGGGAGAGGGACTTCTTCGTGGAGTTCTTCAACGTGGCCGAGCCGATAAACATAAGGATACACGCGCTGGAGGCCCTAAAGGGTGTGGGTAAGAAGATGGCTAGGCACCTCCTCCTCGAGAGGGAGAGGCGTAGGTTCACGAGTTTCGAGGAGGTGAAGA

The human genome is 380,000 longer than the sequence shown

here

Cost: $ 2.7 billions

Page 13: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

The computational challenges in bioinformatics

• Currently they lie in two different aspects of the bioinformatics «galaxy»:– High throughput raw data acquisition, tracking and

preliminary analysis. Current genomics (DNA sequencing), transcriptomics (micro-array) and proteomics (MS) projects require high quantity of storage space (Terabytes), lots of computing power and specialized software systems such as LIMS (Laboratory Information Management Systems).

– Modelisation. Whether it is in the field of 3D structure (homology modeling, docking, etc.) or in attempts to model life processes (pathways, cellular development, etc).

Page 14: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

Some bioinformatics research and service centers

• National Center for Biotechnology Information (NCBI) in the USA;

• European Bioinformatics Institute (EBI) in the UK;• Swiss Institute of Bioinformatics (SIB);• Australian National Genome Information Service (ANGIS);• Canadian Bioinformatics Resource (CBR);• Peking Center of Bioinformatics (CBI);• Singapore BioInformatics Centre (BIC);• South-African National Bioinformatics Institute (SANBI).

Page 15: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

www.ncbi.nlm.nih.gov

Page 16: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

www.ebi.ac.uk

Page 17: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

www.isb-sib.ch

Page 18: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

ExPASy

Page 19: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

A page to navigate through multiple sites in bioinformatics: www.expasy.org/alinks.html

Page 20: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

The contents of the SWISS-PROT protein knowledgebase

• Sequences!• ANNOTATIONS• References• Taxonomic data• Keywords• Cross-references• Documentation

•Function(s); role(s)•Post-translational modifications•Domains•Subcellular location•Protein/protein interactions•Similarities•Diseases, mutagenesis•Conflicts and variants

Page 21: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

Domains, functional sites, protein familiesPROSITEInterProPfamPRINTSProDomSMART

Nucleotide sequence dbEMBL, GenBank, DDBJ

3D/Structural dbsHSSPPDB

Organism-spec. dbsDictyDbEcoGeneFlyBaseHIVLepromaMaizeDBMendelMGDMypuListSGDStyGeneSubtiListTIGRTubercuListWormPepYEPDZfin

Protein-specific dbsGCRDbMEROPSREBASETRANSFAC

SWISS-PROT2D-gel protein dbsSWISS-2DPAGEANU-2DPAGECOMPLUYEAST-2DPAGEECO2DBASEHSC-2DPAGEAarhus and GhentMAIZE-2DPAGEPHCI-2DPAGEPMMA-2DPAGESiena-2DPAGE

Human diseasesMIM

PTMCarbBankGlycoSuiteDB

Page 22: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There
Page 23: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

Some fields of application of bioinformatics

1: Data acquisition• Examples: DNA sequencing, mass

spectrometry (MS), 2D gels image acquisition;• Software programs tightly linked to the

instrumentation hardware;• The main issues are in the field of signal

detection and image analysis;• There is no biological context at this level.

Page 24: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

How to detect spots in 2D gels

How many spots ?

Separation of these 2 spots

Page 25: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There
Page 26: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

• Examples: “base calling” in DNA sequencing, interpretation of mass spectra, detection of spots on 2D-gels;

• There is a need for sophisticated algorithms that can extract a maximum of information (optimization of the “signal/noise ratio”);

• These algorithms require some knowledge of the biological context.

2.Preliminary data analysis

Page 27: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

DNA sequencing

Programme to analyse data from DNA sequencing machine

Example: pregap4 from Rodger Stadenhttps://sourceforge.net/projects/staden.

Page 28: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

• Sequence assembly: the reconstruction of a complete DNA sequence from fragments of 100 to 300 base pairs. These fragments are supposed to overlap;

• The assembled sequence is called a «contig»;• This step is required for the «shotgun» sequencing

method where all or part of a genome is broken down in small pieces;

• It is not a trivial task because of: (a) sequencing errors; (b) sequence repeats.

3: Assembly of DNA sequences

Page 29: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There
Page 30: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

4: Coding sequence detection

• How to find genes in genomic sequences;• A problem whose complexity is directly

correlated with the complexity of a genome. It is easy to find genes in bacteria; very difficult in «superior» eukaryotes (human, Drosophila, etc);

• Various computer methods are used to tackle this problem. Use of intrinsic (transcription signals detection, statistical analysis) and extrinsic (similarity with known genes) approaches.

4: Coding sequence detection

Page 31: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

HMMgene

Page 32: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

Netgene2

Page 33: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

Genebuilder

Page 34: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

Summary of results

3 ’5 ’

108310031305

14061452

16611914

2000

1084 (1.00)

1304 (0.77)

1407 (0.89)

1451 (0.90)

1662 (1.00)

1913 (1.00)

HMMgene Genebuilder Netgene2

Page 35: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

Not easy sometimes…

Ex: Chromosome 21

Page 36: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

• Analyze the restriction sites (enzymes)• Detection of regions of low complexity;• Translating DNA in protein• Detection of sequence repeats such as microsatellites,

minisatellites, Alu repeats, Line-1 elements and many others;

• Detection of important non-coding DNA elements such as transcription signals (promoter elements), origins of replication, etc.;

• Detection of tRNA sequences and of other types of RNA (examples: rRNA, uRNA, tmRNA).

5: DNA sequence analysis

Page 37: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

Restriction enzyme (Webcut)

1

432

5 enzymes cut 3 times because 4 CDS

Page 38: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

6: Similarity searches• The essential tool in molecular bioinformatics: the

comparison of a DNA or protein sequence (“query”) with all or part of the known sequences (“database”);

• No theoretical challenge; but two issues:– Optimization of computing speed either using

algorithmic shortcuts or specialized hardware;– Optimization of the use of biological

information (how to make these programs “smarter”).

Page 39: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

Alignment of 2 sequences. An example

MY-TAIL--ORIS-RICH-#x #### x#x# ####MONTAILLEURESTRICHE

Identities (#), mismatches (x), insertions (-)

Page 40: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

BLAST

Page 41: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

Statistical measure

Page 42: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

BLASTN

Page 43: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

BLASTN (nt sequence against ESTs)

Introns

Page 44: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

BLASTP

ribosomal protein L24 [Homo sapiens] ribosomal protein L24 [Mus musculus]

Page 45: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

7: Protein primary sequence analysis• Physico-chemical characterization• Detection of topogenic regions (i.e. signal sequences,

transit peptides) -> sub-cellular localization• Detection of transmembrane regions;• Prediction of functional regions (conserved regions);• Prediction of post-translational modification (PTM)

sites.• Prediction of antigenicity;• Search of compositionally-biased sequences (i.e. low

complexity sequences, PEST regions, etc.);• Detection of sequence repeats;

Page 46: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

Tools to calculate pI/Mw

Resolving physico-chemical characteristics

Page 47: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

Mw, pI and composition

Page 48: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

Mw, pI and composition

Page 49: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

Sub-cellular localisation PSORT II

Page 50: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

Signal sequenceSignalP V1.1

Signal peptide cleavage sites in amino acid sequences

Page 51: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

Hydrophobic regions

• Ala: 1.8 Leu: 3.8• Arg: -4.5 Lys: -3.9• Asn: -3.5 Met: 1.9• Asp: -3.5 Phe: 2.8• Cys: 2.5 Pro: -1.6• Gln: -3.5 Ser: -0.8• Glu: -3.5 Thr: -0.7• Gly: -0.4 Trp: -0.9• His: -3.2 Tyr: -1.3• Ile: 4.5 Val: 4.2

Kite&Dolittle

•Methods based on different scales: numerical values assigned to each of the 20 amino acid types.

Page 52: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

ProtScale• Tool to plot various protein physicochemical

parameters along the sequence;• More than 50 amino-acid scales are available:

hydrophobicity/hydrophilicity, secondary structure propensity (alpha helix, beta sheet, turn, etc.); amino-acid composition; number of codons; bulkiness; flexibility; etc.;

• WWW site: www.expasy.org/tools/protparam.html

Page 53: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

ProtScale (Kite&Dolittle)

Page 54: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

ProtScale (Chou&Fasman)

Page 55: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

Transmembrane regions (TM)

• 13% to 35% of the proteins of genomes are predicted to have one or more TM regions;

• Eukaryotic genomes are richer than microbial genomes in TM-containing proteins;

• All kinds of TM proteins: from 1 to 14 alpha-helical TM regions, different topologies, different target membranes, etc.

Page 56: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

ProfileScanLooking for functional regions

Page 57: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There
Page 58: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

Looking for functional regions

ATPase family

Page 59: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

LOGO

ATPase signature

Page 60: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

Prediction of post-translational modifications (PTM)

• For the prediction of cleavage sites of signal sequences and transit peptides, see the section on the prediction of topogenic regions;

• To predict some PTM’s a pattern (consensus sequence) can be used. These are found in the PROSITE database;

• Example: potential N-glycosylation sites: N-{P}-[ST]-{P};• NetOGlyc; Neural network for the prediction of mucin-

type O-glycosylation sites: www.cbs.dtu.dk/services/NetOGlyc/

• DGPI; prediction of GPI-anchor sites: www.bigfoot.com/~dgpi

Page 61: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

Sequence 484 ISPTTINTC 0.065 . Sequence 487 TTINTCGAI 0.029 . Sequence 499 CFDKTGTLT 0.077 . Sequence 501 DKTGTLTED 0.845 *T* Sequence 503 TGTLTEDGL 0.533 *T*

http://www.cbs.dtu.dk/services/NetPhos/

Phosphorylation site prediction

Page 62: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

Sulfinator

Sulfation

Glycosylation

Page 63: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

8: Multiple sequence alignments• Alignment of two DNA or protein sequences

(binary alignment);• Alignment of multiple sequences.

Page 64: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

CLUSTAL dendogram

Multiple alignment: a dendogram

Page 65: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

9: RNA folding

• Predicts an optimal secondary structure for a RNA; • Generally applied to tRNAs, rRNAs but also to parts

of mRNAs;• Makes use of information on base pairing; local

energy minimization and structural constraints.

Page 66: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There
Page 67: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

10: Protein secondary and tertiary structure analysis

• Prediction of secondary structure by statistical methods or by neural networks;

• Prediction of the 3D structure directly from the sequence (“ab-initio”). This is still a major challenge!;

• Modeling by homology: prediction of the structure of a new protein similar to one whose sequence is already known;

• Simulation of the “docking” of two proteins or between a protein and a small molecule.

Page 68: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

Secondary structure prediction

GOR IV

Page 69: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

3D structure modelling

Protein sequence

Protein structure

?

Page 70: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

11: Phylogenetic analysis• Reconstruction of the molecular evolution of

families of proteins;• Reconstruction of the evolution of living

species; creation of taxonomic trees;• Reconstruction of the evolution of metabolic

pathways.

Page 71: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

Reptiles: a paraphyletic group

Page 72: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There

12:Proteomics tools

• Tools to identify proteins from the results of 2D-gel and mass-spectrometric (MS) experiments;

• Also allow to further characterized identified proteins by predicting and, in some case proving, the presence of post-translational modifications;

• This subfield of bioinformatics is also known as “proteomatics” (Appel 1998).

Page 73: Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss Institute of Bioinformatics (SIB) PI Group (PIG) Bioinformatics: definition ... • There