Retos de la Bioinformatica

Bioinformática: la biología por otros medios

Alberto Labarga

UGR, Noviembre 2008

Computational Biology

Bioinformatics[Biological Information]

1859 1866 1870 1900 1902

Hacia una teoría científica de la herencia

1859 1866 1870 1900 1902

Charles Darwin publica en 1859 'The Origin of Species‘donde se propone que los seres vivos son el resultado de la selección natural y que todas las criaturas han evolucionado a lo largo de las generaciones a través de pequeños cambios.

1859 1866 1870 1900 1902

Leyes de Mendel,

publicadas en 1866,

redescubiertas en 1900

1859 1866 1870 1900 1902

En 1870, un científico alemán llamado Friedrich Miescher aísla los componentes almacenados en el núcleo, compuesto principalmente por proteinas y ácidos nucleicos. En aquel momento se creía que el elemento que almacenaba la información hereditaria tenía que ser la proteína, compuesta por 20 aminoacidos, mientras que los ácidos nucleicos tenían sólo 4 componentes.

1859 1866 1870 1900 1902

A comienzo de siglo, Phoebus Levene, descubrió que el ADN es una cadena de nucleótidos, en la que cada nucleótido está compuesto de un azucar (desoxirribosa), un grupo fosfato y una base nitrogenada, que podía ser de cuatro tipos, Adenin, Timina, guanina y Citosina

1859 1866 1870 1900 1902

Walter Sutton, a graduate student in E. B. Wilson’s

lab at Columbia University, observed that in the

process of cell division, called meiosis, that produces

sperm and egg cells, each sperm or egg receives only

one chromosome of each type. (In other parts of the

body, cells have two chromosomes of each type, one

inherited from each parent.) The segregation pattern

of chromosomes during meiosis matched the

segregation patterns of Mendel’s genes.

1928 1944 1949 1952 1953

El descubrimiento del ADN

1928 1944 1949 1952 1953

1928 Frederick Griffith: principio de transformación

si mezclaba a los neumococos R

con neumococos S previamente

muertos por calor, entonces los

ratones se morían. Aún más, en la

sangre de estos ratones muertos

Griffith encontró neumococos

con cápsula (S).

1928 1944 1949 1952 1953

En 1944 Oswald Avery y sus colaboradores, que estaban estudiando la bacateria que causa la neumonía, Pneumococcus, descubrieron que las bacterias tienen ácidos nucleicos y que es la molécula de ADN la encargada de almacenar los genes. Otros estudios con virus se encargaronde confirmar esta teoría a pesar de que se seguía creyendo que el ADN era demasiado simple.

1928 1944 1949 1952 1953

La vida puede verse como un proceso de almacenamiento y transmisión de información biológica.

Los cromosomas son los portadores de esta información.

La información está almacenada en la forma de un código molecular

Para entender la vida debemos identificar estas moléculas y descifrar el código

1928 1944 1949 1952 1953

1949DNA se duplica durante la división celularChargaff: A = T and G = C

1928 1944 1949 1952 1953

1952 - Hershey-Chase Experiment

1928 1944 1949 1952 1953

M.H.F. Wilkins, A.R. Stokes, H.R. Wilson:

Molecular Structure of Deoxypentose Nucleic

Acids. Nature 171, 738 (1953)

R.E. Franklin and R.G. Gosling

Molecular Configuration in Sodium

Thymonucleate, Nature 171, 740

(1953)

1928 1944 1949 1952 1953

MOLECULAR STRUCTURE

OF NUCLEIC ACIDS

“We wish to propose a

structure for the salt of

desoxyribose nucleic acid

(DNA). This structure has

novel features which are of

considerable biological

interest”

Nature. 25 de abril de 1953

1928 1944 1949 1952 1953

“It has not escaped our

attention that the specific

pairing we have

postulated immediately

suggests a possible

copying mechanism for

the genetic material.”

The base pairs

1955 1959 1962 1966

En 1955 Ochoa publicó en Journal of the American

Chemical Society con la bioquímica francorrusa

Marianne Grunberg-Manago, el aislamiento de una

enzima del colibacilo que cataliza la síntesis de ARN, el

intermediario entre el ADN y las proteínas. Los

descubridores llamaron «polinucleótido-fosforilasa» a

la enzima, conocida luego como ARN-polimerasa. El

descubrimiento de la polinucleótido fosforilasa dio

lugar a la preparación de polinucleótidos sintéticos de

distinta composición de bases con los que el grupo de

Severo Ochoa, en paralelo con el grupo de Marshall

Nirenberg, llegaron al desciframiento de la clave

genética.

http://es.wikipedia.org/wiki/1955

http://es.wikipedia.org/w/index.php?title=Marianne_Grunberg-Manago&action=edit&redlink=1



http://es.wikipedia.org/wiki/ARN

http://es.wikipedia.org/wiki/ADN

http://es.wikipedia.org/wiki/Prote%C3%ADna

http://es.wikipedia.org/w/index.php?title=Marshall_Nirenberg&action=edit&redlink=1

http://es.wikipedia.org/w/index.php?title=Marshall_Nirenberg&action=edit&redlink=1

http://es.wikipedia.org/wiki/Gen%C3%A9tica

1955 1959 1962 1966

1955 1959 1962 1966

Cuando Perutz llegó a Cambridge la

estructura molecular más grande que se

había resuelto era la del pigmento natural

ficocianina, de 58 átomos. Una proteína

tiene miles de átomos. Bernal, su director,

había realizado algunas imágenes de

difracción de rayos X de cristales de una

proteína, la pepsina, pero sin llegar a

interpretarlas. El tema escogido por Perutz

para su tesis fue otra proteína, la

hemoglobina, el transportador de oxígeno

que da color rojo a nuestra sangre. La

hemoglobina tiene nada menos que 11.000

átomos. Tardo 23 años.

1955 1959 1963 1966

1955 1959 1962 1966

Over the course of several years,

Marshall Nirenberg, Har Khorana and

Severo Ochoa and their colleagues

elucidated the genetic code – showing

how nucleic acids with their 4-letter

alphabet determine the order of the 20

kinds of amino acids in proteins.

Messenger RNA is interpreted three

letters at a time; a set of three

nucleotides forms a "codon" that

encodes an amino acid. A three-letter

word made of four possible letters can

have 64 (4 x 4 x 4) permutations, which

is more than enough to encode the 20

amino acids in living beings.

From DNA to protein

1970 1975 1977 19801971

Entendiendo los mecanismos, creando las herramientas

1970 1975 1977 19801971

El Central Dogma

1970 1975 1977 19801971

Created in 1971

with seven

structures

1970 1975 1977 19801971

El ADN recombinante, o ADN recombinado, es

una molécula de ADN formada por la unión de

dos moléculas heterólogas, es decir, de diferente

origen.

Se realiza a través de las enzimas de restricción

que son capaces de "cortar" el ADN en puntos

concretos.

De una manera muy simple podemos decir que

"cortamos" un gen humano y se lo "pegamos" al

ADN de una bacteria; si por ejemplo es el gen

que regula la fabricación de insulina, lo que

haríamos al ponérselo a una bacteria es

"obligar" a ésta a que fabrique la insulina.


1970 1975 1977 19801971

1970 1975 1977 19801971

A precursor-RNA may often be matured to

mRNAs with alternative structures. An example

where alternative splicing has a dramatic

consequence is somatic sex determination in the

fruit fly Drosophila melanogaster.

In this system, the female-specific sxl-protein

is a key regulator. It controls a cascade of

alternative RNA splicing decisions that finally

result in female flies.

1981 1985 1987 199019831982

Entendiendo los mecanismos, creando las herramientas

1981 1985 1987 199019831982

Read out the letters from a DNA sequence

GTGAGGCGCTGC

1981 1985 1987 199019831982

1983 La reacción en cadena de la polimerasa,

conocida como PCR por sus siglas en inglés

(Polymerase Chain Reaction), es una técnica

de biología molecular descrita en 1986 por

Kary Mullis,[1] cuyo objetivo es obtener un

gran número de copias de un fragmento de

ADN particular, partiendo de un mínimo; en

teoría basta partir de una única copia de ese

fragmento original, o molde.

http://es.wikipedia.org/wiki/Biolog%C3%ADa_molecular


http://es.wikipedia.org/wiki/Kary_Mullis



http://es.wikipedia.org/wiki/Reacci%C3%B3n_en_cadena_de_la_polimerasa


1981 1985 1987 199019831982

Total nucleotides

(Nov 07: 188,490,792,445)

Number of entries

(Nov 07: 106,144,026)

1981 1985 1987 199019831982

1981 1985 1987 199019831982

El Proyecto Genoma Humano (PGH) (Human

Genome Project en inglés) consiste en

determinar las posiciones relativas de todos los

nucleótidos (o pares de bases) e identificar

100.000 genes presentes en él.

El proyecto, dotado con 3.000 millones de

dólares, fue fundado en 1990 por el

Departamento de Energía y los Institutos de la

Salud de los Estados Unidos, con un plazo de

realización de 15 años.

http://es.wikipedia.org/wiki/Nucle%C3%B3tido

http://es.wikipedia.org/wiki/Gen

http://es.wikipedia.org/wiki/D%C3%B3lar_estadounidense


http://es.wikipedia.org/wiki/National_Institutes_of_Health

http://es.wikipedia.org/wiki/National_Institutes_of_Health

http://es.wikipedia.org/wiki/Estados_Unidos

”Imagine varias copias de un libro, cortadas en

10 millones de trocitos cada una, de manera

que los trocitos se solapan. Supongamos que 1

millón de trocitos se han perdido, y que los

otros 9 millones están manchados de tinta.

Recupere el texto original.”

HUGO: Idealized representation of the hierarchical shotgun sequencing strategy. A library is constructed by

fragmenting the target genome and cloning it into a large-fragment cloning vector; here, BAC vectors are shown. The

genomic DNA fragments represented in the library are then organized into a physical map and individual BAC clones

are selected and sequenced by the random shotgun strategy. Finally, the clone sequences are assembled to reconstruct

the sequence of the genome.

1990 1995 1997 20011996 1998 1999

Descifrando el libro de la vida

1990 1995 1997 20011996 1998 1999

S.F. Altschul, et al. (1990), "Basic Local

Alignment Search Tool," J. Molec.

Biol., 215(3): 403-10, 1990. 15,306

citations

Altschul, S.F. et al (1997), “Gapped

BLAST and PSI-BLAST: a new

generation of protein database search

programs”, Nucleic Acids Res., vol. 25,

no. 17, pp. 3389-402.

• SSAHA (Ning et al., 2001)• http://www.sanger.ac.uk/Software/analysis/SSAHA/

• SSAHA is an algorithm for very fast matching and alignment of DNA

sequences. It stands for Sequence Search and Alignment by Hashing

Algorithm. It achieves its fast search speed by converting sequence

information into a `hash table' data structure, which can then be

searched very rapidly for matches.

• BLAT (J. Kent, 2002)• http://genome.ucsc.edu/cgi-bin/hgBlat

• BLAT on DNA is designed to quickly find sequences of 95% and greater

similarity of length 40 bases or more. It may miss more divergent or

shorter sequence alignments. It will find perfect sequence matches of 33

bases, and sometimes find them down to 20 bases. BLAT on proteins

finds sequences of 80% and greater similarity of length 20 amino acids

or more.

1990 1995 1997 20011996 1998 1999

J. Thompson, T. Gibson, D.

Higgins (1994), CLUSTAL W:

improving the sensitivity of

progressive multiple sequence

alignment … Nuc. Acids. Res. 22,

4673 - 4680

Flowchart of computation steps in

Clustal W (Thompson et al., 1994)

Pairwise alignment: calculation of distance matrix

Creation of unrooted neighbor-joining tree

Rooted nJ tree (guide tree) and calculation of sequence weights

Progressive alignment following the guide tree

Otros métodos

Notredame, C., Higgins, D.G., Heringa, J. (2000) T-Coffee: a novel method for

fast and accurate multiple sequence alignment. J. Mol. Biol, 302, 205–217.

Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high

accuracy and high throughput. Nucleic Acids Res., 32, 1792–1797.

Katoh, K., Kuma, K., Toh, H., Miyata, T. (2005) MAFFT version 5:

improvement in accuracy of multiple sequence alignment. Nucleic Acids

Res, 33, 511–518.

Lassmann, T., Sonnhammer, E. (2005) Kalign – an accurate and fast multiple

sequence alignment algorithm. BMC Bioinformatics , 6, 298.

Larkin M.A. et al. (2007) ClustalW and ClustalX version 2. Bioinformatics 2007

23(21): 2947-2948.

Tree of Life

http://tolweb.org/tree/phylogeny.html http://itol.embl.de/

1990 1995 1997 20011996 1998 1999

1995• El primer genoma completo de un organismoHemophilus influenzae.

1990 1995 1997 20011996 1998 1999

1996• El genoma de la levadura se completa: aproximadamente, 6,000 genes y 14.000.000 de pares de bases

1990 1995 1997 20011996 1998 1999

1990 1995 1997 20011996 1998 1999

1997

•Ecuenciado el genoma de la bacteria E. Coli: 4,600 genes 4,5 millones de nucleótidos.

1990 1995 1997 20011996 1998 1999

1998

El genoma del gusano Caenorhabditis elegans, tiene 18,000 genes unos 100 millones de nucleotidos

1990 1995 1997 20011996 1998 1999

1999•Se consigue la secuencia completa del cromosoma 22 El HGP va por delante de lo planeado.Sorprende el reducido número de genes encontrado (unos 300)

Fire A, Xu S, Montgomery M, Kostas

S, Driver S, Mello C (1998). "Potent

and specific genetic interference by

double-stranded RNA in

Caenorhabditis elegans". Nature 391

(6669): 806–11. doi:10.1038/35888.

PMID 9486653

http://en.wikipedia.org/wiki/Digital_object_identifier

http://dx.doi.org/10.1038/35888

http://www.ncbi.nlm.nih.gov/pubmed/9486653

Hamilton A, Baulcombe D

(1999). "A species of small

antisense RNA in

posttranscriptional gene

silencing in plants". Science

286 (5441): 950–2.

PMID 10542148

http://www.ncbi.nlm.nih.gov/pubmed/10542148

Dr Alan Wolffe (1999)

• Epigenetics is heritable changes in gene expression that occur without a change in DNA sequence

• Such changes cannot be attributed to changes in DNA sequence (mutations)

• They are as Irreversible as mutations (or difficult to reverse)

1990 1995 1997 20011996 1998 1999

Gene prediction

In humans:

~22,000 genes

~1.5% of human DNA

Where are the genes?

the gencode pipeline

1. mapping of known transcripts sequences (ESTs, cDNAs, proteins) into the

human genome

2. manual curation to resolve conflicting evidence

3. additional computational predictions

4. experimental verification

5. FINAL ANNOTATION

August 2008 Bioinformatics tools for Comparative

Genomics of Vectors

64

Genome annotation - building a pipeline

Genome sequence

Map repeats

Genefinding

Protein-coding genes

Map ESTs Map Peptides

nc-RNAs

Functional annotation

Release


Genomics of Vectors

65

Genefinding - ab initio predictions

Use compositional features of the DNA sequence to define coding

segments (essentially exons)

ORFs

Coding bias

Splice site consensus sequences

Start and stop codons

Each feature is assigned a log likelihood score

Use dynamic programming to find the highest scoring path

Need to be trained using a known set of coding sequences

Examples: Genefinder, Augustus, Glimmer, SNAP, fgenesh


Genomics of Vectors

66

ab initio prediction

Genome

Coding

potential

Coding

potential

ATG & Stop

codons

ATG & Stop

codons

Splice sites


Genomics of Vectors

67


Genome

Coding

potential

Coding

potential

ATG & Stop

codons

ATG & Stop

codons

Splice sites


Genomics of Vectors

68


Find best prediction

Genome

Coding

potential

Coding

potential

ATG & Stop

codons

ATG & Stop

codons

Splice sites


Genomics of Vectors

69

Genefinding - similarity

Use known coding sequence to define coding regions

EST sequences

Peptide sequences

Needs to handle fuzzy alignment regions around splice sites

Needs to attempt to find start and stop codons

Examples: EST2Genome, exonerate, genewise

Use 2 or more genomic sequences to predict genes based on

conservation of exon sequences

Examples: Twinscan and SLAM


Genomics of Vectors

70

Similarity-based prediction

Align

Create prediction

Genome

cDNA/peptide

Example of a simple HMM

EPFL – Bioinformatics I – 05 Dec 2005

Top: model architecture and parameters. Bottom: sequence generation process.

green: state transition probabilities, red: emission probabilities.

Prob(sequence, path|model) = 6.8e-8.

Automatic Annotation vs Manual

Automatic Annotation

• Quick whole genome analysis ~ weeks

• Consistent annotation

• Use unfinished sequence/shotgun assembly

• No polyA sites/signals, pseudogene

• Predicts ~70% loci

Manual Annotation

• Extremely slow~3 months Chr 6

• Need finished seq

• Flexible, can deal with inconsistencies in data

• Most rules have exception

• Consult publications as well as databases

Analysis EGASP predictions vs manual

annotation

0

10

20

30

40

50

60

70

80

90

100

9_101_1 20_79_1 36_46_1 41_77_1

Nuc Sn

Nuc Sp

0

10

20

30

40

50

60

70

80

90

100

9_101_1 20_79_1 36_46_1 41_77_1

Exon Sn

Exon Sp

0

10

20

30

40

50

60

70

80

9_101_1 20_79_1 36_46_1 41_77_1

Trans Sn

Trans Sp

0

10

20

30

40

50

60

70

80

9_101_1 20_79_1 36_46_1 41_77_1

Gene Sn

Gene Sp

2002 2007 201020052004

Y sólo es el principio

2002 2007 201020052004

2002 2007 201020052004

874

2124

1004

10/0810/3/02

104

316

218

8/28/03

156

386

246

5/07

500

1500

700

4000

Published complete genomes:

Ongoing prokaryotic genomes:

Ongoing eukaryotic genomes:

http://www.genomesonline.org

2002 2007 201020052004

Illumina / Solexa

Genetic Analyzer

2000 Mb / run

Applied Biosystems

ABI 3730XL

1 Mb / day

Roche / 454

Genome Sequencer FLX

100 Mb / run

Applied Biosystems

SOLiD

3000 Mb / run

454-GS20

32,000,000

0 .04

0 .54

1 .04

1 .54

2 .04

2 .54

3 .04

3 .54

4 .04

4 .54

199 4 199 6 199 8 200 0 200 2 200 4 200 6

Mill

ions

Date of Introduction

# B

ases

/Run

ABI

3730ABI

370/377

ABI

3700

Aunque los seres humanos compartimos

99.9 por ciento de la información genética,

tenemos pequeñas variaciones, llamadas

poliformismos singulares de nucléotido o

SNP (por su siglas en inglés; se pronuncia

snip). Se estima que existen unos 10

millones de SNP en la especie humana y

supuestamente esas diferencias estarían

relacionadas con la mayor resistencia o

susceptibilidad a enfermedades y

medicamentos.

2002 2007 201020052004

VARIACIÓN EN LA SECUENCIA HUMANA DE

DNA

Tasa de mutación = 10-8 /sitio/generación

Nº generaciones ancestro común-humano actual: 104-105

2002 2007 20102005

ENCyclopedia Of DNA Elements

2004

2002 2007 201020052004

Genómica funcional

Comparative

genomics

Sequence (DNA/RNA)

& phylogeny

Regulation of gene

expression;

transcription factors &

micro RNAs

Protein sequence analysis &

evolution

Protein families,

motifs and domains

Protein structure & function:

computational crystallography

Protein interactions & complexes: modelling and

prediction

Chemical biology

Pathway analysis

Systems

modelling

Image analysis

Data integration & literature

mining

Se preparan copias del ADN

de los genes de interés

Transcripción

inversa

...que se

imprimen

en el chip

Las muestras se hibridan

en el microarray

Laser 1 Laser 2

El chip se excita

con láseres

diferentes: el

control

reacciona a uno

de ellos y la

muestra al otro

La comparación

de ambas

imágenes nos

indica que genes

se expresan de

manera diferente

Añadir

fluorescencia

control muestr

a

Se preparan las

muestras de ARN

de interés

Schena et al. Science 1995

Microarray analysis

Clinical prediction of Leukemia type

• 2 types

– Acute lymphoid (ALL)

– Acute myeloid (AML)

• Different treatment & outcomes

• Predict type before treatment?

Golub et. al. Science 286:531-537. (1999)

Biomarkers discovery

Data

Management

statistical

analysis AnnotationNetwork

análisis Selection

30.000

genes

1500 genes 150 genes 50 elements 10 targets

Step1: Calculate Ct with SDS and export text file

TaqMan Assays

Step 3: Biological Replicates

Step 4: Selection of Optimal Endogenous Controls &

Calculation of ΔCt

Step 5: Differential Expression Analysis ΔΔCt

! Overview Plates & Samples

! Quality Control

Raw Values

! Discard Samples

! Quality Control

ΔCt Overview

RT-PCR Standard Processing Procedure

Step2: Retrieve data and define

experiment design

88

Example of Array CGH Technology*

Chari et al, Cancer Informatics, 2006, 2, 48-58

Source: http://www.chiponchip.org/

Chip-on-chip

DNA-binding proteins are crosslinked to DNA with formaldehyde in vivo

Isolate the chromatin. Shear DNA along with bound proteins into small fragments.

Bind antibodies specific to the DNA-binding protein to isolate the complex by precipitation. Reverse the cross-linking to release the DNA and digest the proteins.

Use PCR( Polymerase Chain Reaction )

to amplify specific DNA sequences to see if they were precipitated with the antibody

ChIP (Chromatin ImmunoPrecipitation)

• Chromatin immunoprecipitation, or ChIP, refers to a procedure used to determine whether a given protein binds to a specific DNA sequence in vivo

Protein MicroarrayG. MacBeath and S.L. Schreiber, 2000, Science 289:1760

arrayIT TM

Spotting platform and protein microarray

Different Kinds of Protein Arrays*

Antibody Array Antigen Array Ligand Array

Detection by: SELDI MS, fluorescence, SPR,

electrochemical, radioactivity, microcantelever

The Microarray Study Process

Preprocesado

Some Questions:

• Which genes have expression levels that are correlated

with some external variable?

• For a given pathway, which of the genes in our collection

are most likely to be involved?

• For a diffuse disease, which genes are associated with

different outcomes?

Challenges for Data Analysis

• Normalization (removing systematic measurement effects)

• Variable Selection (Identification of relevant Variables)

• Large sample Effects:

Type I and Type II errors (False positives / False negatives)

• Dimensionality Reduction

• Identification of new disease classes

• Classification of data into known disease classes

Data Analysis Methods

Dimension Reduction

• PCA (Principle Component Analysis)

• ICA (Independent Component Analysis)

• Multidimensional Scaling

Unsupervised Learning

• K-Means / K-Medoid

• Hierarchical Clustering Algorithms

Supervised Learning

• Linear Discriminant Analysis

• Maximum Likelihood Discrimination

• Nearest Neighbor Methods

• Decision Trees

• Random Forests

Matrix factorization

102

Popular Classification Methods

• Decision Trees/Rules– Find smallest gene sets, but not robust – poor performance

• Neural Nets - work well for reduced number of genes

• K-nearest neighbor – good results for small number of genes, but no model

• Naïve Bayes – simple, robust, but ignores gene interactions

• Support Vector Machines (SVM)– Good accuracy, does own gene selection,

but hard to understand

• Specialized methods, D/S/A (Dudoit), …

Support Vector Machine (SVM)

• Main idea: Select hyperplane that is more likely to

generalize on a future datum

104

Best Practices

• Capture the complete process, from raw data to final results

• Gene (feature) selection inside cross-validation

• Randomization testing

• Robust classification algorithms– Simple methods give good results

– Advanced methods can be better

• Wrapper approach for best gene subset selection

• Use bagging to improve accuracy

• Remove/relabel mislabeled or poorly differentiated samples

Alistair Chalk, 2008

Enrichment Analysis

• What are major enriched GO terms?

• What are the highly active pathways?

• What are the frequently interacting proteins?

• What are the known disease associations?

Meta-analysis example: “Creation and

implications of a phenome-genome network”

Butte and Kohane. Nat Biotech. 2006

Meta-analysis example: “Creation and

implications of a phenome-genome network”

Butte and Kohane. Nat Biotech. 2006

• Clustered experiments based on mapping concepts found in sample annotations to UMLS meta-thesaurus.

• Relationships found between phenotype (e.g., aging), disease (e.g., leukemia), environmental (e.g., injury) and experimental (e.g., muscle cells) factors and genes with differential expression.

• “the ease and accuracy of automating inferences across data are crucially dependent on the accuracy and consistency of the human annotation process, which will only happen when every investigator has a better prospective understanding of the long-term value of the time invested in improving annotations.”

Biología de sistemas

PPI ANNOTATION AND DATABASES

http://www.hpid.org (Han et al., 2004)HPID

http://www.ebi.ac.uk/intact(Hermjakob et al., 2004)IntAct

http://www.hprd.org/(Peri et al., 2004)HPRD

http://dip.doe-mbi.ucla.edu/(Xenarios et al., 2002)DIP

http://mint.bio.uniroma2.it/mint(Zanoni et al., 2002)MINT

URLReferenceDatabase

iMEX agreement to share curation efforts

Protein Standard Initiative (PSI) recommendation

Molecular Interaction (MI) Ontology

Large scale experiments

Literature curation

Complex networks

• Many systems can be represented as networks (graphs)– Nodes: individual component (proteins)

– Edges: relationships (interactions)

• They share common properties– Scale-free

– Hierarchical

– Clustering

• Some properties may be intrinsic and can be understood better when putting into the context of evolution

Detecting Hierarchical Organization

Summary: Network Measures

• Degree ki

The number of edges involving node i

• Degree distribution P(k)

The probability (frequency) of nodes of degree k

• Mean path length

The avg. shortest path between all node pairs

• Network Diameter

– i.e. the longest shortest path

• Clustering Coefficient

– A high CC is found for modules

Mapping the phenotypic data to the network

Begley TJ, Rosenbach AS, Ideker T,

Samson LD. Damage recovery pathways

in Saccharomyces cerevisiae revealed by

genomic phenotyping and interactome

mapping. Mol Cancer Res. 2002

Dec;1(2):103-12.

•Systematic phenotyping

of 1615 gene knockout

strains in yeast

•Evaluation of growth of

each strain in the presence

of MMS (and other DNA

damaging agents)

•Screening against a

network of 12,232 protein

interactions

The Role of Proteomics

• The existence of an ORF does not imply the

existence of a functional gene.

• Limitations of comparative genomics.

• mRNA levels may not correlate with protein levels.

• Protein modifications post-transcriptional

modifications, isoforms, post-translational

modifications, mutants.

• Issues of proteolysis, sequestration, etc. relevant only

at the protein level.

• Protein complex composition, protein-protein

interactions, structures.

Structural proteomics

• Folding

• Structure and function

• Protein structure prediction

• Secondary structure

• Tertiary structure

• Function

• Post-translational modification

• Prot.-Prot. Interaction -- Docking algorithm

• Molecular dynamics/Monte Carlo

What kind of methods around?

5 main levels of protein Structure prediction:

1. Extensive Sequence Search

2. Threading and 1D-3D profiles

3. Ab initio prediction of protein structure

4. Comparative Modelling

5. Docking (domain interaction prediction)

Prediction of Protein Structures

• Examples – a few good examples

actual predicted actual

actual actual

predicted

predicted predicted

START

Get profile for sequence (NR)

Scan sequence profile against

representative PDB chains

Scan PDB chain profiles

against sequence

PS

I-B

LA

ST

MODPIPE: Large-Scale Comparative Protein Structure Modeling

Select templates using

permissive E-value cutoff

1

Expand match to cover

complete domains

1

Build model for target segment by

satisfaction of spatial restraints

Evaluate model

Align matched parts of sequence and

structure

MO

DE

LL

ER

R. Sánchez & A. Šali, Proc. Natl. Acad. Sci. USA 95, 13597, 1998.

N. Eswar, M. Marti-Renom, M.S. Madhusudhan, B. John, A. Fiser, R. Sánchez, F. Melo, N. Mirkovic, A. Šali.

Fo

r ea

ch t

arg

et s

equ

ence

Fo

r ea

ch t

emp

late

str

uct

ure

3/25/03

END

Structural Proteomics:

The Motivation*

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

2000000

1980 1985 1990 1995 2000 2005

2000040000

6000080000

100000120000

140000160000

0

Seq

uen

ces S

tructu

res

180000200000

The hierarchies of protein structure

126

Docking Programs

• Dock (UCSF)

• Autodock (Scripps)

• Glide

(Schrodinger)

• ICM (Molsoft)

• FRED (Open Eye)

• Gold, FlexX, etc.

Cell cycle network from KEGG

128

Graphical Notation: a necessity for the conceptual representation

of biopathways

Thiery & Sleeman, Nat. Rev. Mol.

Cell. Biol 7:131 (2006)

Qualitative Mechanistic

various degree of

detail, mixed level

of presentation

Aladjem et al., Science STKE pe8

(2004)

129

Strategies: simulate or analyse?

(or rather what to do first)

convert diagram

into a quantitative

model

simulate model

behavior

numerically

obtain qualitative

understanding

through numerical

results and model

reduction

qualitatively

analyze network

topology, stability,

etc

identify

“elementary

modes”

build and

simulate a

reduced model

130

Space of modeling methods

con

tin

uou

s↔

dis

cret

e

sto

chsi

mB

oo

lean

net

wo

rks

Continuum of modeling approaches

Top-down Bottom-up

Frazier et al. (2003) Science 11 April Vol 300:290-293

Integración de datos

Nucleic Acids Research article lists

1078 public databases

Nucleic Acids Research, 2008, Vol. 36, Database issue

http://nar.oxfordjournals.org/cgi/reprint/36/suppl_1/D2

Growth in Available Bioinformatics Databases

Too much unintegrated data

• Data sources incompatible

• No (or few) standard naming convention

• No common interface (varying tools for browsing,

querying and visualizing data)

– Small, isolated, independent, groups/individuals

– Loosely coupled provider-consumer of resources.

– Commonly resource consumers

– Boutique suppliers.

– Poor access systems admins

– Large experiments or large research groups/labs, possibly distributed

– Large service provider institutes.

– Tightly coupled provider-consumer of resources.

– Commonly resource providers.

– Some or lots of access to sys admin

138

Challenges: Names and Identity

Q92983

O00275

O00276

O00277

O00278

O00279

O00280

O14865

O14866

P78507

• WSL-1 protein

• Apoptosis-mediating receptor DR3

• Apoptosis-mediating receptor TRAMP

• Death domain receptor 3

• WSL protein

• Apoptosis-inducing receptor AIR

• Apo-3

• Lymphocyte-associated receptor of death

• LARD

• GENE: Name=TNFRSF25

Q93038 = Tumor necrosis factor receptor superfamily member 25 precursor

P78515

Q93036

Q93037

Q99722

Q99830

Q99831

Q9BY86

Q9UME0

Q9UME1

Q9UME5

Annotation history:

http://www.expasy.org/uniprot/Q93038

GUIDs

Life Science

Identifier?

Normalisation

http://www.expasy.org/uniprot/Q93038

Why must support standards?

• Unambiguous representation, description

and communication

– Final results and metadata

• Interoperability

– Data management and analysis

• Integration of OMICS system biology

What to standarize?

• CONTENT: Minimal/Core Information to be reported

• MIBBI (http://www.mibbi.org)

• SEMANTIC: Terminology Used -> Ontologies

• OBI (http://obi-ontology.org)

• SYNTAX: Data Model, Data Exchange

• Fuge (http://fuge.sourceforge.net/)

MIBBI: Standard Content

Promoting Coherent Minimum Reporting Requirements for

Biological and Biomedical Investigations: The MIBBI Project, Taylor et Al, Nature Biotech.

143U

ser

inte

rface

Applic

ation

Applic

ation inte

rface

Link Integration: Integration Lite

Ontology

Authority

Identity Authority

144

Warehouse

Applic

ation

User

inte

rfaceW

rappers

Wra

ppers

Wra

ppers

Unified

model Data

Access a

nd Q

uery

• Copy the data sets, clean and massage data into shape

• Combine them into a (different) pre-determined model before query

• ATLAS, MRS, e-Fungi, GIMS, Medicel Integrator, MIPS, BioMART

• Often called “Knowledge bases”

145

View integration

• Data at Source; Virtual integrating database view

• Global as View / Local as View mappings between models

• Map from model to databases dynamically so always fresh

• TAMBIS, Information Integrator, K4, ComparaGrid, UTOPIA, caCORE

Wra

ppers

Wra

ppers

Wra

ppers

Applic

ation

User

inte

rface

Unified

model Data

Access a

nd Q

uery

146

Specialist Integrating Application

E.g. Ensembl, UTOPIA

• Very popular. Known to be one application.

Applic

ation

User

inte

rfaceW

rappers

Wra

ppers

Wra

ppers

147

Workflows

• Data flow protocol. Automated data chaining.

• General technique for describing and enacting a process

• Describes what you want to do, not how you want to do it

• Various degrees of data type compliance anticipated

Applic

ation

User

inte

rface

Wra

pper

Workflow

Engine

148

Mash-Up Data Marshalling

• Content syndication and feeds

• Emphasis on User creating specific integration by mapping.

• Just in time, just enough design

• On demand integration

Ma

sh U

p A

pplic

ation

User

inte

rfaceP

roto

col

objects

Pro

tocol

Pro

tocol

Composite applications

150

Semantic Web help?

• Slight problem: we have no first class metadata migration and

management infrastructure, where metadata is outside the application and

in the middleware, and we can handle progressive curation

Wra

ppers

Wra

ppers

Wra

pper

Applic

ation

User

inte

rface

Acce

ss a

nd

Qu

ery

Semantic Enrichment

Model flattening

Mapping Transparency

dataflow workflow

ws ws ws ws ws

curation

submission

Advanced Search

Retrieve data

Submit data

Service Oriented Architecture

http://www.ebi.ac.uk/fasta/lgicp.html

http://www.ebi.ac.uk/blast2/asd.html

http://www.ebi.ac.uk/clustalw/

Distributed Annotation System

An Integrative Analysis Example

Relational data

mining Text mining

Spectrum data

mining

Chemical sequence

data model

Visualizing

relational data

clusters

Visualizingmultidimensi

onal data

Visualizingsequence

data

Visualizingpathway

dataText mining visualization

Visualizing cluster

statistics

Visualizing serial/spect

rum data

Decision tree model

of metabonomi

c profile

Chemical structure

visualization

1- Experiments

Planning and carrying outexperiments(lab work)

2- Results

Processing and interpretation of obtained results

3- Scientific Peer-reviewed articles

'Relevant' results are published in scientific

journals

From experiments to scientific publications

PubMed/Medline database at NCBI

- Developed at the National

Center for Biotechnology

Information (NCBI).

- The core 'Textome'.

- repository of citation

entries of scientific

articles.

- PubMed titles and

abstracts

are primary data source for

Bio-NLP.

- ~ 450,000 new abstracts/a

- > 4,800 biomedical

journals

- ENTREZ search engine

ScientificJournals

Journal-specific

Information:

•Format•Paper structure

(sections)•Article type

Data in scientific articles

Free Text

Title

Abstracts

Keywords

Text body

References

Tables Figures

Biomedical literature characteristics

- Heavy use of domain specific terminology (12%

biochemistry

related technical terms).

- Polysemic words (word sense disambiguation).

- Most words with low frequency (data sparseness).

- New names and terms created.

- Typographical variants

- Different writing styles (native languages)

BioCreative

BioCreative results

1: Chiang et al.

2: Couto et al.

3: Ehrler et al.

4: Ray et al.

5: Rice et al.

6: Verspoor et al.

TP: prediction evaluated as protein and GO terms correct

Precision: TP / Total nr. of

evaluated submissions

Data Integration

• Standards, DBs

Knowledge Discovery

• Algorithms, Informatics, Machine Learning

Integrate knowledge

• Text mining, Ontologies

Modelling

• Pathways, Circuits, Abstraction

Infrastructure

SupportResearch

Los retos de la biología en los próximos

50 years

• Listado de todos los componentes moleculares que forman un organismo:– Genes, proteinas, y otros elementos funcionales

• Comprender la funcion de cada componente

• Comprender como interaccionan

• Estudiar como la función ha evolucionado

• Encontrar defectos geneticos que causan enfermedades

• Diseñar medicamentos y terapias de manera racional

• Secuenciar el genoma de cada individuo y usarlo en una medicina personalizada

• La Bioinformatica es un componente esencial para conseguir todos estos objetivos

Retos de la Bioinformatica

Technology

Transcript of Retos de la Bioinformatica