Retos de la Bioinformatica
-
Upload
alberto-labarga -
Category
Technology
-
view
2.788 -
download
4
description
Transcript of Retos de la Bioinformatica
Bioinformática: la biología por otros medios
Alberto Labarga
UGR, Noviembre 2008
Computational Biology
Bioinformatics[Biological Information]
1859 1866 1870 1900 1902
Hacia una teoría científica de la herencia
1859 1866 1870 1900 1902
Charles Darwin publica en 1859 'The Origin of Species‘donde se propone que los seres vivos son el resultado de la selección natural y que todas las criaturas han evolucionado a lo largo de las generaciones a través de pequeños cambios.
1859 1866 1870 1900 1902
Leyes de Mendel,
publicadas en 1866,
redescubiertas en 1900
1859 1866 1870 1900 1902
En 1870, un científico alemán llamado Friedrich Miescher aísla los componentes almacenados en el núcleo, compuesto principalmente por proteinas y ácidos nucleicos. En aquel momento se creía que el elemento que almacenaba la información hereditaria tenía que ser la proteína, compuesta por 20 aminoacidos, mientras que los ácidos nucleicos tenían sólo 4 componentes.
1859 1866 1870 1900 1902
A comienzo de siglo, Phoebus Levene, descubrió que el ADN es una cadena de nucleótidos, en la que cada nucleótido está compuesto de un azucar (desoxirribosa), un grupo fosfato y una base nitrogenada, que podía ser de cuatro tipos, Adenin, Timina, guanina y Citosina
1859 1866 1870 1900 1902
Walter Sutton, a graduate student in E. B. Wilson’s
lab at Columbia University, observed that in the
process of cell division, called meiosis, that produces
sperm and egg cells, each sperm or egg receives only
one chromosome of each type. (In other parts of the
body, cells have two chromosomes of each type, one
inherited from each parent.) The segregation pattern
of chromosomes during meiosis matched the
segregation patterns of Mendel’s genes.
1928 1944 1949 1952 1953
El descubrimiento del ADN
1928 1944 1949 1952 1953
1928 Frederick Griffith: principio de transformación
si mezclaba a los neumococos R
con neumococos S previamente
muertos por calor, entonces los
ratones se morían. Aún más, en la
sangre de estos ratones muertos
Griffith encontró neumococos
con cápsula (S).
1928 1944 1949 1952 1953
En 1944 Oswald Avery y sus colaboradores, que estaban estudiando la bacateria que causa la neumonía, Pneumococcus, descubrieron que las bacterias tienen ácidos nucleicos y que es la molécula de ADN la encargada de almacenar los genes. Otros estudios con virus se encargaronde confirmar esta teoría a pesar de que se seguía creyendo que el ADN era demasiado simple.
1928 1944 1949 1952 1953
La vida puede verse como un proceso de almacenamiento y transmisión de información biológica.
Los cromosomas son los portadores de esta información.
La información está almacenada en la forma de un código molecular
Para entender la vida debemos identificar estas moléculas y descifrar el código
1928 1944 1949 1952 1953
1949DNA se duplica durante la división celularChargaff: A = T and G = C
1928 1944 1949 1952 1953
1952 - Hershey-Chase Experiment
1928 1944 1949 1952 1953
M.H.F. Wilkins, A.R. Stokes, H.R. Wilson:
Molecular Structure of Deoxypentose Nucleic
Acids. Nature 171, 738 (1953)
R.E. Franklin and R.G. Gosling
Molecular Configuration in Sodium
Thymonucleate, Nature 171, 740
(1953)
1928 1944 1949 1952 1953
MOLECULAR STRUCTURE
OF NUCLEIC ACIDS
“We wish to propose a
structure for the salt of
desoxyribose nucleic acid
(DNA). This structure has
novel features which are of
considerable biological
interest”
Nature. 25 de abril de 1953
1928 1944 1949 1952 1953
“It has not escaped our
attention that the specific
pairing we have
postulated immediately
suggests a possible
copying mechanism for
the genetic material.”
The base pairs
1955 1959 1962 1966
En 1955 Ochoa publicó en Journal of the American
Chemical Society con la bioquímica francorrusa
Marianne Grunberg-Manago, el aislamiento de una
enzima del colibacilo que cataliza la síntesis de ARN, el
intermediario entre el ADN y las proteínas. Los
descubridores llamaron «polinucleótido-fosforilasa» a
la enzima, conocida luego como ARN-polimerasa. El
descubrimiento de la polinucleótido fosforilasa dio
lugar a la preparación de polinucleótidos sintéticos de
distinta composición de bases con los que el grupo de
Severo Ochoa, en paralelo con el grupo de Marshall
Nirenberg, llegaron al desciframiento de la clave
genética.
1955 1959 1962 1966
1955 1959 1962 1966
Cuando Perutz llegó a Cambridge la
estructura molecular más grande que se
había resuelto era la del pigmento natural
ficocianina, de 58 átomos. Una proteína
tiene miles de átomos. Bernal, su director,
había realizado algunas imágenes de
difracción de rayos X de cristales de una
proteína, la pepsina, pero sin llegar a
interpretarlas. El tema escogido por Perutz
para su tesis fue otra proteína, la
hemoglobina, el transportador de oxígeno
que da color rojo a nuestra sangre. La
hemoglobina tiene nada menos que 11.000
átomos. Tardo 23 años.
1955 1959 1963 1966
1955 1959 1962 1966
Over the course of several years,
Marshall Nirenberg, Har Khorana and
Severo Ochoa and their colleagues
elucidated the genetic code – showing
how nucleic acids with their 4-letter
alphabet determine the order of the 20
kinds of amino acids in proteins.
Messenger RNA is interpreted three
letters at a time; a set of three
nucleotides forms a "codon" that
encodes an amino acid. A three-letter
word made of four possible letters can
have 64 (4 x 4 x 4) permutations, which
is more than enough to encode the 20
amino acids in living beings.
From DNA to protein
1970 1975 1977 19801971
Entendiendo los mecanismos, creando las herramientas
1970 1975 1977 19801971
El Central Dogma
1970 1975 1977 19801971
Created in 1971
with seven
structures
1970 1975 1977 19801971
El ADN recombinante, o ADN recombinado, es
una molécula de ADN formada por la unión de
dos moléculas heterólogas, es decir, de diferente
origen.
Se realiza a través de las enzimas de restricción
que son capaces de "cortar" el ADN en puntos
concretos.
De una manera muy simple podemos decir que
"cortamos" un gen humano y se lo "pegamos" al
ADN de una bacteria; si por ejemplo es el gen
que regula la fabricación de insulina, lo que
haríamos al ponérselo a una bacteria es
"obligar" a ésta a que fabrique la insulina.
1970 1975 1977 19801971
1970 1975 1977 19801971
A precursor-RNA may often be matured to
mRNAs with alternative structures. An example
where alternative splicing has a dramatic
consequence is somatic sex determination in the
fruit fly Drosophila melanogaster.
In this system, the female-specific sxl-protein
is a key regulator. It controls a cascade of
alternative RNA splicing decisions that finally
result in female flies.
1981 1985 1987 199019831982
Entendiendo los mecanismos, creando las herramientas
1981 1985 1987 199019831982
Read out the letters from a DNA sequence
GTGAGGCGCTGC
1981 1985 1987 199019831982
1983 La reacción en cadena de la polimerasa,
conocida como PCR por sus siglas en inglés
(Polymerase Chain Reaction), es una técnica
de biología molecular descrita en 1986 por
Kary Mullis,[1] cuyo objetivo es obtener un
gran número de copias de un fragmento de
ADN particular, partiendo de un mínimo; en
teoría basta partir de una única copia de ese
fragmento original, o molde.
1981 1985 1987 199019831982
Total nucleotides
(Nov 07: 188,490,792,445)
Number of entries
(Nov 07: 106,144,026)
1981 1985 1987 199019831982
1981 1985 1987 199019831982
El Proyecto Genoma Humano (PGH) (Human
Genome Project en inglés) consiste en
determinar las posiciones relativas de todos los
nucleótidos (o pares de bases) e identificar
100.000 genes presentes en él.
El proyecto, dotado con 3.000 millones de
dólares, fue fundado en 1990 por el
Departamento de Energía y los Institutos de la
Salud de los Estados Unidos, con un plazo de
realización de 15 años.
”Imagine varias copias de un libro, cortadas en
10 millones de trocitos cada una, de manera
que los trocitos se solapan. Supongamos que 1
millón de trocitos se han perdido, y que los
otros 9 millones están manchados de tinta.
Recupere el texto original.”
HUGO: Idealized representation of the hierarchical shotgun sequencing strategy. A library is constructed by
fragmenting the target genome and cloning it into a large-fragment cloning vector; here, BAC vectors are shown. The
genomic DNA fragments represented in the library are then organized into a physical map and individual BAC clones
are selected and sequenced by the random shotgun strategy. Finally, the clone sequences are assembled to reconstruct
the sequence of the genome.
1990 1995 1997 20011996 1998 1999
Descifrando el libro de la vida
1990 1995 1997 20011996 1998 1999
S.F. Altschul, et al. (1990), "Basic Local
Alignment Search Tool," J. Molec.
Biol., 215(3): 403-10, 1990. 15,306
citations
Altschul, S.F. et al (1997), “Gapped
BLAST and PSI-BLAST: a new
generation of protein database search
programs”, Nucleic Acids Res., vol. 25,
no. 17, pp. 3389-402.
• SSAHA (Ning et al., 2001)• http://www.sanger.ac.uk/Software/analysis/SSAHA/
• SSAHA is an algorithm for very fast matching and alignment of DNA
sequences. It stands for Sequence Search and Alignment by Hashing
Algorithm. It achieves its fast search speed by converting sequence
information into a `hash table' data structure, which can then be
searched very rapidly for matches.
• BLAT (J. Kent, 2002)• http://genome.ucsc.edu/cgi-bin/hgBlat
• BLAT on DNA is designed to quickly find sequences of 95% and greater
similarity of length 40 bases or more. It may miss more divergent or
shorter sequence alignments. It will find perfect sequence matches of 33
bases, and sometimes find them down to 20 bases. BLAT on proteins
finds sequences of 80% and greater similarity of length 20 amino acids
or more.
1990 1995 1997 20011996 1998 1999
J. Thompson, T. Gibson, D.
Higgins (1994), CLUSTAL W:
improving the sensitivity of
progressive multiple sequence
alignment … Nuc. Acids. Res. 22,
4673 - 4680
Flowchart of computation steps in
Clustal W (Thompson et al., 1994)
Pairwise alignment: calculation of distance matrix
Creation of unrooted neighbor-joining tree
Rooted nJ tree (guide tree) and calculation of sequence weights
Progressive alignment following the guide tree
Otros métodos
Notredame, C., Higgins, D.G., Heringa, J. (2000) T-Coffee: a novel method for
fast and accurate multiple sequence alignment. J. Mol. Biol, 302, 205–217.
Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high
accuracy and high throughput. Nucleic Acids Res., 32, 1792–1797.
Katoh, K., Kuma, K., Toh, H., Miyata, T. (2005) MAFFT version 5:
improvement in accuracy of multiple sequence alignment. Nucleic Acids
Res, 33, 511–518.
Lassmann, T., Sonnhammer, E. (2005) Kalign – an accurate and fast multiple
sequence alignment algorithm. BMC Bioinformatics , 6, 298.
Larkin M.A. et al. (2007) ClustalW and ClustalX version 2. Bioinformatics 2007
23(21): 2947-2948.
Tree of Life
http://tolweb.org/tree/phylogeny.html http://itol.embl.de/
1990 1995 1997 20011996 1998 1999
1995• El primer genoma completo de un organismoHemophilus influenzae.
1990 1995 1997 20011996 1998 1999
1996• El genoma de la levadura se completa: aproximadamente, 6,000 genes y 14.000.000 de pares de bases
1990 1995 1997 20011996 1998 1999
1990 1995 1997 20011996 1998 1999
1997
•Ecuenciado el genoma de la bacteria E. Coli: 4,600 genes 4,5 millones de nucleótidos.
1990 1995 1997 20011996 1998 1999
1998
El genoma del gusano Caenorhabditis elegans, tiene 18,000 genes unos 100 millones de nucleotidos
1990 1995 1997 20011996 1998 1999
1999•Se consigue la secuencia completa del cromosoma 22 El HGP va por delante de lo planeado.Sorprende el reducido número de genes encontrado (unos 300)
Fire A, Xu S, Montgomery M, Kostas
S, Driver S, Mello C (1998). "Potent
and specific genetic interference by
double-stranded RNA in
Caenorhabditis elegans". Nature 391
(6669): 806–11. doi:10.1038/35888.
PMID 9486653
Hamilton A, Baulcombe D
(1999). "A species of small
antisense RNA in
posttranscriptional gene
silencing in plants". Science
286 (5441): 950–2.
PMID 10542148
Dr Alan Wolffe (1999)
• Epigenetics is heritable changes in gene expression that occur without a change in DNA sequence
• Such changes cannot be attributed to changes in DNA sequence (mutations)
• They are as Irreversible as mutations (or difficult to reverse)
1990 1995 1997 20011996 1998 1999
Gene prediction
In humans:
~22,000 genes
~1.5% of human DNA
Where are the genes?
the gencode pipeline
1. mapping of known transcripts sequences (ESTs, cDNAs, proteins) into the
human genome
2. manual curation to resolve conflicting evidence
3. additional computational predictions
4. experimental verification
5. FINAL ANNOTATION
August 2008 Bioinformatics tools for Comparative
Genomics of Vectors
64
Genome annotation - building a pipeline
Genome sequence
Map repeats
Genefinding
Protein-coding genes
Map ESTs Map Peptides
nc-RNAs
Functional annotation
Release
August 2008 Bioinformatics tools for Comparative
Genomics of Vectors
65
Genefinding - ab initio predictions
Use compositional features of the DNA sequence to define coding
segments (essentially exons)
ORFs
Coding bias
Splice site consensus sequences
Start and stop codons
Each feature is assigned a log likelihood score
Use dynamic programming to find the highest scoring path
Need to be trained using a known set of coding sequences
Examples: Genefinder, Augustus, Glimmer, SNAP, fgenesh
August 2008 Bioinformatics tools for Comparative
Genomics of Vectors
66
ab initio prediction
Genome
Coding
potential
Coding
potential
ATG & Stop
codons
ATG & Stop
codons
Splice sites
August 2008 Bioinformatics tools for Comparative
Genomics of Vectors
67
ab initio prediction
Genome
Coding
potential
Coding
potential
ATG & Stop
codons
ATG & Stop
codons
Splice sites
August 2008 Bioinformatics tools for Comparative
Genomics of Vectors
68
ab initio prediction
Find best prediction
Genome
Coding
potential
Coding
potential
ATG & Stop
codons
ATG & Stop
codons
Splice sites
August 2008 Bioinformatics tools for Comparative
Genomics of Vectors
69
Genefinding - similarity
Use known coding sequence to define coding regions
EST sequences
Peptide sequences
Needs to handle fuzzy alignment regions around splice sites
Needs to attempt to find start and stop codons
Examples: EST2Genome, exonerate, genewise
Use 2 or more genomic sequences to predict genes based on
conservation of exon sequences
Examples: Twinscan and SLAM
August 2008 Bioinformatics tools for Comparative
Genomics of Vectors
70
Similarity-based prediction
Align
Create prediction
Genome
cDNA/peptide
Example of a simple HMM
EPFL – Bioinformatics I – 05 Dec 2005
Top: model architecture and parameters. Bottom: sequence generation process.
green: state transition probabilities, red: emission probabilities.
Prob(sequence, path|model) = 6.8e-8.
Automatic Annotation vs Manual
Automatic Annotation
• Quick whole genome analysis ~ weeks
• Consistent annotation
• Use unfinished sequence/shotgun assembly
• No polyA sites/signals, pseudogene
• Predicts ~70% loci
Manual Annotation
• Extremely slow~3 months Chr 6
• Need finished seq
• Flexible, can deal with inconsistencies in data
• Most rules have exception
• Consult publications as well as databases
Analysis EGASP predictions vs manual
annotation
0
10
20
30
40
50
60
70
80
90
100
9_101_1 20_79_1 36_46_1 41_77_1
Nuc Sn
Nuc Sp
0
10
20
30
40
50
60
70
80
90
100
9_101_1 20_79_1 36_46_1 41_77_1
Exon Sn
Exon Sp
0
10
20
30
40
50
60
70
80
9_101_1 20_79_1 36_46_1 41_77_1
Trans Sn
Trans Sp
0
10
20
30
40
50
60
70
80
9_101_1 20_79_1 36_46_1 41_77_1
Gene Sn
Gene Sp
2002 2007 201020052004
Y sólo es el principio
2002 2007 201020052004
2002 2007 201020052004
874
2124
1004
10/0810/3/02
104
316
218
8/28/03
156
386
246
5/07
500
1500
700
4000
Published complete genomes:
Ongoing prokaryotic genomes:
Ongoing eukaryotic genomes:
http://www.genomesonline.org
2002 2007 201020052004
Illumina / Solexa
Genetic Analyzer
2000 Mb / run
Applied Biosystems
ABI 3730XL
1 Mb / day
Roche / 454
Genome Sequencer FLX
100 Mb / run
Applied Biosystems
SOLiD
3000 Mb / run
454-GS20
32,000,000
0 .04
0 .54
1 .04
1 .54
2 .04
2 .54
3 .04
3 .54
4 .04
4 .54
199 4 199 6 199 8 200 0 200 2 200 4 200 6
Mill
ions
Date of Introduction
# B
ases
/Run
ABI
3730ABI
370/377
ABI
3700
Aunque los seres humanos compartimos
99.9 por ciento de la información genética,
tenemos pequeñas variaciones, llamadas
poliformismos singulares de nucléotido o
SNP (por su siglas en inglés; se pronuncia
snip). Se estima que existen unos 10
millones de SNP en la especie humana y
supuestamente esas diferencias estarían
relacionadas con la mayor resistencia o
susceptibilidad a enfermedades y
medicamentos.
2002 2007 201020052004
VARIACIÓN EN LA SECUENCIA HUMANA DE
DNA
Tasa de mutación = 10-8 /sitio/generación
Nº generaciones ancestro común-humano actual: 104-105
2002 2007 20102005
ENCyclopedia Of DNA Elements
2004
2002 2007 201020052004
Genómica funcional
Comparative
genomics
Sequence (DNA/RNA)
& phylogeny
Regulation of gene
expression;
transcription factors &
micro RNAs
Protein sequence analysis &
evolution
Protein families,
motifs and domains
Protein structure & function:
computational crystallography
Protein interactions & complexes: modelling and
prediction
Chemical biology
Pathway analysis
Systems
modelling
Image analysis
Data integration & literature
mining
Se preparan copias del ADN
de los genes de interés
Transcripción
inversa
...que se
imprimen
en el chip
Las muestras se hibridan
en el microarray
Laser 1 Laser 2
El chip se excita
con láseres
diferentes: el
control
reacciona a uno
de ellos y la
muestra al otro
La comparación
de ambas
imágenes nos
indica que genes
se expresan de
manera diferente
Añadir
fluorescencia
control muestr
a
Se preparan las
muestras de ARN
de interés
Schena et al. Science 1995
Microarray analysis
Clinical prediction of Leukemia type
• 2 types
– Acute lymphoid (ALL)
– Acute myeloid (AML)
• Different treatment & outcomes
• Predict type before treatment?
Golub et. al. Science 286:531-537. (1999)
Biomarkers discovery
Data
Management
statistical
analysis AnnotationNetwork
análisis Selection
30.000
genes
1500 genes 150 genes 50 elements 10 targets
Step1: Calculate Ct with SDS and export text file
TaqMan Assays
Step 3: Biological Replicates
Step 4: Selection of Optimal Endogenous Controls &
Calculation of ΔCt
Step 5: Differential Expression Analysis ΔΔCt
! Overview Plates & Samples
! Quality Control
Raw Values
! Discard Samples
! Quality Control
ΔCt Overview
RT-PCR Standard Processing Procedure
Step2: Retrieve data and define
experiment design
88
Example of Array CGH Technology*
Chari et al, Cancer Informatics, 2006, 2, 48-58
89
Source: http://www.chiponchip.org/
Chip-on-chip
DNA-binding proteins are crosslinked to DNA with formaldehyde in vivo
Isolate the chromatin. Shear DNA along with bound proteins into small fragments.
Bind antibodies specific to the DNA-binding protein to isolate the complex by precipitation. Reverse the cross-linking to release the DNA and digest the proteins.
Use PCR( Polymerase Chain Reaction )
to amplify specific DNA sequences to see if they were precipitated with the antibody
ChIP (Chromatin ImmunoPrecipitation)
• Chromatin immunoprecipitation, or ChIP, refers to a procedure used to determine whether a given protein binds to a specific DNA sequence in vivo
Protein MicroarrayG. MacBeath and S.L. Schreiber, 2000, Science 289:1760
arrayIT TM
Spotting platform and protein microarray
Different Kinds of Protein Arrays*
Antibody Array Antigen Array Ligand Array
Detection by: SELDI MS, fluorescence, SPR,
electrochemical, radioactivity, microcantelever
The Microarray Study Process
Preprocesado
Some Questions:
• Which genes have expression levels that are correlated
with some external variable?
• For a given pathway, which of the genes in our collection
are most likely to be involved?
• For a diffuse disease, which genes are associated with
different outcomes?
Challenges for Data Analysis
• Normalization (removing systematic measurement effects)
• Variable Selection (Identification of relevant Variables)
• Large sample Effects:
Type I and Type II errors (False positives / False negatives)
• Dimensionality Reduction
• Identification of new disease classes
• Classification of data into known disease classes
Data Analysis Methods
Dimension Reduction
• PCA (Principle Component Analysis)
• ICA (Independent Component Analysis)
• Multidimensional Scaling
Unsupervised Learning
• K-Means / K-Medoid
• Hierarchical Clustering Algorithms
Supervised Learning
• Linear Discriminant Analysis
• Maximum Likelihood Discrimination
• Nearest Neighbor Methods
• Decision Trees
• Random Forests
Matrix factorization
102
Popular Classification Methods
• Decision Trees/Rules– Find smallest gene sets, but not robust – poor performance
• Neural Nets - work well for reduced number of genes
• K-nearest neighbor – good results for small number of genes, but no model
• Naïve Bayes – simple, robust, but ignores gene interactions
• Support Vector Machines (SVM)– Good accuracy, does own gene selection,
but hard to understand
• Specialized methods, D/S/A (Dudoit), …
Support Vector Machine (SVM)
• Main idea: Select hyperplane that is more likely to
generalize on a future datum
104
Best Practices
• Capture the complete process, from raw data to final results
• Gene (feature) selection inside cross-validation
• Randomization testing
• Robust classification algorithms– Simple methods give good results
– Advanced methods can be better
• Wrapper approach for best gene subset selection
• Use bagging to improve accuracy
• Remove/relabel mislabeled or poorly differentiated samples
Alistair Chalk, 2008
Enrichment Analysis
• What are major enriched GO terms?
• What are the highly active pathways?
• What are the frequently interacting proteins?
• What are the known disease associations?
Meta-analysis example: “Creation and
implications of a phenome-genome network”
Butte and Kohane. Nat Biotech. 2006
Meta-analysis example: “Creation and
implications of a phenome-genome network”
Butte and Kohane. Nat Biotech. 2006
• Clustered experiments based on mapping concepts found in sample annotations to UMLS meta-thesaurus.
• Relationships found between phenotype (e.g., aging), disease (e.g., leukemia), environmental (e.g., injury) and experimental (e.g., muscle cells) factors and genes with differential expression.
• “the ease and accuracy of automating inferences across data are crucially dependent on the accuracy and consistency of the human annotation process, which will only happen when every investigator has a better prospective understanding of the long-term value of the time invested in improving annotations.”
Biología de sistemas
PPI ANNOTATION AND DATABASES
http://www.hpid.org (Han et al., 2004)HPID
http://www.ebi.ac.uk/intact(Hermjakob et al., 2004)IntAct
http://www.hprd.org/(Peri et al., 2004)HPRD
http://dip.doe-mbi.ucla.edu/(Xenarios et al., 2002)DIP
http://mint.bio.uniroma2.it/mint(Zanoni et al., 2002)MINT
URLReferenceDatabase
iMEX agreement to share curation efforts
Protein Standard Initiative (PSI) recommendation
Molecular Interaction (MI) Ontology
Large scale experiments
Literature curation
Complex networks
• Many systems can be represented as networks (graphs)– Nodes: individual component (proteins)
– Edges: relationships (interactions)
• They share common properties– Scale-free
– Hierarchical
– Clustering
• Some properties may be intrinsic and can be understood better when putting into the context of evolution
Detecting Hierarchical Organization
Summary: Network Measures
• Degree ki
The number of edges involving node i
• Degree distribution P(k)
The probability (frequency) of nodes of degree k
• Mean path length
The avg. shortest path between all node pairs
• Network Diameter
– i.e. the longest shortest path
• Clustering Coefficient
– A high CC is found for modules
Mapping the phenotypic data to the network
Begley TJ, Rosenbach AS, Ideker T,
Samson LD. Damage recovery pathways
in Saccharomyces cerevisiae revealed by
genomic phenotyping and interactome
mapping. Mol Cancer Res. 2002
Dec;1(2):103-12.
•Systematic phenotyping
of 1615 gene knockout
strains in yeast
•Evaluation of growth of
each strain in the presence
of MMS (and other DNA
damaging agents)
•Screening against a
network of 12,232 protein
interactions
The Role of Proteomics
• The existence of an ORF does not imply the
existence of a functional gene.
• Limitations of comparative genomics.
• mRNA levels may not correlate with protein levels.
• Protein modifications post-transcriptional
modifications, isoforms, post-translational
modifications, mutants.
• Issues of proteolysis, sequestration, etc. relevant only
at the protein level.
• Protein complex composition, protein-protein
interactions, structures.
Structural proteomics
• Folding
• Structure and function
• Protein structure prediction
• Secondary structure
• Tertiary structure
• Function
• Post-translational modification
• Prot.-Prot. Interaction -- Docking algorithm
• Molecular dynamics/Monte Carlo
What kind of methods around?
5 main levels of protein Structure prediction:
1. Extensive Sequence Search
2. Threading and 1D-3D profiles
3. Ab initio prediction of protein structure
4. Comparative Modelling
5. Docking (domain interaction prediction)
Prediction of Protein Structures
• Examples – a few good examples
actual predicted actual
actual actual
predicted
predicted predicted
START
Get profile for sequence (NR)
Scan sequence profile against
representative PDB chains
Scan PDB chain profiles
against sequence
PS
I-B
LA
ST
MODPIPE: Large-Scale Comparative Protein Structure Modeling
Select templates using
permissive E-value cutoff
1
Expand match to cover
complete domains
1
Build model for target segment by
satisfaction of spatial restraints
Evaluate model
Align matched parts of sequence and
structure
MO
DE
LL
ER
R. Sánchez & A. Šali, Proc. Natl. Acad. Sci. USA 95, 13597, 1998.
N. Eswar, M. Marti-Renom, M.S. Madhusudhan, B. John, A. Fiser, R. Sánchez, F. Melo, N. Mirkovic, A. Šali.
Fo
r ea
ch t
arg
et s
equ
ence
Fo
r ea
ch t
emp
late
str
uct
ure
3/25/03
END
Structural Proteomics:
The Motivation*
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
1800000
2000000
1980 1985 1990 1995 2000 2005
2000040000
6000080000
100000120000
140000160000
0
Seq
uen
ces S
tructu
res
180000200000
The hierarchies of protein structure
126
Docking Programs
• Dock (UCSF)
• Autodock (Scripps)
• Glide
(Schrodinger)
• ICM (Molsoft)
• FRED (Open Eye)
• Gold, FlexX, etc.
Cell cycle network from KEGG
128
Graphical Notation: a necessity for the conceptual representation
of biopathways
Thiery & Sleeman, Nat. Rev. Mol.
Cell. Biol 7:131 (2006)
Qualitative Mechanistic
various degree of
detail, mixed level
of presentation
Aladjem et al., Science STKE pe8
(2004)
129
Strategies: simulate or analyse?
(or rather what to do first)
convert diagram
into a quantitative
model
simulate model
behavior
numerically
obtain qualitative
understanding
through numerical
results and model
reduction
qualitatively
analyze network
topology, stability,
etc
identify
“elementary
modes”
build and
simulate a
reduced model
130
Space of modeling methods
con
tin
uou
s↔
dis
cret
e
sto
chsi
mB
oo
lean
net
wo
rks
Continuum of modeling approaches
Top-down Bottom-up
Frazier et al. (2003) Science 11 April Vol 300:290-293
Integración de datos
Nucleic Acids Research article lists
1078 public databases
Nucleic Acids Research, 2008, Vol. 36, Database issue
http://nar.oxfordjournals.org/cgi/reprint/36/suppl_1/D2
Growth in Available Bioinformatics Databases
Too much unintegrated data
• Data sources incompatible
• No (or few) standard naming convention
• No common interface (varying tools for browsing,
querying and visualizing data)
– Small, isolated, independent, groups/individuals
– Loosely coupled provider-consumer of resources.
– Commonly resource consumers
– Boutique suppliers.
– Poor access systems admins
– Large experiments or large research groups/labs, possibly distributed
– Large service provider institutes.
– Tightly coupled provider-consumer of resources.
– Commonly resource providers.
– Some or lots of access to sys admin
138
Challenges: Names and Identity
Q92983
O00275
O00276
O00277
O00278
O00279
O00280
O14865
O14866
P78507
• WSL-1 protein
• Apoptosis-mediating receptor DR3
• Apoptosis-mediating receptor TRAMP
• Death domain receptor 3
• WSL protein
• Apoptosis-inducing receptor AIR
• Apo-3
• Lymphocyte-associated receptor of death
• LARD
• GENE: Name=TNFRSF25
Q93038 = Tumor necrosis factor receptor superfamily member 25 precursor
P78515
Q93036
Q93037
Q99722
Q99830
Q99831
Q9BY86
Q9UME0
Q9UME1
Q9UME5
Annotation history:
http://www.expasy.org/uniprot/Q93038
GUIDs
Life Science
Identifier?
Normalisation
Why must support standards?
• Unambiguous representation, description
and communication
– Final results and metadata
• Interoperability
– Data management and analysis
• Integration of OMICS system biology
What to standarize?
• CONTENT: Minimal/Core Information to be reported
• MIBBI (http://www.mibbi.org)
• SEMANTIC: Terminology Used -> Ontologies
• OBI (http://obi-ontology.org)
• SYNTAX: Data Model, Data Exchange
• Fuge (http://fuge.sourceforge.net/)
MIBBI: Standard Content
Promoting Coherent Minimum Reporting Requirements for
Biological and Biomedical Investigations: The MIBBI Project, Taylor et Al, Nature Biotech.
143U
ser
inte
rface
Applic
ation
Applic
ation inte
rface
Link Integration: Integration Lite
Ontology
Authority
Identity Authority
144
Warehouse
Applic
ation
User
inte
rfaceW
rappers
Wra
ppers
Wra
ppers
Unified
model Data
Access a
nd Q
uery
• Copy the data sets, clean and massage data into shape
• Combine them into a (different) pre-determined model before query
• ATLAS, MRS, e-Fungi, GIMS, Medicel Integrator, MIPS, BioMART
• Often called “Knowledge bases”
145
View integration
• Data at Source; Virtual integrating database view
• Global as View / Local as View mappings between models
• Map from model to databases dynamically so always fresh
• TAMBIS, Information Integrator, K4, ComparaGrid, UTOPIA, caCORE
Wra
ppers
Wra
ppers
Wra
ppers
Applic
ation
User
inte
rface
Unified
model Data
Access a
nd Q
uery
146
Specialist Integrating Application
E.g. Ensembl, UTOPIA
• Very popular. Known to be one application.
Applic
ation
User
inte
rfaceW
rappers
Wra
ppers
Wra
ppers
147
Workflows
• Data flow protocol. Automated data chaining.
• General technique for describing and enacting a process
• Describes what you want to do, not how you want to do it
• Various degrees of data type compliance anticipated
Applic
ation
User
inte
rface
Wra
pper
Workflow
Engine
148
Mash-Up Data Marshalling
• Content syndication and feeds
• Emphasis on User creating specific integration by mapping.
• Just in time, just enough design
• On demand integration
Ma
sh U
p A
pplic
ation
User
inte
rfaceP
roto
col
objects
Pro
tocol
Pro
tocol
Composite applications
150
Semantic Web help?
• Slight problem: we have no first class metadata migration and
management infrastructure, where metadata is outside the application and
in the middleware, and we can handle progressive curation
Wra
ppers
Wra
ppers
Wra
pper
Applic
ation
User
inte
rface
Acce
ss a
nd
Qu
ery
Semantic Enrichment
Model flattening
Mapping Transparency
dataflow workflow
ws ws ws ws ws
curation
submission
Advanced Search
Retrieve data
Submit data
Service Oriented Architecture
Distributed Annotation System
Distributed Annotation System
An Integrative Analysis Example
Relational data
mining Text mining
Spectrum data
mining
Chemical sequence
data model
Visualizing
relational data
clusters
Visualizingmultidimensi
onal data
Visualizingsequence
data
Visualizingpathway
dataText mining visualization
Visualizing cluster
statistics
Visualizing serial/spect
rum data
Decision tree model
of metabonomi
c profile
Chemical structure
visualization
1- Experiments
Planning and carrying outexperiments(lab work)
2- Results
Processing and interpretation of obtained results
3- Scientific Peer-reviewed articles
'Relevant' results are published in scientific
journals
From experiments to scientific publications
PubMed/Medline database at NCBI
- Developed at the National
Center for Biotechnology
Information (NCBI).
- The core 'Textome'.
- repository of citation
entries of scientific
articles.
- PubMed titles and
abstracts
are primary data source for
Bio-NLP.
- ~ 450,000 new abstracts/a
- > 4,800 biomedical
journals
- ENTREZ search engine
ScientificJournals
Journal-specific
Information:
•Format•Paper structure
(sections)•Article type
Data in scientific articles
Free Text
Title
Abstracts
Keywords
Text body
References
Tables Figures
Biomedical literature characteristics
- Heavy use of domain specific terminology (12%
biochemistry
related technical terms).
- Polysemic words (word sense disambiguation).
- Most words with low frequency (data sparseness).
- New names and terms created.
- Typographical variants
- Different writing styles (native languages)
BioCreative
BioCreative
BioCreative results
1: Chiang et al.
2: Couto et al.
3: Ehrler et al.
4: Ray et al.
5: Rice et al.
6: Verspoor et al.
TP: prediction evaluated as protein and GO terms correct
Precision: TP / Total nr. of
evaluated submissions
Data Integration
• Standards, DBs
Knowledge Discovery
• Algorithms, Informatics, Machine Learning
Integrate knowledge
• Text mining, Ontologies
Modelling
• Pathways, Circuits, Abstraction
Infrastructure
SupportResearch
Los retos de la biología en los próximos
50 years
• Listado de todos los componentes moleculares que forman un organismo:– Genes, proteinas, y otros elementos funcionales
• Comprender la funcion de cada componente
• Comprender como interaccionan
• Estudiar como la función ha evolucionado
• Encontrar defectos geneticos que causan enfermedades
• Diseñar medicamentos y terapias de manera racional
• Secuenciar el genoma de cada individuo y usarlo en una medicina personalizada
• La Bioinformatica es un componente esencial para conseguir todos estos objetivos