Bioinformatics A Biased Overview

Post on 06-May-2015

565 views 1 download

Tags:

description

Lecture given to graduate microbiology students on examples of work in bioinformatics. Date:

Transcript of Bioinformatics A Biased Overview

A {Biased} Overview of Bioinformatics

with Examples Drawn from Our Own Work

Philip E. Bourne Professor of Pharmacology UCSD

pbourne@ucsd.edu

1Bioinformatics - Overview

There Are Multiple Types of Informatics in the Life Sciences

Bioinformatics - Overview 2

PharmacyInformatics

BiomedicalInformatics

Bioinformatics

Drug dosingPharmacokineticsPharmacy InformationSystems

EHRDecision support systemsHospital Information Systems

AlgorithmsGenomicsProteomicsBiological networksSystems Biology

Note: These are only representative examples

There Are Multiple Types of Informatics in the Life Sciences

Bioinformatics - Overview 3

PharmacyInformatics

BiomedicalInformatics

Bioinformatics

Controlled vocabulariesOntologiesLiterature searchingData managementPharmacogenomicsPersonalized medicine

Note: These are only representative examples

Biological Experiment Data Information Knowledge Discovery

Collect Characterize Compare Model Infer

Sequence

Structure

Assembly

Sub-cellular

Cellular

Organ

Higher-life

90 05

Computing Power

Sequencing

Data1 10 100 1000 105

95 00

Human Genome Project

E.ColiGenome

C.ElegansGenome

1 Small Genome/Mo.

ESTs

YeastGenome

Gene Chips

Virus Structure

Ribosome

Model Metaboloic Pathway of E.coli

Complexity Technology

Brain Mapping

Genetic Circuits

Neuronal Modeling

Cardiac Modeling

Human Genome

# People/Web Site

106 102 1

VirtualCommunities

Bioinformatics In One Slide

106

BlogsFacebook

1000’sGWAS

The Omics Revolution4Bioinformatics - Overview

Bioinformatics – One Definition

• The integration of biological data in digital form from different sources and possibly different scales (complexity), usually collected by others, and subsequent analyzed to offer new biological insights

Bioinformatics - Overview 5

Biological Scales (Complexity)

Bioinformatics - Overview 6

Genomics

Proteomics

Protein-protein interactions

Biological Networks

Systems Biology

We will look at an example of how bioinformatics is used at each scale

Some Thoughts on Genomic Data

• Its scary• Its time to consider

cost vs benefit• Reductionism is

not a dirty word• We need to do

more with the long tail

On the Future of Genomic DataScience 11 February 2011: vol. 331 no. 6018 728-729

8

Bioinformatics & Metagenomics• New type of genomics

• New data (and lots of it) and new types of data– 17M new (predicted

proteins!) 4-5 x growth in just few months and much more coming

– New challenges and exacerbation of old challenges

Bioinformatics at Different Scales - GenomicsBioinformatics - Overview

9

Metagenomics: Early Results

• More then 99.5% of DNA in every environment studied represent unknown organisms

• Most genes represent distant homologs of known genes, but there are thousands of new families

• Environments being studied:– Water (ocean, lakes)– Air– Soil– Human body (gut, oral

cavity, human microbiome)

Bioinformatics at Different Scales - GenomicsBioinformatics - Overview

10

Metagenomics New DiscoveriesEnvironmental (red) vs. Currently Known PTPases (blue)

Higher eukaryotes

1

23

4Bioinformatics at Different Scales - Genomics

Bioinformatics - Overview

Proteomics

Bioinformatics - Overview 11

Num

ber

of r

elea

sed

entr

ies

Year

Its Not Just About Numbers its About Complexity

Courtesy of the RCSB Protein Data BankBioinformatics at Different Scales - Proteomics12Bioinformatics - Overview

13

Determining 3D Structures – The Impact of Bioinformatics

Basic Steps

Target Selection

Crystallomics• Isolation,• Expression,• Purification,• Crystallization

DataCollection

StructureSolution

StructureRefinement

Functional Annotation Publish

Structural biology moves from being functionally driven to genomically driven

Fill inprotein fold

space

Robotics-ve data

Software engineering Functional prediction

Notnecessarily

Bioinformatics at Different Scales - ProteomicsBioinformatics - Overview

Bioinformatics at Different Scales - Proteomics14Bioinformatics - Overview

Nature’s ReductionismThere are ~ 20300 possible proteins>>>> all the atoms in the Universe

~20M protein sequences from UniProt/TrEMBL

~75,000 protein structures Yield ~1500 folds, ~2000 superfamilies,

~4000 families (SCOP 1.75)Using Protein Structure to Study Evolution

16

Structure Provides an Evolutionary Fingerprint

Distribution among the three kingdoms as taken from SUPERFAMILY

• Superfamily distributions would seem to be related to the complexity of life

Eukaryota (650)

Archaea (416) Bacteria (564)

2 42

10

135

118

387

17

SCOP fold (765 total)

1

153/14

9/1

21/2 310/0645/49

29/0 68/0

Any genome / All genomes

Using Protein Structure to Study Evolution

17

Method – Distance Determination

(FSF)SCOP

SUPERFAMILY

organisms

C. intestinalis C. briggsae F. rubripes

a.1.1 1 1 1

a.1.2 1 1 1

a.10.1 0 0 1

a.100.1 1 1 1

a.101.1 0 0 0

a.102.1 0 1 1

a.102.2 1 1 1

C. intestinalis C. briggsae F. rubripes

C. intestinalis 0 101 109

C. briggsae 0 144

F. rubripes 0

Presence/Absence Data Matrix

Distance Matrix

Using Protein Structure to Study Evolution

18

If Structure is so Conservedis it a Useful Tool in the Study of Evolution?

The Answer Would Appear to be Yes

• It is possible to generate a reasonable tree of life from merely the presence or absence of superfamilies (FSFs) within a given proteome

Using Protein Structure to Study Evolution

Yang, Doolittle & Bourne (2005) PNAS 102(2) 373-8

19

The Influence of Environment on Life

Chris Dupont Scripps Institute of Oceanography

UCSD

DuPont, Yang, Palenik, Bourne. 2006 PNAS 103(47) 17822-17827

Using Protein Structure to Study Evolution

20

Consider the Distribution of Disulfide

Bonds among Folds • Disulphides are only stable under

oxidizing conditions• Oxygen content gradually

accumulated during the earth’s evolution

• The divergence of the three kingdoms occurred 1.8-2.2 billion years ago

• Oxygen began to accumulate ~ 2.0 billion years ago

• Logical deduction – disulfides more prevalent in folds (organisms) that evolved later

• This would seem to hold true

• Can we take this further?

Eukaryota

Archaea Bacteria

0% (0/2)

16.7% (7/42)

0% (0/10)

31.9% (43/135)

14.4% (17/118) 4.7%

(18/387)

5.9% (1/17)

SCOP fold (708 total)

1

Using Protein Structure to Study Evolution

21

Evolution of the Earth• 4.5 billion years of change• 300+50K• 1-5 atmospheres• Constant photoenergy• Chemical and geological

changes• Life has evolved in this

time

• The ocean was the “cradle” for 90% of evolution

Using Protein Structure to Study Evolution

22

• Whether the deep ocean became oxic or euxinic following the rise in atmospheric oxygen (~2.3 Gya) is debated, therefore both are shown (oxic ocean-solid lines, euxinic ocean-dashed lines).

• The phylogenetic tree symbols at the top of the figure show one idea as to the theoretical periods of diversification for each Superkingdom.

0

0.5

1

1.00E-20

1.00E-16

1.00E-12

1.00E-08

1.00E-15

1.00E-12

1.00E-09

1.00E-06

1.00E-11

1.00E-09

1.00E-07

00.511.522.533.544.5

Billions of years before present

Concentration

(O2

in arbitrary units, Zn and Fe in m

oles L-1

BacteriaArchaea

Eukarya

Oxygen

Zinc

Iron

CobaltManganese

Theoretical Levels of Trace Metals and Oxygen in the Deep Ocean Through Earth’s History

Replotted from Saito et al, 2003Inorganica Chimica Acta 356: 308-318

Using Protein Structure to Study Evolution

23

The Gaia Hypothesis

Gaia - a complex entity involving the Earth's biosphere, atmosphere, oceans, and soil; the totality constituting a feedback system which seeks an optimal physical and chemical environment for life on this planet.

James Lovelock

Gaia (pronounced /'geɪ.ə/ or /'gaɪ.ə/) "land" or "earth", from the Greek Γαῖα; is a Greek goddess personifying the Earth

Using Protein Structure to Study Evolution

24

The Question

• Have the emergent properties of an organism as judged by its protein content been influenced by the environment?

• Will do this by consideration of the metallomes of a broad range of species

• The metallomes can only be deduced by consideration of the protein structures to which the metal is covalently bound

• Will hypothesize that these emergent properties in turn influenced the environment

Using Protein Structure to Study Evolution

27

Bacteria Fe superfamilies

a.1.1 a.1.2

a.104.1 a.110.1

a.119.1 a.138.1

a.2.11 a.24.3

a.24.4 a.25.1

a.3.1 a.39.3

a.56.1 a.93.1

b.1.13 b.2.6

b.3.6 b.33.1

b.70.2 b.82.2

c.56.6 c.83.1

c.96.1 d.134.1

d.15.4 d.174.1

d.178.1 d.35.1

d.44.1 d.58.1

e.18.1 e.19.1

e.26.1 e.5.1

f.21.1 f.21.2

f.24.1 f.26.1

g.35.1 g.36.1

g.41.5

Eukaryotic Fe superfamilies

a.1.1 a.1.2

a.104.1 a.110.1

a.119.1 a.138.1

a.2.11 a.24.3

a.24.4 a.25.1

a.3.1 a.39.3

a.56.1 a.93.1

b.1.13 b.2.6

b.3.6 b.33.1

b.70.2 b.82.2

c.56.6 c.83.1

c.96.1 d.134.1

d.15.4 d.174.1

d.178.1 d.35.1

d.44.1 d.58.1

e.18.1 e.19.1

e.26.1 e.5.1

f.21.1 f.21.2

f.24.1 f.26.1

g.35.1 g.36.1

g.41.5

Superfamily Distribution As Well As Overall Content Has Changed

Using Protein Structure to Study Evolution

28

Metal Binding Proteins are Not Consistent Across Superkingdoms

0

1

2

Zn Fe Mn Co

Archaea Bacteria Eukarya

Total domains in a proteome

Tot

al Z

n-bi

ndin

g do

mai

ns in

a p

rote

ome

10

104

102.5 105

Slo

pe o

f fi

tted

pow

er la

w

A B

Since these data are derived from current species they are independent ofevolutionary events such as duplication, gene loss, horizontal transfer andendosymbiosis

Using Protein Structure to Study Evolution

Power Laws: Fundamental Constants in the Evolution of Proteomes

A slope of 1 indicates that a group of structural domains is in equilibrium with genome

growth, while a slope > 1 indicates that the group of domains is being preferentially

duplicated (or retained in the case of genome reductions).

van Nimwegen E (2006) in: Koonin EV, Wolf YI, Karev GP, (Ed.). Power laws, scale-free networks, and genome biology

Using Protein Structure to Study Evolution

30

Why are the Power Laws Different for Each Superkingdom?

• Power laws are likely influenced by selective pressure. Qualitatively, the differences in the power law slopes describing Eukarya and Prokarya are correlated to the shifts in trace metal geochemistry that occur with the rise in oceanic oxygen

• We hypothesize that proteomes contain an imprint of the environment at the time of the last common ancestor in each Superkingdom

• This suggests that Eukarya evolved in an oxic environment, whereas the Prokarya evolved in anoxic environments

Using Protein Structure to Study Evolution

31

Do the Metallomes Contain Further Support for this Hypothesis?

Overall percent of Fe bound bySuperkingdom Fold Family % Fe-binding O2 Fe-S heme amino

Cytochrome P450 0.44 + 0.48 heme yesCytochrome c3-like 0.13 + 0.3 heme noCytochrome b5 0.12 + 0.09 heme no

Eukarya Purple acid phosphatase 0.11 + 0.08 amino no 21 + 9 47 + 19 32 + 12Penicillin synthase-like 0.07 + 0.1 amino yesHypoxia-inducible factor 0.07 + 0.04 amino yesDi-heme elbow motif 0.06 + 0.01 heme no

4Fe-4S ferredoxins 1.80 + 0.7 Fe-S noMoCo biosynthesis proteins 1.60 + 0.3 Fe-S noHeme-binding PAS domain 1.10 + 1.0 heme no

Archaea HemN 0.80 + 0.20 Fe-S 1 68 + 12 13 + 14 19 + 6a helical ferrodoxin 0.60 + 0.16 Fe-S nobiotin synthase 0.55 + 0.1 Fe-S noROO N-terminal domain-like 0.5 + 0.1 amino 2

High potential iron protein 0.38 + 0.25 Fe-S noHeme-binding PAS domain 0.3 + 0.4 heme 1MoCo biosynthesis proteins 0.21 + 0.15 Fe-S no

Bacteria HemN 0.2 + 0.15 Fe-S no 47 + 11 22 + 12 31 + 164Fe-4S ferredoxins 0.2 + 0.2 Fe-S nocytochrome c 0.14 + 0.2 heme noa helical ferrodoxin 0.12 + 0.09 Fe-S no

1. Some, but not all, PAS domains actually sense oxygen2. The Rubredoxin oxygen:oxidoreductase (ROO) protein does not contact oxygen, but catalyzes an oxygen reduction pathway

Using Protein Structure to Study Evolution

32

e- Transfer ProteinsSame Broad Function, Same Metal, Different Chemistry

Induced by the Environment?

Fe-S clustersFe bound by S

Cluster held in place by Cys

Generally negative reduction potentials

Very susceptible to oxidation

CytochromesFe bound by heme (and

amino-acids)

Generally positive reduction potentials

Less susceptible to oxidation

Using Protein Structure to Study Evolution

33

Hypothesis

• Emergence of cyanobacteria changed oxygen concentrations

• Impacted relative metal ion concentrations in the ocean

• Organisms evolved to use these metals in new ways to evolve new biological processes eg complex signaling

• This in turn further impacted the environment

• Only protein structures could reveal such dependencies

Using Protein Structure to Study Evolution

Bioinformatics in the Context of Drug Discovery

Bioinformatics - Overview 34

Our Motivation• Tykerb – Breast cancer

• Gleevac – Leukemia, GI cancers

• Nexavar – Kidney and liver cancer

• Staurosporine – natural product – alkaloid – uses many e.g., antifungal antihypertensive

Collins and Workman 2006 Nature Chemical Biology 2 689-700Motivators

A Reverse Engineering Approach to Drug Discovery Across Gene FamiliesCharacterize ligand binding site of primary target (Geometric Potential)

Identify off-targets by ligand binding site similarity(Sequence order independent profile-profile alignment)

Extract known drugs or inhibitors of the primary and/or off-targets

Search for similar small molecules

Dock molecules to both primary and off-targets

Statistics analysis of docking score correlations

Computational MethodologyXie and Bourne 2009 Bioinformatics 25(12) 305-312

The Problem with Tuberculosis

• One third of global population infected• 1.7 million deaths per year• 95% of deaths in developing countries• Anti-TB drugs hardly changed in 40 years• MDR-TB and XDR-TB pose a threat to

human health worldwide• Development of novel, effective and

inexpensive drugs is an urgent priority

Repositioning - The TB Story

The TB-Drugome

1. Determine the TB structural proteome

2. Determine all known drug binding sites from the PDB

3. Determine which of the sites found in 2 exist in 1

4. Call the result the TB-drugome

A Multi-target/drug Strategy Kinnings et al 2010 PLoS Comp Biol 6(11): e1000976

1. Determine the TB Structural Proteome

284

1, 446

3, 996 2, 266

TB proteome

homology models

solved structu

res

• High quality homology models from ModBase (http://modbase.compbio.ucsf.edu) increase structural coverage from 7.1% to 43.3%

A Multi-target/drug Strategy Kinnings et al 2010 PLoS Comp Biol 6(11): e1000976

2. Determine all Known Drug Binding Sites in the PDB

• Searched the PDB for protein crystal structures bound with FDA-approved drugs

• 268 drugs bound in a total of 931 binding sites

No. of drug binding sites

MethotrexateChenodiol

AlitretinoinConjugated estrogens

DarunavirAcarbose

A Multi-target/drug Strategy Kinnings et al 2010 PLoS Comp Biol 6(11): e1000976

Map 2 onto 1 – The TB-Drugomehttp://funsite.sdsc.edu/drugome/TB/

Similarities between the binding sites of M.tb proteins (blue), and binding sites containing approved drugs (red).

From a Drug Repositioning Perspective

• Similarities between drug binding sites and TB proteins are found for 61/268 drugs

• 41 of these drugs could potentially inhibit more than one TB protein

No. of potential TB targets

raloxifenealitretinoin

conjugated estrogens &methotrexate

ritonavir

testosteronelevothyroxine

chenodiol

A Multi-target/drug Strategy Kinnings et al 2010 PLoS Comp Biol 6(11): e1000976

Top 5 Most Highly Connected Drugs

Drug Intended targets Indications No. of connections TB proteins

levothyroxine transthyretin, thyroid hormone receptor α & β-1, thyroxine-binding globulin, mu-crystallin homolog, serum albumin

hypothyroidism, goiter, chronic lymphocytic thyroiditis, myxedema coma, stupor

14

adenylyl cyclase, argR, bioD, CRP/FNR trans. reg., ethR, glbN, glbO, kasB, lrpA, nusA, prrA, secA1, thyX, trans. reg. protein

alitretinoin retinoic acid receptor RXR-α, β & γ, retinoic acid receptor α, β & γ-1&2, cellular retinoic acid-binding protein 1&2

cutaneous lesions in patients with Kaposi's sarcoma 13

adenylyl cyclase, aroG, bioD, bpoC, CRP/FNR trans. reg., cyp125, embR, glbN, inhA, lppX, nusA, pknE, purN

conjugated estrogens estrogen receptor

menopausal vasomotor symptoms, osteoporosis, hypoestrogenism, primary ovarian failure

10

acetylglutamate kinase, adenylyl cyclase, bphD, CRP/FNR trans. reg., cyp121, cysM, inhA, mscL, pknB, sigC

methotrexatedihydrofolate reductase, serum albumin

gestational choriocarcinoma, chorioadenoma destruens, hydatidiform mole, severe psoriasis, rheumatoid arthritis

10

acetylglutamate kinase, aroF, cmaA2, CRP/FNR trans. reg., cyp121, cyp51, lpd, mmaA4, panC, usp

raloxifeneestrogen receptor, estrogen receptor β

osteoporosis in post-menopausal women 9

adenylyl cyclase, CRP/FNR trans. reg., deoD, inhA, pknB, pknE, Rv1347c, secA1, sigC

Chang et al. 2010 Plos Comp. Biol. 6(9): e1000938

Systems Biology & Drug Discovery

44Bioinformatics - Overview

Bioinformatics & Patient Care

Bioinformatics - Overview 45

7. Social ChangeJosh Sommer and Chordoma Disease

http://fora.tv/2010/04/23/Sage_Commons_Josh_Sommer_Chordoma_Foundation#fullprogram

5. Personalized Medicine

http://pharmacogenomics.ucsd.edu/

Additional Reading

• http://en.wikipedia.org/wiki/Bioinformatics

Bioinformatics - Overview 48

Questions?

pbourne@ucsd.edu

49Bioinformatics - Overview

9 Translational Medicine