Bioinformatics A Biased Overview

48
A {Biased} Overview of Bioinformatics with Examples Drawn from Our Own Work Philip E. Bourne Professor of Pharmacology UCSD [email protected] 1 Bioinformatics - Overview

description

Lecture given to graduate microbiology students on examples of work in bioinformatics. Date:

Transcript of Bioinformatics A Biased Overview

Page 1: Bioinformatics A Biased Overview

A {Biased} Overview of Bioinformatics

with Examples Drawn from Our Own Work

Philip E. Bourne Professor of Pharmacology UCSD

[email protected]

1Bioinformatics - Overview

Page 2: Bioinformatics A Biased Overview

There Are Multiple Types of Informatics in the Life Sciences

Bioinformatics - Overview 2

PharmacyInformatics

BiomedicalInformatics

Bioinformatics

Drug dosingPharmacokineticsPharmacy InformationSystems

EHRDecision support systemsHospital Information Systems

AlgorithmsGenomicsProteomicsBiological networksSystems Biology

Note: These are only representative examples

Page 3: Bioinformatics A Biased Overview

There Are Multiple Types of Informatics in the Life Sciences

Bioinformatics - Overview 3

PharmacyInformatics

BiomedicalInformatics

Bioinformatics

Controlled vocabulariesOntologiesLiterature searchingData managementPharmacogenomicsPersonalized medicine

Note: These are only representative examples

Page 4: Bioinformatics A Biased Overview

Biological Experiment Data Information Knowledge Discovery

Collect Characterize Compare Model Infer

Sequence

Structure

Assembly

Sub-cellular

Cellular

Organ

Higher-life

90 05

Computing Power

Sequencing

Data1 10 100 1000 105

95 00

Human Genome Project

E.ColiGenome

C.ElegansGenome

1 Small Genome/Mo.

ESTs

YeastGenome

Gene Chips

Virus Structure

Ribosome

Model Metaboloic Pathway of E.coli

Complexity Technology

Brain Mapping

Genetic Circuits

Neuronal Modeling

Cardiac Modeling

Human Genome

# People/Web Site

106 102 1

VirtualCommunities

Bioinformatics In One Slide

106

BlogsFacebook

1000’sGWAS

The Omics Revolution4Bioinformatics - Overview

Page 5: Bioinformatics A Biased Overview

Bioinformatics – One Definition

• The integration of biological data in digital form from different sources and possibly different scales (complexity), usually collected by others, and subsequent analyzed to offer new biological insights

Bioinformatics - Overview 5

Page 6: Bioinformatics A Biased Overview

Biological Scales (Complexity)

Bioinformatics - Overview 6

Genomics

Proteomics

Protein-protein interactions

Biological Networks

Systems Biology

We will look at an example of how bioinformatics is used at each scale

Page 7: Bioinformatics A Biased Overview

Some Thoughts on Genomic Data

• Its scary• Its time to consider

cost vs benefit• Reductionism is

not a dirty word• We need to do

more with the long tail

On the Future of Genomic DataScience 11 February 2011: vol. 331 no. 6018 728-729

Page 8: Bioinformatics A Biased Overview

8

Bioinformatics & Metagenomics• New type of genomics

• New data (and lots of it) and new types of data– 17M new (predicted

proteins!) 4-5 x growth in just few months and much more coming

– New challenges and exacerbation of old challenges

Bioinformatics at Different Scales - GenomicsBioinformatics - Overview

Page 9: Bioinformatics A Biased Overview

9

Metagenomics: Early Results

• More then 99.5% of DNA in every environment studied represent unknown organisms

• Most genes represent distant homologs of known genes, but there are thousands of new families

• Environments being studied:– Water (ocean, lakes)– Air– Soil– Human body (gut, oral

cavity, human microbiome)

Bioinformatics at Different Scales - GenomicsBioinformatics - Overview

Page 10: Bioinformatics A Biased Overview

10

Metagenomics New DiscoveriesEnvironmental (red) vs. Currently Known PTPases (blue)

Higher eukaryotes

1

23

4Bioinformatics at Different Scales - Genomics

Bioinformatics - Overview

Page 11: Bioinformatics A Biased Overview

Proteomics

Bioinformatics - Overview 11

Page 12: Bioinformatics A Biased Overview

Num

ber

of r

elea

sed

entr

ies

Year

Its Not Just About Numbers its About Complexity

Courtesy of the RCSB Protein Data BankBioinformatics at Different Scales - Proteomics12Bioinformatics - Overview

Page 13: Bioinformatics A Biased Overview

13

Determining 3D Structures – The Impact of Bioinformatics

Basic Steps

Target Selection

Crystallomics• Isolation,• Expression,• Purification,• Crystallization

DataCollection

StructureSolution

StructureRefinement

Functional Annotation Publish

Structural biology moves from being functionally driven to genomically driven

Fill inprotein fold

space

Robotics-ve data

Software engineering Functional prediction

Notnecessarily

Bioinformatics at Different Scales - ProteomicsBioinformatics - Overview

Page 14: Bioinformatics A Biased Overview

Bioinformatics at Different Scales - Proteomics14Bioinformatics - Overview

Page 15: Bioinformatics A Biased Overview

Nature’s ReductionismThere are ~ 20300 possible proteins>>>> all the atoms in the Universe

~20M protein sequences from UniProt/TrEMBL

~75,000 protein structures Yield ~1500 folds, ~2000 superfamilies,

~4000 families (SCOP 1.75)Using Protein Structure to Study Evolution

Page 16: Bioinformatics A Biased Overview

16

Structure Provides an Evolutionary Fingerprint

Distribution among the three kingdoms as taken from SUPERFAMILY

• Superfamily distributions would seem to be related to the complexity of life

Eukaryota (650)

Archaea (416) Bacteria (564)

2 42

10

135

118

387

17

SCOP fold (765 total)

1

153/14

9/1

21/2 310/0645/49

29/0 68/0

Any genome / All genomes

Using Protein Structure to Study Evolution

Page 17: Bioinformatics A Biased Overview

17

Method – Distance Determination

(FSF)SCOP

SUPERFAMILY

organisms

C. intestinalis C. briggsae F. rubripes

a.1.1 1 1 1

a.1.2 1 1 1

a.10.1 0 0 1

a.100.1 1 1 1

a.101.1 0 0 0

a.102.1 0 1 1

a.102.2 1 1 1

C. intestinalis C. briggsae F. rubripes

C. intestinalis 0 101 109

C. briggsae 0 144

F. rubripes 0

Presence/Absence Data Matrix

Distance Matrix

Using Protein Structure to Study Evolution

Page 18: Bioinformatics A Biased Overview

18

If Structure is so Conservedis it a Useful Tool in the Study of Evolution?

The Answer Would Appear to be Yes

• It is possible to generate a reasonable tree of life from merely the presence or absence of superfamilies (FSFs) within a given proteome

Using Protein Structure to Study Evolution

Yang, Doolittle & Bourne (2005) PNAS 102(2) 373-8

Page 19: Bioinformatics A Biased Overview

19

The Influence of Environment on Life

Chris Dupont Scripps Institute of Oceanography

UCSD

DuPont, Yang, Palenik, Bourne. 2006 PNAS 103(47) 17822-17827

Using Protein Structure to Study Evolution

Page 20: Bioinformatics A Biased Overview

20

Consider the Distribution of Disulfide

Bonds among Folds • Disulphides are only stable under

oxidizing conditions• Oxygen content gradually

accumulated during the earth’s evolution

• The divergence of the three kingdoms occurred 1.8-2.2 billion years ago

• Oxygen began to accumulate ~ 2.0 billion years ago

• Logical deduction – disulfides more prevalent in folds (organisms) that evolved later

• This would seem to hold true

• Can we take this further?

Eukaryota

Archaea Bacteria

0% (0/2)

16.7% (7/42)

0% (0/10)

31.9% (43/135)

14.4% (17/118) 4.7%

(18/387)

5.9% (1/17)

SCOP fold (708 total)

1

Using Protein Structure to Study Evolution

Page 21: Bioinformatics A Biased Overview

21

Evolution of the Earth• 4.5 billion years of change• 300+50K• 1-5 atmospheres• Constant photoenergy• Chemical and geological

changes• Life has evolved in this

time

• The ocean was the “cradle” for 90% of evolution

Using Protein Structure to Study Evolution

Page 22: Bioinformatics A Biased Overview

22

• Whether the deep ocean became oxic or euxinic following the rise in atmospheric oxygen (~2.3 Gya) is debated, therefore both are shown (oxic ocean-solid lines, euxinic ocean-dashed lines).

• The phylogenetic tree symbols at the top of the figure show one idea as to the theoretical periods of diversification for each Superkingdom.

0

0.5

1

1.00E-20

1.00E-16

1.00E-12

1.00E-08

1.00E-15

1.00E-12

1.00E-09

1.00E-06

1.00E-11

1.00E-09

1.00E-07

00.511.522.533.544.5

Billions of years before present

Concentration

(O2

in arbitrary units, Zn and Fe in m

oles L-1

BacteriaArchaea

Eukarya

Oxygen

Zinc

Iron

CobaltManganese

Theoretical Levels of Trace Metals and Oxygen in the Deep Ocean Through Earth’s History

Replotted from Saito et al, 2003Inorganica Chimica Acta 356: 308-318

Using Protein Structure to Study Evolution

Page 23: Bioinformatics A Biased Overview

23

The Gaia Hypothesis

Gaia - a complex entity involving the Earth's biosphere, atmosphere, oceans, and soil; the totality constituting a feedback system which seeks an optimal physical and chemical environment for life on this planet.

James Lovelock

Gaia (pronounced /'geɪ.ə/ or /'gaɪ.ə/) "land" or "earth", from the Greek Γαῖα; is a Greek goddess personifying the Earth

Using Protein Structure to Study Evolution

Page 24: Bioinformatics A Biased Overview

24

The Question

• Have the emergent properties of an organism as judged by its protein content been influenced by the environment?

• Will do this by consideration of the metallomes of a broad range of species

• The metallomes can only be deduced by consideration of the protein structures to which the metal is covalently bound

• Will hypothesize that these emergent properties in turn influenced the environment

Using Protein Structure to Study Evolution

Page 25: Bioinformatics A Biased Overview

27

Bacteria Fe superfamilies

a.1.1 a.1.2

a.104.1 a.110.1

a.119.1 a.138.1

a.2.11 a.24.3

a.24.4 a.25.1

a.3.1 a.39.3

a.56.1 a.93.1

b.1.13 b.2.6

b.3.6 b.33.1

b.70.2 b.82.2

c.56.6 c.83.1

c.96.1 d.134.1

d.15.4 d.174.1

d.178.1 d.35.1

d.44.1 d.58.1

e.18.1 e.19.1

e.26.1 e.5.1

f.21.1 f.21.2

f.24.1 f.26.1

g.35.1 g.36.1

g.41.5

Eukaryotic Fe superfamilies

a.1.1 a.1.2

a.104.1 a.110.1

a.119.1 a.138.1

a.2.11 a.24.3

a.24.4 a.25.1

a.3.1 a.39.3

a.56.1 a.93.1

b.1.13 b.2.6

b.3.6 b.33.1

b.70.2 b.82.2

c.56.6 c.83.1

c.96.1 d.134.1

d.15.4 d.174.1

d.178.1 d.35.1

d.44.1 d.58.1

e.18.1 e.19.1

e.26.1 e.5.1

f.21.1 f.21.2

f.24.1 f.26.1

g.35.1 g.36.1

g.41.5

Superfamily Distribution As Well As Overall Content Has Changed

Using Protein Structure to Study Evolution

Page 26: Bioinformatics A Biased Overview

28

Metal Binding Proteins are Not Consistent Across Superkingdoms

0

1

2

Zn Fe Mn Co

Archaea Bacteria Eukarya

Total domains in a proteome

Tot

al Z

n-bi

ndin

g do

mai

ns in

a p

rote

ome

10

104

102.5 105

Slo

pe o

f fi

tted

pow

er la

w

A B

Since these data are derived from current species they are independent ofevolutionary events such as duplication, gene loss, horizontal transfer andendosymbiosis

Using Protein Structure to Study Evolution

Page 27: Bioinformatics A Biased Overview

Power Laws: Fundamental Constants in the Evolution of Proteomes

A slope of 1 indicates that a group of structural domains is in equilibrium with genome

growth, while a slope > 1 indicates that the group of domains is being preferentially

duplicated (or retained in the case of genome reductions).

van Nimwegen E (2006) in: Koonin EV, Wolf YI, Karev GP, (Ed.). Power laws, scale-free networks, and genome biology

Using Protein Structure to Study Evolution

Page 28: Bioinformatics A Biased Overview

30

Why are the Power Laws Different for Each Superkingdom?

• Power laws are likely influenced by selective pressure. Qualitatively, the differences in the power law slopes describing Eukarya and Prokarya are correlated to the shifts in trace metal geochemistry that occur with the rise in oceanic oxygen

• We hypothesize that proteomes contain an imprint of the environment at the time of the last common ancestor in each Superkingdom

• This suggests that Eukarya evolved in an oxic environment, whereas the Prokarya evolved in anoxic environments

Using Protein Structure to Study Evolution

Page 29: Bioinformatics A Biased Overview

31

Do the Metallomes Contain Further Support for this Hypothesis?

Overall percent of Fe bound bySuperkingdom Fold Family % Fe-binding O2 Fe-S heme amino

Cytochrome P450 0.44 + 0.48 heme yesCytochrome c3-like 0.13 + 0.3 heme noCytochrome b5 0.12 + 0.09 heme no

Eukarya Purple acid phosphatase 0.11 + 0.08 amino no 21 + 9 47 + 19 32 + 12Penicillin synthase-like 0.07 + 0.1 amino yesHypoxia-inducible factor 0.07 + 0.04 amino yesDi-heme elbow motif 0.06 + 0.01 heme no

4Fe-4S ferredoxins 1.80 + 0.7 Fe-S noMoCo biosynthesis proteins 1.60 + 0.3 Fe-S noHeme-binding PAS domain 1.10 + 1.0 heme no

Archaea HemN 0.80 + 0.20 Fe-S 1 68 + 12 13 + 14 19 + 6a helical ferrodoxin 0.60 + 0.16 Fe-S nobiotin synthase 0.55 + 0.1 Fe-S noROO N-terminal domain-like 0.5 + 0.1 amino 2

High potential iron protein 0.38 + 0.25 Fe-S noHeme-binding PAS domain 0.3 + 0.4 heme 1MoCo biosynthesis proteins 0.21 + 0.15 Fe-S no

Bacteria HemN 0.2 + 0.15 Fe-S no 47 + 11 22 + 12 31 + 164Fe-4S ferredoxins 0.2 + 0.2 Fe-S nocytochrome c 0.14 + 0.2 heme noa helical ferrodoxin 0.12 + 0.09 Fe-S no

1. Some, but not all, PAS domains actually sense oxygen2. The Rubredoxin oxygen:oxidoreductase (ROO) protein does not contact oxygen, but catalyzes an oxygen reduction pathway

Using Protein Structure to Study Evolution

Page 30: Bioinformatics A Biased Overview

32

e- Transfer ProteinsSame Broad Function, Same Metal, Different Chemistry

Induced by the Environment?

Fe-S clustersFe bound by S

Cluster held in place by Cys

Generally negative reduction potentials

Very susceptible to oxidation

CytochromesFe bound by heme (and

amino-acids)

Generally positive reduction potentials

Less susceptible to oxidation

Using Protein Structure to Study Evolution

Page 31: Bioinformatics A Biased Overview

33

Hypothesis

• Emergence of cyanobacteria changed oxygen concentrations

• Impacted relative metal ion concentrations in the ocean

• Organisms evolved to use these metals in new ways to evolve new biological processes eg complex signaling

• This in turn further impacted the environment

• Only protein structures could reveal such dependencies

Using Protein Structure to Study Evolution

Page 32: Bioinformatics A Biased Overview

Bioinformatics in the Context of Drug Discovery

Bioinformatics - Overview 34

Page 33: Bioinformatics A Biased Overview

Our Motivation• Tykerb – Breast cancer

• Gleevac – Leukemia, GI cancers

• Nexavar – Kidney and liver cancer

• Staurosporine – natural product – alkaloid – uses many e.g., antifungal antihypertensive

Collins and Workman 2006 Nature Chemical Biology 2 689-700Motivators

Page 34: Bioinformatics A Biased Overview

A Reverse Engineering Approach to Drug Discovery Across Gene FamiliesCharacterize ligand binding site of primary target (Geometric Potential)

Identify off-targets by ligand binding site similarity(Sequence order independent profile-profile alignment)

Extract known drugs or inhibitors of the primary and/or off-targets

Search for similar small molecules

Dock molecules to both primary and off-targets

Statistics analysis of docking score correlations

Computational MethodologyXie and Bourne 2009 Bioinformatics 25(12) 305-312

Page 35: Bioinformatics A Biased Overview

The Problem with Tuberculosis

• One third of global population infected• 1.7 million deaths per year• 95% of deaths in developing countries• Anti-TB drugs hardly changed in 40 years• MDR-TB and XDR-TB pose a threat to

human health worldwide• Development of novel, effective and

inexpensive drugs is an urgent priority

Repositioning - The TB Story

Page 36: Bioinformatics A Biased Overview

The TB-Drugome

1. Determine the TB structural proteome

2. Determine all known drug binding sites from the PDB

3. Determine which of the sites found in 2 exist in 1

4. Call the result the TB-drugome

A Multi-target/drug Strategy Kinnings et al 2010 PLoS Comp Biol 6(11): e1000976

Page 37: Bioinformatics A Biased Overview

1. Determine the TB Structural Proteome

284

1, 446

3, 996 2, 266

TB proteome

homology models

solved structu

res

• High quality homology models from ModBase (http://modbase.compbio.ucsf.edu) increase structural coverage from 7.1% to 43.3%

A Multi-target/drug Strategy Kinnings et al 2010 PLoS Comp Biol 6(11): e1000976

Page 38: Bioinformatics A Biased Overview

2. Determine all Known Drug Binding Sites in the PDB

• Searched the PDB for protein crystal structures bound with FDA-approved drugs

• 268 drugs bound in a total of 931 binding sites

No. of drug binding sites

MethotrexateChenodiol

AlitretinoinConjugated estrogens

DarunavirAcarbose

A Multi-target/drug Strategy Kinnings et al 2010 PLoS Comp Biol 6(11): e1000976

Page 39: Bioinformatics A Biased Overview

Map 2 onto 1 – The TB-Drugomehttp://funsite.sdsc.edu/drugome/TB/

Similarities between the binding sites of M.tb proteins (blue), and binding sites containing approved drugs (red).

Page 40: Bioinformatics A Biased Overview

From a Drug Repositioning Perspective

• Similarities between drug binding sites and TB proteins are found for 61/268 drugs

• 41 of these drugs could potentially inhibit more than one TB protein

No. of potential TB targets

raloxifenealitretinoin

conjugated estrogens &methotrexate

ritonavir

testosteronelevothyroxine

chenodiol

A Multi-target/drug Strategy Kinnings et al 2010 PLoS Comp Biol 6(11): e1000976

Page 41: Bioinformatics A Biased Overview

Top 5 Most Highly Connected Drugs

Drug Intended targets Indications No. of connections TB proteins

levothyroxine transthyretin, thyroid hormone receptor α & β-1, thyroxine-binding globulin, mu-crystallin homolog, serum albumin

hypothyroidism, goiter, chronic lymphocytic thyroiditis, myxedema coma, stupor

14

adenylyl cyclase, argR, bioD, CRP/FNR trans. reg., ethR, glbN, glbO, kasB, lrpA, nusA, prrA, secA1, thyX, trans. reg. protein

alitretinoin retinoic acid receptor RXR-α, β & γ, retinoic acid receptor α, β & γ-1&2, cellular retinoic acid-binding protein 1&2

cutaneous lesions in patients with Kaposi's sarcoma 13

adenylyl cyclase, aroG, bioD, bpoC, CRP/FNR trans. reg., cyp125, embR, glbN, inhA, lppX, nusA, pknE, purN

conjugated estrogens estrogen receptor

menopausal vasomotor symptoms, osteoporosis, hypoestrogenism, primary ovarian failure

10

acetylglutamate kinase, adenylyl cyclase, bphD, CRP/FNR trans. reg., cyp121, cysM, inhA, mscL, pknB, sigC

methotrexatedihydrofolate reductase, serum albumin

gestational choriocarcinoma, chorioadenoma destruens, hydatidiform mole, severe psoriasis, rheumatoid arthritis

10

acetylglutamate kinase, aroF, cmaA2, CRP/FNR trans. reg., cyp121, cyp51, lpd, mmaA4, panC, usp

raloxifeneestrogen receptor, estrogen receptor β

osteoporosis in post-menopausal women 9

adenylyl cyclase, CRP/FNR trans. reg., deoD, inhA, pknB, pknE, Rv1347c, secA1, sigC

Page 42: Bioinformatics A Biased Overview

Chang et al. 2010 Plos Comp. Biol. 6(9): e1000938

Systems Biology & Drug Discovery

44Bioinformatics - Overview

Page 43: Bioinformatics A Biased Overview

Bioinformatics & Patient Care

Bioinformatics - Overview 45

Page 44: Bioinformatics A Biased Overview

7. Social ChangeJosh Sommer and Chordoma Disease

http://fora.tv/2010/04/23/Sage_Commons_Josh_Sommer_Chordoma_Foundation#fullprogram

Page 45: Bioinformatics A Biased Overview

5. Personalized Medicine

http://pharmacogenomics.ucsd.edu/

Page 46: Bioinformatics A Biased Overview

Additional Reading

• http://en.wikipedia.org/wiki/Bioinformatics

Bioinformatics - Overview 48

Page 47: Bioinformatics A Biased Overview

Questions?

[email protected]

49Bioinformatics - Overview

Page 48: Bioinformatics A Biased Overview

9 Translational Medicine