Modeling Biological Systems and Analyzing Large-Scale Data Sets

32
Modeling Biological Systems and Analyzing Large-Scale Data Sets ilya shmulevich

description

Modeling Biological Systems and Analyzing Large-Scale Data Sets. ilya shmulevich. TCGA Data Types. TCGA Research Network. Heterogeneous data. Clinical variables contributing to tumor aggressiveness. Nature , 487,330-337, 2012. Vesteinn Thorsson. FBXW7. Vesteinn Thorsson. - PowerPoint PPT Presentation

Transcript of Modeling Biological Systems and Analyzing Large-Scale Data Sets

Page 1: Modeling Biological Systems and Analyzing Large-Scale Data Sets

Modeling Biological Systems and Analyzing Large-Scale Data Sets

ilya shmulevich

Page 2: Modeling Biological Systems and Analyzing Large-Scale Data Sets

TCGA Data Types

Page 3: Modeling Biological Systems and Analyzing Large-Scale Data Sets

TCGA Research Network

Page 4: Modeling Biological Systems and Analyzing Large-Scale Data Sets

Heterogeneous data

Page 5: Modeling Biological Systems and Analyzing Large-Scale Data Sets

Less Aggressive

More Aggressive

Distant Metastasis

M0=No M1=Yes

Tumor Stage Early (I-II) Late(III-IV)

Fraction Lymph Nodes Positive by H & E

0 – 100 %

Lymphatic Invasion Present

No Yes

Vascular Invasion Present

No Yes

Histological Type

Mucinous Non-mucinous

Clinical variables contributing to tumor aggressiveness

Nature, 487,330-337, 2012.

Vesteinn Thorsson

Page 6: Modeling Biological Systems and Analyzing Large-Scale Data Sets

Vesteinn Thorsson

FBXW7

Page 7: Modeling Biological Systems and Analyzing Large-Scale Data Sets

Vesteinn ThorssonNature, 487,330-337, 2012.

Page 8: Modeling Biological Systems and Analyzing Large-Scale Data Sets

Vesteinn Thorsson, Dick Kreisberg

Nature, 487,330-337, 2012.

Page 9: Modeling Biological Systems and Analyzing Large-Scale Data Sets

http://explorer.cancerregulome.org

Web-based Apps

Page 10: Modeling Biological Systems and Analyzing Large-Scale Data Sets

The Regulome Explorer is an interactive web application that allows the user to explore multivariate relationships in data

Richard Kreisberg, Jake Lin, Timo Erkkila, Sheila Reynolds

explorer.cancerregulome.org

Page 11: Modeling Biological Systems and Analyzing Large-Scale Data Sets

explorer.cancerregulome.org

Page 12: Modeling Biological Systems and Analyzing Large-Scale Data Sets

RF-ACE, a multivariate statistical inference method based on ensembles of decision trees, which seeks to uncover significant associations between features in the input data matrix.

Timo Erkkilä, Sheila Reynolds, Kari Torkkola

Page 13: Modeling Biological Systems and Analyzing Large-Scale Data Sets

RF-ACE has high predictive power and is resistant to over-fitting.

Computational challenges:

• mixed data types: continuous, discrete, and categorical

• tens of thousands of features x tens or hundreds of samples

• non-linear, noisy, and multivariate relationships

• correlated features

• missing data

RF-ACE has high predictive power and is resistant to over-fitting.

Computational challenges:

• mixed data types: continuous, discrete, and categorical

• tens of thousands of features x tens or hundreds of samples

• non-linear, noisy, and multivariate relationships

• correlated features

• missing data

http://code.google.com/p/rf-ace/

Timo Erkkilä

RF-ACE features:

• handles mixed variable types

• does not require imputation of missing values

• random subsampling rather than combinatorial search

• statistical testing removes redundant features

• “importance” p-value for each candidate predictor

• fast, portable implementation in C++

RF-ACE features:

• handles mixed variable types

• does not require imputation of missing values

• random subsampling rather than combinatorial search

• statistical testing removes redundant features

• “importance” p-value for each candidate predictor

• fast, portable implementation in C++

Page 14: Modeling Biological Systems and Analyzing Large-Scale Data Sets

Google I/O keynote presentationJune 27, 2012

600,000 cores

Page 15: Modeling Biological Systems and Analyzing Large-Scale Data Sets
Page 16: Modeling Biological Systems and Analyzing Large-Scale Data Sets

A multilevel pan-cancer view: from genes to hallmarks

Theo Knijnenburg

Page 17: Modeling Biological Systems and Analyzing Large-Scale Data Sets

Mutational investment

Page 18: Modeling Biological Systems and Analyzing Large-Scale Data Sets

explorer.cancerregulome.org

Billions of Associations!

Page 19: Modeling Biological Systems and Analyzing Large-Scale Data Sets

Motivating questions

• Repurposing– Which existing cancer drugs may be therapeutic

in which other cancers?– Which inhibitors with no current cancer

indications may be therapeutic in certain cancers?

• Opportunity– TCGA primary tumor data may serve as the basis

for guided investigation of these open questions

Page 20: Modeling Biological Systems and Analyzing Large-Scale Data Sets

Guiding principle• The direct protein target for most inhibitors is not

the sensitizing aberrated protein itself– e.g., AKT1 inhibitors are most effective against cell

lines with PTEN mutations

Song et al. (2012)

Page 21: Modeling Biological Systems and Analyzing Large-Scale Data Sets

Proof of concept:Associations between drug targets (e.g., AKT1) and sensitizing aberrations (e.g., PTEN) also evident in TCGA

PTEN mutations in UCEC

AKT1 protein expression related to PTEN mutation in UCEC

PTEN mutation status

AKT1

RPP

A pr

otei

n ex

pres

sion

gene

spot

.org

canc

erre

gulo

me.

org

Page 22: Modeling Biological Systems and Analyzing Large-Scale Data Sets

Association

drug target : sensitizing aberration pairs

Proof of concept:Associations between drug targets (e.g., AKT1) and sensitizing aberrations (e.g., PTEN) also evident in TCGA

Page 23: Modeling Biological Systems and Analyzing Large-Scale Data Sets

Approach

• Create large heterogeneous graph of associations from TCGA data, literature, databases, …– [Billions of edges, Terabytes of data]

• Query on Cray YarcData uRiKA graph analytics appliance– No locality of reference, graphs hard to partition – [Minutes rather than hours per query]

• Identify aberrated gene → target → drug relationships for drugs with and without known efficacy in cancer

Genomic Aberration

TP53 mutation

Synthetic lethal protein targets Candidate compounds

ATR

CHEK1

PAK3

PLK1

SGK2

WEE1

Page 24: Modeling Biological Systems and Analyzing Large-Scale Data Sets

Integrating multiple data sources into a (big) graphGenomic aberrations Therapeutic targets Candidate inhibitors

RNAi

Page 25: Modeling Biological Systems and Analyzing Large-Scale Data Sets

Graph Data Model:Resource Description Framework (RDF)

<http://www.systemsbiology.net/tp53y_n_somatic>

http://www.systemsbiology.net

/tp53>

gnab

<http://www.systemsbiology.net/feature#label>

<http://www.systemsbiology.net/feature#source>

_:blankGeneNMD

http://www.systemsbiology.net

/brca2>

0.25151

<http://www.systemsbiology.net/nmd#term1> <http://www.systemsbiology.net/nmd#term2>

<http://www.systemsbiology.net/nmd#combocount>

_:blankDrugGeneNMD

1.1628http://www.systemsbiology.net

/biotin>

<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>

http://www.systemsbiology.net

/Drug>

_:blankPairwise

<http://www.systemsbiology.net/association#dataset>

brca_05nov

<http://www.systemsbiology.net/association#feature1>

<http://www.systemsbiology.net/association#feature2>

<http://www.systemsbiology.net/gata3gexp>

-0.511

<http://www.systemsbiology.net/association#correlation>

gexp

<http://www.systemsbiology.net/feature#source>

<http://www.systemsbiology.net/feature#source>

<http://www.systemsbiology.net/nmd#term2>

<http://www.systemsbiology.net/nmd#term1>

<http://www.systemsbiology.net/nmd#nmd>

<http://www.systemsbiology.net/nmd#nmd>

Page 26: Modeling Biological Systems and Analyzing Large-Scale Data Sets

Example SPARQL Query

Seed Gene List

Associated Genes

Small Molecules

Cancer Type

Literature

TCGA

Database

Literature

cancer.gov approved drugs

Page 27: Modeling Biological Systems and Analyzing Large-Scale Data Sets

Example Result: PTEN associations in UCEC

Genomic aberrations Candidate targets Candidate inhibitors

PTEN

ASRGL1ESR1GLYATL2PLIN3HADHNT5EPIK3R3GABREPGRFBP1SMPD3GRIN1PIK3R1RARGAADATCACNA2D2SSTSRD5A1B4GALT1ADRA1BKCNJ12RYR1SLC6A14RETSATFAAHSRRNQO1CEACAM1KCNK6ACADSCRATELOVL4FOLH1ALDH1A3SORDASS1NADSYN1PRNPNDUFA11KCNH2CPS1SLC22A5HMGCRALDH18A1PARS2GLSB4GALT4ACACBSLC38A3GSROAZ3TCN1SLC1A1SMPD4BHMT2HSD17B4GRIK5GLDCPPIBPIPOXADASCN3BS100A1PLGSLC1A4CBSGLRBACVR1BSLC6A2

AcepromazineAcitretinAdapaleneAdenineAdenosine monophosphateAdenosine triphosphateAdinazolamAlfuzosinAlitretinoinAllylestrenolAlpha-Linolenic AcidAlprazolamAlteplaseAminocaproic AcidAmiodaroneAmitriptylineAmoxapineAmsacrineAnistreplaseAprotininArcitumomabAripiprazoleAstemizoleAtomoxetineAtorvastatinBepridilBiotinBromazepamBromocriptineBupropionCaffeineCapromabCarglumic acidCarmustineCarvedilolChlordiazepoxideChlorotrianiseneChlorpheniramineChlorpromazineCinolazepamCisaprideClobazamClomifeneClomipramineClonazepamClorazepateClotiazepamClozapineCocaineConjugated EstrogensCysteamineDanazolDantroleneDapiprazoleDebrisoquinDesipramineDesogestrelDesvenlafaxineDexmethylphenidateDextroamphetamineDiazepamDicumarolDienestrolDiethylpropionDiethylstilbestrolDipyridamoleDofetilideDoxazosinDoxepinDrospirenoneDroxidopaDuloxetineDutasterideDydrogesteroneEphedraEphedrineEpinephrineErgotamineEscitalopramEstazolamEstradiolEstramustineEstriolEstroneEstropipateEthinyl EstradiolEthynodiol DiacetateEtonogestrelFelodipineFinasterideFludiazepamFluoxymesteroneFlurazepamFluticasone PropionateFluvastatinFulvestrantGabapentinGalsulfaseGinkgo bilobaGlutathioneGlycineGuanadrel SulfateGuanethidineHalazepamHalofantrineHydroxocobalaminIbutilideIdursulfaseImipramineIsoproterenolIsradipineKetazolamLabetalolL-AlanineL-ArginineL-AsparagineL-Aspartic AcidL-CarnitineL-CitrullineL-CysteineLevonordefrinLevonorgestrelL-Glutamic AcidL-HistidineLindaneLisdexamfetamineL-MethionineLorazepamL-OrnithineLovastatinL-ProlineL-SerineMaprotilineMazindolMedroxyprogesteroneMegestrolMelatoninMenadioneMeperidineMestranolMethamphetamineMethotrimeprazineMethoxamineMethylphenidateMianserinMiconazoleMidazolamMidodrineMifepristoneMilnacipranModafinilN-Acetyl-D-glucosamineNADHNaloxoneNefazodoneNicardipineNitrazepamNitrendipineNorelgestrominNorepinephrineNorethindroneNorgestimateNortriptylineOlanzapineOlopatadineOrphenadrineOxazepamPaliperidoneParoxetinePentostatinPentoxifyllinePergolidePhendimetrazinePhenmetrazinePhenterminePhenylephrinePhosphatidylserinePimozidePravastatinPrazepamPrazosinProgesteronePromazinePropafenonePropericiazinePropiomazineProtriptylinePseudoephedrinePyridoxineQuazepamQuetiapineQuinestrolQuinidineRaloxifeneReboxetineReteplaseRisperidoneRosuvastatinS-AdenosylmethionineSertindoleSibutramineSimvastatinSotalolStreptokinaseSuraminTamoxifenTamsulosinTazaroteneTemazepamTenecteplaseTerazosinTerfenadineTetracyclineThiopentalThioproperazineThioridazineToremifeneTramadolTranexamic AcidTretinoinTriazolamTrilostaneTrimipramineUrokinaseVenlafaxineVerapamilVitamin AXylometazolineZiprasidoneZonisamide

Page 28: Modeling Biological Systems and Analyzing Large-Scale Data Sets

Example Result: PTEN associations in UCECGenomic aberrations Candidate targets Candidate inhibitors

PTEN PIK3R1/PIK3CA Wortmannin

PTEN mutation status

PIK3

R1 g

ene

expr

essi

on

PDB id 3hhm

Page 29: Modeling Biological Systems and Analyzing Large-Scale Data Sets

Repurposing existing cancer drugs in other cancers

Genomic aberrations Candidate targets Candidate inhibitors

Existing cancer indication Target Cancer Drug A

New cancer indication

Page 30: Modeling Biological Systems and Analyzing Large-Scale Data Sets

Example Result

• TP53 is frequently mutated in most tumor types• ABCG2, also known as Breast Cancer Resistance

Protein (BCRP), is associated with TP53 mutation in TCGA breast cancer data

• Nelfinavir, an HIV protease inhibitor, also binds ABCG2 and many other proteins

• High-throughput cell line screening of breast cancer cells recently identified Nelfinavir as a selective inhibitor. “It can be brought to HER2-breast cancer treatment trials with the same dosage regimen as that used among HIV patients. “ [Shim et al. JNCI 2012]

Page 31: Modeling Biological Systems and Analyzing Large-Scale Data Sets

source: EMBO Rep. 2004 May; 5(5): 470–476.

Source: http://www.sjrcd.org/soilhealth/soilagg.html

Source: http://www.webmd.com

Source: http://www.theregister.co.uk

Understanding behavior of massive multicellular systems: BioCellion

Ductal Carcinoma model:

Nicholas Flann, Utah State Univ.

Page 32: Modeling Biological Systems and Analyzing Large-Scale Data Sets

Brady Bernard, Ryan Bressler, Andrea Eakin, Timo Erkkilä, Lisa Iype, Seunghwa Kang, Theo Knijnenburg, Roger Kramer, Richard Kreisberg, Kalle Leinonen, Jake Lin, Yuexin Liu, Michael Miller, Sheila Reynolds, Hector Rovira, Vesteinn Thorsson, Da Yang, Wei Zhang

Acknowledgments