Download - A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Transcript
Page 1: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Status asof08.11.201510:00

Dear Students,welcometothe5thlectureofourcourse.Pleaserememberfromthelastlecturethebasicarchitectureofahospitalinformationsystem,thecomplexityofmedicalworkflows,thechallengesofdataintegration,datafusion,datacuration;thebuildingblocksofhospitalinformationsystems,databases,datawarehouses,datamarts;thedifferencebetweenknowledgediscoveryandinformationretrieval;pleaseremembertheformaldescriptionofainformationretrievalmodel– thebestpracticeexampleisthePage‐RankAlgorithm,see:Hastie,T.,Tibshirani,R.&Friedman,J.2009.TheElementsofStatisticalLearning:DataMining,Inference,andPrediction.SecondEdition,NewYork,Springer.Orhavealooktothereprintpaper:Brin,S.&Page,L.2012.Reprintof:Theanatomyofalarge‐scalehypertextual websearchengine.ComputerNetworks,56,(18),3825‐3833.http://www.sciencedirect.com/science/article/pii/S1389128612003611doi:10.1016/j.comnet.2012.10.007

Pleasealwaysbeawareofthedefinitionofbiomedicalinformatics(MedizinischeInformatik):BiomedicalInformatics istheinter‐disciplinaryfieldthatstudiesandpursuestheeffectiveuseofbiomedicaldata,information,andknowledgeforscientificinquiry,problemsolving,anddecisionmaking,motivatedbyeffortstoimprovehumanhealth(and well‐being).

1WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 2: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

2WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 3: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

3WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 4: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

4WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 5: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Invivo(Latinfor"withintheliving")isexperimentationusingawhole,livingorganismasopposedtoapartialordeadorganism,oraninvitro("withintheglass",i.e.,inatesttubeorpetridish)controlledenvironment.

5WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 6: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

6WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 7: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

7WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 8: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

8WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 9: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Itiswidelyacknowledgedinmachinelearningthattheperformanceofalearningalgorithmisdependentonbothitsparametersandthetrainingdata.Yet,thebulkofalgorithmicdevelopmenthasfocusedonadjustingmodelparameterswithoutfullyunderstandingthedatathatthelearningalgorithmismodeling.Assuch,algorithmicdevelopmentforclassificationproblemshaslargelybeenmeasuredbyclassificationaccuracy,precision,orasimilarmetriconbenchmarkdatasets.Asmostmachinelearningresearchisfocusedonthedatasetlevel,oneisconcernedwithmaximizingp(h|t),whereh:X→YisahypothesisorfunctionmappinginputfeaturevectorsXtotheircorrespondinglabelvectorsY,andt={(xi,yi):xi∈X∧yi ∈Y}isatrainingset.

Oneofthemethodsforprivacypreservingdataminingisthatofanonymization,inwhicharecordisreleasedonlyifitisindistinguishablefromkotherentitiesinthedata.Wenotethatmethodssuchask‐anonymityarehighlydependentuponspatiallocalityinordertoeffectivelyimplementthetechniqueinastatisticallyrobustway.Inhighdimensionalspacethedatabecomessparse,andtheconceptofspatiallocalityisnolongereasytodefinefromanapplicationpointofview.Aggarwal,C.C.Onk‐anonymityandthecurseofdimensionality.Proceedingsofthe31stinternationalconferenceonVerylargedatabasesVLDB,2005.901‐909.

Holzinger,A.,Stocker,C.&Dehmer,M.2014.BigComplexBiomedicalData:TowardsaTaxonomyofData.In:Obaidat,M.S.&Filipe,J.(eds.)CommunicationsinComputerandInformationScienceCCIS455.BerlinHeidelberg:Springerpp.3‐18.

A. Holzinger                                                        LV709.049                                                 11.11.2015

WS 2015 9

Page 10: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

https://www.projectrhea.org/rhea/index.php/File:Complexitytable.png

Pstandsfor“polynomialtime”.Thisthesubsetofproblemsthatcanbeguaranteedtobesolvedinapolynomialamountoftimerelatedtotheirinputlength.ProblemsinPcommonlyoperateonsingleinputs,lists,ormatrices,andcanoccasionallyapplytographs.Thetypicaltypesofoperationstheyperformaremathematicaloperators,sorting,findingminimumandmaximumvalues,determinates,andmanyothers.NPstandsfor“nondeterministicpolynomialtime”.Theseproblemsareonesthatcanbesolvedinpolynomialtimeusinganondeterministiccomputer.Thisconceptisalittlehardertounderstand,soanotherdefinitionthatisaconsequenceofthefirstisoftenused.NPproblemsareproblemsthatcanbechecked,or“certified”,inpolynomialtime.TheoutputofanNPsolvingprogramiscalledacertificate,andthepolynomialtimeprogramthatchecksthecertificateforitsvalidityiscalledthecertificationprogram.NP‐hard:AproblemisNP‐hardifitasleastashardasthehardestproblemsknowntobeNP.Thisleadstotwopossibilities:eithertheproblemisinNPandalsoconsideredNP‐hard,oritismoredifficultthananyNPproblem.NP‐complete:ThisclassificationistheintersectionofNPandNP‐hard.IfaproblemisinNPandalsoNP‐hard,thenitisconsideredNP‐complete.Thisclassofproblemsisarguablythemostinterestingforitsconsequencesonmanyothertypesofproblems.

For thosewhowanttogodeeperintocomplexitytheory,thereisexcellentMITOpenCoursewarebyEricDemaine,http://erikdemaine.org/https://www.youtube.com/watch?v=moPtwq_cVH8Youcandosomeownexperimentationviahttp://www.algomation.com

A. Holzinger                                                        LV709.049                                                 11.11.2015

WS 2015 10

Page 11: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Keyproblemsindealingwithdatainthelifesciencesinclude:• Complexityofourworld• High‐dimensionality(curseofdimensionality(Catchpoole etal.,2010))• Mostofthedataisweakly‐structuredandunstructured

Agrandchallengeinhealthcareisthecomplexityofdata,implicatingtwoissues:structurization andstandardization.Aswehavelearnedinlecture2,verylittleofthedataisstructured.Mostofourdataisweaklystructured(Holzinger,2012).Inthelanguageofbusinessthereisoftentheuseoftheword“unstructured”,butwehavetousethiswordwithcare;unstructuredwouldmean– inastrictmathematicalsense– thatwearetalkingabouttotalrandomnessandcompleteuncertainty,whichwouldmeannoise,wherestandardmethodsfailorleadtothemodelingofartifacts,andonlystatisticalapproachesmayhelp.Thecorrecttermwouldbeunmodeled data– orweshallspeakaboutunstructuredinformation.Pleasemindthedifferences.

Totheimageabove:Advancesingeneticsandgenomicshaveacceleratedthediscovery‐based(=hypothesesgenerating)researchthatprovidesapowerfulcomplementtothedirecthypothesis‐drivenmolecular,cellularandsystemssciences.Forexample,geneticandfunctionalgenomicstudieshaveyieldedimportantinsightsintoneuronalfunctionanddisease.Oneofthemostexcitingandchallengingfrontiersinneuroscienceinvolvesharnessingthepoweroflarge‐scalegenetic,genomicandphenotypicdatasets,andthedevelopmentoftoolsfordataintegrationanddatamining(Geschwind &Konopka,2009).

11WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 12: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Donotconfusestructurewithstandardization(seeSlide2‐9).Datacanbestandardized(e.g.numericalentriesinlaboratoryreports)andnon‐standardized.Atypicalexampleisnon‐standardizedtext– impreciselycalled“Free‐Text”or“unstructureddata”inanelectronicpatientrecord(Kreuzthaleretal.,2011).

Standardizeddata isthe basisforaccuratecommunication.Inthemedicaldomain,manydifferentpeopleworkatdifferenttimesinvariouslocations.Datastandards canensurethatinformationisinterpretedbyalluserswiththesameunderstanding.Moreover,standardizeddatafacilitatecomparabilityofdataandinteroperabilityofsystems.Itsupportsthereusabilityofthedata,improvestheefficiencyofhealthcareservicesandavoidserrorsbyreducingduplicatedeffortsindataentry.

Datastandardizationreferstoa)thedatacontent;b)theterminologiesthatareusedtorepresentthedata;c)howdataisexchanged;andiv)howknowledge,e.g.clinicalguidelines,protocols,decisionsupportrules,checklists,standardoperatingproceduresarerepresentedinthehealthinformationsystem(refertoIOM).Technicalelementsfordatasharingrequirestandardizationofidentification,recordstructure,terminology,messaging,privacyetc.ThemostusedstandardizeddatasettodateistheinternationalClassificationofDiseases(ICD),whichwasfirstadoptedin1900forcollectingstatistics(Ahmadian etal.,2011),whichwewilldiscussin→Lecture3.Non‐standardizeddata isthemajorityofdataandinhibitdataquality,dataexchangeandinteroperability.Well‐structureddata istheminorityofdataandanidealisticcasewheneachdataelementhasanassociateddefinedstructure,relationaltables,ortheresourcedescriptionframeworkRDF,ortheWebOntologyLanguageOWL(see→Lecture3).Note:Ill‐structured isatermoftenusedfortheoppositeofwell‐structured,althoughthistermoriginallywasusedinthecontextofproblemsolving(Simon,1973).Semi‐structuredisaformofstructureddatathatdoesnotconformwiththestrictformalstructureoftablesanddatamodelsassociatedwithrelationaldatabasesbutcontainstagsormarkerstoseparatestructureandcontent,i.e.areschema‐lessorself‐describing;atypicalexampleisamarkup‐languagesuchasXML(see→Lecture3and4).Weakly‐Structureddata isthemostofourdatainthewholeuniverse,whetheritisinmacroscopic(astronomy)ormicroscopicstructures(biology)– see→Lecture5.Non‐structureddata orunstructureddata isanimprecisedefinitionusedforinformation expressedinnaturallanguage,whennospecificstructurehasbeendefined.Thisisanissuefordebate:Texthasalsosomestructure:words,sentences,paragraphs.Ifweareveryprecise,unstructureddatawouldmeantthatthedataiscompleterandomized– whichisusuallycallednoiseandisdefinedby(Duda,Hart&Stork,2000)asanypropertyofdatawhichisnotduetotheunderlyingmodelbutinsteadtorandomness(eitherintherealworld,fromthesensorsorthemeasurementprocedure).

12WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 13: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Alookonthetypicalviewofanhospitalinformationsystemshowsustheorganizationofwell‐structureddata:Standardizedandwell‐structureddataisthebasisforaccuratecommunication.Inthemedicaldomain,manydifferentpeopleworkatdifferenttimesinvariouslocations.Datastandardscanensurethatinformationisinterpretedbyalluserswiththesameunderstanding.Moreover,standardizeddatafacilitatecomparabilityofdataandinteroperabilityofsystems.Itsupportsthereusabilityofthedata,improvestheefficiencyofhealthcareservicesandavoidserrorsbyreducingduplicatedeffortsindataentry.Remember:Datastandardizationreferstoa)thedatacontent;b)theterminologiesthatareusedtorepresentthedata;c)howdataisexchanged;andd)howknowledge,e.g.clinicalguidelines,protocols,decisionsupportrules,checklists,standardoperatingproceduresarerepresentedinthehealthinformationsystem.Note:Theopposite,i.e.non‐standardizeddataisthemajorityofdataandinhibitdataquality,dataexchangeandinteroperability.Remark:Care2xisanOpenSourceInformationSystem,see:http://care2x.orgSee→Lecture10formoredetails.

13WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 14: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

ThisisaMedicalexampleforsemi‐structureddatainXML(Holzinger,2003).TheeXtensible MarkupLanguage(XML)isaflexibletextformatrecommendedbytheW3CfordataexchangeandderivedfromSGML(ISO8879),(Usdin &Graham,1998).XMLisoftenclassifiedassemi‐structured,howeverthisisinsomewaymisleading,asthedataitselfisstillstructured,butinaflexibleratherthanastaticway(Forster&Vossen,2012).Suchdatadoesnotconformtotheformalstructureoftablesanddatamodelsasforexampleinrelationaldatabases,butatleastcontainstags/markerstoseparatesemanticelementsandenforcehierarchiesofrecordsandfieldswithinthesedata.

14WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 15: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Thisexampleby(Rassinoux etal.,2003)showshowXMLcanbeusedinthehospitalinformationsystem:ThestructureofanynewdocumenteditedinthePatientRecord(here:DPI)isbasedonatemplatedefinedinXMLformat(left).ThesetemplatesplaytheroleofDTDsorXMLschemasastheypreciselydefinethestructureandcontenttypeofeachparagraph,thusvalidatingthedocumentattheapplicationlevel.Suchastructureembedsa<HEADER>anda<BODY>.Theheaderencapsulatesthepropertiesthatareinherenttothenewdocumentandthatwillbeusefultofurtherclassifyit,accordingtovariouscriteria,including:thepatientidentification,thedocumenttype,theidentifierofitsredactorsandofthehospitalizationstayorambulatoryconsultationtowhichthedocumentwillbeattachedinthepatienttrajectory,etc.Thebodyencapsulatesthecontent,andisdividedintotwoparts:The<STRUCDOC>partdescribesthesemanticentitiesthatcomposethedocument.The<FULLDOC>partembedsthedocumentitselfwithitspagelayoutinformation,whichcanbestoredeitherasadraft,atemporarytextorasadefinitivetext.Thisformatguaranteesthestorageofdynamicandcontrolledfieldsfordatainput,thusallowingthecombinationoffreetextandstructureddataentryinthedocument.Oncethedocumentisnolongereditable,itisdefinitivelysavedintotheRTFformat.ACDATAsectionisutilizedforstoringtheroughdocumentwhateveritsformat,asitpermitstodisregardblocksoftextcontainingcharactersthatwouldotherwiseberegardedasmarkup(Rassinoux,Lovis,Baud&Geissbuhler,2003).

15WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 16: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

OntopinthisslideyoucanseeasampleXMLdescribinggenesfromDrosophilamelanogasterinvolvedinlong‐termmemory.Nestedwithinthegeneelements,aresub‐elementsrelatedtotheparent.Thefirstgeneincludestwonucleicacidsequences,aproteinproduct,andafunctionalannotation.Additionalinformationisprovidedbyattributes,suchastheorganism.Thisexampleillustratesthedifficultyofmodelingmany‐to‐manyrelationships,suchastherelationshipbetweengenesandfunctions.Informationaboutfunctionsmustberepeatedundereachgenewiththatfunction.Ifweinvertthenesting,thenwemustrepeatinformationaboutgeneswithmorethanasinglefunction.BelowtheXMLweseetheinformationaboutgenesusingbothRDFandOWL.BothgenesareinstancesoftheclassFlyGene,whichhasbeendefinedasthesetofallGenesfortheorganismD.melanogaster.Thefunctionalinformationisrepresentedusingahierarchicaltaxonomy,inwhichLong‐TermMemoryisasubclassofMemory(Louieetal.,2007).Remark:Drosophilamelanogasterisamodelorganismandsharesmanygeneswithhumans.AlthoughDrosophilaisaninsectwhosegenomehasonlyabout14,000genes(halfofhumans),aremarkablenumberofthesehaveveryclosecounterpartsinhumans;someevenoccurinthesameorderinthefly'sDNAasinourown.This,plustheorganism'smorethan100‐yearhistoryinthelab,makesitoneofthemostimportantmodelsforstudyingbasicbiologyanddisease(seee.g.http://www.lbl.gov/Science‐Articles/Archive/sabl/2007/Feb/drosophila.html)Note:Therelationaldatamodelrequirespreciseness:Thedatamustberegular,completeandstructured.However,inBiologytherelationshipsaremostlyun‐precise.Genomicmedicineisextremelydataintensiveandthereisanincreasingdiversityinthetypeofdata:DNAsequence,mutation,expressionarrays,haplotype,proteomicetc.Inbioinformaticsmanyheterogeneousdatasourcesareusedtomodelcomplexbiologicalsystems(Rassinoux,Lovis,Baud&Geissbuhler,2003),(Achard,Vaysseix &Barillot,2001).Thechallengeingenomicmedicineistointegrateandanalyzethesediverseandhugedatasourcestoelucidatephysiologyandinparticulardiseasephysiology.XMLissuitedfordescribingsemi‐structureddata,includingakindofnaturalmodelingofbiologicalentities,becauseitallowsfeaturesase.g.nesting(seeSlide5‐6ontop).StillakeylimitationofXMLis,thatitisdifficulttomodelcomplexrelationships;forexample,thereisnoobviouswaytorepresentmany‐to‐manyrelationships,whichareneededtomodelcomplexpathways.OntopinFigure5‐9wecanseeasampleXML,describinggenesinvolvedinthelong‐termmemoryofasamplespecimend.melanogaster.Nestedwithinthegeneelements,aresub‐elementsrelatedtotheparent.Thefirstgeneincludestwonucleicacidsequences,aproteinproduct,andafunctionalannotation.Additionalinformationisprovidedbyattributes,suchastheorganism.Thisexampleillustratesthedifficultyofmodelingmany‐to‐manyrelationships,suchastherelationshipbetweengenesandfunctions.Informationaboutfunctionsmustberepeatedundereachgenewiththatfunction.Ifweinvertthenesting(i.e.,nestinggenesinsidefunctionelements),thenwemustrepeatinformationaboutgeneswithmorethanasinglefunction.AtthebottominSlide5‐6weseethesameinformationaboutgenes,butusingRDFandOWL.BothgenesareinstancesoftheclassFlyGene,whichhasbeendefinedasthesetofallGenesfortheorganismD.melanogaster.Thefunctionalinformationisrepresentedusingahierarchicaltaxonomy,inwhichLong‐TermMemoryisasubclassofMemory(Louieetal.,2007).

16WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 17: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Thehumanproteininteractionnetworkanditsconnectiontopositiveselection.Proteinslikelytobeunderpositiveselectionarecoloredinshadesofred(lightred,lowlikelihoodofpositiveselection;darkred,highlikelihood)(6).Proteinsestimatednottobeunderpositiveselectionareinyellow,andproteinsforwhichthelikelihoodofpositiveselectionwasnotestimatedareinwhite(6).

17WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 18: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

http://www.barabasilab.com/pubs/CCNR‐ALB_Publications/200907‐24_Science‐Decade/200907‐24_Science‐CoverImage.gif

Anemergingtrendinmanyscientificdisciplinesisastrongtendencytowardbeingtransformedintosomeformofinformationscience.Oneimportantpathwayinthistransitionhasbeenviatheapplicationofnetworkanalysis.Thebasicmethodologyinthisareaistherepresentationofthestructureofanobjectofinvestigationbyagraphrepresentingarelationalstructure.Itisbecauseofthisgeneralnaturethatgraphshavebeenusedinmanydiversebranchesofscienceincludingbioinformatics,molecularandsystemsbiology,theoreticalphysics,computerscience,chemistry,engineering,drugdiscovery,andlinguistics,tonamejustafew.Animportantfeatureofthebook“StatisticalandMachineLearningApproachesforNetworkAnalysis”istocombinetheoreticaldisciplinessuchasgraphtheory,machinelearning,andstatisticaldataanalysisand,hence,toarriveatanewfieldtoexplorecomplexnetworksbyusingmachinelearningtechniquesinaninterdisciplinarymanner.Theageofnetworksciencehasdefinitelyarrived.Large‐scalegenerationofgenomic,proteomic,signaling,andmetabolomic dataisallowingtheconstructionofcomplexnetworksthatprovideanewframeworkforunderstandingthemolecularbasisofphysiologicalandpathologicalstates.Networksandnetwork‐basedmethodshavebeenusedinbiologytocharacterizegenomicandgeneticmechanismsaswellasproteinsignaling.Diseasesarelookeduponasabnormalperturbationsofcriticalcellular networks.Onset,progression,andinterventionincomplexdiseasessuchascanceranddiabetesareanalyzedtodayusingnetworktheory.Oncethesystemisrepresentedbyanetwork,methodsofnetworkanalysiscanbeappliedtoextractusefulinformationregardingimportantsystempropertiesandtoinvestigate itsstructureandfunction.Variousstatisticalandmachinelearningmethodshavebeendevelopedforthispurposeandhavealreadybeenappliedtonetworks.Dehmer,M.&Basak,S.C.2012.StatisticalandMachineLearningApproachesforNetworkAnalysis,WileyOnlineLibrary.

18WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 19: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Theconceptofnetworkstructuresisfascinating,compellingandpowerfulandapplicableinnearlyanydomainatanyscale.Networktheorycanbetracedbacktographtheory,developedbyLeonhardEulerin1736(see→Slide5‐8).However,stimulatedbyworkse.g.fromBarabási,Albert&Jeong (1999),researchoncomplexnetworkshasonlyrecentlybeenappliedtobiomedicalinformatics.Asanextensionofclassicalgraphtheory,seeforexample(Diestel,2010),complexnetworkresearchfocusesonthecharacterization,analysis,modelingandsimulationofcomplexsystemsinvolvingmanyelementsandconnections,examplesincludingtheinternet,generegulatorynetworks,protein‐proteinnetworks,socialrelationshipsandtheWebandmanymore.Attentionisgivennotonlytotrytoidentifyspecialpatternsofconnectivity,suchastheshortestaveragepathbetweenpairsofnodes(Newman,2003),butalsotoconsidertheevolutionofconnectivityandthegrowthofnetworks,anexamplefrombiologybeingtheevolutionofprotein‐proteininteractionnetworksindifferentspecies(→Slide5‐8).Inordertounderstandcomplexbiologicalsystems,thethreefollowingkeyconceptsneedtobeconsidered:(i)emergence,thediscoveryoflinksbetweenelementsofasystembecausethestudyofindividualelementssuchasgenes,proteinsandmetabolitesisinsufficienttoexplainthebehaviorofwholesystems;(ii)robustness,biologicalsystemsmaintaintheirmainfunctionsevenunderperturbationsimposedbytheenvironment;and(iii)modularity,verticessharingsimilarfunctionsarehighlyconnected.Networktheorycanlargelybeappliedforbiomedicalinformatics,becausemanytoolsarealreadyavailable(Costa,Rodrigues&Cristino,2008).

19WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 20: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Figure1p.74‐ precedingtotheyeastprotein networkAgraphG(V,E)describesastructurewhichconsistsofnodesakaverticesV,connectedbyasetofpairsofdistinctnodes(links),callededgesE{a,b}witha,b∈V;a≠b.Graphscontainingcyclesand/oralternativepathsarereferredtoasnetworks.Thevertexesandedgescanhavearangeofpropertiesdefinedascolors,whichalsomayhavequantitativevalues,referredtoasweights.InthisSlideweseethebasicbuildingblocksymbolsofabiologicalnetworkasusedinbioinformatics.Thebluedotsareservingasnetworkhubs,theredblockisacriticalnode(onacriticallink),thewhiteballsarebottlenecks,thestarssecondorderhubsetc.(Hodgman,French&Westhead,2010).

20WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 21: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Inordertorepresentnetworkdataincomputersitisnotcomfortabletousesets;morepracticalarematrices.Thesimplestformofagraphrepresentationisthesocalledadjacencymatrix.InthisSlideweseeanundirected(left)andadirectedgraphandtheirrespectiveadjacencymatrices.Ifthegraphisundirected,theadjacencymatrixissymmetric,i.e.,theelementsaij =aji foranyi andj.

21WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 22: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

ThisToolisaniceexampleontheusefulnessofadjacencymatrices:TheInfoVisToolkitisaninteractivegraphicstoolkitdevelopedbyJean‐DanielFekete atINRIA(TheFrenchNationalInstituteforComputerScienceandControl).Thetoolkitimplementsninetypesofvisualization:ScatterPlots,TimeSeries,ParallelCoordinatesandMatricesfortables;Node‐Linkdiagrams,IcicletreesandTreemapsfortrees;AdjacencyMatricesandNode‐Linkdiagramsforgraphs.Node‐Linkvisualizationsprovidesseveralvariants(8forgraphsand4fortrees).Therearealsoanumberofinteractivecontrolsandinformationdisplays,includingdynamicquerysliders,fisheyelenses,andexcentric labels.InformationabouttheInfoVistoolkitcanbefoundathttp://ivtk.sourceforge.netTheInfoVis Toolkitprovidesinteractivecomponentssuchasrangeslidersandtailoredcontrolpanelsrequiredtoconfigurethevisualizations.Thesecomponentsareintegratedintoacoherentframeworkthatsimplifiesthemanagementofrichdatastructuresandthedesignandextensionofvisualizations.Supporteddatastructuresincludetables,treesandgraphs.Allvisualizationscanusefisheyelensesanddynamiclabeling(Fekete,2004).

22WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 23: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Illustrationofthemeaningofcommonlyusedterms.Theprocessofdigitalimageformationinmicroscopyisdescribedinotherbooks.Imageprocessingtakesanimageasinputandproducesamodifiedversionofit(inthecaseshown,theobjectcontoursareenhancedusinganoperationknownasedgedetection,describedinmoredetailelsewhereinthisbooklet).Imageanalysisconcernstheextractionofobjectfeaturesfromanimage.Insomesense,computergraphicsistheinverseofimageanalysis:itproducesanimagefromgivenprimitives,whichcouldbenumbers(thecaseshown),orparameterizedshapes,ormathematicalfunctions.Computervisionaimsatproducingahigh‐levelinterpretationofwhatiscontainedinanimage.Thisisalsoknownasimageunderstanding.Finally,theaimofvisualizationistotransformhigher‐dimensionalimagedataintoamoreprimitiverepresentationtofacilitateexploringthedat

A. Holzinger                                                        LV709.049                                                 11.11.2015

WS 2015 23

Page 24: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Thetrulymulti‐disciplinarynetworksciencehasledtoawidevarietyofquantitativemeasurementsoftheirtopologicalcharacteristics(Costaetal.,2007).Theidentificationbetweenagraphandanadjacencymatrixmakesallthepowerfulmethodsoflinearalgebra,graphtheoryandstatisticalmechanicsavailabletousforinvestigatingspecificnetworkcharacteristics:Order(ainFigureSlide5‐11)=totalnumberofnodesnSize=totalnumberoflinks:∑_i▒∑_j▒a_ijClusteringCoefficient(binSlide5‐11)=thedegreeofconcentrationoftheconnectionsofthenode’sneighborsinagraphandgivesameasureoflocalinhomogeneityofthelinkdensity,i.e.thelevelofconnectednessofthegraph.Itiscalculatedastheratiobetweentheactualnumberti oflinksconnectingtheneighborhood(thenodesimmediatelyconnectedtoachosennode)ofanodeandthemaximumpossiblenumberoflinksinthatneighborhood:C_i=(2t_i)/(k(k_i‐1))Forthewholenetwork,theclusteringcoefficientisthearithmeticmean:C=1/n∑_i▒C_iPathlength(cinSlide5‐11)=isthearithmeticalmeanofallthedistances;Thecharacteristicpathlengthofnodei providesinformationabouthowclosenodei isconnectedtoallothernodesinthenetworkandisgivenbythedistanced(i,j)betweennodei andallothernodesjinthenetwork.ThePathlengthlprovidesimportantinformationaboutthelevelofglobalcommunicationefficiencyofanetwork:l=1/(n(n‐1))∑_(i≠j)▒d_ijNote:Numericalmethods,e.g.theDijkstra's algorithm(1959)areusedtocalculateallthepossiblepathsbetweenanytwonodesinanetwork.

24WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 25: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Centrality(dinSlide5‐12)=thelevelof“betweenness‐ centrality”ofanodei;itindicateshowmanyoftheshortestpathsbetweenthenodesofthenetworkpassthroughnodei.Ahigh“betweenness‐centrality”indicatesthatthisnodeisimportantininterconnectingthenodesofthenetwork,markingapotentialhubrole(referto→Slide5‐8)ofthisnodeintheoverallnetwork.Nodaldegree(einSlide5‐12)=numberoflinksconnectingi toitsneighbors.Thedegreeofnodei isdefinedasitstotalnumberofconnections.k_i=∑_i▒a_ijThedegreeprobabilitydistributionP(k)describesthep(x)thatanodeisconnectedtokothernodesinthenetwork.Modularity(finSlide5‐12)=describesthepossibleformationofcommunitiesinthenetwork,indicatinghowstronggroupsofnodesformrelativeisolatedsub‐networkswithinthefullnetwork(referalsoto→Slide5‐8)).

25WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 26: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Regularnetwork(ainSlide5‐13)hasalocalcharacter,characterizedbyahighclustering‐coefficient(cinSlide5‐13)andahighpathlength(L,Slide5‐13).Ittakesalargenumberofstepstotravelfromaspecificnodetoanodeontheotherendofthegraph.Aspecialcaseofaregularnetworkisthe:Randomnetwork,whereallconnectionsaredistributedrandomlyacrossthenetwork;theresultisagraphwitharandomorganization(outerrightinSlide5‐13).Incontrasttothelocalcharacteroftheregularnetwork,arandomnetworkhasamoreglobalcharacter,withalowCandamuchshorterpathlengthLthantheregularnetwork.Aparticularcaseisthe:Small‐worldnetwork(centerofSlide5‐13)whichareveryrobustandcombineahighleveloflocalandglobalefficiency.Watts&Strogatz (1998)showedthatwithalowprobabilitypofrandomlyreconnectingaconnectionintheregularnetwork,aso‐calledsmall‐worldorganizationarises.IthasbothahighCandalowL,combiningahighleveloflocalclusteringwithstillashortaveragetraveldistance.Manynetworksinnaturearesmall‐world(e.g.internet,protein‐networks,socialnetworks,functionalandstructuralbrainnetworketc.),combiningahighlevelofsegregationwithahighlevelofglobalinformationintegration.Inaddition,suchnetworkscanhaveaheavytailedconnectivitydistribution,incontrasttorandomnetworksinwhichthenodesroughlyallhavethesamenumberofconnections.Scale‐freenetworks(BinSlide5‐13)arecharacterizedbyadegreeprobabilitydistributionthatfollowsapower‐lawfunction,indicatingthatonaverageanodehasonlyafewconnections,butwiththeexceptionofasmallnumberofnodesthatareheavilyconnected.Thesenodesareoftenreferredtoashubnodes(see→Slide5‐8)andtheyplayacentralroleinthelevelofefficiencyofthenetwork,astheyareresponsibleforkeepingtheoveralltraveldistanceinthenetworktoaminimum.Asthesehubnodesplayakeyroleintheorganizationofthenetwork,scale‐freenetworkstendtobevulnerabletospecializedattackonthehubnodes.Modularnetworks(cinSlide5‐13)showtheformationofso‐calledcommunities,consistingofasubsetofnodesthataremostlyconnectedtotheirdirectneighborsintheircommunityandtoalesserextendtotheothernodesinthenetwork.Suchnetworksarecharacterizedbyahighlevelofmodularityofthenodes.

26WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 27: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

27WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 28: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

TherearemanywaystoconstructaproximitygraphrepresentationfromasetofdatapointsthatareembeddedinR^d.Letusconsiderasetofdatapoints{x_1,...,x_n}∈R^d .ToeachdatapointweassociateavertexofaproximitygraphGtodefineasetofverticesV={v1,v2,...,vn}.DeterminingtheedgesetEoftheproximitygraphGrequiresdefiningtheneighborsofeachvertexviaccordingtoitsembeddingxi.Consequently,aproximitygraphisagraphinwhichtwoverticesareconnectedbyanedgeiffthedatapointsassociatedtotheverticessatisfyparticulargeometricrequirements.Suchparticulargeometricrequirementsareusuallybasedonametricmeasuringthedistancebetweentwodatapoints.AusualchoiceofmetricistheEuclideanmetric.Lookattheslide:a)isourinitialsetofpointsintheplaneR^2b)ε‐ballgraphvi∼vj ifxj ∈B(vi;ε)c)k‐nearest‐neighborgraph(k‐NNG):vi∼vj ifthedistancebetweenxiandxj isamongthek‐thsmallestdistancesfromxitootherdatapoints.Thek‐NNGisadirectedgraphsinceonecanhavexiamongthek‐nearestneighborsofxj butnotviceversa.d)EuclideanMinimumSpanningTree(EMST)graphisaconnectedtreesub‐graphthatcontainsalltheverticesandhasaminimumsumofedgeweights.TheweightoftheedgebetweentwoverticesistheEuclideandistancebetweenthecorrespondingdatapoints.e)Symmetrick‐nearest‐neighborgraph(Sk‐NNG):vi∼vj ifxiisamongthek‐nearestneighborsofyorviceversa.f)Mutualk‐nearest‐neighborgraph(Mk‐NNG):vi∼vj ifxiisamongthek‐nearestneighborsofyandviceversa.Allverticesinamutualk‐NNgraphhaveadegreeupper‐boundedbyk,whichisnotusuallythecasewithstandardk‐NNgraphs.g)RelativeNeighborhoodGraph(RNG):vi∼vj iffthereisnovertexinB(vi;D(vi,vj))∩B(vj ;D(vi,vj)).h)GabrielGraph(GG)i)Theβ‐SkeletonGraph(β‐SG):Fordetailspleasereferto(Lézoray &Grady,2012),ortoaclassicalgraphtheorybook,e.g.(Harary,1969),(Bondy &Murty,1976),(Golumbic,2004),(Diestel,2010)

28WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 29: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Slide5‐16:GraphsfromImagesInthisslideweseetheexamplesofa)arealimagewiththequadtree tessellation,b)theregionadjacencygraphassociatedtothequadtree partition,c)Irregulartessellationusingimage‐dependentsuperpixel WatershedSegmentation(Vincent&Soille,1991)d)irregulartessellationusingimage‐dependentSLICsuperpixels (Lucchi etal.,2010)SLIC=SimpleLinearIterativeClustering)

29WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 30: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

AstraightforwardimplementationoftheoriginalVincent‐Soille algorithmisdifficultifplateausoccur.Therefore,analternativeapproachwasproposedby(Meijster &Roerdink,1995),inwhichtheimageisfirsttransformedtoadirectedvaluedgraphwithdistinctneighborvalues,calledthecomponentsgraphoff.Onthisgraphthewatershedtransformcanbecomputedbyasimplied versionoftheVincent‐Soille algorithm,wherefifo queuesarenolongernecessary,sincetherearenoplateausinthegraph(Roerdink &Meijster,2000).

30WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 31: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Theoriginalnaturaldigitalimageisfirsttransformedintogrey‐scale,thentheWatershedalgorithmisappliedandthenthecentroidfunctioncalculated,theresultsarerepresentativepointsetsintheplane.

31WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 32: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

TheDelaunayTriangulation(DT):vi∼vj iffthereisaclosedballB(•;r)withviandvj onitsboundaryandnoothervertexvk containedinit.ThedualtotheDTistheVoronoiirregulartessellationwhereeachVoronoicellisdefinedbytheset{x∈Rn |D(x,vk)≤D(x,vj)forallvj =vk}.Insuchagraph,∀vi,deg (vi)=3.(Lézoray &Grady,2012)

32WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 33: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

ThisanimationshowstheconstructionofaDelaunaygraph:Firsttheredpointsontheplanearedrawn,thenweinserttheblueedgesandtheblueverticesontheVoronoigraph,finallyherededgesdrawnbuildtheDelaunaygraph(Kropatsch,Burge&Glantz,2001).

http://oldwww.prip.tuwien.ac.at/research/research‐areas/structure‐and‐topology/graphs‐in‐image‐analysis/graphs‐in‐image‐analysis/use‐of‐graphs‐in‐image‐analysis/voronoi‐graph‐and‐delaunay‐graph

33WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 34: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

InthisSlideweseetheevaluatedinformation‐theoreticnetworkmeasuresonpublicationnetworks.HerefromtheexcellencenetworkofRWTHAachenUniversity.Thosemeasurescanbeunderstoodasgraphcomplexitymeasureswhichevaluatethestructuralcomplexitybasedonthecorrespondingconcept.Apossibleusefulinterpretationofthesemeasureshelpstounderstandthedifferencesinsubgraphs ofacluster.Forexampleonecouldapplycommunitydetectionalgorithmsandcompareentropymeasuresofsuchdetectedcommunities.Relatingthesedatatosocialmeasures(e.g.balancedscorecarddata)ofsub‐communitiescouldbeusedasindicatorsofcollaborationsuccessorlackthereof.Thenodesizeshowsthenodedegreewhereasthenodecolorshowsthebetweenness centrality,darkercolormeanshighercentrality(Holzingeretal.,2013a).

34WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 35: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Afurtherexampleshalldemonstratetheusefulnessofgraphtheoryandnetworkanalysis:ThisgraphshowsthemedicalknowledgespaceofastandardquickreferenceguideforemergencydoctorsandparamedicsintheGermanspeakingarea.Ithasbeensubsequentlydeveloped,testedinthemedicalrealworldandconstantlyimprovedfor20yearsbyDr.med.RalfMüller,emergencydoctoratGraz‐LKHUniversityHospitalandispracticallyinthepocketofeveryemergencyandfamilydoctorandparamedicsintheGermanspeakingarea(Holzingeretal.,2013b).UptoknowweknowthatGraphsandGraph‐Theoryarepowerfultoolstomapdatastructuresandtofindnovelconnectionsbetweensingledataobjects(Strogatz,2001),(Dorogovtsev &Mendes,2003).Theinferredgraphscanbefurtheranalyzedbyusinggraph‐theoreticalandstatisticalandmachinelearningtechniques(Dehmer,Emmert‐Streib &Mehler,2011).Amappingofthealreadyexistingandinthemedicalpracticeapproved“knowledgespace”asaconceptualgraphandthesubsequentvisualandgraph‐theoreticalanalysismayprovidenovelinsightsonhiddenpatternsinthedata.Anotherbenefitofthegraph‐baseddatastructureisintheapplicabilityofmethodsfromnetworktopologyandnetworkanalysisanddatamining,e.g.small‐worldphenomenon(Barabasi &Albert,1999),(Kleinberg,2000),andclusteranalysis(Koontz,Narendra &Fukunaga,1976),(Wittkop etal.,2011).Thegraph‐theoreticdataofthegraphseeninthisSlideinclude:Numberofnodes=641,numberofedges=1250,redareagents,blackareconditions,bluearepharmacologicalgroups,greyareotherdocuments.Theaveragedegreeofthisgraph=3.888,theaveragepathlength=4.683,thenetworkdiameter=9.

35WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 36: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Thenodesofthesamplegraphrepresent:drugs,clinicalguidelines,patientconditions(indication,contraindication),pharmacologicalgroups,tablesandcalculationsofmedicalscores,algorithmsandothermedicaldocuments;andtheedgesrepresent3crucialtypesofrelationsinducingmedicalrelevancebetweentwoactivesubstances,i.e.:pharmacologicalgroups,indicationsandcontra‐indications.Thefollowingexamplewilldemonstratetheusefulnessofthisapproach.

36WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 37: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Thisexampleshowsushowconvenientwecanfindwhichpathbetweentwonodesistheshortestaswellasthenavigationwaybetweenthesenodes.Computingshortestpathsisafundamentalandubiquitousprobleminnetworkanalysis.Wecan,e.g.applytheDijkstra‐algorithm,solvestheshortestpathproblemforagraphwithnon‐negativeedgepathcosts,producingashortestpathtree.Thisalgorithmisoftenusedinroutingandasasubroutineinothergraphalgorithms:Foragivennode,thealgorithmfindsthepathwithlowestcost(i.e.theshortestpath)betweenthatnodeandeveryothernode(Henzinger etal.,1997).

37WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 38: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

HereweseetherelationshipbetweenAdrenaline(centerblacknode)andDobutamine (topleftblacknode),Blue:PharmacologicalGroup,Darkred:Contraindication;Lightred:Condition,theGreennodes(fromdarktolight)are:1.Application(oneoremoreindications+correspondingdosages)2.Singleindicationwithadditionaldetails(e.g.“VFafter3rdShock”)3.Condition(e.g.VF,VentricularFibrillation)

38WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 39: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Ourbrainformsoneintegrativecomplexnetwork,linkingallbrainregionsandsub‐networkstogether(VanDenHeuvel &Hulshoff Pol,2010).Examiningtheorganizationofthisnetworkprovidesinsightsinhowourbrainworks.Graphtheoryprovidesaframeworkinwhichthetopologyofcomplexnetworkscanbeexamined;thuscanrevealnoveltiesaboutboththelocalandglobalorganizationoffunctionalbrainnetworks.Intheslidewecanseehowthemodelingofthefunctionalbrainbyagraphworks:edgesaretheconnectionsbetweenregionsthatarefunctionallylinked.First,thecollectionofnodesisdefined(A),secondtheexistenceoffunctionalconnectionsbetweenthenodesinthenetworkneedstobedefined,resultinginaconnectivitymatrix(B).Finally,theexistenceofaconnectionbetweentwopointscanbedefinedaswhethertheirleveloffunctionalconnectivityexceedsacertainpredefinedthreshold(C)(VanDenHeuvel &Hulshoff Pol,2010).

39WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 40: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Developmentofthehumanheartstarts2weeksafterfertilization,withtheformationofthecardiaccrescentandthesubsequentformationandloopingoftheprimitivehearttube.Insightintothebiologyofmolecularnetworksisanimportantfield,asanomaliesinthesesystemsunderlieawidespectrumofpolygenetichumandisorders,rangingfromschizophreniatocongenitalheartdisease(CHD).Understandingthefunctionalarchitectureofnetworksthatorganizethedevelopmentoforgans,seee.g.(Chien,Domian&Parker,2008),laysthefoundationofnovelapproachesinregenerativemedicine,sincemanipulationofsuchsystemsisnecessaryforsuccessoftissueengineeringtechnologiesandstemcelltherapy.Lage etal.(2010)developedaframeworkforgainingnewinsightsintothesystemsbiologyoftheproteinnetworksdrivingorgandevelopmentandrelatedpolygenichumandiseasephenotypes,exemplifiedwithheartdevelopmentandCHD.IntheSlideweseeexamplesoffourfunctionalnetworksdrivingthedevelopmentofdifferentanatomicalstructuresinthehumanheart.Thesefournetworksareconstructedbyanalyzingtheinteractionpatternsoffourdifferentsetsofcardiacdevelopment(CD):proteinscorrespondingtothemorphologicalgroups‘atrialseptal defects,’‘abnormalatrioventricular valvemorphology,’‘abnormalmyocardialtrabeculae morphology,’and‘abnormaloutflowtractdevelopment’.CDproteinsfromtherelevantgroupsareshowninorangeandtheirinteractionpartnersareshowningray.Functionalmodulesannotatedbyliteraturecuration areindicatedwithacoloredbackground.CentrallyintheFigureisahaematoxylin‐eosinstainedfrontalsectionoftheheartfroma37‐dayhumanembryo,wheretissuesaffectedbythefournetworksaremarked;AS(developingatrialseptum),EC(endocardial cushions,whichareanatomicalprecursorstotheatrioventricular valves),VT(developingventriculartrabeculae),andOFT(developingoutflowtract).

40WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 41: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

InthisSlideweseeanoverviewofthemodularorganizationofheartdevelopment:(A)Proteininteractionnetworksareplottedattheresolutionoffunctionalmodules.Eachmoduleiscolorcodedaccordingtofunctionalassignmentasdeterminedbyliteraturecuration.Theamountofproteinsineachmoduleisproportionaltotheareaofitscorrespondingnode.Edgesindicatedirect(lines)orindirect(dottedlines)interactionsbetweenproteinsfromtherelevantmodules.(B)Recyclingoffunctionalmodulesduringheartdevelopment.Thebarsrepresentfunctionalmodulesandrecyclingisindicatedbyarrows.Thebarsfollowthecolorcodeof(A)andtheheightofthebarsrepresentthenumberofproteinsineachmodule,asshownleftontheyaxis(Lage etal.,2010).Note:Phenotype=anorganism'sobservablecharacteristics(traits),e.g.morphology,biochemical/physiologicalproperties,behaviour,etc.Phenotypesresultfromtheexpressionofanorganism'sgenesaswellastheinfluenceofenvironmentalfactorsandtheinteractionsbetweenthem.Genotype=inheritedinstructionswithinitsgeneticcode.

41WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 42: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Diseases(e.g.obesity,diabetes,atherosclerosisetc.)resultfrommultiplegeneticandenvironmentalfactors,andimportantly,interactionsbetweengeneticandenvironmentalfactors.ThisSlideshowsthevastnetworksofmolecularinteractions.Itcanbeseenthatthegastrointestinal(GI)tract,vasculature,immunesystem,heartandbrainareallpotentiallyinvolvedineithertheonsetofdiseasessuchasatherosclerosisorincomorbiditiessuchasmyocardialinfarctionandstrokebroughtonbysuchdiseases.Further,therisksofcomorbiditiesfordiseasessuchasatherosclerosisareincreasedbyotherdiseases,suchashypertension,whichmay,inturn,involveotherorgans,suchaskidney.Therolethateachorganandtissuetypeplaysinagivendiseaseislargelydeterminedbygeneticbackgroundandenvironment,wheredifferentperturbationstothegeneticbackground(perturbationscorrespondingtoDNAvariationsthataffectgenefunction,which,inturn,leadstodisease)and/orenvironment(changesindiet,levelsofstress,levelofactivity,andsoon)definethesubtypesofdiseasemanifestedinanygivenindividual.Althoughthephysiologyofdiseasessuchasatherosclerosisisbeginningtobebetterunderstood,whathavenotbeenfullyexploitedtodataarethevastnetworksofmolecularinteractionswithinthecells.WeseeclearlyintheSlidethatthereisadiversityofmolecularnetworksfunctioninginanygiventissue,includinggenomicsnetworks,networksofcodingandnoncodingRNA,proteininteractionnetworks,proteinstatenetworks,signalingnetworks,andnetworksofmetabolites.Further,thesenetworksarenotactinginisolationwithineachcell,butinsteadinteractwithoneanothertoformcomplex,giantmolecularnetworkswithinandbetweencellsthatdriveallactivityinthedifferenttissues,aswellassignalingbetweentissues.VariationsinDNAandenvironmentleadtochangesinthesemolecularnetworks,which,inturn,inducecomplicatedphysiologicalprocessesthatcanmanifestasdisease.Despitethisvastcomplexity,theclassicapproachtoelucidatinggenesthatdrivediseasehasfocusedonsinglegenesorsinglelinearlyorderedpathwaysofgenesthoughttobeassociatedwithdisease.Thisnarrowapproachisanaturalconsequenceofthelimitedsetoftoolsthatwereavailableforqueryingbiologicalsystems;suchtoolswerenotcapableofenablingamoreholisticapproach,resultingintheadoptionofareductionistapproachtoteasingapartpathwaysassociatedwithcomplexdiseasephenotypes.Althoughtheemergingviewthatcomplexbiologicalsystemsarebestmodeledashighlymodular,fluidsystemsexhibitingaplasticitythatallowsthemtoadapttoavastarrayofconditions,thehistoryofsciencedemonstratesthatthisview,althoughlongtheideal,wasneverwithinreach,giventheunavailabilityoftoolsadequatetocarryingoutthistypeofresearch.Theexplosionoflarge‐scale,high‐throughputtechnologiesinthebiologicalsciencesoverthepast15to20yearshasmotivatedarapidparadigmshiftawayfromreductionisminfavorofasystems‐levelviewofbiology(Schadt &Lum,2006).

42WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 43: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Thethreemaintypesofbiologicalnetworks:(a)atranscriptionalregulatorynetworkhastwocomponents:transcriptionfactor(TF)andtargetgenes(TG),whereTFregulatesthetranscriptionofTGs;(b)protein‐proteininteractionnetworks:twoproteinsareconnectedifthereisadockingbetweenthem;(c)ametabolicnetworkisconstructedconsideringthereactants,chemicalreactionsandenzymes.

43WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 44: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

TheextremecomplexityoftheE.colitranscriptionalregulatorynetwork.Inthisgraphicalrepresentation,nodesaregenes,andedgesrepresentregulatoryinteractions.ThenetworkwasreconstructedusingdatafromtheRegulonDB (Salgadoetal.2006).Thisfigurehighlightstheextremecomplexityinregulatorynetworks.Toobtainadeeperunderstandingofregulatorycomplexity,scientistsmustfirstdiscoverbiologicallyrelevantorganizationalprinciplestounravelthehiddenarchitecturegoverningthesenetworks(seeNatureEducation:http://www.nature.com/scitable/content/the‐extreme‐complexity‐of‐the‐e‐coli‐14457504)

Thecomplexityoforganismsarisesratherasaconsequenceofelaboratedregulationsofgeneexpressionthanfromdifferencesingeneticcontentintermsofthenumberofgenes.Thetranscriptionnetworkisacriticalsystemthatregulatesgeneexpressioninacell.Transcriptionfactors(TFs)respondtochangesinthecellularenvironment,regulatingthetranscriptionoftargetgenes(TGs)andconnectingfunctionalproteininteractionstothegeneticinformationencodedininheritedgenomicDNAinordertocontrolthetimingandsitesofgeneexpressionduringbiologicaldevelopment.TheinteractionsbetweenTFsandTGscanberepresentedasadirectedgraph:Thetwotypesofnodes(TFandTG)areconnectedbyarcs(see→Slide5‐31,arrows)whenregulatoryinteractionoccursbetweenregulatorsandtargets.Transcriptionalregulatorynetworksdisplayinterestingpropertiesthatcanbeinterpretedinabiologicalcontexttobetterunderstandthecomplexbehaviorofgeneregulatorynetworks.Atalocalnetworklevel,thesenetworksareorganizedinsubstructuressuchasmotifsandmodules.Motifsrepresentthesimplestunitsofanetworkarchitecturerequiredtocreatespecificpatternsofinter‐regulationbetweenTFsandTGs.Threemostcommontypesofmotifscanbefoundingeneregulatorynetworks:(1)singleinput,(2)multipleinputand(3)feed‐forwardloopTargetgenesbelongingtothesamesingleandmultipleinputmotifstendtobeco‐expressed,andthelevelofco‐expressionishigherwhenmultipletranscriptionfactorsareinvolved.Modularityintheregulatorynetworksarisesfromgroupsofhighlyconnectedmotifsthatarehierarchicallyorganized,inwhichmodulesaredividedintosmallerones.Theevolutionofgeneregulatorynetworksmainlyoccursthroughextensiveduplicationoftranscriptionfactorsandtargetgeneswithinheritanceofregulatoryinteractionsfromancestralgeneswhiletheevolutionofmotifsdoesnotshowcommonancestrybutisaresultofconvergentevolution(Costa,Rodrigues&Cristino,2008).

44WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 45: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Theinteractionsbetweenproteinsareessentialtokeepthemolecularsystemsoflivingcellsworkingproperly.Protein‐proteininteraction(PPI)isimportantforvariousbiologicalprocessessuchascell‐cellcommunication,theperceptionofenvironmentalchanges,proteintransportandmodification.Complexnetworktheoryissuitabletostudyprotein‐proteininteractionmapsbecauseofitsuniversalityandintegrationinrepresentingcomplexsystems.Incomplexnetworkanalysiseachproteinisrepresentedasanodeandthephysicalinteractionsbetweenproteinsareindicatedbytheedgesinthenetwork.Manycomplexnetworksarenaturallydividedintocommunitiesormodules,wherelinkswithinmodulesaremuchdenserthanthoseacrossmodules(e.g.humanindividualsbelongingtothesameethnicgroupsinteractmorethanthosefromdifferentethnicgroups).Cellularfunctionsarealsoorganizedinahighlymodularmanner,whereeachmoduleisadiscreteobjectcomposedofagroupoftightlylinkedcomponentsandperformsarelativelyindependenttask.ItisinterestingtoaskwhetherthismodularityincellularfunctionarisesfrommodularityinmolecularinteractionnetworkssuchasthetranscriptionalregulatorynetworkandPPInetwork.TheSlideshowsahypotheticalproteincomplex(A).Binaryprotein‐proteininteractions(PPI)aredepictedbydirectcontactsbetweenproteins.Althoughfiveproteins(A,B,C,D,andE)areidentifiedthroughtheuseofabaitprotein(red),onlyAandDdirectlybindtothebait.(B)showsthetruePPInetworktopologyoftheproteincomplexisshownin.(C)depictsthePPInetworktopologyoftheproteincomplexinferredbythe‘‘matrix’’model,whereallproteinsinacomplexareassumedtointeractwitheachother.Finally(D)demonstratesthePPInetworktopologyoftheproteincomplexinferredbythe‘‘spoke’’model,whereallproteinsinacomplexareassumedtointeractwiththebait;butnootherinteractionsareallowed(Wang&Zhang,2007).

45WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 46: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Correlatedmotifmining(CMM)isthechallengetofindoverrepresentedpairsofpatterns(motifs),insequencesofinteractingproteins.AlgorithmicsolutionsforCMMtherebyprovideacomputationalmethodforpredictingbindingsitesforproteininteraction.ThetaskisbasicallytorepresentmotifsXandY(Figure119)totrulyrepresentanoverrepresentedconsensuspatterninthesequencesoftheproteinsinVX,respectivelyVY,inordertoincreasethelikelihoodthattheycorrespondoroverlapwithasocalledbindingsite—asiteonthesurfaceofthemoleculethatmakesinteractionsbetweenproteinsfromVXandVYpossiblethroughamolecularlock‐and‐keymechanism.Wecall{X,Y}a(k_x k_y k_xy )‐motifpairofaPPInetworkG=(V,E,λ)if|V_x |=k_x,|V_y |=k_y and|V_x∩V_y |=k_xyItiscalledcompleteifallverticesfromV_x areconnectedwithallverticesfromV_y (Boyen etal.,2011).

Ingenetics,asequencemotifisanucleotideoramino‐acidsequencepatternthatiswidespreadandhas,orisconjecturedtohave,abiologicalsignificance.Forproteins,asequencemotifisdistinguishedfromastructuralmotif,amotifformedbythethreedimensionalarrangementofaminoacids,whichmaynotbeadjacent.Inachain‐likebiologicalmolecule,suchasaproteinornucleicacid,astructuralmotifisasupersecondarystructure,whichappearsalsoinavarietyofothermolecules.Motifsdonotallowustopredictthebiologicalfunctionsbecausetheyarefoundinproteinsandenzymeswithdissimilarfunctions.Networkmotifsareconnectivity‐patterns(sub‐graphs)thatoccurmuchmoreoftenthantheydoinrandomnetworks.Mostnetworksstudiedinbiology,ecologyandotherfieldshavebeenfoundtoshowasmallsetofnetworkmotifs;surprisingly,inmostcasesthenetworksseemtobelargelycomposedofthesenetworkmotifs,occurringagainandagain.

46WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 47: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

ThegeneralsteepestascentalgorithmwithabstractneighborfunctionappliedtoCMM(SA‐CMM).

SincethedecisionproblemassociatedwithCMMisinNP,wecanefficientlycheckifamotifpairhashighersupportthananotherwhichmakesitpossibletotackleCMMasasearchprobleminthespaceofallpossible(l,d)‐motifpairs.Ifweaddtheassumptionthatsimilarmotifscanbeexpectedtogetsimilarsupport,ithasthetypicalformofacombinatorialoptimizationproblem.Incombinatorialoptimization,theobjectiveistofindapointinadiscretesearchspacewhichmaximizesauser‐providedfunctionf.Anumberofheuristicalgorithmscalledmetaheuristics areknowntoyieldstableresults,e.g.thesteepestascentalgorithm(Aarts &Lenstra,1997),illustratedaspseudocode intheSlide.

47WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 48: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Metabolismisprimarilydeterminedbygenes,environmentandnutrition.Itconsistsofchemicalreactionscatalyzedbyenzymestoproduceessentialcomponentssuchasaminoacids,sugarsandlipids,andalsotheenergynecessarytosynthesizeandusetheminconstructingcellularcomponents.Sincethechemicalreactionsareorganizedintometabolicpathways,inwhichonechemicalistransformedintoanotherbyenzymesandco‐factors,suchastructurecanbenaturallymodeledasacomplexnetwork.Inthisway,metabolicnetworksaredirectedandweightedgraphs,whoseverticescanbemetabolites,reactionsandenzymes,andtwotypesofedgesthatrepresentmassflowandcatalyticreactions.Onewidelyconsideredcatalogueofmetabolicpathwaysavailableon‐lineistheKyotoEncyclopediaofGenesandGenomes(KEGG).IntheSlideweseeasimplemetabolicnetworkinvolvingfivemetabolitesM1‐M5andthreeenzymesE1‐E3,ofwhichthelattercatalyzesanirreversiblereaction(Hodgman,French&Westhead,2010).

48WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 49: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Suchmetabolicstructurescanbeverylarge,ascanbeseeninthisSlide.Theenzyme‐codinggenesunderTrmB (thisisthethermococcus regulatorofmaltosebinding)actsasarepressorforgenesencodingglycolyticenzymesandasactivatorforgenesencodinggluconeogenic enzymescontrolincludedinthemetabolicpathwaysshownintheSlide(13areuniquetoarchaea and35areconservedacrossspeciesfromallthreedomainsoflife.Integratedanalysisofthemetabolicandgeneregulatorynetworkarchitecturerevealsvariousinterestingscenarios(Schmid etal.,2009).

49WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 50: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Electronicpatientrecords(EPRremainanunexplored,butrichdatasourcefordiscoveringe.g.correlationsbetweendiseases.(Roque etal.,2011)describeageneralapproachforgatheringphenotypicdescriptionsofpatientsfrommedicalrecordsinasystematicandnon‐cohortdependentmanner:Byextractingphenotypeinformationfromthe“free‐text”(=unstructuredinformation)insuchrecordstheydemonstratedthattheycanextendtheinformationcontainedinthestructuredrecorddata,anduseitforproducingfine‐grainedpatientstratificationanddiseaseco‐occurrencestatistics.TheirapproachusesadictionarybasedontheInternationalClassificationofDisease(ICD‐10)ontologyandisthereforeinprinciplelanguageindependent.AsausecasetheyshowhowrecordsfromaDanishpsychiatrichospitalleadtotheidentificationofdiseasecorrelations,whichsubsequentlycanbemappedtosystemsbiologyframeworks.

50WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 51: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Disease‐diseasecorrelations.Heatmap ofthemostsignificant100ICD10codes,basedonrankingthelistof802candidatepairsbytheircomorbidityscores.ChaptercolorsarehighlightednexttotheICD10codes.Diseasesthatoccuroftentogetherhaveredcolorintheheatmap,whilethosewithlowerthanexpectedco‐occurrencearecoloredblue.Thecolorlabelshowsthelog2changeofcomorbiditybetweentwodiseaseswhencomparedtotheexpectedlevel.doi:10.1371/journal.pcbi.1002141.g002

Roque etal.(2011)haveusedtextminingtoautomaticallyextractclinicallyrelevanttermsfrom5543psychiatricpatientrecordsandmappedthesetodiseasecodesintheICD10.Theyclusteredpatientstogetherbasedonthesimilarityoftheirprofiles.Theresultisapatientstratification,basedonmorecompleteprofilesthantheprimarydiagnosis,whichistypicallyused.Figure124illustratesthegeneralapproachtocapturecorrelationsbetweendifferentdisorders.SeveralclustersofICD10codesrelatingtothesameanatomicalareaortypeofdisordercanbeidentifiedalongthediagonaloftheheatmap,rangingfromtrivialcorrelations(e.g.,differentarthritisdisorders),tocorrelationsofcauseandeffectcodes(e.g.,strokeandmental/behavioural disorders),tosocialandhabitualcorrelations(e.g.drugabuse,liverdiseasesandHIV).

51WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 52: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Homology(plural:homologies)originsfromGreekὁμολογέω (homologeo)andmeans“toconform”(inGerman:übereinstimmen)andhasitsoriginsinBiologyandAnthropology,wherethewordisusedforacorrespondenceofstructuresintwolifeformswithacommonevolutionaryorigin(Darwin,1859).Inchemistryitisusedfortherelationshipbetweentheelementsinthesamegroupoftheperiodictable,orbetweenorganiccompoundsinahomologousseries.Inmathematicshomologyisaformalismfortalkinginaquantitativeandunambiguousmannerabouthowaspaceisconnected(Edelsbrunner &Harer,2010).Basically,homologyisaconceptthatisusedinmanybranchesofalgebraandtopology.Historically,thetermwasfirstusedinatopologicalsensebyHenryPoincaré.InBioinformatics,homologymodelling isamaturetechniquethatcanbeusedtoaddressmanyproblemsinmolecularmedicine.Homologymodelling isoneofthemostefficientmethodstopredictproteinstructures.Withtheincreaseinthenumberofmedicallyrelevantproteinsequences,resultingfromautomatedsequencinginthelaboratory,andinthefractionofallknownstructuralfolds,homologymodelling willbeevenmoreimportanttopersonalizedandmolecularmedicineinthefuture.Homologymodelling isaknowledge‐basedpredictionofproteinstructures.Inhomologymodelling aproteinsequencewithanunknownstructure(thetarget)isalignedwithoneormoreproteinsequenceswithknownstructures(thetemplates).Themethodofhomologymodelling isbasedontheprinciplethathomologueproteinshavesimilarstructures.Theprerequisiteforsuccessfulhomologymodelling isadetectablesimilaritybetweenthetargetsequenceandthetemplatesequences(morethan30%)allowingtheconstructionofacorrectalignment.Homologymodelling isaknowledge‐basedstructurepredictionrelyingonobservedfeaturesinknownhomologousproteinstructures.Byexploitingthisinformationfromtemplatestructuresthestructuralmodelofthetargetproteincanbeconstructed(Wiltgen &Tilz,2009).Twowell‐knownhomologymodelling programs,whicharefreeforacademicresearch,areMODELLER(http://salilab.org/modeller)andSWISSMODEL(http://swissmodel.expasy.org).Theslideshowsthecomparisonoftwoproteins:Thesequencesofbothproteinsare95%(53of56)identical(onlyresidues20,30and45differ),yetthestructuresaretotallydifferent.

52WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 53: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

All theareaswehavetouchedinthislectureareextremelyimportanttowardstheconceptofpersonalizedmedicineandmolecularmedicineandwillkeepusbusywithinthenextdecades.Dataminingismaybethemostcentralandmostimportantcomputationalsubjectinthisrespect.

53WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 54: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Alltheseapproachesareproducinggiganticamounts ofhighlycomplexdatasets!

Seetherecentarticle inScience– doublingofdatainproteomicsevery18months

54WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 55: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

55

My DEDICATION is to make data valuable … Thank you!The Klein-Bottle is the symbol for geometry and topology.

Topological data analysis (TDA) is a fast growing branch of applied mathematics and of enormous importance for data mining and knowledge discovery,particularly from large, high-dimensional, incomplete and noisy dirty data.

WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 56: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

56WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 57: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

http://psychology.wikia.com/wiki/Information_retrieval

57WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 58: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Networkmotifsinintegratedmolecularnetworksrepresentfunctionalrelationshipsbetweendistinctdatatypes.Theyaggregatetoformdensetopologicalstructurescorrespondingtofunctionalmoduleswhichcannotbedetectedbytraditionalgraphclusteringalgorithms.

58WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 59: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

http://www.nature.com/nri/journal/v3/n10/fig_tab/nri1200_F2.html

59WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 60: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

http://www.maa.org/cvm/1998/01/tprppoh/article/Pictures/KleinBottle.gif

60WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 61: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

61WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 62: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

62WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 63: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Nesting=recursion, subroutines,informationhiding,

63WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 64: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

OntopinFigure39weseeasampleXMLdescribinggenesinvolvedinlong‐termmemoryofasamplespecimenDrosophilamelanogaster.Nestedwithinthegeneelements,aresub‐elementsrelatedtotheparent.Thefirstgeneincludestwonucleicacidsequences,aproteinproduct,andafunctionalannotation.Additionalinformationisprovidedbyattributes,suchastheorganism.Thisexampleillustratesthedifficultyofmodelingmany‐to‐manyrelationships,suchastherelationshipbetweengenesandfunctions.Informationaboutfunctionsmustberepeatedundereachgenewiththatfunction.Ifweinvertthenesting(i.e.,nestinggenesinsidefunctionelements),thenwemustrepeatinformationaboutgeneswithmorethanasinglefunction.AtthebottominFigure39weseethesameinformationaboutgenes,butusingRDFandOWL.BothgenesareinstancesoftheclassFlyGene,whichhasbeendefinedasthesetofallGenesfortheorganismD.melanogaster.Thefunctionalinformationisrepresentedusingahierarchicaltaxonomy,inwhichLong‐TermMemoryisasubclassofMemory(Louieetal.,2007).

64WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 65: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

ThisisstarclusterstructureM30Letuslookintothemacroscopicareafirstandletuslookforsomesimilarities…ThisisstarglobularstarclusterM30(NGC7099),includingsome100.000starsadiameterofabout100light‐years,approx.40,000light‐yearsawayfromearth–lookatthestructure– lookatthesimilarity– andconsiderthetime,ifoureyesseethisstructuretheymightbevanished(DarwinChannel)Macroscopicstructure

65WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 66: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Fromthislargemacroscopic structurestotinymicroscopicstructureHereaX‐raycrystallography,which isastandardmethodtoanalyse thearrangementofobjects(atoms,molecules)withinacrystalstructure.Thisdatacontainsthemeanpositionsoftheentitieswithinthesubstance,theirchemicalrelationship,andvariousothers…andthedataisstored,forexample– ifhavingaproteinstructure– inaProteinDataBase(PDB).Thisdatabasecontainsvastamountsofdata.Ifamedicalprofessionallooksatthedata,heorsheseesonlylengthytablesofnumbers…

66WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 67: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Structures!Thisisnowourkeyword.Whenwetalkaboutstructures,wewillseesomereallyinterestingaspectsofstructures.Agoodexampleforadataintensiveandhighlycomplexmicroscopicstructureisayeastproteinnetwork.Note:Yeasts(Hefe)areeukaryoticmicro‐organisms(fungi)with1,500knownspeciescurrently,estimatedtobeonly1%ofallyeastspecies.Yeastsareunicellular,typicallymeasuring4µmindiameter.Inthispictureyoucanseethefirstproteininteractionnetwork(publishedbyJeong et.al,2001).Thenodesaretheproteins.Thelinksarethephysicalinteractions(bindings).Therednodesarelethaltotheorganism,thegreenonesarenon‐lethalandtheyellowonesarenotyetknown(stillunknown).Youmayaskwhetherthisstructureisuseful?Well,whatwegetoutbythisyeastissomethingwhichsomeofusmayreallylike:Prost!Theproblemwithsuchstructuresisthattheyareverybigandthattherearesomany!KnowledgeManagementcanhelptodiscoversuchunknownstructuresamongsttheenormoussetofuncharacterizeddata.Wewillcomebacktosuchstructuralhomologism later.NowletusmakeacloserlookonwhatKnowledgeManagementcandoforus.

67WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 68: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Whenthinkingaboutdata,weshouldalwayskeeptwofundamentalphysicalaspectsinmind:timerelatedaspects(e.g.entropyofdata)andspacerelatedaspects(e.g.topologyofdata).

http://www.youtube.com/watch?v=oBkOYQ02chsTedxWarwick 2010RogerPenroseinSpace‐TimeGeometry.http://www.youtube.com/watch?v=aSz5BjExs9oVisualizingElevenDimensions

68WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 69: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Cloudsofdata.Veryoften,dataisrepresentedasanunorderedsequenceofpointsinaEuclideann‐dimensionalspaceEn.Datacomingfromanarrayofsensorreadingsinanengineeringtestbed,fromquestionnaireresponsesinapsychologyexperiment,orfrompopulationsizesinacomplexecosystemallresideinaspaceofpotentiallyhighdimension.Theglobal‘shape’ofthedatamayoftenprovideimportantinformationabouttheunderlyingphenomenawhichthedatarepresents.Onetypeofdatasetforwhichglobalfeaturesarepresentandsignificantistheso‐calledpointclouddatacomingfromphysicalobjectsin3‐d.Touchprobes,pointlasers,orlinelaserssweepasuspendedbodyandsamplethesurface,record‐ing coordinatesofanchorpointsonthesurfaceofthebody.Thecloudofsuchpointscanbequicklyobtainedandusedinacomputerrepresentationoftheob‐ject.Atemporalversionofthissituationistobefoundinmotion‐capturedata,wheregeometricpointsarerecordedastimeseries.Inbothofthesesettings,itisimportanttoidentifyandrecognizeglobalfeatures:whereistheindexfinger,thekeyhole,thefracture?

69WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 70: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

70WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 71: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

a =orderb=clusteringcoefficientc=pathlengthd=centralitye=nodaldegreeF=modularityNetworkmetrics

71WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 72: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

72WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 73: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

http://www.google.com/patents/US6384826

73WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 74: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

74WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 75: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Representativeexamplesofdiseasecomplexesaredisplayed.Diseasesareassociatedwithtissuesbyusingourdisease–tissuematrix,andexpressiondataarefromtheGNFdataset.Theexpressionlevelsofcomplexesareshownaszscores.Ifadiseaseisassociatedwithmorethan3tissues,onlythe3mostassociatedtissuesareshownforclarity.Inagivencomplex,proteinsrelevanttothediseaseinquestionareyellow.Thefigureshowsthegeneraltendencyofoverexpressionofthecomplexesinthetissuesinwhichtheyareinvolvedinpathologycomparedwiththeirexpressionlevelinothertissues.Allmembersofthecomplexescanbeseenin

75WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 76: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Three‐dimensionalstructureofventricularmusclebasketweave,coronaryarterialtree,andpacemakerandconductionsystem.Oneofthecentralchallengesofcell‐basedtherapyforregeneratingspecificheartcomponentsisguidingtransplantedcellsintoafunctionalsyncytiumwiththeexistingthree‐dimensionalarchitecture.Transplantedcellsmustmakefunctionalconnectionswithneighboringspecializedheartcellstoresultinanetgainofglobalfunction.Transplantedmyogenicprogenitors,forexample,mustalignwithandintegrateintotheexistingventricularmusclebasketweavetoallowsynchronouscontractionandrelaxationofgraftandhostmyocardium.Integrationofpacemakerandconductionsystemprogenitorsintotheappropriatetissuetypeisnecessarytogenerateabiologicalpacemakerandavoidcardiacarrhythmia.Forexample,havingatransplantedheartmuscleprogenitorintegrateintotheconductionsystemmighthavearrythmogenic consequences,aswouldtheintroductionofcellswithindependentpacemakerpotentialintheheart.Similarly,cell‐basedtherapiestopromotecoronarycollateralformationorneo‐arteriogenesis requirefunctionalintegrationoftransplantedcellswiththehostcoronaryarterialtree.

76WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 77: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Three‐dimensionalstructureofventricularmusclebasketweave,coronaryarterialtree,andpacemakerandconductionsystem.Oneofthecentralchallengesofcell‐basedtherapyforregeneratingspecificheartcomponentsisguidingtransplantedcellsintoafunctionalsyncytiumwiththeexistingthree‐dimensionalarchitecture.Transplantedcellsmustmakefunctionalconnectionswithneighboringspecializedheartcellstoresultinanetgainofglobalfunction.Transplantedmyogenicprogenitors,forexample,mustalignwithandintegrateintotheexistingventricularmusclebasketweavetoallowsynchronouscontractionandrelaxationofgraftandhostmyocardium.Integrationofpacemakerandconductionsystemprogenitorsintotheappropriatetissuetypeisnecessarytogenerateabiologicalpacemakerandavoidcardiacarrhythmia.Forexample,havingatransplantedheartmuscleprogenitorintegrateintotheconductionsystemmighthavearrythmogenic consequences,aswouldtheintroductionofcellswithindependentpacemakerpotentialintheheart.Similarly,cell‐basedtherapiestopromotecoronarycollateralformationorneo‐arteriogenesis requirefunctionalintegrationoftransplantedcellswiththehostcoronaryarterialtree.

77WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 78: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

78WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015