A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data...

78
Status as of 08.11.2015 10:00 Dear Students, welcome to the 5th lecture of our course. Please remember from the last lecture the basic architecture of a hospital information system, the complexity of medical workflows, the challenges of data integration, data fusion, data curation; the building blocks of hospital information systems, databases, data warehouses, data marts; the difference between knowledge discovery and information retrieval; please remember the formal description of a information retrieval model – the best practice example is the Page‐Rank Algorithm, see: Hastie, T., Tibshirani, R. & Friedman, J. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a look to the reprint paper: Brin, S. & Page, L. 2012. Reprint of: The anatomy of a large‐scale hypertextual web search engine. Computer Networks, 56, (18), 3825‐3833. http://www.sciencedirect.com/science/article/pii/S1389128612003611 doi:10.1016/j.comnet.2012.10.007 Please always be aware of the definition of biomedical informatics (Medizinische Informatik): Biomedical Informatics is the inter‐disciplinary field that studies and pursues the effective use of biomedical data, information, and knowledge for scientific inquiry, problem solving, and decision making, motivated by efforts to improve human health (and well‐being). 1 WS 2015 A. Holzinger LV709.049 11.11.2015

Transcript of A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data...

Page 1: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Status asof08.11.201510:00

Dear Students,welcometothe5thlectureofourcourse.Pleaserememberfromthelastlecturethebasicarchitectureofahospitalinformationsystem,thecomplexityofmedicalworkflows,thechallengesofdataintegration,datafusion,datacuration;thebuildingblocksofhospitalinformationsystems,databases,datawarehouses,datamarts;thedifferencebetweenknowledgediscoveryandinformationretrieval;pleaseremembertheformaldescriptionofainformationretrievalmodel– thebestpracticeexampleisthePage‐RankAlgorithm,see:Hastie,T.,Tibshirani,R.&Friedman,J.2009.TheElementsofStatisticalLearning:DataMining,Inference,andPrediction.SecondEdition,NewYork,Springer.Orhavealooktothereprintpaper:Brin,S.&Page,L.2012.Reprintof:Theanatomyofalarge‐scalehypertextual websearchengine.ComputerNetworks,56,(18),3825‐3833.http://www.sciencedirect.com/science/article/pii/S1389128612003611doi:10.1016/j.comnet.2012.10.007

Pleasealwaysbeawareofthedefinitionofbiomedicalinformatics(MedizinischeInformatik):BiomedicalInformatics istheinter‐disciplinaryfieldthatstudiesandpursuestheeffectiveuseofbiomedicaldata,information,andknowledgeforscientificinquiry,problemsolving,anddecisionmaking,motivatedbyeffortstoimprovehumanhealth(and well‐being).

1WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 2: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

2WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 3: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

3WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 4: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

4WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 5: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Invivo(Latinfor"withintheliving")isexperimentationusingawhole,livingorganismasopposedtoapartialordeadorganism,oraninvitro("withintheglass",i.e.,inatesttubeorpetridish)controlledenvironment.

5WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 6: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

6WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 7: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

7WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 8: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

8WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 9: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Itiswidelyacknowledgedinmachinelearningthattheperformanceofalearningalgorithmisdependentonbothitsparametersandthetrainingdata.Yet,thebulkofalgorithmicdevelopmenthasfocusedonadjustingmodelparameterswithoutfullyunderstandingthedatathatthelearningalgorithmismodeling.Assuch,algorithmicdevelopmentforclassificationproblemshaslargelybeenmeasuredbyclassificationaccuracy,precision,orasimilarmetriconbenchmarkdatasets.Asmostmachinelearningresearchisfocusedonthedatasetlevel,oneisconcernedwithmaximizingp(h|t),whereh:X→YisahypothesisorfunctionmappinginputfeaturevectorsXtotheircorrespondinglabelvectorsY,andt={(xi,yi):xi∈X∧yi ∈Y}isatrainingset.

Oneofthemethodsforprivacypreservingdataminingisthatofanonymization,inwhicharecordisreleasedonlyifitisindistinguishablefromkotherentitiesinthedata.Wenotethatmethodssuchask‐anonymityarehighlydependentuponspatiallocalityinordertoeffectivelyimplementthetechniqueinastatisticallyrobustway.Inhighdimensionalspacethedatabecomessparse,andtheconceptofspatiallocalityisnolongereasytodefinefromanapplicationpointofview.Aggarwal,C.C.Onk‐anonymityandthecurseofdimensionality.Proceedingsofthe31stinternationalconferenceonVerylargedatabasesVLDB,2005.901‐909.

Holzinger,A.,Stocker,C.&Dehmer,M.2014.BigComplexBiomedicalData:TowardsaTaxonomyofData.In:Obaidat,M.S.&Filipe,J.(eds.)CommunicationsinComputerandInformationScienceCCIS455.BerlinHeidelberg:Springerpp.3‐18.

A. Holzinger                                                        LV709.049                                                 11.11.2015

WS 2015 9

Page 10: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

https://www.projectrhea.org/rhea/index.php/File:Complexitytable.png

Pstandsfor“polynomialtime”.Thisthesubsetofproblemsthatcanbeguaranteedtobesolvedinapolynomialamountoftimerelatedtotheirinputlength.ProblemsinPcommonlyoperateonsingleinputs,lists,ormatrices,andcanoccasionallyapplytographs.Thetypicaltypesofoperationstheyperformaremathematicaloperators,sorting,findingminimumandmaximumvalues,determinates,andmanyothers.NPstandsfor“nondeterministicpolynomialtime”.Theseproblemsareonesthatcanbesolvedinpolynomialtimeusinganondeterministiccomputer.Thisconceptisalittlehardertounderstand,soanotherdefinitionthatisaconsequenceofthefirstisoftenused.NPproblemsareproblemsthatcanbechecked,or“certified”,inpolynomialtime.TheoutputofanNPsolvingprogramiscalledacertificate,andthepolynomialtimeprogramthatchecksthecertificateforitsvalidityiscalledthecertificationprogram.NP‐hard:AproblemisNP‐hardifitasleastashardasthehardestproblemsknowntobeNP.Thisleadstotwopossibilities:eithertheproblemisinNPandalsoconsideredNP‐hard,oritismoredifficultthananyNPproblem.NP‐complete:ThisclassificationistheintersectionofNPandNP‐hard.IfaproblemisinNPandalsoNP‐hard,thenitisconsideredNP‐complete.Thisclassofproblemsisarguablythemostinterestingforitsconsequencesonmanyothertypesofproblems.

For thosewhowanttogodeeperintocomplexitytheory,thereisexcellentMITOpenCoursewarebyEricDemaine,http://erikdemaine.org/https://www.youtube.com/watch?v=moPtwq_cVH8Youcandosomeownexperimentationviahttp://www.algomation.com

A. Holzinger                                                        LV709.049                                                 11.11.2015

WS 2015 10

Page 11: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Keyproblemsindealingwithdatainthelifesciencesinclude:• Complexityofourworld• High‐dimensionality(curseofdimensionality(Catchpoole etal.,2010))• Mostofthedataisweakly‐structuredandunstructured

Agrandchallengeinhealthcareisthecomplexityofdata,implicatingtwoissues:structurization andstandardization.Aswehavelearnedinlecture2,verylittleofthedataisstructured.Mostofourdataisweaklystructured(Holzinger,2012).Inthelanguageofbusinessthereisoftentheuseoftheword“unstructured”,butwehavetousethiswordwithcare;unstructuredwouldmean– inastrictmathematicalsense– thatwearetalkingabouttotalrandomnessandcompleteuncertainty,whichwouldmeannoise,wherestandardmethodsfailorleadtothemodelingofartifacts,andonlystatisticalapproachesmayhelp.Thecorrecttermwouldbeunmodeled data– orweshallspeakaboutunstructuredinformation.Pleasemindthedifferences.

Totheimageabove:Advancesingeneticsandgenomicshaveacceleratedthediscovery‐based(=hypothesesgenerating)researchthatprovidesapowerfulcomplementtothedirecthypothesis‐drivenmolecular,cellularandsystemssciences.Forexample,geneticandfunctionalgenomicstudieshaveyieldedimportantinsightsintoneuronalfunctionanddisease.Oneofthemostexcitingandchallengingfrontiersinneuroscienceinvolvesharnessingthepoweroflarge‐scalegenetic,genomicandphenotypicdatasets,andthedevelopmentoftoolsfordataintegrationanddatamining(Geschwind &Konopka,2009).

11WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 12: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Donotconfusestructurewithstandardization(seeSlide2‐9).Datacanbestandardized(e.g.numericalentriesinlaboratoryreports)andnon‐standardized.Atypicalexampleisnon‐standardizedtext– impreciselycalled“Free‐Text”or“unstructureddata”inanelectronicpatientrecord(Kreuzthaleretal.,2011).

Standardizeddata isthe basisforaccuratecommunication.Inthemedicaldomain,manydifferentpeopleworkatdifferenttimesinvariouslocations.Datastandards canensurethatinformationisinterpretedbyalluserswiththesameunderstanding.Moreover,standardizeddatafacilitatecomparabilityofdataandinteroperabilityofsystems.Itsupportsthereusabilityofthedata,improvestheefficiencyofhealthcareservicesandavoidserrorsbyreducingduplicatedeffortsindataentry.

Datastandardizationreferstoa)thedatacontent;b)theterminologiesthatareusedtorepresentthedata;c)howdataisexchanged;andiv)howknowledge,e.g.clinicalguidelines,protocols,decisionsupportrules,checklists,standardoperatingproceduresarerepresentedinthehealthinformationsystem(refertoIOM).Technicalelementsfordatasharingrequirestandardizationofidentification,recordstructure,terminology,messaging,privacyetc.ThemostusedstandardizeddatasettodateistheinternationalClassificationofDiseases(ICD),whichwasfirstadoptedin1900forcollectingstatistics(Ahmadian etal.,2011),whichwewilldiscussin→Lecture3.Non‐standardizeddata isthemajorityofdataandinhibitdataquality,dataexchangeandinteroperability.Well‐structureddata istheminorityofdataandanidealisticcasewheneachdataelementhasanassociateddefinedstructure,relationaltables,ortheresourcedescriptionframeworkRDF,ortheWebOntologyLanguageOWL(see→Lecture3).Note:Ill‐structured isatermoftenusedfortheoppositeofwell‐structured,althoughthistermoriginallywasusedinthecontextofproblemsolving(Simon,1973).Semi‐structuredisaformofstructureddatathatdoesnotconformwiththestrictformalstructureoftablesanddatamodelsassociatedwithrelationaldatabasesbutcontainstagsormarkerstoseparatestructureandcontent,i.e.areschema‐lessorself‐describing;atypicalexampleisamarkup‐languagesuchasXML(see→Lecture3and4).Weakly‐Structureddata isthemostofourdatainthewholeuniverse,whetheritisinmacroscopic(astronomy)ormicroscopicstructures(biology)– see→Lecture5.Non‐structureddata orunstructureddata isanimprecisedefinitionusedforinformation expressedinnaturallanguage,whennospecificstructurehasbeendefined.Thisisanissuefordebate:Texthasalsosomestructure:words,sentences,paragraphs.Ifweareveryprecise,unstructureddatawouldmeantthatthedataiscompleterandomized– whichisusuallycallednoiseandisdefinedby(Duda,Hart&Stork,2000)asanypropertyofdatawhichisnotduetotheunderlyingmodelbutinsteadtorandomness(eitherintherealworld,fromthesensorsorthemeasurementprocedure).

12WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 13: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Alookonthetypicalviewofanhospitalinformationsystemshowsustheorganizationofwell‐structureddata:Standardizedandwell‐structureddataisthebasisforaccuratecommunication.Inthemedicaldomain,manydifferentpeopleworkatdifferenttimesinvariouslocations.Datastandardscanensurethatinformationisinterpretedbyalluserswiththesameunderstanding.Moreover,standardizeddatafacilitatecomparabilityofdataandinteroperabilityofsystems.Itsupportsthereusabilityofthedata,improvestheefficiencyofhealthcareservicesandavoidserrorsbyreducingduplicatedeffortsindataentry.Remember:Datastandardizationreferstoa)thedatacontent;b)theterminologiesthatareusedtorepresentthedata;c)howdataisexchanged;andd)howknowledge,e.g.clinicalguidelines,protocols,decisionsupportrules,checklists,standardoperatingproceduresarerepresentedinthehealthinformationsystem.Note:Theopposite,i.e.non‐standardizeddataisthemajorityofdataandinhibitdataquality,dataexchangeandinteroperability.Remark:Care2xisanOpenSourceInformationSystem,see:http://care2x.orgSee→Lecture10formoredetails.

13WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 14: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

ThisisaMedicalexampleforsemi‐structureddatainXML(Holzinger,2003).TheeXtensible MarkupLanguage(XML)isaflexibletextformatrecommendedbytheW3CfordataexchangeandderivedfromSGML(ISO8879),(Usdin &Graham,1998).XMLisoftenclassifiedassemi‐structured,howeverthisisinsomewaymisleading,asthedataitselfisstillstructured,butinaflexibleratherthanastaticway(Forster&Vossen,2012).Suchdatadoesnotconformtotheformalstructureoftablesanddatamodelsasforexampleinrelationaldatabases,butatleastcontainstags/markerstoseparatesemanticelementsandenforcehierarchiesofrecordsandfieldswithinthesedata.

14WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 15: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Thisexampleby(Rassinoux etal.,2003)showshowXMLcanbeusedinthehospitalinformationsystem:ThestructureofanynewdocumenteditedinthePatientRecord(here:DPI)isbasedonatemplatedefinedinXMLformat(left).ThesetemplatesplaytheroleofDTDsorXMLschemasastheypreciselydefinethestructureandcontenttypeofeachparagraph,thusvalidatingthedocumentattheapplicationlevel.Suchastructureembedsa<HEADER>anda<BODY>.Theheaderencapsulatesthepropertiesthatareinherenttothenewdocumentandthatwillbeusefultofurtherclassifyit,accordingtovariouscriteria,including:thepatientidentification,thedocumenttype,theidentifierofitsredactorsandofthehospitalizationstayorambulatoryconsultationtowhichthedocumentwillbeattachedinthepatienttrajectory,etc.Thebodyencapsulatesthecontent,andisdividedintotwoparts:The<STRUCDOC>partdescribesthesemanticentitiesthatcomposethedocument.The<FULLDOC>partembedsthedocumentitselfwithitspagelayoutinformation,whichcanbestoredeitherasadraft,atemporarytextorasadefinitivetext.Thisformatguaranteesthestorageofdynamicandcontrolledfieldsfordatainput,thusallowingthecombinationoffreetextandstructureddataentryinthedocument.Oncethedocumentisnolongereditable,itisdefinitivelysavedintotheRTFformat.ACDATAsectionisutilizedforstoringtheroughdocumentwhateveritsformat,asitpermitstodisregardblocksoftextcontainingcharactersthatwouldotherwiseberegardedasmarkup(Rassinoux,Lovis,Baud&Geissbuhler,2003).

15WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 16: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

OntopinthisslideyoucanseeasampleXMLdescribinggenesfromDrosophilamelanogasterinvolvedinlong‐termmemory.Nestedwithinthegeneelements,aresub‐elementsrelatedtotheparent.Thefirstgeneincludestwonucleicacidsequences,aproteinproduct,andafunctionalannotation.Additionalinformationisprovidedbyattributes,suchastheorganism.Thisexampleillustratesthedifficultyofmodelingmany‐to‐manyrelationships,suchastherelationshipbetweengenesandfunctions.Informationaboutfunctionsmustberepeatedundereachgenewiththatfunction.Ifweinvertthenesting,thenwemustrepeatinformationaboutgeneswithmorethanasinglefunction.BelowtheXMLweseetheinformationaboutgenesusingbothRDFandOWL.BothgenesareinstancesoftheclassFlyGene,whichhasbeendefinedasthesetofallGenesfortheorganismD.melanogaster.Thefunctionalinformationisrepresentedusingahierarchicaltaxonomy,inwhichLong‐TermMemoryisasubclassofMemory(Louieetal.,2007).Remark:Drosophilamelanogasterisamodelorganismandsharesmanygeneswithhumans.AlthoughDrosophilaisaninsectwhosegenomehasonlyabout14,000genes(halfofhumans),aremarkablenumberofthesehaveveryclosecounterpartsinhumans;someevenoccurinthesameorderinthefly'sDNAasinourown.This,plustheorganism'smorethan100‐yearhistoryinthelab,makesitoneofthemostimportantmodelsforstudyingbasicbiologyanddisease(seee.g.http://www.lbl.gov/Science‐Articles/Archive/sabl/2007/Feb/drosophila.html)Note:Therelationaldatamodelrequirespreciseness:Thedatamustberegular,completeandstructured.However,inBiologytherelationshipsaremostlyun‐precise.Genomicmedicineisextremelydataintensiveandthereisanincreasingdiversityinthetypeofdata:DNAsequence,mutation,expressionarrays,haplotype,proteomicetc.Inbioinformaticsmanyheterogeneousdatasourcesareusedtomodelcomplexbiologicalsystems(Rassinoux,Lovis,Baud&Geissbuhler,2003),(Achard,Vaysseix &Barillot,2001).Thechallengeingenomicmedicineistointegrateandanalyzethesediverseandhugedatasourcestoelucidatephysiologyandinparticulardiseasephysiology.XMLissuitedfordescribingsemi‐structureddata,includingakindofnaturalmodelingofbiologicalentities,becauseitallowsfeaturesase.g.nesting(seeSlide5‐6ontop).StillakeylimitationofXMLis,thatitisdifficulttomodelcomplexrelationships;forexample,thereisnoobviouswaytorepresentmany‐to‐manyrelationships,whichareneededtomodelcomplexpathways.OntopinFigure5‐9wecanseeasampleXML,describinggenesinvolvedinthelong‐termmemoryofasamplespecimend.melanogaster.Nestedwithinthegeneelements,aresub‐elementsrelatedtotheparent.Thefirstgeneincludestwonucleicacidsequences,aproteinproduct,andafunctionalannotation.Additionalinformationisprovidedbyattributes,suchastheorganism.Thisexampleillustratesthedifficultyofmodelingmany‐to‐manyrelationships,suchastherelationshipbetweengenesandfunctions.Informationaboutfunctionsmustberepeatedundereachgenewiththatfunction.Ifweinvertthenesting(i.e.,nestinggenesinsidefunctionelements),thenwemustrepeatinformationaboutgeneswithmorethanasinglefunction.AtthebottominSlide5‐6weseethesameinformationaboutgenes,butusingRDFandOWL.BothgenesareinstancesoftheclassFlyGene,whichhasbeendefinedasthesetofallGenesfortheorganismD.melanogaster.Thefunctionalinformationisrepresentedusingahierarchicaltaxonomy,inwhichLong‐TermMemoryisasubclassofMemory(Louieetal.,2007).

16WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 17: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Thehumanproteininteractionnetworkanditsconnectiontopositiveselection.Proteinslikelytobeunderpositiveselectionarecoloredinshadesofred(lightred,lowlikelihoodofpositiveselection;darkred,highlikelihood)(6).Proteinsestimatednottobeunderpositiveselectionareinyellow,andproteinsforwhichthelikelihoodofpositiveselectionwasnotestimatedareinwhite(6).

17WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 18: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

http://www.barabasilab.com/pubs/CCNR‐ALB_Publications/200907‐24_Science‐Decade/200907‐24_Science‐CoverImage.gif

Anemergingtrendinmanyscientificdisciplinesisastrongtendencytowardbeingtransformedintosomeformofinformationscience.Oneimportantpathwayinthistransitionhasbeenviatheapplicationofnetworkanalysis.Thebasicmethodologyinthisareaistherepresentationofthestructureofanobjectofinvestigationbyagraphrepresentingarelationalstructure.Itisbecauseofthisgeneralnaturethatgraphshavebeenusedinmanydiversebranchesofscienceincludingbioinformatics,molecularandsystemsbiology,theoreticalphysics,computerscience,chemistry,engineering,drugdiscovery,andlinguistics,tonamejustafew.Animportantfeatureofthebook“StatisticalandMachineLearningApproachesforNetworkAnalysis”istocombinetheoreticaldisciplinessuchasgraphtheory,machinelearning,andstatisticaldataanalysisand,hence,toarriveatanewfieldtoexplorecomplexnetworksbyusingmachinelearningtechniquesinaninterdisciplinarymanner.Theageofnetworksciencehasdefinitelyarrived.Large‐scalegenerationofgenomic,proteomic,signaling,andmetabolomic dataisallowingtheconstructionofcomplexnetworksthatprovideanewframeworkforunderstandingthemolecularbasisofphysiologicalandpathologicalstates.Networksandnetwork‐basedmethodshavebeenusedinbiologytocharacterizegenomicandgeneticmechanismsaswellasproteinsignaling.Diseasesarelookeduponasabnormalperturbationsofcriticalcellular networks.Onset,progression,andinterventionincomplexdiseasessuchascanceranddiabetesareanalyzedtodayusingnetworktheory.Oncethesystemisrepresentedbyanetwork,methodsofnetworkanalysiscanbeappliedtoextractusefulinformationregardingimportantsystempropertiesandtoinvestigate itsstructureandfunction.Variousstatisticalandmachinelearningmethodshavebeendevelopedforthispurposeandhavealreadybeenappliedtonetworks.Dehmer,M.&Basak,S.C.2012.StatisticalandMachineLearningApproachesforNetworkAnalysis,WileyOnlineLibrary.

18WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 19: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Theconceptofnetworkstructuresisfascinating,compellingandpowerfulandapplicableinnearlyanydomainatanyscale.Networktheorycanbetracedbacktographtheory,developedbyLeonhardEulerin1736(see→Slide5‐8).However,stimulatedbyworkse.g.fromBarabási,Albert&Jeong (1999),researchoncomplexnetworkshasonlyrecentlybeenappliedtobiomedicalinformatics.Asanextensionofclassicalgraphtheory,seeforexample(Diestel,2010),complexnetworkresearchfocusesonthecharacterization,analysis,modelingandsimulationofcomplexsystemsinvolvingmanyelementsandconnections,examplesincludingtheinternet,generegulatorynetworks,protein‐proteinnetworks,socialrelationshipsandtheWebandmanymore.Attentionisgivennotonlytotrytoidentifyspecialpatternsofconnectivity,suchastheshortestaveragepathbetweenpairsofnodes(Newman,2003),butalsotoconsidertheevolutionofconnectivityandthegrowthofnetworks,anexamplefrombiologybeingtheevolutionofprotein‐proteininteractionnetworksindifferentspecies(→Slide5‐8).Inordertounderstandcomplexbiologicalsystems,thethreefollowingkeyconceptsneedtobeconsidered:(i)emergence,thediscoveryoflinksbetweenelementsofasystembecausethestudyofindividualelementssuchasgenes,proteinsandmetabolitesisinsufficienttoexplainthebehaviorofwholesystems;(ii)robustness,biologicalsystemsmaintaintheirmainfunctionsevenunderperturbationsimposedbytheenvironment;and(iii)modularity,verticessharingsimilarfunctionsarehighlyconnected.Networktheorycanlargelybeappliedforbiomedicalinformatics,becausemanytoolsarealreadyavailable(Costa,Rodrigues&Cristino,2008).

19WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 20: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Figure1p.74‐ precedingtotheyeastprotein networkAgraphG(V,E)describesastructurewhichconsistsofnodesakaverticesV,connectedbyasetofpairsofdistinctnodes(links),callededgesE{a,b}witha,b∈V;a≠b.Graphscontainingcyclesand/oralternativepathsarereferredtoasnetworks.Thevertexesandedgescanhavearangeofpropertiesdefinedascolors,whichalsomayhavequantitativevalues,referredtoasweights.InthisSlideweseethebasicbuildingblocksymbolsofabiologicalnetworkasusedinbioinformatics.Thebluedotsareservingasnetworkhubs,theredblockisacriticalnode(onacriticallink),thewhiteballsarebottlenecks,thestarssecondorderhubsetc.(Hodgman,French&Westhead,2010).

20WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 21: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Inordertorepresentnetworkdataincomputersitisnotcomfortabletousesets;morepracticalarematrices.Thesimplestformofagraphrepresentationisthesocalledadjacencymatrix.InthisSlideweseeanundirected(left)andadirectedgraphandtheirrespectiveadjacencymatrices.Ifthegraphisundirected,theadjacencymatrixissymmetric,i.e.,theelementsaij =aji foranyi andj.

21WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 22: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

ThisToolisaniceexampleontheusefulnessofadjacencymatrices:TheInfoVisToolkitisaninteractivegraphicstoolkitdevelopedbyJean‐DanielFekete atINRIA(TheFrenchNationalInstituteforComputerScienceandControl).Thetoolkitimplementsninetypesofvisualization:ScatterPlots,TimeSeries,ParallelCoordinatesandMatricesfortables;Node‐Linkdiagrams,IcicletreesandTreemapsfortrees;AdjacencyMatricesandNode‐Linkdiagramsforgraphs.Node‐Linkvisualizationsprovidesseveralvariants(8forgraphsand4fortrees).Therearealsoanumberofinteractivecontrolsandinformationdisplays,includingdynamicquerysliders,fisheyelenses,andexcentric labels.InformationabouttheInfoVistoolkitcanbefoundathttp://ivtk.sourceforge.netTheInfoVis Toolkitprovidesinteractivecomponentssuchasrangeslidersandtailoredcontrolpanelsrequiredtoconfigurethevisualizations.Thesecomponentsareintegratedintoacoherentframeworkthatsimplifiesthemanagementofrichdatastructuresandthedesignandextensionofvisualizations.Supporteddatastructuresincludetables,treesandgraphs.Allvisualizationscanusefisheyelensesanddynamiclabeling(Fekete,2004).

22WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 23: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Illustrationofthemeaningofcommonlyusedterms.Theprocessofdigitalimageformationinmicroscopyisdescribedinotherbooks.Imageprocessingtakesanimageasinputandproducesamodifiedversionofit(inthecaseshown,theobjectcontoursareenhancedusinganoperationknownasedgedetection,describedinmoredetailelsewhereinthisbooklet).Imageanalysisconcernstheextractionofobjectfeaturesfromanimage.Insomesense,computergraphicsistheinverseofimageanalysis:itproducesanimagefromgivenprimitives,whichcouldbenumbers(thecaseshown),orparameterizedshapes,ormathematicalfunctions.Computervisionaimsatproducingahigh‐levelinterpretationofwhatiscontainedinanimage.Thisisalsoknownasimageunderstanding.Finally,theaimofvisualizationistotransformhigher‐dimensionalimagedataintoamoreprimitiverepresentationtofacilitateexploringthedat

A. Holzinger                                                        LV709.049                                                 11.11.2015

WS 2015 23

Page 24: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Thetrulymulti‐disciplinarynetworksciencehasledtoawidevarietyofquantitativemeasurementsoftheirtopologicalcharacteristics(Costaetal.,2007).Theidentificationbetweenagraphandanadjacencymatrixmakesallthepowerfulmethodsoflinearalgebra,graphtheoryandstatisticalmechanicsavailabletousforinvestigatingspecificnetworkcharacteristics:Order(ainFigureSlide5‐11)=totalnumberofnodesnSize=totalnumberoflinks:∑_i▒∑_j▒a_ijClusteringCoefficient(binSlide5‐11)=thedegreeofconcentrationoftheconnectionsofthenode’sneighborsinagraphandgivesameasureoflocalinhomogeneityofthelinkdensity,i.e.thelevelofconnectednessofthegraph.Itiscalculatedastheratiobetweentheactualnumberti oflinksconnectingtheneighborhood(thenodesimmediatelyconnectedtoachosennode)ofanodeandthemaximumpossiblenumberoflinksinthatneighborhood:C_i=(2t_i)/(k(k_i‐1))Forthewholenetwork,theclusteringcoefficientisthearithmeticmean:C=1/n∑_i▒C_iPathlength(cinSlide5‐11)=isthearithmeticalmeanofallthedistances;Thecharacteristicpathlengthofnodei providesinformationabouthowclosenodei isconnectedtoallothernodesinthenetworkandisgivenbythedistanced(i,j)betweennodei andallothernodesjinthenetwork.ThePathlengthlprovidesimportantinformationaboutthelevelofglobalcommunicationefficiencyofanetwork:l=1/(n(n‐1))∑_(i≠j)▒d_ijNote:Numericalmethods,e.g.theDijkstra's algorithm(1959)areusedtocalculateallthepossiblepathsbetweenanytwonodesinanetwork.

24WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 25: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Centrality(dinSlide5‐12)=thelevelof“betweenness‐ centrality”ofanodei;itindicateshowmanyoftheshortestpathsbetweenthenodesofthenetworkpassthroughnodei.Ahigh“betweenness‐centrality”indicatesthatthisnodeisimportantininterconnectingthenodesofthenetwork,markingapotentialhubrole(referto→Slide5‐8)ofthisnodeintheoverallnetwork.Nodaldegree(einSlide5‐12)=numberoflinksconnectingi toitsneighbors.Thedegreeofnodei isdefinedasitstotalnumberofconnections.k_i=∑_i▒a_ijThedegreeprobabilitydistributionP(k)describesthep(x)thatanodeisconnectedtokothernodesinthenetwork.Modularity(finSlide5‐12)=describesthepossibleformationofcommunitiesinthenetwork,indicatinghowstronggroupsofnodesformrelativeisolatedsub‐networkswithinthefullnetwork(referalsoto→Slide5‐8)).

25WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 26: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Regularnetwork(ainSlide5‐13)hasalocalcharacter,characterizedbyahighclustering‐coefficient(cinSlide5‐13)andahighpathlength(L,Slide5‐13).Ittakesalargenumberofstepstotravelfromaspecificnodetoanodeontheotherendofthegraph.Aspecialcaseofaregularnetworkisthe:Randomnetwork,whereallconnectionsaredistributedrandomlyacrossthenetwork;theresultisagraphwitharandomorganization(outerrightinSlide5‐13).Incontrasttothelocalcharacteroftheregularnetwork,arandomnetworkhasamoreglobalcharacter,withalowCandamuchshorterpathlengthLthantheregularnetwork.Aparticularcaseisthe:Small‐worldnetwork(centerofSlide5‐13)whichareveryrobustandcombineahighleveloflocalandglobalefficiency.Watts&Strogatz (1998)showedthatwithalowprobabilitypofrandomlyreconnectingaconnectionintheregularnetwork,aso‐calledsmall‐worldorganizationarises.IthasbothahighCandalowL,combiningahighleveloflocalclusteringwithstillashortaveragetraveldistance.Manynetworksinnaturearesmall‐world(e.g.internet,protein‐networks,socialnetworks,functionalandstructuralbrainnetworketc.),combiningahighlevelofsegregationwithahighlevelofglobalinformationintegration.Inaddition,suchnetworkscanhaveaheavytailedconnectivitydistribution,incontrasttorandomnetworksinwhichthenodesroughlyallhavethesamenumberofconnections.Scale‐freenetworks(BinSlide5‐13)arecharacterizedbyadegreeprobabilitydistributionthatfollowsapower‐lawfunction,indicatingthatonaverageanodehasonlyafewconnections,butwiththeexceptionofasmallnumberofnodesthatareheavilyconnected.Thesenodesareoftenreferredtoashubnodes(see→Slide5‐8)andtheyplayacentralroleinthelevelofefficiencyofthenetwork,astheyareresponsibleforkeepingtheoveralltraveldistanceinthenetworktoaminimum.Asthesehubnodesplayakeyroleintheorganizationofthenetwork,scale‐freenetworkstendtobevulnerabletospecializedattackonthehubnodes.Modularnetworks(cinSlide5‐13)showtheformationofso‐calledcommunities,consistingofasubsetofnodesthataremostlyconnectedtotheirdirectneighborsintheircommunityandtoalesserextendtotheothernodesinthenetwork.Suchnetworksarecharacterizedbyahighlevelofmodularityofthenodes.

26WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 27: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

27WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 28: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

TherearemanywaystoconstructaproximitygraphrepresentationfromasetofdatapointsthatareembeddedinR^d.Letusconsiderasetofdatapoints{x_1,...,x_n}∈R^d .ToeachdatapointweassociateavertexofaproximitygraphGtodefineasetofverticesV={v1,v2,...,vn}.DeterminingtheedgesetEoftheproximitygraphGrequiresdefiningtheneighborsofeachvertexviaccordingtoitsembeddingxi.Consequently,aproximitygraphisagraphinwhichtwoverticesareconnectedbyanedgeiffthedatapointsassociatedtotheverticessatisfyparticulargeometricrequirements.Suchparticulargeometricrequirementsareusuallybasedonametricmeasuringthedistancebetweentwodatapoints.AusualchoiceofmetricistheEuclideanmetric.Lookattheslide:a)isourinitialsetofpointsintheplaneR^2b)ε‐ballgraphvi∼vj ifxj ∈B(vi;ε)c)k‐nearest‐neighborgraph(k‐NNG):vi∼vj ifthedistancebetweenxiandxj isamongthek‐thsmallestdistancesfromxitootherdatapoints.Thek‐NNGisadirectedgraphsinceonecanhavexiamongthek‐nearestneighborsofxj butnotviceversa.d)EuclideanMinimumSpanningTree(EMST)graphisaconnectedtreesub‐graphthatcontainsalltheverticesandhasaminimumsumofedgeweights.TheweightoftheedgebetweentwoverticesistheEuclideandistancebetweenthecorrespondingdatapoints.e)Symmetrick‐nearest‐neighborgraph(Sk‐NNG):vi∼vj ifxiisamongthek‐nearestneighborsofyorviceversa.f)Mutualk‐nearest‐neighborgraph(Mk‐NNG):vi∼vj ifxiisamongthek‐nearestneighborsofyandviceversa.Allverticesinamutualk‐NNgraphhaveadegreeupper‐boundedbyk,whichisnotusuallythecasewithstandardk‐NNgraphs.g)RelativeNeighborhoodGraph(RNG):vi∼vj iffthereisnovertexinB(vi;D(vi,vj))∩B(vj ;D(vi,vj)).h)GabrielGraph(GG)i)Theβ‐SkeletonGraph(β‐SG):Fordetailspleasereferto(Lézoray &Grady,2012),ortoaclassicalgraphtheorybook,e.g.(Harary,1969),(Bondy &Murty,1976),(Golumbic,2004),(Diestel,2010)

28WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 29: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Slide5‐16:GraphsfromImagesInthisslideweseetheexamplesofa)arealimagewiththequadtree tessellation,b)theregionadjacencygraphassociatedtothequadtree partition,c)Irregulartessellationusingimage‐dependentsuperpixel WatershedSegmentation(Vincent&Soille,1991)d)irregulartessellationusingimage‐dependentSLICsuperpixels (Lucchi etal.,2010)SLIC=SimpleLinearIterativeClustering)

29WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 30: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

AstraightforwardimplementationoftheoriginalVincent‐Soille algorithmisdifficultifplateausoccur.Therefore,analternativeapproachwasproposedby(Meijster &Roerdink,1995),inwhichtheimageisfirsttransformedtoadirectedvaluedgraphwithdistinctneighborvalues,calledthecomponentsgraphoff.Onthisgraphthewatershedtransformcanbecomputedbyasimplied versionoftheVincent‐Soille algorithm,wherefifo queuesarenolongernecessary,sincetherearenoplateausinthegraph(Roerdink &Meijster,2000).

30WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 31: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Theoriginalnaturaldigitalimageisfirsttransformedintogrey‐scale,thentheWatershedalgorithmisappliedandthenthecentroidfunctioncalculated,theresultsarerepresentativepointsetsintheplane.

31WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 32: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

TheDelaunayTriangulation(DT):vi∼vj iffthereisaclosedballB(•;r)withviandvj onitsboundaryandnoothervertexvk containedinit.ThedualtotheDTistheVoronoiirregulartessellationwhereeachVoronoicellisdefinedbytheset{x∈Rn |D(x,vk)≤D(x,vj)forallvj =vk}.Insuchagraph,∀vi,deg (vi)=3.(Lézoray &Grady,2012)

32WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 33: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

ThisanimationshowstheconstructionofaDelaunaygraph:Firsttheredpointsontheplanearedrawn,thenweinserttheblueedgesandtheblueverticesontheVoronoigraph,finallyherededgesdrawnbuildtheDelaunaygraph(Kropatsch,Burge&Glantz,2001).

http://oldwww.prip.tuwien.ac.at/research/research‐areas/structure‐and‐topology/graphs‐in‐image‐analysis/graphs‐in‐image‐analysis/use‐of‐graphs‐in‐image‐analysis/voronoi‐graph‐and‐delaunay‐graph

33WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 34: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

InthisSlideweseetheevaluatedinformation‐theoreticnetworkmeasuresonpublicationnetworks.HerefromtheexcellencenetworkofRWTHAachenUniversity.Thosemeasurescanbeunderstoodasgraphcomplexitymeasureswhichevaluatethestructuralcomplexitybasedonthecorrespondingconcept.Apossibleusefulinterpretationofthesemeasureshelpstounderstandthedifferencesinsubgraphs ofacluster.Forexampleonecouldapplycommunitydetectionalgorithmsandcompareentropymeasuresofsuchdetectedcommunities.Relatingthesedatatosocialmeasures(e.g.balancedscorecarddata)ofsub‐communitiescouldbeusedasindicatorsofcollaborationsuccessorlackthereof.Thenodesizeshowsthenodedegreewhereasthenodecolorshowsthebetweenness centrality,darkercolormeanshighercentrality(Holzingeretal.,2013a).

34WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 35: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Afurtherexampleshalldemonstratetheusefulnessofgraphtheoryandnetworkanalysis:ThisgraphshowsthemedicalknowledgespaceofastandardquickreferenceguideforemergencydoctorsandparamedicsintheGermanspeakingarea.Ithasbeensubsequentlydeveloped,testedinthemedicalrealworldandconstantlyimprovedfor20yearsbyDr.med.RalfMüller,emergencydoctoratGraz‐LKHUniversityHospitalandispracticallyinthepocketofeveryemergencyandfamilydoctorandparamedicsintheGermanspeakingarea(Holzingeretal.,2013b).UptoknowweknowthatGraphsandGraph‐Theoryarepowerfultoolstomapdatastructuresandtofindnovelconnectionsbetweensingledataobjects(Strogatz,2001),(Dorogovtsev &Mendes,2003).Theinferredgraphscanbefurtheranalyzedbyusinggraph‐theoreticalandstatisticalandmachinelearningtechniques(Dehmer,Emmert‐Streib &Mehler,2011).Amappingofthealreadyexistingandinthemedicalpracticeapproved“knowledgespace”asaconceptualgraphandthesubsequentvisualandgraph‐theoreticalanalysismayprovidenovelinsightsonhiddenpatternsinthedata.Anotherbenefitofthegraph‐baseddatastructureisintheapplicabilityofmethodsfromnetworktopologyandnetworkanalysisanddatamining,e.g.small‐worldphenomenon(Barabasi &Albert,1999),(Kleinberg,2000),andclusteranalysis(Koontz,Narendra &Fukunaga,1976),(Wittkop etal.,2011).Thegraph‐theoreticdataofthegraphseeninthisSlideinclude:Numberofnodes=641,numberofedges=1250,redareagents,blackareconditions,bluearepharmacologicalgroups,greyareotherdocuments.Theaveragedegreeofthisgraph=3.888,theaveragepathlength=4.683,thenetworkdiameter=9.

35WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 36: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Thenodesofthesamplegraphrepresent:drugs,clinicalguidelines,patientconditions(indication,contraindication),pharmacologicalgroups,tablesandcalculationsofmedicalscores,algorithmsandothermedicaldocuments;andtheedgesrepresent3crucialtypesofrelationsinducingmedicalrelevancebetweentwoactivesubstances,i.e.:pharmacologicalgroups,indicationsandcontra‐indications.Thefollowingexamplewilldemonstratetheusefulnessofthisapproach.

36WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 37: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Thisexampleshowsushowconvenientwecanfindwhichpathbetweentwonodesistheshortestaswellasthenavigationwaybetweenthesenodes.Computingshortestpathsisafundamentalandubiquitousprobleminnetworkanalysis.Wecan,e.g.applytheDijkstra‐algorithm,solvestheshortestpathproblemforagraphwithnon‐negativeedgepathcosts,producingashortestpathtree.Thisalgorithmisoftenusedinroutingandasasubroutineinothergraphalgorithms:Foragivennode,thealgorithmfindsthepathwithlowestcost(i.e.theshortestpath)betweenthatnodeandeveryothernode(Henzinger etal.,1997).

37WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 38: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

HereweseetherelationshipbetweenAdrenaline(centerblacknode)andDobutamine (topleftblacknode),Blue:PharmacologicalGroup,Darkred:Contraindication;Lightred:Condition,theGreennodes(fromdarktolight)are:1.Application(oneoremoreindications+correspondingdosages)2.Singleindicationwithadditionaldetails(e.g.“VFafter3rdShock”)3.Condition(e.g.VF,VentricularFibrillation)

38WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 39: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Ourbrainformsoneintegrativecomplexnetwork,linkingallbrainregionsandsub‐networkstogether(VanDenHeuvel &Hulshoff Pol,2010).Examiningtheorganizationofthisnetworkprovidesinsightsinhowourbrainworks.Graphtheoryprovidesaframeworkinwhichthetopologyofcomplexnetworkscanbeexamined;thuscanrevealnoveltiesaboutboththelocalandglobalorganizationoffunctionalbrainnetworks.Intheslidewecanseehowthemodelingofthefunctionalbrainbyagraphworks:edgesaretheconnectionsbetweenregionsthatarefunctionallylinked.First,thecollectionofnodesisdefined(A),secondtheexistenceoffunctionalconnectionsbetweenthenodesinthenetworkneedstobedefined,resultinginaconnectivitymatrix(B).Finally,theexistenceofaconnectionbetweentwopointscanbedefinedaswhethertheirleveloffunctionalconnectivityexceedsacertainpredefinedthreshold(C)(VanDenHeuvel &Hulshoff Pol,2010).

39WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 40: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Developmentofthehumanheartstarts2weeksafterfertilization,withtheformationofthecardiaccrescentandthesubsequentformationandloopingoftheprimitivehearttube.Insightintothebiologyofmolecularnetworksisanimportantfield,asanomaliesinthesesystemsunderlieawidespectrumofpolygenetichumandisorders,rangingfromschizophreniatocongenitalheartdisease(CHD).Understandingthefunctionalarchitectureofnetworksthatorganizethedevelopmentoforgans,seee.g.(Chien,Domian&Parker,2008),laysthefoundationofnovelapproachesinregenerativemedicine,sincemanipulationofsuchsystemsisnecessaryforsuccessoftissueengineeringtechnologiesandstemcelltherapy.Lage etal.(2010)developedaframeworkforgainingnewinsightsintothesystemsbiologyoftheproteinnetworksdrivingorgandevelopmentandrelatedpolygenichumandiseasephenotypes,exemplifiedwithheartdevelopmentandCHD.IntheSlideweseeexamplesoffourfunctionalnetworksdrivingthedevelopmentofdifferentanatomicalstructuresinthehumanheart.Thesefournetworksareconstructedbyanalyzingtheinteractionpatternsoffourdifferentsetsofcardiacdevelopment(CD):proteinscorrespondingtothemorphologicalgroups‘atrialseptal defects,’‘abnormalatrioventricular valvemorphology,’‘abnormalmyocardialtrabeculae morphology,’and‘abnormaloutflowtractdevelopment’.CDproteinsfromtherelevantgroupsareshowninorangeandtheirinteractionpartnersareshowningray.Functionalmodulesannotatedbyliteraturecuration areindicatedwithacoloredbackground.CentrallyintheFigureisahaematoxylin‐eosinstainedfrontalsectionoftheheartfroma37‐dayhumanembryo,wheretissuesaffectedbythefournetworksaremarked;AS(developingatrialseptum),EC(endocardial cushions,whichareanatomicalprecursorstotheatrioventricular valves),VT(developingventriculartrabeculae),andOFT(developingoutflowtract).

40WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 41: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

InthisSlideweseeanoverviewofthemodularorganizationofheartdevelopment:(A)Proteininteractionnetworksareplottedattheresolutionoffunctionalmodules.Eachmoduleiscolorcodedaccordingtofunctionalassignmentasdeterminedbyliteraturecuration.Theamountofproteinsineachmoduleisproportionaltotheareaofitscorrespondingnode.Edgesindicatedirect(lines)orindirect(dottedlines)interactionsbetweenproteinsfromtherelevantmodules.(B)Recyclingoffunctionalmodulesduringheartdevelopment.Thebarsrepresentfunctionalmodulesandrecyclingisindicatedbyarrows.Thebarsfollowthecolorcodeof(A)andtheheightofthebarsrepresentthenumberofproteinsineachmodule,asshownleftontheyaxis(Lage etal.,2010).Note:Phenotype=anorganism'sobservablecharacteristics(traits),e.g.morphology,biochemical/physiologicalproperties,behaviour,etc.Phenotypesresultfromtheexpressionofanorganism'sgenesaswellastheinfluenceofenvironmentalfactorsandtheinteractionsbetweenthem.Genotype=inheritedinstructionswithinitsgeneticcode.

41WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 42: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Diseases(e.g.obesity,diabetes,atherosclerosisetc.)resultfrommultiplegeneticandenvironmentalfactors,andimportantly,interactionsbetweengeneticandenvironmentalfactors.ThisSlideshowsthevastnetworksofmolecularinteractions.Itcanbeseenthatthegastrointestinal(GI)tract,vasculature,immunesystem,heartandbrainareallpotentiallyinvolvedineithertheonsetofdiseasessuchasatherosclerosisorincomorbiditiessuchasmyocardialinfarctionandstrokebroughtonbysuchdiseases.Further,therisksofcomorbiditiesfordiseasessuchasatherosclerosisareincreasedbyotherdiseases,suchashypertension,whichmay,inturn,involveotherorgans,suchaskidney.Therolethateachorganandtissuetypeplaysinagivendiseaseislargelydeterminedbygeneticbackgroundandenvironment,wheredifferentperturbationstothegeneticbackground(perturbationscorrespondingtoDNAvariationsthataffectgenefunction,which,inturn,leadstodisease)and/orenvironment(changesindiet,levelsofstress,levelofactivity,andsoon)definethesubtypesofdiseasemanifestedinanygivenindividual.Althoughthephysiologyofdiseasessuchasatherosclerosisisbeginningtobebetterunderstood,whathavenotbeenfullyexploitedtodataarethevastnetworksofmolecularinteractionswithinthecells.WeseeclearlyintheSlidethatthereisadiversityofmolecularnetworksfunctioninginanygiventissue,includinggenomicsnetworks,networksofcodingandnoncodingRNA,proteininteractionnetworks,proteinstatenetworks,signalingnetworks,andnetworksofmetabolites.Further,thesenetworksarenotactinginisolationwithineachcell,butinsteadinteractwithoneanothertoformcomplex,giantmolecularnetworkswithinandbetweencellsthatdriveallactivityinthedifferenttissues,aswellassignalingbetweentissues.VariationsinDNAandenvironmentleadtochangesinthesemolecularnetworks,which,inturn,inducecomplicatedphysiologicalprocessesthatcanmanifestasdisease.Despitethisvastcomplexity,theclassicapproachtoelucidatinggenesthatdrivediseasehasfocusedonsinglegenesorsinglelinearlyorderedpathwaysofgenesthoughttobeassociatedwithdisease.Thisnarrowapproachisanaturalconsequenceofthelimitedsetoftoolsthatwereavailableforqueryingbiologicalsystems;suchtoolswerenotcapableofenablingamoreholisticapproach,resultingintheadoptionofareductionistapproachtoteasingapartpathwaysassociatedwithcomplexdiseasephenotypes.Althoughtheemergingviewthatcomplexbiologicalsystemsarebestmodeledashighlymodular,fluidsystemsexhibitingaplasticitythatallowsthemtoadapttoavastarrayofconditions,thehistoryofsciencedemonstratesthatthisview,althoughlongtheideal,wasneverwithinreach,giventheunavailabilityoftoolsadequatetocarryingoutthistypeofresearch.Theexplosionoflarge‐scale,high‐throughputtechnologiesinthebiologicalsciencesoverthepast15to20yearshasmotivatedarapidparadigmshiftawayfromreductionisminfavorofasystems‐levelviewofbiology(Schadt &Lum,2006).

42WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 43: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Thethreemaintypesofbiologicalnetworks:(a)atranscriptionalregulatorynetworkhastwocomponents:transcriptionfactor(TF)andtargetgenes(TG),whereTFregulatesthetranscriptionofTGs;(b)protein‐proteininteractionnetworks:twoproteinsareconnectedifthereisadockingbetweenthem;(c)ametabolicnetworkisconstructedconsideringthereactants,chemicalreactionsandenzymes.

43WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 44: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

TheextremecomplexityoftheE.colitranscriptionalregulatorynetwork.Inthisgraphicalrepresentation,nodesaregenes,andedgesrepresentregulatoryinteractions.ThenetworkwasreconstructedusingdatafromtheRegulonDB (Salgadoetal.2006).Thisfigurehighlightstheextremecomplexityinregulatorynetworks.Toobtainadeeperunderstandingofregulatorycomplexity,scientistsmustfirstdiscoverbiologicallyrelevantorganizationalprinciplestounravelthehiddenarchitecturegoverningthesenetworks(seeNatureEducation:http://www.nature.com/scitable/content/the‐extreme‐complexity‐of‐the‐e‐coli‐14457504)

Thecomplexityoforganismsarisesratherasaconsequenceofelaboratedregulationsofgeneexpressionthanfromdifferencesingeneticcontentintermsofthenumberofgenes.Thetranscriptionnetworkisacriticalsystemthatregulatesgeneexpressioninacell.Transcriptionfactors(TFs)respondtochangesinthecellularenvironment,regulatingthetranscriptionoftargetgenes(TGs)andconnectingfunctionalproteininteractionstothegeneticinformationencodedininheritedgenomicDNAinordertocontrolthetimingandsitesofgeneexpressionduringbiologicaldevelopment.TheinteractionsbetweenTFsandTGscanberepresentedasadirectedgraph:Thetwotypesofnodes(TFandTG)areconnectedbyarcs(see→Slide5‐31,arrows)whenregulatoryinteractionoccursbetweenregulatorsandtargets.Transcriptionalregulatorynetworksdisplayinterestingpropertiesthatcanbeinterpretedinabiologicalcontexttobetterunderstandthecomplexbehaviorofgeneregulatorynetworks.Atalocalnetworklevel,thesenetworksareorganizedinsubstructuressuchasmotifsandmodules.Motifsrepresentthesimplestunitsofanetworkarchitecturerequiredtocreatespecificpatternsofinter‐regulationbetweenTFsandTGs.Threemostcommontypesofmotifscanbefoundingeneregulatorynetworks:(1)singleinput,(2)multipleinputand(3)feed‐forwardloopTargetgenesbelongingtothesamesingleandmultipleinputmotifstendtobeco‐expressed,andthelevelofco‐expressionishigherwhenmultipletranscriptionfactorsareinvolved.Modularityintheregulatorynetworksarisesfromgroupsofhighlyconnectedmotifsthatarehierarchicallyorganized,inwhichmodulesaredividedintosmallerones.Theevolutionofgeneregulatorynetworksmainlyoccursthroughextensiveduplicationoftranscriptionfactorsandtargetgeneswithinheritanceofregulatoryinteractionsfromancestralgeneswhiletheevolutionofmotifsdoesnotshowcommonancestrybutisaresultofconvergentevolution(Costa,Rodrigues&Cristino,2008).

44WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 45: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Theinteractionsbetweenproteinsareessentialtokeepthemolecularsystemsoflivingcellsworkingproperly.Protein‐proteininteraction(PPI)isimportantforvariousbiologicalprocessessuchascell‐cellcommunication,theperceptionofenvironmentalchanges,proteintransportandmodification.Complexnetworktheoryissuitabletostudyprotein‐proteininteractionmapsbecauseofitsuniversalityandintegrationinrepresentingcomplexsystems.Incomplexnetworkanalysiseachproteinisrepresentedasanodeandthephysicalinteractionsbetweenproteinsareindicatedbytheedgesinthenetwork.Manycomplexnetworksarenaturallydividedintocommunitiesormodules,wherelinkswithinmodulesaremuchdenserthanthoseacrossmodules(e.g.humanindividualsbelongingtothesameethnicgroupsinteractmorethanthosefromdifferentethnicgroups).Cellularfunctionsarealsoorganizedinahighlymodularmanner,whereeachmoduleisadiscreteobjectcomposedofagroupoftightlylinkedcomponentsandperformsarelativelyindependenttask.ItisinterestingtoaskwhetherthismodularityincellularfunctionarisesfrommodularityinmolecularinteractionnetworkssuchasthetranscriptionalregulatorynetworkandPPInetwork.TheSlideshowsahypotheticalproteincomplex(A).Binaryprotein‐proteininteractions(PPI)aredepictedbydirectcontactsbetweenproteins.Althoughfiveproteins(A,B,C,D,andE)areidentifiedthroughtheuseofabaitprotein(red),onlyAandDdirectlybindtothebait.(B)showsthetruePPInetworktopologyoftheproteincomplexisshownin.(C)depictsthePPInetworktopologyoftheproteincomplexinferredbythe‘‘matrix’’model,whereallproteinsinacomplexareassumedtointeractwitheachother.Finally(D)demonstratesthePPInetworktopologyoftheproteincomplexinferredbythe‘‘spoke’’model,whereallproteinsinacomplexareassumedtointeractwiththebait;butnootherinteractionsareallowed(Wang&Zhang,2007).

45WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 46: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Correlatedmotifmining(CMM)isthechallengetofindoverrepresentedpairsofpatterns(motifs),insequencesofinteractingproteins.AlgorithmicsolutionsforCMMtherebyprovideacomputationalmethodforpredictingbindingsitesforproteininteraction.ThetaskisbasicallytorepresentmotifsXandY(Figure119)totrulyrepresentanoverrepresentedconsensuspatterninthesequencesoftheproteinsinVX,respectivelyVY,inordertoincreasethelikelihoodthattheycorrespondoroverlapwithasocalledbindingsite—asiteonthesurfaceofthemoleculethatmakesinteractionsbetweenproteinsfromVXandVYpossiblethroughamolecularlock‐and‐keymechanism.Wecall{X,Y}a(k_x k_y k_xy )‐motifpairofaPPInetworkG=(V,E,λ)if|V_x |=k_x,|V_y |=k_y and|V_x∩V_y |=k_xyItiscalledcompleteifallverticesfromV_x areconnectedwithallverticesfromV_y (Boyen etal.,2011).

Ingenetics,asequencemotifisanucleotideoramino‐acidsequencepatternthatiswidespreadandhas,orisconjecturedtohave,abiologicalsignificance.Forproteins,asequencemotifisdistinguishedfromastructuralmotif,amotifformedbythethreedimensionalarrangementofaminoacids,whichmaynotbeadjacent.Inachain‐likebiologicalmolecule,suchasaproteinornucleicacid,astructuralmotifisasupersecondarystructure,whichappearsalsoinavarietyofothermolecules.Motifsdonotallowustopredictthebiologicalfunctionsbecausetheyarefoundinproteinsandenzymeswithdissimilarfunctions.Networkmotifsareconnectivity‐patterns(sub‐graphs)thatoccurmuchmoreoftenthantheydoinrandomnetworks.Mostnetworksstudiedinbiology,ecologyandotherfieldshavebeenfoundtoshowasmallsetofnetworkmotifs;surprisingly,inmostcasesthenetworksseemtobelargelycomposedofthesenetworkmotifs,occurringagainandagain.

46WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 47: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

ThegeneralsteepestascentalgorithmwithabstractneighborfunctionappliedtoCMM(SA‐CMM).

SincethedecisionproblemassociatedwithCMMisinNP,wecanefficientlycheckifamotifpairhashighersupportthananotherwhichmakesitpossibletotackleCMMasasearchprobleminthespaceofallpossible(l,d)‐motifpairs.Ifweaddtheassumptionthatsimilarmotifscanbeexpectedtogetsimilarsupport,ithasthetypicalformofacombinatorialoptimizationproblem.Incombinatorialoptimization,theobjectiveistofindapointinadiscretesearchspacewhichmaximizesauser‐providedfunctionf.Anumberofheuristicalgorithmscalledmetaheuristics areknowntoyieldstableresults,e.g.thesteepestascentalgorithm(Aarts &Lenstra,1997),illustratedaspseudocode intheSlide.

47WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 48: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Metabolismisprimarilydeterminedbygenes,environmentandnutrition.Itconsistsofchemicalreactionscatalyzedbyenzymestoproduceessentialcomponentssuchasaminoacids,sugarsandlipids,andalsotheenergynecessarytosynthesizeandusetheminconstructingcellularcomponents.Sincethechemicalreactionsareorganizedintometabolicpathways,inwhichonechemicalistransformedintoanotherbyenzymesandco‐factors,suchastructurecanbenaturallymodeledasacomplexnetwork.Inthisway,metabolicnetworksaredirectedandweightedgraphs,whoseverticescanbemetabolites,reactionsandenzymes,andtwotypesofedgesthatrepresentmassflowandcatalyticreactions.Onewidelyconsideredcatalogueofmetabolicpathwaysavailableon‐lineistheKyotoEncyclopediaofGenesandGenomes(KEGG).IntheSlideweseeasimplemetabolicnetworkinvolvingfivemetabolitesM1‐M5andthreeenzymesE1‐E3,ofwhichthelattercatalyzesanirreversiblereaction(Hodgman,French&Westhead,2010).

48WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 49: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Suchmetabolicstructurescanbeverylarge,ascanbeseeninthisSlide.Theenzyme‐codinggenesunderTrmB (thisisthethermococcus regulatorofmaltosebinding)actsasarepressorforgenesencodingglycolyticenzymesandasactivatorforgenesencodinggluconeogenic enzymescontrolincludedinthemetabolicpathwaysshownintheSlide(13areuniquetoarchaea and35areconservedacrossspeciesfromallthreedomainsoflife.Integratedanalysisofthemetabolicandgeneregulatorynetworkarchitecturerevealsvariousinterestingscenarios(Schmid etal.,2009).

49WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 50: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Electronicpatientrecords(EPRremainanunexplored,butrichdatasourcefordiscoveringe.g.correlationsbetweendiseases.(Roque etal.,2011)describeageneralapproachforgatheringphenotypicdescriptionsofpatientsfrommedicalrecordsinasystematicandnon‐cohortdependentmanner:Byextractingphenotypeinformationfromthe“free‐text”(=unstructuredinformation)insuchrecordstheydemonstratedthattheycanextendtheinformationcontainedinthestructuredrecorddata,anduseitforproducingfine‐grainedpatientstratificationanddiseaseco‐occurrencestatistics.TheirapproachusesadictionarybasedontheInternationalClassificationofDisease(ICD‐10)ontologyandisthereforeinprinciplelanguageindependent.AsausecasetheyshowhowrecordsfromaDanishpsychiatrichospitalleadtotheidentificationofdiseasecorrelations,whichsubsequentlycanbemappedtosystemsbiologyframeworks.

50WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 51: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Disease‐diseasecorrelations.Heatmap ofthemostsignificant100ICD10codes,basedonrankingthelistof802candidatepairsbytheircomorbidityscores.ChaptercolorsarehighlightednexttotheICD10codes.Diseasesthatoccuroftentogetherhaveredcolorintheheatmap,whilethosewithlowerthanexpectedco‐occurrencearecoloredblue.Thecolorlabelshowsthelog2changeofcomorbiditybetweentwodiseaseswhencomparedtotheexpectedlevel.doi:10.1371/journal.pcbi.1002141.g002

Roque etal.(2011)haveusedtextminingtoautomaticallyextractclinicallyrelevanttermsfrom5543psychiatricpatientrecordsandmappedthesetodiseasecodesintheICD10.Theyclusteredpatientstogetherbasedonthesimilarityoftheirprofiles.Theresultisapatientstratification,basedonmorecompleteprofilesthantheprimarydiagnosis,whichistypicallyused.Figure124illustratesthegeneralapproachtocapturecorrelationsbetweendifferentdisorders.SeveralclustersofICD10codesrelatingtothesameanatomicalareaortypeofdisordercanbeidentifiedalongthediagonaloftheheatmap,rangingfromtrivialcorrelations(e.g.,differentarthritisdisorders),tocorrelationsofcauseandeffectcodes(e.g.,strokeandmental/behavioural disorders),tosocialandhabitualcorrelations(e.g.drugabuse,liverdiseasesandHIV).

51WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 52: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Homology(plural:homologies)originsfromGreekὁμολογέω (homologeo)andmeans“toconform”(inGerman:übereinstimmen)andhasitsoriginsinBiologyandAnthropology,wherethewordisusedforacorrespondenceofstructuresintwolifeformswithacommonevolutionaryorigin(Darwin,1859).Inchemistryitisusedfortherelationshipbetweentheelementsinthesamegroupoftheperiodictable,orbetweenorganiccompoundsinahomologousseries.Inmathematicshomologyisaformalismfortalkinginaquantitativeandunambiguousmannerabouthowaspaceisconnected(Edelsbrunner &Harer,2010).Basically,homologyisaconceptthatisusedinmanybranchesofalgebraandtopology.Historically,thetermwasfirstusedinatopologicalsensebyHenryPoincaré.InBioinformatics,homologymodelling isamaturetechniquethatcanbeusedtoaddressmanyproblemsinmolecularmedicine.Homologymodelling isoneofthemostefficientmethodstopredictproteinstructures.Withtheincreaseinthenumberofmedicallyrelevantproteinsequences,resultingfromautomatedsequencinginthelaboratory,andinthefractionofallknownstructuralfolds,homologymodelling willbeevenmoreimportanttopersonalizedandmolecularmedicineinthefuture.Homologymodelling isaknowledge‐basedpredictionofproteinstructures.Inhomologymodelling aproteinsequencewithanunknownstructure(thetarget)isalignedwithoneormoreproteinsequenceswithknownstructures(thetemplates).Themethodofhomologymodelling isbasedontheprinciplethathomologueproteinshavesimilarstructures.Theprerequisiteforsuccessfulhomologymodelling isadetectablesimilaritybetweenthetargetsequenceandthetemplatesequences(morethan30%)allowingtheconstructionofacorrectalignment.Homologymodelling isaknowledge‐basedstructurepredictionrelyingonobservedfeaturesinknownhomologousproteinstructures.Byexploitingthisinformationfromtemplatestructuresthestructuralmodelofthetargetproteincanbeconstructed(Wiltgen &Tilz,2009).Twowell‐knownhomologymodelling programs,whicharefreeforacademicresearch,areMODELLER(http://salilab.org/modeller)andSWISSMODEL(http://swissmodel.expasy.org).Theslideshowsthecomparisonoftwoproteins:Thesequencesofbothproteinsare95%(53of56)identical(onlyresidues20,30and45differ),yetthestructuresaretotallydifferent.

52WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 53: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

All theareaswehavetouchedinthislectureareextremelyimportanttowardstheconceptofpersonalizedmedicineandmolecularmedicineandwillkeepusbusywithinthenextdecades.Dataminingismaybethemostcentralandmostimportantcomputationalsubjectinthisrespect.

53WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 54: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Alltheseapproachesareproducinggiganticamounts ofhighlycomplexdatasets!

Seetherecentarticle inScience– doublingofdatainproteomicsevery18months

54WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 55: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

55

My DEDICATION is to make data valuable … Thank you!The Klein-Bottle is the symbol for geometry and topology.

Topological data analysis (TDA) is a fast growing branch of applied mathematics and of enormous importance for data mining and knowledge discovery,particularly from large, high-dimensional, incomplete and noisy dirty data.

WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 56: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

56WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 57: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

http://psychology.wikia.com/wiki/Information_retrieval

57WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 58: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Networkmotifsinintegratedmolecularnetworksrepresentfunctionalrelationshipsbetweendistinctdatatypes.Theyaggregatetoformdensetopologicalstructurescorrespondingtofunctionalmoduleswhichcannotbedetectedbytraditionalgraphclusteringalgorithms.

58WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 59: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

http://www.nature.com/nri/journal/v3/n10/fig_tab/nri1200_F2.html

59WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 60: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

http://www.maa.org/cvm/1998/01/tprppoh/article/Pictures/KleinBottle.gif

60WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 61: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

61WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 62: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

62WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 63: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Nesting=recursion, subroutines,informationhiding,

63WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 64: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

OntopinFigure39weseeasampleXMLdescribinggenesinvolvedinlong‐termmemoryofasamplespecimenDrosophilamelanogaster.Nestedwithinthegeneelements,aresub‐elementsrelatedtotheparent.Thefirstgeneincludestwonucleicacidsequences,aproteinproduct,andafunctionalannotation.Additionalinformationisprovidedbyattributes,suchastheorganism.Thisexampleillustratesthedifficultyofmodelingmany‐to‐manyrelationships,suchastherelationshipbetweengenesandfunctions.Informationaboutfunctionsmustberepeatedundereachgenewiththatfunction.Ifweinvertthenesting(i.e.,nestinggenesinsidefunctionelements),thenwemustrepeatinformationaboutgeneswithmorethanasinglefunction.AtthebottominFigure39weseethesameinformationaboutgenes,butusingRDFandOWL.BothgenesareinstancesoftheclassFlyGene,whichhasbeendefinedasthesetofallGenesfortheorganismD.melanogaster.Thefunctionalinformationisrepresentedusingahierarchicaltaxonomy,inwhichLong‐TermMemoryisasubclassofMemory(Louieetal.,2007).

64WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 65: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

ThisisstarclusterstructureM30Letuslookintothemacroscopicareafirstandletuslookforsomesimilarities…ThisisstarglobularstarclusterM30(NGC7099),includingsome100.000starsadiameterofabout100light‐years,approx.40,000light‐yearsawayfromearth–lookatthestructure– lookatthesimilarity– andconsiderthetime,ifoureyesseethisstructuretheymightbevanished(DarwinChannel)Macroscopicstructure

65WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 66: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Fromthislargemacroscopic structurestotinymicroscopicstructureHereaX‐raycrystallography,which isastandardmethodtoanalyse thearrangementofobjects(atoms,molecules)withinacrystalstructure.Thisdatacontainsthemeanpositionsoftheentitieswithinthesubstance,theirchemicalrelationship,andvariousothers…andthedataisstored,forexample– ifhavingaproteinstructure– inaProteinDataBase(PDB).Thisdatabasecontainsvastamountsofdata.Ifamedicalprofessionallooksatthedata,heorsheseesonlylengthytablesofnumbers…

66WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 67: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Structures!Thisisnowourkeyword.Whenwetalkaboutstructures,wewillseesomereallyinterestingaspectsofstructures.Agoodexampleforadataintensiveandhighlycomplexmicroscopicstructureisayeastproteinnetwork.Note:Yeasts(Hefe)areeukaryoticmicro‐organisms(fungi)with1,500knownspeciescurrently,estimatedtobeonly1%ofallyeastspecies.Yeastsareunicellular,typicallymeasuring4µmindiameter.Inthispictureyoucanseethefirstproteininteractionnetwork(publishedbyJeong et.al,2001).Thenodesaretheproteins.Thelinksarethephysicalinteractions(bindings).Therednodesarelethaltotheorganism,thegreenonesarenon‐lethalandtheyellowonesarenotyetknown(stillunknown).Youmayaskwhetherthisstructureisuseful?Well,whatwegetoutbythisyeastissomethingwhichsomeofusmayreallylike:Prost!Theproblemwithsuchstructuresisthattheyareverybigandthattherearesomany!KnowledgeManagementcanhelptodiscoversuchunknownstructuresamongsttheenormoussetofuncharacterizeddata.Wewillcomebacktosuchstructuralhomologism later.NowletusmakeacloserlookonwhatKnowledgeManagementcandoforus.

67WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 68: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Whenthinkingaboutdata,weshouldalwayskeeptwofundamentalphysicalaspectsinmind:timerelatedaspects(e.g.entropyofdata)andspacerelatedaspects(e.g.topologyofdata).

http://www.youtube.com/watch?v=oBkOYQ02chsTedxWarwick 2010RogerPenroseinSpace‐TimeGeometry.http://www.youtube.com/watch?v=aSz5BjExs9oVisualizingElevenDimensions

68WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 69: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Cloudsofdata.Veryoften,dataisrepresentedasanunorderedsequenceofpointsinaEuclideann‐dimensionalspaceEn.Datacomingfromanarrayofsensorreadingsinanengineeringtestbed,fromquestionnaireresponsesinapsychologyexperiment,orfrompopulationsizesinacomplexecosystemallresideinaspaceofpotentiallyhighdimension.Theglobal‘shape’ofthedatamayoftenprovideimportantinformationabouttheunderlyingphenomenawhichthedatarepresents.Onetypeofdatasetforwhichglobalfeaturesarepresentandsignificantistheso‐calledpointclouddatacomingfromphysicalobjectsin3‐d.Touchprobes,pointlasers,orlinelaserssweepasuspendedbodyandsamplethesurface,record‐ing coordinatesofanchorpointsonthesurfaceofthebody.Thecloudofsuchpointscanbequicklyobtainedandusedinacomputerrepresentationoftheob‐ject.Atemporalversionofthissituationistobefoundinmotion‐capturedata,wheregeometricpointsarerecordedastimeseries.Inbothofthesesettings,itisimportanttoidentifyandrecognizeglobalfeatures:whereistheindexfinger,thekeyhole,thefracture?

69WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 70: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

70WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 71: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

a =orderb=clusteringcoefficientc=pathlengthd=centralitye=nodaldegreeF=modularityNetworkmetrics

71WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 72: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

72WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 73: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

http://www.google.com/patents/US6384826

73WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 74: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

74WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 75: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Representativeexamplesofdiseasecomplexesaredisplayed.Diseasesareassociatedwithtissuesbyusingourdisease–tissuematrix,andexpressiondataarefromtheGNFdataset.Theexpressionlevelsofcomplexesareshownaszscores.Ifadiseaseisassociatedwithmorethan3tissues,onlythe3mostassociatedtissuesareshownforclarity.Inagivencomplex,proteinsrelevanttothediseaseinquestionareyellow.Thefigureshowsthegeneraltendencyofoverexpressionofthecomplexesinthetissuesinwhichtheyareinvolvedinpathologycomparedwiththeirexpressionlevelinothertissues.Allmembersofthecomplexescanbeseenin

75WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 76: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Three‐dimensionalstructureofventricularmusclebasketweave,coronaryarterialtree,andpacemakerandconductionsystem.Oneofthecentralchallengesofcell‐basedtherapyforregeneratingspecificheartcomponentsisguidingtransplantedcellsintoafunctionalsyncytiumwiththeexistingthree‐dimensionalarchitecture.Transplantedcellsmustmakefunctionalconnectionswithneighboringspecializedheartcellstoresultinanetgainofglobalfunction.Transplantedmyogenicprogenitors,forexample,mustalignwithandintegrateintotheexistingventricularmusclebasketweavetoallowsynchronouscontractionandrelaxationofgraftandhostmyocardium.Integrationofpacemakerandconductionsystemprogenitorsintotheappropriatetissuetypeisnecessarytogenerateabiologicalpacemakerandavoidcardiacarrhythmia.Forexample,havingatransplantedheartmuscleprogenitorintegrateintotheconductionsystemmighthavearrythmogenic consequences,aswouldtheintroductionofcellswithindependentpacemakerpotentialintheheart.Similarly,cell‐basedtherapiestopromotecoronarycollateralformationorneo‐arteriogenesis requirefunctionalintegrationoftransplantedcellswiththehostcoronaryarterialtree.

76WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 77: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

Three‐dimensionalstructureofventricularmusclebasketweave,coronaryarterialtree,andpacemakerandconductionsystem.Oneofthecentralchallengesofcell‐basedtherapyforregeneratingspecificheartcomponentsisguidingtransplantedcellsintoafunctionalsyncytiumwiththeexistingthree‐dimensionalarchitecture.Transplantedcellsmustmakefunctionalconnectionswithneighboringspecializedheartcellstoresultinanetgainofglobalfunction.Transplantedmyogenicprogenitors,forexample,mustalignwithandintegrateintotheexistingventricularmusclebasketweavetoallowsynchronouscontractionandrelaxationofgraftandhostmyocardium.Integrationofpacemakerandconductionsystemprogenitorsintotheappropriatetissuetypeisnecessarytogenerateabiologicalpacemakerandavoidcardiacarrhythmia.Forexample,havingatransplantedheartmuscleprogenitorintegrateintotheconductionsystemmighthavearrythmogenic consequences,aswouldtheintroductionofcellswithindependentpacemakerpotentialintheheart.Similarly,cell‐basedtherapiestopromotecoronarycollateralformationorneo‐arteriogenesis requirefunctionalintegrationoftransplantedcellswiththehostcoronaryarterialtree.

77WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 78: A. Holzinger LV709 - human-centered.ai...2015/11/05  · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a

78WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015