Download - A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Transcript
Page 1: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Statusasof Mo,19.10.201508:30DearStudents – welcometothesecondlectureofourcourse“biomedicalinformatics”,pleaserememberfromthelastlecturethedefinition:AccordingtotheAmericanAssociationofMedicalInformatics(AMIA)thetermMedicalInformaticshasnowbeenexpandedtoBiomedicalInformaticsandisdefinedas“theinterdisciplinaryfieldthatstudiesandpursuestheeffectiveuseofbiomedicaldata,information,andknowledgeforscientificinquiry,problemsolving,anddecisionmaking,motivatedbyeffortstoimprovehumanhealth”.[1]Itisimportanttoknow:Bioinformatics+MedicalInformatics=BiomedicalInformatics(seeSlide1‐42).Note:Computersarejustthevehiclestorealizethecentralgoals:Toharnessthepowerofthemachinestosupportandtoamplifyhumanintelligence[2].[1]Shortliffe,E.H.2011.BiomedicalInformatics:DefiningtheScienceanditsRoleinHealthProfessionalEducation.In:Holzinger,A.&Simonic,K.‐M.(eds.)InformationQualityine‐Health.LectureNotesinComputerScienceLNCS7058.Heidelberg,NewYork:Springer,pp.711‐714.[2]Holzinger,A.2013.Human–ComputerInteractionandKnowledgeDiscovery(HCI‐KDD):Whatisthebenefitofbringingthosetwofieldstoworktogether?In:Cuzzocrea,A.,Kittl,C.,Simos,D.E.,Weippl,E.&Xu,L.(eds.)MultidisciplinaryResearchandPracticeforInformationSystems,SpringerLectureNotesinComputerScienceLNCS8127.Heidelberg,Berlin,NewYork:Springer,pp.319‐328.Regardingthecurrenttrendtowardspersonalizedmedicinehaveareadofthispaper:Holzinger,A.2014.TrendsinInteractiveKnowledgeDiscoveryforPersonalizedMedicine:CognitiveSciencemeetsMachineLearning.IEEEIntelligentInformaticsBulletin,15,(1),6‐14.Onlineavailable via:http://www.comp.hkbu.edu.hk/~cib/2014/Dec/article2/iib_vol15no1_article2.pdf

1WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 2: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Inthissecondlecturewestartwithalookondatasources,reviewsomedatastructures,discussstandardizationversusstructurization,reviewthedifferencesbetweendata,informationandknowledgeandclosewithanoverviewaboutinformationentropy.

2WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 3: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Acentral topicisthedimensionalityofdataandthe interrelated(connected)curseofdimensionalitywhichreferstovariousphenomenathatarisewhenanalyzingandorganizingdatainhigh‐dimensionalspaces(thousandsofdimensions)thatdonotoccurinlow‐dimensionalsettingssuchasthethree‐dimensionalphysicalspaceofoureverydayworld.

3WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 4: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

4WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 5: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

5WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 6: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

6WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 7: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Arecommendable smallbookletis:

Scheinerman,E.R.2011.MathematicalNotation:AGuideforEngineersandScientists,Baltimore(MD),Scheinerman.Whichalsoincludes themostimportantLATEXcommandsforproducingmaths symbols

http://www.ams.jhu.edu/~ers/notation/

7WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 8: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

componentsthroughpandw(Golan,Judge,andMiller;1996);

8WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 9: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Holzinger,A.,Dehmer,M.&Jurisica,I.2014.KnowledgeDiscoveryandinteractiveDataMininginBioinformatics‐ State‐of‐the‐Art,futurechallengesandresearchdirections.BMCBioinformatics,15,(S6),I1.

Hund,M.,Sturm,W.,Schreck,T.,Ullrich,T.,Keim,D.,Majnaric,L.&Holzinger,A.2015.AnalysisofPatientGroupsandImmunizationResultsBasedonSubspaceClustering.In:Guo,Y.,Friston,K.,Aldo,F.,Hill,S.&Peng,H.(eds.)BrainInformaticsandHealth,LectureNotesinArtificialIntelligenceLNAI9250.Cham:SpringerInternationalPublishing,pp.358‐368.

Holzinger,A.,Stocker,C.&Dehmer,M.2014.BigComplexBiomedicalData:TowardsaTaxonomyofData.In:Obaidat,M.S.&Filipe,J.(eds.)CommunicationsinComputerandInformationScienceCCIS455.BerlinHeidelberg:Springerpp.3‐18.

Relatedrecommended reading:

Dong‐Hee,S.&MinJae,C.2015.Ecologicalviewsofbigdata:Perspectivesandissues.TelematicsandInformatics,32,(2),311‐320.

Dong,X.L.&Srivastava,D.2015.BigDataIntegration.SynthesisLecturesonDataManagement,7,(1),1‐198.

Wu,X.D.,Zhu,X.Q.,Wu,G.Q.&Ding,W.2014.DataMiningwithBigData.IEEETransactionsonKnowledgeandDataEngineering,26,(1),97‐107.

Shneiderman,B.2014.TheBigPictureforBigData:Visualization.Science,343,(6172),730‐730.

Sackman,J.E.&Kuchenreuther,M.2014.MarryingBigDatawithPersonalizedMedicine.Biopharm International,27,(8),36‐38.

9WS 2015

Page 10: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Withregardto data,thedifferencebetweenclassicalstatisticsandmodern machinelearningisthatmachinelearningdiscoversintricatestructuresinlargedatasetstoindicatehowamachineshouldchangeitsinternalparameters.

A. Holzinger    LV 709.049 Med. Informatik                                

WS 2015 10

Page 11: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

JohnvonNeumannandhishigh‐speedcomputer,approx.1952

Ourfirstquestionis:Wheredoesthedatacomefrom?Thesecondquestion:Whatkindofdataisthis?Thethirdquestion:Howbigisthisdata?So,letuslookatsomebiomedicaldatasources(seeSlide2‐1):

11WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 12: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Duetotheincreasingtrendtowardspersonalizedandmolecularmedicine,biomedicaldataresultsfromvarioussourcesindifferentstructuraldimensions,rangingfromthemicroscopicworld(e.g.genomics,epigenomics,metagenomics,proteomics,metabolomics)tothemacroscopicworld(e.g.diseasespreadingdataofpopulationsinpublichealthinformatics).Justfororientation:theGlucosemoleculehasasizeof900 =900 10 andtheCarbonatomapprox.300 .Ahepatitisvirusisrelativelylargewith45 =45 10 andtheX‐Chromosomemuchbiggerwith7 =7 10 .Herealotof“bigdata”isproduced,e.g.genomics,metabolomicsandproteomicsdata.Thisisreally“bigdata”– thedatasetsenormouslylarge– whereasineachindividualweestimatemanyTerabytes(1TB=1 10 Byte=1000GByte)ofgenomicsdata,weareconfrontedwithPetabytesofproteomicsdataandthefusionofthoseforpersonalizedmedicineresultsinExabytes ofdata(1EB=1 10 Byte).

Ofcoursetheseamountsareforeachhumanindividual,however,wehaveacurrentworldpopulationof7Billion(1BillioninEnglishlanguageis1MilliardinEuropeanlanguage)people(=7 10 people).Soyoucanseethatthisisreally“bigdata”.This“natural”dataisthenfusedwith“produced”data,e.g.theunstructureddata(text)inthepatientrecords,ordatafromphysiologicalsensorsetc.– thesedataisalsorapidlyincreasinginsizeandcomplexity.Youcanimaginethatwithoutcomputationalintelligencewehavenochancetosurviveinthiscomplexbigdatasets.

http://learn.genetics.utah.edu/content/begin/cells/scale/C‐Atom340pm=340.10‐12mMoleculeGlucose900pmVirus HepatitisVirus45nm=45.10‐9mMicroscope200.10‐9mConfocalmicroscopy 20.10‐6mElectron‐Microscopy0,1.10‐9mX‐Chromosome7.10‐6mDNA2.10‐9mEncyme =Metabolomics

Holzinger,A.,Dehmer,M.&Jurisica,I.2014.KnowledgeDiscoveryandinteractiveDataMininginBioinformatics‐ State‐of‐the‐Art,futurechallengesandresearchdirections.BMCBioinformatics,15,(S6),I1.

12WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 13: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

MostofourcomputersareVon‐Neumannmachines(seechapter1),consequentlyatthelowestphysicallayer,dataisrepresentedaspatternsofelectricalon/offstates(1/0,H/L,high/low);wespeakofabit,whichisalsoknownasBit,theBasicindissolubleinformationunit(Shannon,1948).DonotconfusethisBitwiththeIEC60027‐2symbolbit– insmallletters– whichisusedasanSIdimensionprefix(e.g.1Kbit=1024bit,1Byte=8bit).Beginningwiththephysicallevelofdatawecandeterminevariouslevelsofdatastructures(seeSlide2‐2):Referto:http://physics.nist.gov/cuu/Units/binary.html1)Physicallevel: inaVon‐Neumannsystem:bit;inaQuantumsystem:qubitNote: Regardlessofitsphysicalrealization(e.g.voltage,ormechanicalstate,orblack/whiteetc.),abitisalwayslogicallyeither0or1(analog toalight‐switch).Aqubithassimilaritiestoaclassicalbit,butisoverallverydifferent:Aclassicalbitisascalarvariablewiththesinglevalueofeither0or1,sothevalueisunique,deterministicandunambiguous.Aqubitismoregeneralinthesensethatitrepresentsastatedefinedbyapairofcomplexnumbers , , whichexpresstheprobabilitythatareadingofthevalueofthequbitwillgiveavalueof0or1.Thus,aqubitcanbeinthestateof0,1,orsomemixture‐ referredtoasasuperposition‐ ofthe0and1states.Theweightsof0and1inthissuperpositionaredeterminedby(a,b)inthefollowingway:qubit≜ , ≜ 0 1 .Pleasebeawarethatthismodelofquantumcomputationisnottheonlyone(Lanzagorta &Uhlmann,2008).

Forarecentoverviewonquantumcomputation pleasereferto:http://peterwittek.com/book.html2)LogicalLevel:1)Primitivedatatypes,including:a)Booleandatatype(true/false);b)numericaldatatype(e.g.integer( ,floating‐pointnumbers(“reals”),etc.);2)compositedatatypes,including:a)array,b)record,c)union,d)set(storesvalueswithoutanyparticularorder,andnorepeatedvalues),e)object(containsothers);3)Stringandtexttypes,including:a)alphanumericcharacters,b)alphanumericstrings(=sequenceofcharacterstorepresentwordsandtext)3)AbstractLevel: includingabstractdatastructures,e.g.queue(FIFO),stack(LIFO),set(noorder,norepeatedvalues),lists,hashtable,arrays,trees,graphs,…4)TechnicalLevel: Applicationdataformats,e.g.text,vectorgraphics,pixelimages,audiosignals,videosequences,multimedia,…5)HospitalLevel: Narrative(textual,naturallanguage)patientrecorddata(structured/unstructuredandstandardized/non‐standardized),Omicsdata(genomics,proteomics,metabolomics,microarraydata,fluxomics,phenomics),numericalmeasurements(physiologicaldata,timeseries,labresults,vitalsigns,bloodpressure,CO2 partialpressure,temperature,…),recordedsignals(ECG,EEG,ENG,EMG,EOG,EP…),graphics(sketches,drawings,handwriting,…);audiosignals,images(cams,x‐ray,MR,CT,PET,…),etc.

13WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 14: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Inbiomedicalinformaticswehavealottodowithabstractdatatypes(ADT),consequentlywebrieflyreviewthemostimportantoneshere.FordetailspleaserefertoacourseonAlgorithm&Datastructures,ortoaclassictextbooksuchas(Aho,Hopcroft&Ullman,1983),(Cormen etal.,2009),orinGerman(Ottmann &Widmayer,2012),(Holzinger,2003)andpleasetakeintoconsiderationthatdatastructuresandalgorithmsgohandinhand,so amust‐have‐on‐the‐deskofeverycomputerscientistis:Cormen,T.H.,Leiserson,C.E.,Rivest,R.L.&Stein,C.2009.IntroductiontoAlgorithms(3rdedition),Cambridge(MA),TheMITPress.

Listisasequentialcollectionofitems , , … , accessibleoneafteranother,beginningattheheadandendingatthetailz.InaVon‐Neumannmachineitisawidelyuseddatastructureforapplicationswhichdonotneedrandomaccess.Itdiffersfromthestack(last‐in‐first‐out,LIFO)andqueue(first‐in‐first‐out,FIFO)datastructuresinsofar,thatadditionsandremovalscanbemadeatanypositioninthelist.Incontrasttoasimpleset theorderisimportant.AtypicalexamplefortheuseofalistisaDNAsequence.ThecombinationofGGGTTTAAAissuchalist,theelementsofthelistarethenucleotidebases.Nucleotides arethejoinedmoleculeswhichformthestructuralunitsoftheRNAandtheDNAandplaythecentralroleinmetabolism.

14WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 15: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Graph isapair , ,where isasetoffinite,non‐emptyvertices(nodes)and isasetofedges(lines,arcs),whichare2‐elementsubsetsof .If isasetoforderedpairsofvertices(arcs,directededges,arrows),thenitisadirectedgraph(digraph).Thedistancesbetweentheedgescanberepresentedwithinadistance‐matric(twodimensionalarray).

Theedgesinagraphcanbemultidimensionalobjects, e.g.vectorscontainingtheresultsofmultipleGen‐expressionmeasures.Forthispurposethedistanceoftwoedgescanbemeasuredbyvariousdistancemetrics.Graphsareideallysuitedforrepresentingnetworksinmedicineandbiology,e.g.metabolismpathways,etc.Inbioinformatics,distancematricesareusedtorepresentproteinstructuresinacoordinate‐independentmanner,aswellasthepairwisedistancesbetweentwosequencesinsequencespace.Theyareusedinstructuralandsequentialalignment,andforthedeterminationofproteinstructuresfromNMRorX‐raycrystallography.Evolutionarydynamicsactonpopulations.Neithergenes,norcells,norindividualsevolve;onlypopulationsevolve.ThissocalledMoranprocess describesthestochasticevolutionofafinitepopulationofconstantsize:Ineachtimestep,anindividualischosenforreproductionwithaprobabilityproportionaltoitsfitness;asecondindividualischosenfordeath.Theoffspringofthefirstindividualreplacesthesecondandindividualsoccupytheverticesofagraph.Ineachtimestep,anindividualisselectedwithaprobabilityproportionaltoitsfitness;theweightsoftheoutgoingedgesdeterminetheprobabilitiesthatthecorrespondingneighborwillbereplacedbytheoffspring.Theprocessisdescribedbyastochasticmatrix ,where denotestheprobabilitythatanoffspringofindividuali willreplaceindividualj.Ateachtimestep,anedge isselectedwithaprobabilityproportionaltoitsweightandthefitnessoftheindividualatitstail.TheMoranprocessisacompletegraphwithidenticalweights(Lieberman,Hauert &Nowak,2005).

Graphs canberepresentedcomputationallybyanAdjacencylist,AdjacencymatrixandanIncidencematrix.Thefirstpre‐processingstepistoproducepoint clouddatasetsfromrawdata,see:Holzinger,A.,Malle,B.,Bloice,M.,Wiltgen,M.,Ferri,M.,Stanganelli,I.&Hofmann‐Wellenhof,R.2014.OntheGenerationofPointCloudDataSets:StepOneintheKnowledgeDiscoveryProcess.In:Holzinger,A.&Jurisica,I.(eds.)InteractiveKnowledgeDiscoveryandDataMininginBiomedicalInformatics,LectureNotesinComputerScience,LNCS8401.BerlinHeidelberg:Springer,pp.57‐80.https://online.tugraz.at/tug_online/voe_main2.getVollText?pDocumentNr=974579&pCurrPk=83005

Forthe specifictaskofgettinggraphsfromimagedatahavealookat:Holzinger,A.,Malle,B.&Giuliani,N.2014.OnGraphExtractionfromImageData.In:Slezak,D.,Peters,J.F.,Tan,A.‐H.&Schwabe,L.(eds.)LectureNotesinArtificialIntelligence,LNAI8609.Heidelberg,Berlin:Springer,pp.552‐563.https://online.tugraz.at/tug_online/voe_main2.getVollText?pDocumentNr=868952&pCurrPk=80830

15WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 16: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Treeisacollectionofelementscallednodes,oneofwhichisdistinguishedasaroot,alongwitharelation("parenthood")thatplacesahierarchicalstructureonthenodes.Anode,likeanelementofalist,canbeofwhatevertypewewish.Weoftendepictanodeasaletter,astring,oranumberwithacirclearoundit.Formally,atreecanbedefinedrecursivelyinthefollowingmanner:1.Asinglenodebyitselfisatree.Thisnodeisalsotherootofthetree.2.Suppose isanodeand 1, 2, . . . , aretreeswithroots 1, 2, . . . , , respectively.Wecanconstructanewtreebymaking betheparentofnodes 1, 2, . . . , .Inthistree istherootand 1, 2, . . . , arethesubtrees oftheroot.Nodes 1, 2, . . . , arecalledthechildrenofnode .

Dendrogram (fromGreekdendron "tree",‐gramma "drawing")isatreediagramfrequentlyusedtoillustratethearrangementoftheclustersproducedbyhierarchicalclustering.Dendrograms areoftenusedincomputationalbiologytoillustratetheclusteringofgenesorsamples.Theoriginofsuchdendrograms canbefoundin(Darwin,1859).Theexampleby(Hufford etal.,2012)showsaneighbor‐joiningtreeandthechangingmorphologyofdomesticatedmaizeanditswildrelatives.Taxaintheneighbor‐joiningtreearerepresentedbydifferentcolors:parviglumis (green),landraces(red),improvedlines(blue),mexicana (yellow)andTripsacum (brown).Themorphologicalchangesareshownforfemaleinflorescencesandplantarchitectureduringdomesticationandimprovement.

16WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 17: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Please rememberthekeyproblemsindealingwithdatainclude:1)Heterogeneousdatasources(needfordatafusionanddataintegration)2)Complexityofthedata(high‐dimensionality)3)Noisy,uncertaindata(challengeofpre‐processing)4)Thediscrepancybetweendata‐information‐knowledge(variousdefinitions)5)Bigdatasets(manualhandlingofthedataisimpossible)

17WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 18: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Nowthatwehaveseensomeexamplesofdatafromthebiomedicaldomain,wecanlookatthe“bigpicture”.Manyika etal.(2011)localizedfourmajordatapoolsintheUShealthcareanddescribethatthedataarehighlyfragmented,withlittleoverlapandlowintegration.Moreover,theyreportthatapprox.30%ofclinicaltext/numericaldataintheUnitedStates,includingmedicalrecords,bills,laboratoryandsurgeryreports,isstillnotgeneratedelectronically.Evenwhenclinicaldataareindigitalform,theyareusuallyheldbyanindividualproviderandrarelyshared(seeSlide2‐4).Biomedicalresearchdata,e.g.clinicaltrials,predictivemodelingetc.,isproducedbyacademiaandpharmaceuticalcompaniesandstoredindatabasesandlibraries.Clinicaldataisproducedinthehospitalandarestoredinhospitalinformationsystems(HIS),picturearchivingandcommunicationsystems(PACS)orinlaboratorydatabases,etc.Muchdataishealthbusinessdataproducedbypayors,providers,insurances,etc.Finally,thereisanincreasingpoolofpatientbehaviorandsentimentdata,producedbyvariouscustomersandstakeholders,outsidethetypicalclinicalcontext,includingthegrowingdatafromthewellnessandambientassistedlivingdomain.

18WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 19: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Amajorchallengeinournetworkedworldistheincreasingamountofdata– todaycalled“bigdata”.Thetrendtowardspersonalizedmedicinehasresultedinasheermassofthegenerated(‐omics)data,(seeSlide2‐7).Inthelifesciencesdomain,mostdatamodelsarecharacterizedbycomplexity,whichmakesmanualanalysisverytime‐consumingandfrequentlypracticallyimpossible(Holzinger,2013).

MoreandmoreOmics‐dataaregenerated,including:1)Genomicsdata(e.g.sequenceannotation),2)Transcriptomics data(e.g.microarraydata);thetranscriptome isthesetofallRNAmolecules,includingmRNA,rRNA,tRNA andnon‐codingRNAproducedinthecells.3)Proteomicsdata:Proteomicstudiesgeneratelargevolumesofrawexperimentaldataandinferredbiologicalresultsstoredindatarepositories,mostlyopenlyavailable;anoverviewcanbefoundhere:(Riffle&Eng,2009).Theoutcomeofproteomicsexperimentsisalistofproteinsdifferentiallymodifiedorabundantinacertainphenotype.Thelargesizeofproteomicsdatasetsrequiresspecializedanalyticaltools,whichdealwithlargelistsofobjects4)Metabolomics(e.g.enzymeannotation),themetabolome representsthecollectionofallmetabolitesinacell,tissue,organororganism.5)Protein‐DNAinteractions,6)Protein‐proteininteractions;PPIareatthecoreoftheentireinteractomics systemofanylivingcell.7)Fluxomics (isotopictracing,metabolicpathways),8)Phenomics (biomarkers),9)Epigenetics,isthestudyofthechangesingeneexpression– othersthantheDNAsequence,thereforetheprefix“epi‐“10)Microbiomics11)LipidomicsOmics‐dataintegrationhelpstoaddressinterestingbiologicalquestionsonthebiologicalsystemsleveltowardspersonalizedmedicine(Joyce&Palsson,2006).

19WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 20: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

MoreandmoreOmics‐dataaregenerated,including:1)Genomicsdata(e.g.sequenceannotation),2)Transcriptomicsdata(e.g.microarraydata);thetranscriptomeisthesetofallRNAmolecules,includingmRNA,rRNA,tRNA andnon‐codingRNAproducedinthecells.3)Proteomicsdata:Proteomicstudiesgeneratelargevolumesofrawexperimentaldataandinferredbiologicalresultsstoredindatarepositories,mostlyopenlyavailable;anoverviewcanbefoundhere:(Riffle&Eng,2009).Theoutcomeofproteomicsexperimentsisalistofproteinsdifferentiallymodifiedorabundantinacertainphenotype.Thelargesizeofproteomicsdatasetsrequiresspecializedanalyticaltools,whichdealwithlargelistsofobjects(Bessarabova etal.,2012).4)Metabolomics(e.g.enzymeannotation),themetabolome representsthecollectionofallmetabolitesinacell,tissue,organororganism.5)Protein‐DNAinteractions,6)Protein‐proteininteractions;PPIareatthecoreoftheentireinteractomics systemofanylivingcell.7)Fluxomics (isotopictracing,metabolicpathways),8)Phenomics (biomarkers),9)Epigenetics,isthestudyofthechangesingeneexpression– othersthantheDNAsequence,thereforetheprefix“epi‐“10)Microbiomics11)LipidomicsOmics‐dataintegrationhelpstoaddressinterestingbiologicalquestionsonthebiologicalsystemsleveltowardspersonalizedmedicine(Joyce&Palsson,2006).

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2908408/

Formoreinformationpleasereferto:Gomez‐Cabrero,D.,Abugessaisa,I.,Maier,D.,Teschendorff,A.,Merkenschlager,M.,Gisel,A.,Ballestar,E.,Bongcam‐Rudloff,E.,Conesa,A.&Tegner,J.2014.Dataintegrationintheeraofomics:currentandfuturechallenges.BMCSystemsBiology,8,(Suppl 2),I1.http://www.biomedcentral.com/1752‐0509/8/S2/I1

20WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 21: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Afurtherchallengeistointegratethedataandtomakeitaccessibletotheclinician.Whilethereismuchresearchontheintegrationofheterogeneousinformationsystems,ashortcomingisintheintegrationofavailabledata.Datafusionistheprocessofmergingmultiplerecordsrepresentingthesamereal‐worldobjectintoasingle,consistent,accurate,andusefulrepresentation(Bleiholder &Naumann,2008).AnexampleforthemixofdifferentdataforsolvingamedicalproblemcanbeseeninSlide2‐8.

AgoodexampleforcomplexmedicaldataisRCQM,whichisanapplicationthatmanagestheflowofdataandinformationintherheumatologyoutpatientclinic(50patientsperday,5daysperweek)ofGrazUniversityHospital,onthebasisofaqualitymanagementprocessmodel.Eachexaminationproduces100+clinicalandfunctionalparametersperpatient.Thisamasseddataaremorphedintobetteruseableinformationbyapplyingscoringalgorithms(e.g.DiseaseActivityScore,DAS)andareconvolutedovertime.Togetherwithpreviousfindings,physiologicallaboratorydata,patientrecorddataandOmicsdatafromthePathologydepartment,thesedataconstitutetheinformationbasisforanalysisandevaluationofthediseaseactivity.Thechallengeisintheincreasingquantitiesofsuchhighlycomplex,multi‐dimensionalandtimeseriesdata,seeanexample here:Simonic,K.M.,Holzinger,A.,Bloice,M.&Hermann,J.OptimizingLong‐TermTreatmentofRheumatoidArthritiswithSystematicDocumentation.ProceedingsofPervasiveHealth‐ 5thInternationalConferenceonPervasiveComputingTechnologiesforHealthcare,2011Dublin.IEEE,550‐554.http://www.biomedcentral.com/1472‐6947/13/103

21WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 22: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Donotconfusestructurewithstandardization(seeSlide2‐9).Datacanbestandardized(e.g.numericalentriesinlaboratoryreports)andnon‐standardized.Atypicalexampleisnon‐standardizedtext– impreciselycalled“Free‐Text”or“unstructureddata”inanelectronicpatientrecord(Kreuzthaleretal.,2011).

Standardizeddata isthe basisforaccuratecommunication.Inthemedicaldomain,manydifferentpeopleworkatdifferenttimesinvariouslocations.Datastandards canensurethatinformationisinterpretedbyalluserswiththesameunderstanding.Moreover,standardizeddatafacilitatecomparabilityofdataandinteroperabilityofsystems.Itsupportsthereusabilityofthedata,improvestheefficiencyofhealthcareservicesandavoidserrorsbyreducingduplicatedeffortsindataentry.

Datastandardizationreferstoa)thedatacontent;b)theterminologiesthatareusedtorepresentthedata;c)howdataisexchanged;andiv)howknowledge,e.g.clinicalguidelines,protocols,decisionsupportrules,checklists,standardoperatingproceduresarerepresentedinthehealthinformationsystem(refertoIOM).Technicalelementsfordatasharingrequirestandardizationofidentification,recordstructure,terminology,messaging,privacyetc.ThemostusedstandardizeddatasettodateistheinternationalClassificationofDiseases(ICD),whichwasfirstadoptedin1900forcollectingstatistics(Ahmadian etal.,2011),whichwewilldiscussin→Lecture3.Non‐standardizeddata isthemajorityofdataandinhibitdataquality,dataexchangeandinteroperability.Well‐structureddata istheminorityofdataandanidealisticcasewheneachdataelementhasanassociateddefinedstructure,relationaltables,ortheresourcedescriptionframeworkRDF,ortheWebOntologyLanguageOWL(see→Lecture3).Note:Ill‐structured isatermoftenusedfortheoppositeofwell‐structured,althoughthistermoriginallywasusedinthecontextofproblemsolving(Simon,1973).Semi‐structuredisaformofstructureddatathatdoesnotconformwiththestrictformalstructureoftablesanddatamodelsassociatedwithrelationaldatabasesbutcontainstagsormarkerstoseparatestructureandcontent,i.e.areschema‐lessorself‐describing;atypicalexampleisamarkup‐languagesuchasXML(see→Lecture3and4).Weakly‐Structureddata isthemostofourdatainthewholeuniverse,whetheritisinmacroscopic(astronomy)ormicroscopicstructures(biology)– see→Lecture5.Non‐structureddata orunstructureddata isanimprecisedefinitionusedforinformation expressedinnaturallanguage,whennospecificstructurehasbeendefined.Thisisanissuefordebate:Texthasalsosomestructure:words,sentences,paragraphs.Ifweareveryprecise,unstructureddatawouldmeantthatthedataiscompleterandomized– whichisusuallycallednoiseandisdefinedby(Duda,Hart&Stork,2000)asanypropertyofdatawhichisnotduetotheunderlyingmodelbutinsteadtorandomness(eitherintherealworld,fromthesensorsorthemeasurementprocedure).

22WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 23: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

“Multivariate”and“multidimensional”aremodernwordsandconsequentlyoverusedinliterature.Eachitemofdataiscomposedofvariables, andifsuchadataitemisdefinedbymorethanonevariableitiscalledamultivariabledataitem.Variablesarefrequentlyclassifiedintotwocategories:dependentorindependent.

Somemorereadings onthehomepageofYosuhua Bengio,UniversityofMontreal:http://www.iro.umontreal.ca/~bengioy/yoshua_en/research.html

AndtheMILA Lab– MontrealInstituteforLearningAlgorithmshttp://www.mila.umontreal.ca/

23WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 24: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

InPhysics,EngineeringandStatisticsavariableisaphysicalpropertyofasubject,whosequantitycanbemeasured,e.g.mass,length,time,temperature,etc.Inmathematicsa0‐dimensionalspace(nil‐dimensional)isatopologicalspacethathasdimensionzero– which isaninfinitesimalsmallpoint.a 1‐dimensionalspaceisalineinR1a2‐dimensionalspaceistheplaneinR2A3‐dimensionalspaceisasphere(orcube,cylinderetc.)inR3

24WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 25: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

SMILESdata(.smi)consists ofastringobtainedbythesymbolnodesencounteredinadepth‐firsttreetraversalofachemicalgraph, whichisfirsttrimmedtoremovehydrogenatomsandcyclesarebrokentoturnitintoaspanningtree.Wherecycleshavebeenbroken,numericsuffixlabelsareincludedtoindicatetheconnectednodes.

25WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 26: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Proteomicanalysisofmesenchymal stemcells(MSCs).Two‐dimensionalgelelectrophoresiswasperformedusingwholeproteincellextractsfromP2MSCculturesofpatientswithrheumatoidarthritis(RA)(n=10)(A)andhealthycontrols(n=6)(B).Afterscanning,spotdetection,quantificationandnormalisation,gelswerecomparedusingHierarchicalClusteringSoftwareandPearsontest(C).Noclustercouldbedetectedusingtheseproteomicprofiles.

Proteomicanalysis:Two‐dimensionalelectrophoresiswasperformedusingP2MSCsinpatientswithRA(n=10)andhealthycontrols(n=6)(fig4A,B).ByusingtheHierarchicalClusteringmethod,wecouldnotdefineanyclusterthatmightdiscriminatepatientandcontrolcells(fig4C).ThePearsoncorrelationcoefficientwasnotsignificantlydifferentbetweenpatientandcontrolcells(r=0.933(0.022)andr=0.929(0.020),respectively).Thesedatacorroboratethelackofsignificantchangesincytokineproductionbetweenpatientsandcontrols.

26WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 27: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

http://www.rcsb.org/pdb/images/3ond_bio_r_500.jpgThePDBisalargerepository containing3‐Dstructuralinformation,establishedin1971Dataastoredin2Dbutcaninfactrepresentbiologicalentitiesinthreeormoredimensions

27WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 28: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Transaxial (left),coronal(middle),andsagittal(right)imagesofapatientwhowasscannedfor30mininlist‐modewiththeBrainPET scanner;therecordingwasstarted20minafterinjectionofabout300MBq fluor‐deoxy‐glucose.

28WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 29: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

InMathematics,henceinInformatics,however,avariableisassociatedwithaspace–oftenann‐dimensionalEuclideanspace – inwhichanentity(e.g.afunction)oraphenomenonofcontinuousnatureisdefined.Thedatalocationwithinthisspacecanbereferencedbyusingarangeofcoordinatesystems(e.g.Cartesian,Polar‐coordinates,etc.):Thedependentvariablesarethoseusedtodescribetheentity(forexamplethefunctionvalue)whilsttheindependentvariablesarethosethatrepresentthecoordinatesystemusedtodescribethespaceinwhichtheentityisdefined.Ifadatasetiscomposedofvariableswhoseinterpretationfitsthisdefinitionourgoalistounderstandhowthe‘entity’isdefinedwithinthen‐dimensionalEuclideanspace .Sometimeswemaydistinguishbetweenvariablesmeaningmeasurementofproperty,fromvariablesmeaningacoordinatesystem,byreferringtotheformerasvariate,andreferringtothelatterasdimension(DosSantos&Brodlie,2002), (dosSantos&Brodlie,2004).Aspaceisasetofpoints.Ametricspacehasanassociatedmetric,whichenablesustomeasuredistancesbetweenpointsinthatspaceand,inturn,implicitlydefinetheirneighborhoods.Consequently,ametricprovidesaspacewithatopology,andametricspaceisatopologicalone.Topologicalspacesfeelalientousbecauseweareaccustomedtohavingametric.BiomedicalExample:Aproteinisasinglechainofaminoacids,whichfoldsintoaglobularstructure.TheThermodynamicsHypothesisstatesthataproteinalwaysfoldsintoastateofminimumenergy.Topredictproteinstructure,wewouldliketomodelthefoldingofaproteincomputationally.Assuch,theproteinfoldingproblembecomesanoptimizationproblem:Wearelookingforapathtotheglobalminimuminaveryhigh‐dimensionalenergylandscape;

29WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 30: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Letuscollect ‐dimensional observationsintheEuclideanvectorspace andweget:Eq.2‐1

, … ,

Acloudofpointssampledfromanysource(e.g.medicaldata,sensornetworkdata,asolid3‐Dobject,surfaceetc.).ThosedatapointscanbecoordinatedasanunorderedsequenceinanarbitrarilyhighdimensionalEuclideanspace,wheremethodsofalgebraictopologycanbeapplied.Themainchallengeisinmappingthedatabackinto ortobemorepreciseinto ,becauseourretinaisinherentlyperceivingdatain .Thecloudofsuchdatapointscanbeusedasacomputationalrepresentationoftherespectivedataobject.Atemporalversioncanbefoundinmotion‐capturedata,wheregeometricpointsarerecordedastimeseries.Nowyouwillaskanobviousquestion:“Howdowevisualizeafour‐dimensionalobject?”Theobviousansweris:“Howdowevisualizeathreedimensionalobject?”Humansdonotseeinthreespatialdimensionsdirectly,butviasequencesofplanarprojectionsintegratedinamannerthatissensedifnotcomprehended.Littlechildrenspendasignificanttimeoftheirfirstyearoflifelearninghowtoinferthree‐dimensionalspatialdatafrompairedplanarprojections,andmanyyearsofpracticehavetunedaremarkableabilitytoextractglobalstructurefromrepresentationsinastrictlylowerdimension(Ghrist,2008).Becausewehavethesameproblemhereinthisbook,wemuststayin andthereforetheexampleinSlide2‐12(Zomorodian,2005).InEinstein'stheoryofSpecialRelativity,Euclidean3‐spaceplustime(the"4th‐dimension")areunifiedintotheMinkowskispace

30WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 31: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Ametricspacehasanassociatedmetric,whichenablestomeasurethedistancesbetweenpointsinthatspaceand,implicitlydefinetheirneighborhoods.Consequently,ametricprovidesaspacewithatopology,henceametricspaceisatopologicalspace.AsetXwithametricfunctiondiscalledametricspace.Wegiveitthemetrictopologyofd,wherethesetofopenballsMostofour“natural”spacesareaparticulartypeofmetricspaces:theEuclideanspaces:TheCartesianproductof copiesof ,thesetofrealnumbers,alongwiththeEuclideanmetric:Eq.2‐2

,

isthe ‐dimensionalEuclideanspace .Wemayinduceatopologyonsubsetsofmetricspacesasfollows:If ⊆ withtopology ,thenwegettherelativeorinducedtopology bydefiningFormoreinformationreferto(Zomorodian,2005)or(Edelsbrunner &Harer,2010).

31WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 32: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

KnowledgeDiscoveryfromData:Bygettinginsightintothedata;thegainedinformationcanbeusedtobuildupknowledge.Thegrandchallengeistomaphigherdimensionaldataintolowerdimensions,hencemakeitinteractivelyaccessibletotheend‐user(Holzinger,2012),(Holzinger,2013).Thismappingfrom → isthecoretaskofvisualizationandamajorcomponentforknowledgediscovery:Enablingeffectiveinteractivehumancontroloverpowerfulmachinealgorithmstosupporthumansensemaking(Holzinger,2012),(Holzinger,2013).

Holzinger,A.2013.Human–ComputerInteraction&KnowledgeDiscovery(HCI‐KDD):Whatisthebenefitofbringingthosetwofieldstoworktogether?In:AlfredoCuzzocrea,C.K.,Dimitris E.Simos,EdgarWeippl,Lida Xu (ed.)MultidisciplinaryResearchandPracticeforInformationSystems,SpringerLectureNotesinComputerScienceLNCS8127.Heidelberg,Berlin,NewYork:Springer,pp.319‐328.

An importanttopicissubspaceclustering:Hund,M.,Sturm,W.,Schreck,T.,Ullrich,T.,Keim,D.,Majnaric,L.&Holzinger,A.2015.AnalysisofPatientGroupsandImmunizationResultsBasedonSubspaceClustering.In:Guo,Y.,Friston,K.,Aldo,F.,Hill,S.&Peng,H.(eds.)BrainInformaticsandHealth,LectureNotesinArtificialIntelligenceLNAI9250.Cham:SpringerInternationalPublishing,pp.358‐368.https://online.tugraz.at/tug_online/voe_main2.getVollText?pDocumentNr=1198810&pCurrPk=85960

32WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 33: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Multivariatedataset isadatasetthathasmanydependentvariables andtheymightbecorrelatedtoeachothertovaryingdegrees.Usuallythistypeofdatasetisassociatedwithdiscretedatamodels.

Multidimensionaldataset isadatasetthathasmanyindependentvariables clearlyidentified,andoneormoredependentvariablesassociatedtothem.Usuallythistypeofdatasetisassociatedwithcontinuousdatamodels.

Inotherwords,everydataitem(orobject)inacomputerisrepresented(stored)asasetoffeatures.Insteadofthetermfeatureswemayusethetermdimensions,becauseanobjectwith ‐featurescanalsoberepresentedasamultidimensionalpointinan ‐dimensionalspace.Dimensionalityreductionistheprocessofmappingan ‐dimensionalpoint,intoalower ‐dimensionalspace– thisisthemainchallengeinvisualizationsee→Lecture9.

Thenumberofdimensionscansometimesbesmall,e.g.simple1D‐datasuchastemperaturemeasuredatdifferenttimes,to3Dapplicationssuchasmedicalimaging,wheredataiscapturedwithinavolume.Standardtechniques—contouringin2D;isosurfacing andvolumerenderingin3D—haveemergedovertheyearstohandlethissortofdata.Thereisnodimensionreductionissueintheseapplications,sincethedataanddisplaydimensionsessentiallymatch.

33WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 34: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Datacanbecategorizedintoqualitative(nominalandordinal)andquantitative(intervalandratio):Intervalandratiodataareparametric,andareusedwithparametrictoolsinwhichdistributionsarepredictable(andoftenNormal).Nominalandordinaldataarenon‐parametric,anddonotassumeanyparticulardistribution.Theyareusedwithnon‐parametrictoolssuchastheHistogram.Theclassicpaperonthetheoryofscalesofmeasurementis(Stevens,1946).

34WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 35: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Wecansummarizewhatwelearnedsofaraboutdata:Datacanbenumeric,non‐numeric,orboth.Non‐numericdatacanincludeanythingfromlanguagedata(text)tocategorical,image,orvideodata.Datamayrangefromcompletelystructured,suchascategoricaldata,tosemi‐structured,suchasanXMLFilecontainingmetainformation,tounstructured,suchasanarrative“free‐text”.Note,thattermunstructureddoesnotmeanthatthedataarewithoutanypattern,whichwouldmeancompleterandomnessanduncertainty,butratherthat“unstructureddata”areexpressedso,thatonlyhumanscanmeaningfullyinterpretit.Structureprovidesinformationthatcanbeinterpretedtodeterminedataorganizationandmeaning,henceitprovidesacontext fortheinformation.Theinherentstructureinthedatacanformabasisfordatarepresentation.Animportant,yetoftenneglectedissuearetemporalcharacteristicsofdata: Dataofalltypesmayhaveatemporal(time)association,andthisassociationmaybeeitherdiscreteorcontinuous(Thomas&Cook,2005).InMedicalInformaticswehaveapermanentinteractionbetweendata,informationandknowledge,withdifferentdefinitions(Bemmel&Musen,1997),seeSlide2‐16:

35WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 36: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Data arethephysicalentitiesatthelowestabstractionlevelwhichare,e.g.generatedbyapatient(patientdata)orabiologicalprocess(e.g.Omicsdata).Accordingto(Bemmel&Musen,1997)datacontainnomeaning.

Informationisderivedbyinterpretationofthedatabyaclinician(humanintelligence).

Knowledge isobtainedbyinductivereasoningwithpreviouslyinterpreteddata,collectedfrommanysimilarpatientsorprocesses,whichisaddedtothesocalledbodyofknowledgeinmedicine,theexplicitknowledge. Thisknowledgeisusedfortheinterpretationofotherdataandtogainimplicitknowledge whichguidestheclinicianintakingfurtheraction.

36WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 37: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Forhypothesisgenerationandtesting,fourtypesofinferencesexist(Peirce,1955):abstraction,abduction,deduction,andinduction.Thefirsttwodrivehypothesisgenerationwhilethelatterdrivehypothesistesting,seeSlide2‐17:Abstractionmeansthatdataarefilteredaccordingtotheirrelevancefortheproblemsolutionandchunkedinschemasrepresentinganabstract descriptionoftheproblem(e.g.,abstractingthatanadultmalewithhaemoglobinconcentrationlessthan14g/dL isananaemicpatient).Followingthis,hypothesesthatcouldaccountforthecurrentsituationarerelatedthroughaprocessofabduction,characterizedbya"backwardflow"ofinferencesacrossachainofdirectedrelationswhichidentifythoseinitialconditionsfromwhichthecurrentabstractrepresentationoftheproblemoriginates.Thisprovidestentativesolutionstotheproblemathandbywayofhypotheses.Forexample,knowingthatdisease willcausesymptom,abductionwilltrytoidentifytheexplanationforB,whiledeductionwillforecastthatapatientaffectedbydisease will

manifestsymptom :bothinferencesareusingthesamerelationalongtwodifferentdirections(Patel&Ramoni,1997).Abduction ischaracterizedbyacyclicalprocessofgeneratingpossibleexplanations(i.e.,identificationofasetofhypothesesthatareabletoaccountfortheclinicalcaseonthebasisoftheavailabledata)andtestingthoseexplanations(i.e.,evaluationofeachgeneratedhypothesisonthebasisofitsexpectedconsequences)fortheabnormalstateofthepatientathand(Patel,Arocha &Zhang,2004).

ThehypothesistestingprocedurescanbeinferredfromSlide2‐17:Generalknowledgeisgainedfrommanypatients,andthisgeneralknowledgeisthenappliedtoanindividualpatient.Wehavetodeterminebetween:Reasoning istheprocessbywhichaclinicianreachesaconclusionafterthinkingaboutallthefacts;Deduction consistsofderivingaparticularvalidconclusionfromasetofgeneralpremises;Induction consistsofderivingalikelygeneralconclusionfromasetofparticularstatements.Reasoninginthe“realworld”doesnotappeartofitneatlyintoanyofthesebasictypes.Therefore,athirdformofreasoninghasbeenrecognizedbyPeirce(1955),wheredeductionandinductionareinter‐mixed;

37WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 38: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Thequestion“whatisinformation?”isstillanopenquestioninbasicresearch,andanydefinitionisdependingontheviewtaken.Forexample,thedefinitiongivenbyCarl‐FriedrichvonWeizsäcker:“Informationiswhatisunderstood,”impliesthatinformationhasbothasenderandareceiverwhohaveacommonunderstandingoftherepresentationandthemeanstoconveyinformationusingsomepropertiesofthephysicalsystems,andhisaddendum:“Informationhasnoabsolutemeaning;itexistsrelativelybetweentwosemanticlevels”impliestheimportanceofcontext(Marinescu,2011).Withoutdoubtinformationisafundamentallyimportantconceptwithinourworldandlifeiscomplexinformation,seeSlide2‐14:

Manysystems,e.g.inthequantumworldtonotobeytheclassicalviewofinformation.Inthequantumworldandinthelifesciencestraditionalinformationtheoryoftenfailstoaccuratelydescribereality…forexampleinthecomplexityofalivingcell:Allcomplexlifeiscomposedofeukaryotic(nucleated)cells(Lane&Martin,2010).Agoodexampleofsuchacellistheprotist EuglenaGracilis (inGerman“Augentierchen”)withalengthofapprox.30 .Lifecanbeseenasadelicateinterplayofenergy,entropyandinformation,essentialfunctionsoflivingbeingscorrespondtothegeneration,consumption,processing,preservationandduplicationofinformation.

P:Complexity<>Information<>Energy<> Entropy

38WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 39: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

TheetymologicaloriginofthewordinformationcanbetracedbacktotheGreek“forma”andtheLatin“information”and“informare”,tobringsomethingintoashape(“in‐a‐form”).Consequently,thenaivedefinitionincomputerscienceis“informationisdataincontext” andthereforedifferentthandataorknowledge.However,wefollowthenotionof(Boisot &Canals,2004)anddefinethatinformationisanextractionfromdatathat,bymodifyingtherelevantprobabilitydistributions,hasdirectinfluenceonanagent’sknowledgebase.Forabetterunderstandingofthisconcept,wefirstreviewthemodelofhumaninformationprocessingbyWickens (1984):ThemodelbyWickens (1984)beautifullyemphasizesourviewondata,informationandknowledge:thephysicaldatafromthereal‐worldareperceivedasinformationthroughperceptualfilters,controlledbyselectiveattentionandformhypotheseswithintheworkingmemory.Thesehypothesesaretheexpectationsdependingonourpreviousknowledgeavailableinourmentalmodel,storedinthelong‐termmemory.Thesubjectivelybestalternativehypothesiswillbeselectedandprocessedfurtherandmaybetakenasoutcomeforanaction.Duetothefactthatthissystemisaclosedloop,wegetfeedbackthroughnewdataperceivedasnewinformationandtheprocessgoeson.

39WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 40: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Theincomingstimulifromthephysicalworldmustpassbothaperceptualfilterandaconceptualfilter.Theperceptualfilter orientatesthesenses(e.g.visualsense)tocertaintypesofstimuliwithinacertainphysicalrange(e.g.visualsignalrange,pre‐knowledge,attentionetc.).Onlythestimuliwhichpassthroughthisfiltergetregisteredasincomingdata–everythingelseisfilteredout.Atthispointitisimportanttofollowourphysicalprincipleofdata:todifferentiatebetweentwonotionsthatarefrequentlyconfused:anexperiment’s(raw,hard,measured,factual)dataandits(meaningful,subjective)interpretedinformationresults.Dataarepropertiesconcerningonlytheinstrument;itistheexpressionofafact. Theresultconcernsapropertyoftheworld.Thefollowingconceptualfiltersextractinformation‐bearingdatafromwhathasbeenpreviouslyregistered.Bothtypesoffiltersareinfluencedbytheagents’cognitiveandaffectiveexpectations,storedintheirmentalmodels.Theenormousutilityofdataresidesinthefactthatitcancarryinformationaboutthephysicalworld.Thisinformationmaymodifysetexpectationsorthestate‐of‐knowledge.Theseprinciplesallowanagent toactinadaptivewaysinthephysicalworld(Boisot &Canals,2004).Conferthisprocesswiththehumaninformationprocessingmodelby(Wickens,1984),seeninSlide2‐19anddiscussedin→Lecture7.

40WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 41: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Entropyhasmanydifferentdefinitionsandapplications,originallyinstatisticalphysicsandmostoftenitisusedasameasurefordisorder.Ininformationtheory,entropycanbeusedasameasurefortheuncertaintyinadataset.

To demonstratehowusefulentropycanbe‐ youcanhavealookatthispaper:Holzinger,A.,Stocker,C.,Peischl,B.&Simonic,K.‐M.2012.OnUsingEntropyforEnhancingHandwritingPreprocessing.Entropy,14,(11),2324‐2350.http://www.mdpi.com/1099‐4300/14/11/2324

41WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 42: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Theconceptofentropywasfirstintroducedinthermodynamics(Clausius,1850),whereitwasusedtoprovideastatementofthesecondlawofthermodynamics.Later,statisticalmechanicsprovidedaconnectionbetweenthemacroscopicpropertyofentropyandthemicroscopicstateofasystembyBoltzmann.Shannonwasthefirsttodefineentropyandmutualinformation.

Shannon(1948)usedaGedankenexperiment(thoughtexperiment)toproposeameasureofuncertaintyinadiscretedistributionbasedontheBoltzmannentropyofclassicalstatisticalmechanics,seeSlide2‐22:

42WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 43: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Anexampleshalldemonstratetheusefulnessofthisapproach:1)Let beadiscretedatasetwithassociatedprobabilities :Eq.2‐5

… , … ,

2)NowweapplyShannon’sequationEq.2‐4:Eq.2‐6

3)Weassumethatoursourcehastwovalues(ball=white,ball=black)LetusdothefamoussimpleGedankenexperiment(thoughtexperiment):Imagineaboxwhichcancontaintwocoloredballs:blackandwhite.Thisisoursetofdiscretesymbolswithassociatedprobabilities.Ifwegrabblindlyintothisboxtogetaball,wearedealingwithuncertainty,becausewedonotknowwhichballwetouch.Wecanask:Istheballblack?NO.THENitmustbewhite,soweneedonequestiontosurelyprovidetherightanswer.Becauseitisabinarydecision(YES/NO)themaximumnumberof(binary)questionsrequiredtoreducetheuncertaintyis:log ,where isthenumberofthepossibleoutcomes.Ifthereare eventswithequalprobability then 1/ .Ifyouhaveonly1blackball,thenlog 1 0,whichmeansthereisnouncertainty.Eq.2‐7

, with , 14)NowwesolvenumericallyEq.2‐6:Eq.2‐8

∗ log1

∗ log1

1Since rangesfrom0(forimpossibleevents)to1(forcertainevents),theentropyvaluerangesfrominfinity(forimpossibleevents)to0(forcertainevents).So,wecansummarizethattheentropyistheweightedaverageofthesurpriseforallpossibleoutcomes.Forourexamplewiththetwoballswecandrawthefollowingfunction:Theentropyvalueis1for =0,5anditisboth0foreither =0or =1.Thisexamplemightseemtrivial,buttheentropyprinciplehasbeendevelopedalotsinceShannonandtherearemanydifferentmethods,whichareveryusefulfordealingwithdata.

43WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 44: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Shannoncalledittheinformationentropy (akaShannonentropy)anddefined:Eq.2‐9

log1

log

where istheprobabilityoftheeventoccurring.If isnotidenticalforalleventsthentheentropy isaweightedaverageofallprobabilities,whichShannondefinedas:Eq.2‐10

2

Basically,theentropyp(x)approacheszeroifwehaveamaximumofstructure– andopposite,theentropyp(x)reacheshighvaluesifthereisnostructure– hence,ideally,iftheentropyisamaximum,wehavecompleterandomness,totaluncertainty.LowEntropymeansdifferences,structure,individuality– highEntropymeansnodifferences,nostructure,noindividuality.Consequently,lifeneedslowentropy.

44WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 45: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Theprinciplewhatwecaninferfromentropyvaluesis:1)Lowentropy valuesmeanhighprobability,highcertainty, henceahighdegreeofstructurization inthedata.2)Highentropy valuesmeanlowprobability,lowcertainty (≅ highuncertainty;‐),hencealowdegreeofstructurization inthedata.Maximumentropywouldmeancompleterandomnessandtotaluncertainty.Highlystructureddatacontainlowentropy;ideallyifeverythingisinorderandthereisnosurprise(nouncertainty)theentropyislow:Eq.2‐11

0

Eq.2‐12 log .

Ontheotherhandifthedataareweaklystructured– asforexampleinbiologicaldata–andthereisnoabilitytoguess(alldataisequallylikely)theentropyishigh:Ifwefollowthisapproach,“unstructureddata”wouldmeancompleterandomness.Letuslookonthehistoryofentropytounderstandwhatwecandoinfuture,seeSlide2‐25.

45WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 46: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Youmight arguewhatthepracticalpurposeofthisapproachis– manifoldapplications!

Example:Heartratevariabilityisthevariationofthetimeintervalbetweenconsecutiveheartbeats.Entropyisacommonlyusedtooltodescribetheregularityofdatasets.Entropyfunctionsaredefinedusingmultipleparameters,theselectionofwhichiscontroversialanddependsontheintendedpurpose.Mayeretal. (2014)describetheresultsoftestsconductedtosupportparameterselection,towardsthegoalofenablingfurtherbiomarkerdiscovery.Theydealtwithapproximate,sample,fuzzy,andfuzzymeasureentropies.AlldatawereobtainedfromPhysioNet https://www.physionet.org,afree‐access,on‐linearchiveofphysiologicalsignals,andrepresentvariousmedicalconditions.Fivetestsweredefinedandconductedtoexaminetheinfluenceof:varyingthethresholdvaluer(asmultiplesofthesamplestandarddeviation?,ortheentropy‐maximizingrChon),thedatalengthN,theweightingfactorsnforfuzzyandfuzzymeasureentropies,andthethresholdsrF andrL forfuzzymeasureentropy.TheresultsweretestedfornormalityusingLilliefors'compositegoodness‐of‐fittest.Consequently,thep‐valuewascalculatedwitheitheratwosamplet‐testoraWilcoxonranksumtest.Thefirsttestshowsacross‐overofentropyvalueswithregardtoachangeofr.Thus,aclearstatementthatahigherentropycorrespondstoahighirregularityisnotpossible,butisratheranindicatorofdifferencesinregularity.Nshouldbeatleast200datapointsforr=0.2?andshouldevenexceedalengthof1000forr=rChon.Theresultsfortheweightingparametersnforthefuzzymembershipfunctionshowdifferentbehaviorwhencoupledwithdifferentrvalues,thereforetheweightingparametershavebeenchosenindependentlyforthedifferentthresholdvalues.ThetestsconcerningrF andrL showedthatthereisnooptimalchoice,butr=rF =rL isreasonablewithr=rChon orr=0.2?.CONCLUSIONS:Someofthetestsshowedadependencyofthetestsignificanceonthedataathand.Nevertheless,asthemedicalconditionsareunknownbeforehand,compromiseshadtobemade.Optimalparametercombinationsaresuggestedforthemethodsconsidered.Yet,duetothehighnumberofpotentialparametercombinations,furtherinvestigationsofentropyforheartratevariabilitydatawillbenecessary.Mayer,C.,Bachler,M.,Hortenhuber,M.,Stocker,C.,Holzinger,A.&Wassertheurer,S.2014.Selectionofentropy‐measureparametersforknowledgediscoveryinheartratevariabilitydata.BMCBioinformatics,15,(Suppl 6),S2.http://www.ncbi.nlm.nih.gov/pubmed/25078574

46WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 47: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

TheoriginmaybefoundintheworkofJakob Bernoulli,describingtheprincipleofinsufficientreason:weareignorantofthewaysaneventcanoccur,theeventwilloccurequallylikelyinanyway.ThomasBayes(1763)andPierre‐SimonLaplace(1774)carriedonandHaroldJeffreys andDavidCoxsolidifieditintheBayesianStatistics,akastatisticalinference.ThesecondpathleadingtotheclassicalMaximumEntropy,en‐routewiththeShannonEntropy,canbeidentifiedwiththeworkofJamesClerkMaxwellandLudwigBoltzmann,continuedbyWillardGibbsandfinallyClaudeElwoodShannon.Thisworkisgearedtowarddevelopingthemathematicaltoolsforstatisticalmodelingofproblemsininformation.Thesetwoindependentlinesofresearchareverysimilar.Theobjectiveofthefirstlineofresearchistoformulateatheory/methodologythatallowsunderstandingofthegeneralcharacteristics (distribution)ofasystemfrompartialandincompleteinformation.Inthesecondrouteofresearch,thesameobjectiveisexpressedasdetermininghowtoassign(initial)numericalvaluesofprobabilitieswhenonlysome(theoretical)limitedglobalquantitiesoftheinvestigatedsystemareknown.RecognizingthecommonbasicobjectivesofthesetwolinesofresearchaidedJaynes inthedevelopmentofhisclassicalwork,theMaximumEntropyformalism.Thisformalismisbasedonthefirstlineofresearchandthemathematicsofthesecondlineofresearch.TheinterrelationshipbetweenInformationTheory,statisticsandinference,andtheMaximumEntropy(MaxEnt)principlebecameclearin1950ies,andmanydifferentmethodsarosefromtheseprinciples(Golan,2008),seenextSlide

47WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 48: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

MaximumEntropy(MaxEn),describedby(Jaynes,1957),isusedtoestimateunknownparametersofamultinomialdiscretechoiceproblem,whereastheGeneralizedMaximumEntropy(GME)modelincludesnoisetermsinthemultinomialinformationconstraints.Eachnoisetermismodeledasthemeanofafinitesetofaprioriknownpointsintheinterval 1,1 withunknownprobabilitieswherenoparametricassumptionsabouttheerrordistributionaremade.AGMEmodelforthemultinomialprobabilitiesandforthedistributions,associatedwiththenoisetermsisderivedbymaximizingthejointentropyofmultinomialandnoisedistributions,undertheassumptionofindependence(Jaynes,1957).TopologicalEntropy (TopEn),wasintroducedby(Adler,Konheim &McAndrew,1965)withthepurpose tointroducethenotionofentropyasaninvariantforcontinuousmappings:Let , beatopologicaldynamicalsystem,i.e.,let beanonemptycompactHausdorff spaceand : → acontinuousmap;theTopEn isanonnegativenumberwhichmeasuresthecomplexity ofthesystem(Adler,Downarowicz &Misiurewicz,2008).GraphEntropywasdescribedby(Mowshowitz,1968)tomeasurestructuralinformationcontentofgraphs,andadifferentdefinition,morefocusedonproblemsininformationandcodingtheory,wasintroducedby(Körner,1973).Graphentropyisoftenusedforthecharacterizationofthethe structureofgraph‐basedsystems,e.g.inmathematicalbiochemistry.Intheseapplicationstheentropyofagraphisinterpretedasitsstructuralinformationcontentandservesasacomplexitymeasure,andsuchameasureisassociatedwithanequivalencerelationdefinedonafinitegraph;byapplicationofShannon’sEq.2.4withtheprobabilitydistributionwegetanumericalvaluethatservesasanindexofthestructuralfeaturecapturedbytheequivalencerelation(Dehmer&Mowshowitz,2011).

MinimumEntropy (MinEn),describedby(Posner,1975),providesustheleastrandom,andtheleastuniformprobabilitydistributionofadataset,i.e.theminimumuncertainty,whichisthelimitofourknowledgeandofthestructureofthesystem.Often,theclassicalpatternrecognitionisdescribedasaquestforminimumentropy.Mathematically,itismoredifficulttodetermineaminimumentropyprobabilitydistributionthanamaximumentropyprobabilitydistribution;whilethelatterhasaglobalmaximumduetotheconcavityoftheentropy,theformerhastobeobtainedbycalculatingalllocalminima,consequentlytheminimumentropyprobabilitydistributionmaynotexistinmanycases(Yuan&Kesavan,1998).CrossEntropy (CE),discussedby(Rubinstein,1997),wasmotivatedbyanadaptivealgorithmforestimatingprobabilitiesofrareeventsincomplexstochasticnetworks,whichinvolvesvarianceminimization.CEcanalsobeusedforcombinatorialoptimizationproblems(COP).Thisisdonebytranslatingthe“deterministic”optimizationproblemintoarelated“stochastic”optimizationproblemandthenusingrareeventsimulationtechniques(DeBoeretal.,2005).Rényi entropy isageneralizationoftheShannonentropy(informationtheory),andTsallis entropyisageneralizationofthestandardBoltzmann–Gibbsentropy(statisticalphysics).Forusmoreimportantare:ApproximateEntropy(ApEn),describedby(Pincus,1991),isuseabletoquantifyregularityindatawithoutanyaprioriknowledgeaboutthesystem,seeanexampleinSlide2‐20.SampleEntropy(SampEn),wasusedby(Richman&Moorman,2000)foranewrelatedmeasureoftimeseriesregularity.SampEn wasdesignedtoreducethebiasofApEn andisbettersuitedfordatasetswithknownprobabilisticcontent.

48WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 49: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Problem:Monitoringbodymovementsalongwithvitalparametersduringsleepprovidesimportantmedicalinformationregardingthegeneralhealth,andcanthereforebeusedtodetecttrends(largeepidemiologystudies)todiscoversevereillnessesincludinghypertension(whichisenormouslyincreasinginoursociety).Thisseeminglysimpledata– onlyfromonenightperiod– demonstratesthecomplexityandtheboundariesofstandardmethods(forexampleFastFourierTransformation)todiscoverknowledge(forexampledeviations,similaritiesetc.).Duetothecomplexityanduncertaintyofsuchdatasets,standardmethods(suchasFFT)comprisethedangerofmodelingartifacts.Sincetheknowledgeofinterestformedicalpurposesisinanomalies(alterations,differences,a‐typicalities,irregularities),theapplicationofentropicmethodsprovidesbenefits.PhotographtakenduringtheEUProjectEMERGEandusedwithpermission.

49WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 50: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

1)Wehaveagivendataset wherecapital isthenumberofdatapoints:Eq.2‐13

, , … ,

2)Nowweformm‐dimensionalvectorsEq.2‐14

, , … ,3)Wemeasurethedistancebetweeneverycomponent,i.e.themaximumabsolutedifferencebetweentheirscalarcomponentsEq.2‐15

, max, ,…,

4)Welook– sotosay– inwhichdimensionisthebiggestdifference;asaresultwegettheApproximateEntropy(ifthereisnodifferencewehavezerorelativeentropy):Eq.2‐16

ApEn , lim →

where istherunlengthand isthetolerancewindow (letusassumethat isequalto ),ApEn (m,r)couldalsobewrittenasH ,5) iscomputedbyEq.2‐17

1 1

ln

withEq.2‐18

1

6) measureswithinthetolerance theregularityofpatternssimilartoagivenoneofwindowlength7)Finallyweincreasethedimensionto 1 andrepeatthestepsbeforeandgetasaresulttheapproximateentropyApEn ,ApEn , , isapproximatelythenegativenaturallogarithmoftheconditionalprobability(CP)thatadatasetoflength,havingrepeateditselfwithinatolerance for points,willalsorepeatitselffor 1 points.Animportantpointto

keepinmindabouttheparameter isthatitiscommonlyexpressedasafractionoftheStandarddeviation(SD)ofthedataandinthiswaymakesApEn ascale‐invariantmeasure.Alowvaluearisesfromahighprobabilityofrepeatedtemplatesequencesinthedata(Hornero etal.,2006).

50WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 51: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Inthisslidewecanseetheplotofthenormalizedapproximateentropyforeachoftheepisodesandthemedianacrossalltheepisodes.Fromthisfigurewecanseethattheentropyisaminimumwherewehavenoalterationsandentropyisincreasingwhenhavingirregularities.Ifwehavenodifferenceswegetzeroentropy

51WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 52: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Afinalexampleshouldmaketheadvantageofsuchanentropymethodtotallyclear:Intherightdiagramitishardtodiscoverirregularitiesforamedicalprofessional–especiallyoveralongerperiod,butananomalycaneasilybedetectedbydisplayingthemeasuredrelativeApEn.Whatcanwelearnfromthisexperiment?Approximateentropyisrelativelyunaffectedbynoise;itcanbeappliedtocomplextimeserieswithgoodreproduction;itisfiniteforstochastic,noisy,compositeprocesses;thevaluescorresponddirectlytoirregularities;anditisapplicabletomanyotherareas– forexamplefortheclassificationoflargesetsoftexts– theabilitytoguessalgorithmicallythesubjectofatextcollectionwithouthavingtoreaditwouldpermitautomatedclassification.

52WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 53: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

A. Holzinger    LV 709.049 Med. Informatik                                

WS 2015 53

Page 54: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Algorithm 1:WithrotateDataPoints”definedwecancalculatetheprojectionprofilepy;j( forarangeofdifferentanglesandwiththosewecancomputetheentropyHy, (X)foreachangle.Holzinger etal.(2012)implementedthisalgorithmasitisdescribedinAlgorithm2.Formoreinformationplease refertothepaper.Thisjustshalldemonstratethestrengthsandweaknessesofusingentropyforskew‐ andslant‐correction– whichisimportantinhandwritingrecognition.However,theentropybasedskewcorrectiondoesnotoutperformoldermethodslikeskewcorrectionbasedontheleastsquaresmethod:thenoiseinthedrawingdistortstherealminimaoftheentropydistribution.Inmanycaseswheretheglobalminimumwasthewrongchoice,therewasalocalminimumclosetotherealerrorangle.Eventhoughbothapproachesyieldsatisfyingresultforwordslongerthanfiveletters,wesuggestfurtherinvestigationintotheentropy‐basedskewcorrectionmethod,withnoisereductioninmind.Ontheotherhand,itshowsthatentropyisinfactusefulwhenperformingslantcorrection,asitdoesoutperformthewindow‐basedapproach!Theconclusionis,thatthewindow‐basedmethodistoomuchdependentonanumberoffactors.Itsperformanceisinfluencedalotbytheoutcomeofzonedetectionandbythewritingstyleofthewriter.Itisalsoinfluencedalotbywindowselection.Holzinger,A.,Stocker,C.,Peischl,B.&Simonic,K.‐M.2012.OnUsingEntropyforEnhancingHandwritingPreprocessing.Entropy,14,(11),2324‐2350

A. Holzinger    LV 709.049 Med. Informatik                                

WS 2015 54

Page 55: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Whatcanwelearnfromthisexperiment?Approximateentropyisrelativelyunaffectedbynoise;itcanbeappliedtocomplextimeserieswithgoodreproducibility;itisfiniteforstochastic,noisy,compositeprocesses;thevaluescorresponddirectlytoirregularities;

anditisapplicabletomanyotherareas– forexamplefortheclassificationoflargesetsoftexts– theabilitytoguessalgorithmicallythesubjectofatextcollectionwithouthavingtoreaditwouldpermitautomatedclassification.

55WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 56: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

56

My DEDICATION is to make data valuable … Thank you!

WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 57: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

57WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 58: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

58WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 59: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

MFC=MinimumFoot ClearanceStride=stepYoucanseebrilliantlywhatyoucanmeasurewithentropy– youcandetermineanomalies,i.e.thebalanceproblemsofelderlygait

MFCPoincaré plots.ToppanelsshowMFCtimeseriesfromahealthyelderlysubject(A)anditscorrespondingPoincaré plot(B).BottompanelsshowMFCtimeseriesfromanelderlysubjectwithbalanceproblem(C)anditscorrespondingPoincaré plot(D).

SignificantrelationshipsofmeanMFCwithPoincaré plotindexes(SD1,SD2)andApEn (r=0.70,p<0.05;r=0.86,p<0.01;r=0.74,p<0.05)werefoundinthefalls‐riskelderlygroup.Ontheotherhand,suchrelationshipswereabsentinthehealthyelderlygroup.Incontrast,theApEn valuesofMFCdataseriesweresignificantly(p<0.05)correlatedwithPoincaré plotindexesofMFCinthehealthyelderlygroup,whereascorrelationswereabsentinthefalls‐riskgroup.TheApEn valuesinthefalls‐riskgroup(meanApEn =0.18± 0.03)wassignificantly(p<0.05)higherthanthatinthehealthygroup(meanApEn =0.13± 0.13).ThehigherApEn valuesinthefalls‐riskgroupmightindicateincreasedirregularitiesandrandomnessintheirgaitpatternsandanindicationoflossofgaitcontrolmechanism.ApEn valuesofrandomlyshuffledMFCdataoffallsrisksubjectsdidnotshowanysignificantrelationshipwithmeanMFC.

59WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 60: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

60WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 61: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

61WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 62: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Surrogatedatarecords.AandBshowthemajorcomponents.A:themeanprocess,whichhassetpointandspikemodes.B:thebaselineprocess,heremeaningtheheartratevariability,modeledasGaussianrandomnumbers.C:theirsum,asurrogatedatarecord.D–F:amorerealisticsurrogatewiththesamefrequencycontentastheobserveddata.D:aclinicallyobserveddatarecordof4,096R‐Rintervals.Thelefthand ordinateislabeledinms andtherighthand ordinateinSD.E:a4,096‐pointisospectral surrogatedatasetformedusingtheinverseFouriertransformoftheperiodogram ofthedatainD.F:thesurrogatedataafteradditionofaclinicallyobserveddecelerationlasting50pointsandscaledsothatthevarianceoftherecordisincreasedfrom1to2.

62WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 63: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

63WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 64: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

64WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 65: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

65WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 66: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

66WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 67: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

http://support.sas.com/documentation/cdl/en/etsug/60372/HTML/default/viewer.htm#etsug_entropy_sect018.htm

Wheremanyotherlanguagesrefertotables,rows,andcolumns/fields,SASusesthetermsdatasets,observations,andvariables.ThereareonlytwokindsofvariablesinSAS:numericandcharacter(string).Bydefaultallnumericvariablesarestoredas(8byte)real.Itispossibletoreduceprecisioninexternalstorageonly.DateanddatetimevariablesarenumericvariablesthatinherittheCtraditionandarestoredaseitherthenumberofdays(fordatevariables)orseconds(fordatetime variables).

http://www.sas.com/technologies/analytics/statistics/stat/index.html

67WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 68: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Hadoop andtheMapReduce programmingparadigmalreadyhaveasubstantialbaseinthebioinformaticscommunity– inparticular inthefieldofhigh‐throughput next‐generationsequencinganalysis.Thisisduetothecost‐effectivenessofHadoop‐basedanalysisoncommodityLinuxclusters,andinthecloudviadatauploadtocloudvendorswhohaveimplementedHadoop/HBase;andduetotheeffectivenessandease‐of‐useoftheMapReduce methodinparallelizationofmanydataanalysisalgorithms.

68WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 69: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Thechallengewefaceisthatanestimatedaverageof5%ofdataarestructured,therestiseithersemi‐structured,weaklystructuredandmostofourdataisunstructured.Maybethemostimportantfield forthefutureisdatamining– especiallynoveltechniquesofdatamining,includingbothtimeandspace(e.g.graph‐based,entropy‐based,topological‐baseddataminingapproaches).Readmorehere:Holzinger,A.2014.ExtravaganzaTutorialonHotIdeasforInteractiveKnowledgeDiscoveryandDataMininginBiomedicalInformatics.In:Slezak,D.,Tan,A.‐H.,Peters,J.F.&Schwabe,L.(eds.)BrainInformaticsandHealth,BIH2014,LectureNotesinArtificialIntelligence,LNAI8609.Heidelberg,Berlin:Springer,pp.502‐515.https://online.tugraz.at/tug_online/voe_main2.getVollText?pDocumentNr=764238&pCurrPk=79139

69WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 70: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

70WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 71: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

http://minnesotafuturist.pbworks.com/w/page/21441129/DIKW

Afunnydescription ofdatainformationknowledge.

71WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 72: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Avery placative image.Nicetolookat– buttheusefulnessisquestionable.

72WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 73: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Allthismodelsareveryquestionable. PleaserememberthatwefollowinourlecturethenotionofBoisot &Canals.

73WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 74: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Theinterestingissue ofthisgraphicisthatitincludesatime‐axis,whichisimportantfordecisionmakingandpredictiveanalytics.“Pastbehaviourisagoodpredictorforfuturebehaviour”Althoughthis isaoversimplification,scientistswhostudyhumanbehavioragreethatpastbehaviormaybeausefulmarkerforfuturebehavior,however,onlyundercertainspecificconditions. Readmore:Ajzen,I.1991.Thetheoryofplannedbehavior.Organizationalbehaviorandhumandecisionprocesses,50,(2),179‐211.

74WS 2015

A. Holzinger    LV 709.049 Med. Informatik