A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... ·...

74
Status as of Mo, 19.10.2015 08:30 Dear Students – welcome to the second lecture of our course “biomedical informatics”, please remember from the last lecture the definition: According to the American Association of Medical Informatics (AMIA) the term Medical Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary field that studies and pursues the effective use of biomedical data, information, and knowledge for scientific inquiry, problem solving, and decision making, motivated by efforts to improve human health”. [1] It is important to know: Bioinformatics + Medical Informatics = Biomedical Informatics (see Slide 1‐42). Note: Computers are just the vehicles to realize the central goals: To harness the power of the machines to support and to amplify human intelligence [2]. [1] Shortliffe, E. H. 2011. Biomedical Informatics: Defining the Science and its Role in Health Professional Education. In: Holzinger, A. & Simonic, K.‐M. (eds.) Information Quality in e‐Health. Lecture Notes in Computer Science LNCS 7058. Heidelberg, New York: Springer, pp. 711‐714. [2] Holzinger, A. 2013. Human–Computer Interaction and Knowledge Discovery (HCI‐ KDD): What is the benefit of bringing those two fields to work together? In: Cuzzocrea, A., Kittl, C., Simos, D. E., Weippl, E. & Xu, L. (eds.) Multidisciplinary Research and Practice for Information Systems, Springer Lecture Notes in Computer Science LNCS 8127. Heidelberg, Berlin, New York: Springer, pp. 319‐328. Regarding the current trend towards personalized medicine have a read of this paper: Holzinger, A. 2014. Trends in Interactive Knowledge Discovery for Personalized Medicine: Cognitive Science meets Machine Learning. IEEE Intelligent Informatics Bulletin, 15, (1), 6‐14. Online available via: http://www.comp.hkbu.edu.hk/~cib/2014/Dec/article2/iib_vol15no1_article2.pdf 1 WS 2015 A. Holzinger LV 709.049 Med. Informatik

Transcript of A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... ·...

Page 1: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Statusasof Mo,19.10.201508:30DearStudents – welcometothesecondlectureofourcourse“biomedicalinformatics”,pleaserememberfromthelastlecturethedefinition:AccordingtotheAmericanAssociationofMedicalInformatics(AMIA)thetermMedicalInformaticshasnowbeenexpandedtoBiomedicalInformaticsandisdefinedas“theinterdisciplinaryfieldthatstudiesandpursuestheeffectiveuseofbiomedicaldata,information,andknowledgeforscientificinquiry,problemsolving,anddecisionmaking,motivatedbyeffortstoimprovehumanhealth”.[1]Itisimportanttoknow:Bioinformatics+MedicalInformatics=BiomedicalInformatics(seeSlide1‐42).Note:Computersarejustthevehiclestorealizethecentralgoals:Toharnessthepowerofthemachinestosupportandtoamplifyhumanintelligence[2].[1]Shortliffe,E.H.2011.BiomedicalInformatics:DefiningtheScienceanditsRoleinHealthProfessionalEducation.In:Holzinger,A.&Simonic,K.‐M.(eds.)InformationQualityine‐Health.LectureNotesinComputerScienceLNCS7058.Heidelberg,NewYork:Springer,pp.711‐714.[2]Holzinger,A.2013.Human–ComputerInteractionandKnowledgeDiscovery(HCI‐KDD):Whatisthebenefitofbringingthosetwofieldstoworktogether?In:Cuzzocrea,A.,Kittl,C.,Simos,D.E.,Weippl,E.&Xu,L.(eds.)MultidisciplinaryResearchandPracticeforInformationSystems,SpringerLectureNotesinComputerScienceLNCS8127.Heidelberg,Berlin,NewYork:Springer,pp.319‐328.Regardingthecurrenttrendtowardspersonalizedmedicinehaveareadofthispaper:Holzinger,A.2014.TrendsinInteractiveKnowledgeDiscoveryforPersonalizedMedicine:CognitiveSciencemeetsMachineLearning.IEEEIntelligentInformaticsBulletin,15,(1),6‐14.Onlineavailable via:http://www.comp.hkbu.edu.hk/~cib/2014/Dec/article2/iib_vol15no1_article2.pdf

1WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 2: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Inthissecondlecturewestartwithalookondatasources,reviewsomedatastructures,discussstandardizationversusstructurization,reviewthedifferencesbetweendata,informationandknowledgeandclosewithanoverviewaboutinformationentropy.

2WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 3: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Acentral topicisthedimensionalityofdataandthe interrelated(connected)curseofdimensionalitywhichreferstovariousphenomenathatarisewhenanalyzingandorganizingdatainhigh‐dimensionalspaces(thousandsofdimensions)thatdonotoccurinlow‐dimensionalsettingssuchasthethree‐dimensionalphysicalspaceofoureverydayworld.

3WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 4: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

4WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 5: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

5WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 6: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

6WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 7: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Arecommendable smallbookletis:

Scheinerman,E.R.2011.MathematicalNotation:AGuideforEngineersandScientists,Baltimore(MD),Scheinerman.Whichalsoincludes themostimportantLATEXcommandsforproducingmaths symbols

http://www.ams.jhu.edu/~ers/notation/

7WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 8: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

componentsthroughpandw(Golan,Judge,andMiller;1996);

8WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 9: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Holzinger,A.,Dehmer,M.&Jurisica,I.2014.KnowledgeDiscoveryandinteractiveDataMininginBioinformatics‐ State‐of‐the‐Art,futurechallengesandresearchdirections.BMCBioinformatics,15,(S6),I1.

Hund,M.,Sturm,W.,Schreck,T.,Ullrich,T.,Keim,D.,Majnaric,L.&Holzinger,A.2015.AnalysisofPatientGroupsandImmunizationResultsBasedonSubspaceClustering.In:Guo,Y.,Friston,K.,Aldo,F.,Hill,S.&Peng,H.(eds.)BrainInformaticsandHealth,LectureNotesinArtificialIntelligenceLNAI9250.Cham:SpringerInternationalPublishing,pp.358‐368.

Holzinger,A.,Stocker,C.&Dehmer,M.2014.BigComplexBiomedicalData:TowardsaTaxonomyofData.In:Obaidat,M.S.&Filipe,J.(eds.)CommunicationsinComputerandInformationScienceCCIS455.BerlinHeidelberg:Springerpp.3‐18.

Relatedrecommended reading:

Dong‐Hee,S.&MinJae,C.2015.Ecologicalviewsofbigdata:Perspectivesandissues.TelematicsandInformatics,32,(2),311‐320.

Dong,X.L.&Srivastava,D.2015.BigDataIntegration.SynthesisLecturesonDataManagement,7,(1),1‐198.

Wu,X.D.,Zhu,X.Q.,Wu,G.Q.&Ding,W.2014.DataMiningwithBigData.IEEETransactionsonKnowledgeandDataEngineering,26,(1),97‐107.

Shneiderman,B.2014.TheBigPictureforBigData:Visualization.Science,343,(6172),730‐730.

Sackman,J.E.&Kuchenreuther,M.2014.MarryingBigDatawithPersonalizedMedicine.Biopharm International,27,(8),36‐38.

9WS 2015

Page 10: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Withregardto data,thedifferencebetweenclassicalstatisticsandmodern machinelearningisthatmachinelearningdiscoversintricatestructuresinlargedatasetstoindicatehowamachineshouldchangeitsinternalparameters.

A. Holzinger    LV 709.049 Med. Informatik                                

WS 2015 10

Page 11: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

JohnvonNeumannandhishigh‐speedcomputer,approx.1952

Ourfirstquestionis:Wheredoesthedatacomefrom?Thesecondquestion:Whatkindofdataisthis?Thethirdquestion:Howbigisthisdata?So,letuslookatsomebiomedicaldatasources(seeSlide2‐1):

11WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 12: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Duetotheincreasingtrendtowardspersonalizedandmolecularmedicine,biomedicaldataresultsfromvarioussourcesindifferentstructuraldimensions,rangingfromthemicroscopicworld(e.g.genomics,epigenomics,metagenomics,proteomics,metabolomics)tothemacroscopicworld(e.g.diseasespreadingdataofpopulationsinpublichealthinformatics).Justfororientation:theGlucosemoleculehasasizeof900 =900 10 andtheCarbonatomapprox.300 .Ahepatitisvirusisrelativelylargewith45 =45 10 andtheX‐Chromosomemuchbiggerwith7 =7 10 .Herealotof“bigdata”isproduced,e.g.genomics,metabolomicsandproteomicsdata.Thisisreally“bigdata”– thedatasetsenormouslylarge– whereasineachindividualweestimatemanyTerabytes(1TB=1 10 Byte=1000GByte)ofgenomicsdata,weareconfrontedwithPetabytesofproteomicsdataandthefusionofthoseforpersonalizedmedicineresultsinExabytes ofdata(1EB=1 10 Byte).

Ofcoursetheseamountsareforeachhumanindividual,however,wehaveacurrentworldpopulationof7Billion(1BillioninEnglishlanguageis1MilliardinEuropeanlanguage)people(=7 10 people).Soyoucanseethatthisisreally“bigdata”.This“natural”dataisthenfusedwith“produced”data,e.g.theunstructureddata(text)inthepatientrecords,ordatafromphysiologicalsensorsetc.– thesedataisalsorapidlyincreasinginsizeandcomplexity.Youcanimaginethatwithoutcomputationalintelligencewehavenochancetosurviveinthiscomplexbigdatasets.

http://learn.genetics.utah.edu/content/begin/cells/scale/C‐Atom340pm=340.10‐12mMoleculeGlucose900pmVirus HepatitisVirus45nm=45.10‐9mMicroscope200.10‐9mConfocalmicroscopy 20.10‐6mElectron‐Microscopy0,1.10‐9mX‐Chromosome7.10‐6mDNA2.10‐9mEncyme =Metabolomics

Holzinger,A.,Dehmer,M.&Jurisica,I.2014.KnowledgeDiscoveryandinteractiveDataMininginBioinformatics‐ State‐of‐the‐Art,futurechallengesandresearchdirections.BMCBioinformatics,15,(S6),I1.

12WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 13: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

MostofourcomputersareVon‐Neumannmachines(seechapter1),consequentlyatthelowestphysicallayer,dataisrepresentedaspatternsofelectricalon/offstates(1/0,H/L,high/low);wespeakofabit,whichisalsoknownasBit,theBasicindissolubleinformationunit(Shannon,1948).DonotconfusethisBitwiththeIEC60027‐2symbolbit– insmallletters– whichisusedasanSIdimensionprefix(e.g.1Kbit=1024bit,1Byte=8bit).Beginningwiththephysicallevelofdatawecandeterminevariouslevelsofdatastructures(seeSlide2‐2):Referto:http://physics.nist.gov/cuu/Units/binary.html1)Physicallevel: inaVon‐Neumannsystem:bit;inaQuantumsystem:qubitNote: Regardlessofitsphysicalrealization(e.g.voltage,ormechanicalstate,orblack/whiteetc.),abitisalwayslogicallyeither0or1(analog toalight‐switch).Aqubithassimilaritiestoaclassicalbit,butisoverallverydifferent:Aclassicalbitisascalarvariablewiththesinglevalueofeither0or1,sothevalueisunique,deterministicandunambiguous.Aqubitismoregeneralinthesensethatitrepresentsastatedefinedbyapairofcomplexnumbers , , whichexpresstheprobabilitythatareadingofthevalueofthequbitwillgiveavalueof0or1.Thus,aqubitcanbeinthestateof0,1,orsomemixture‐ referredtoasasuperposition‐ ofthe0and1states.Theweightsof0and1inthissuperpositionaredeterminedby(a,b)inthefollowingway:qubit≜ , ≜ 0 1 .Pleasebeawarethatthismodelofquantumcomputationisnottheonlyone(Lanzagorta &Uhlmann,2008).

Forarecentoverviewonquantumcomputation pleasereferto:http://peterwittek.com/book.html2)LogicalLevel:1)Primitivedatatypes,including:a)Booleandatatype(true/false);b)numericaldatatype(e.g.integer( ,floating‐pointnumbers(“reals”),etc.);2)compositedatatypes,including:a)array,b)record,c)union,d)set(storesvalueswithoutanyparticularorder,andnorepeatedvalues),e)object(containsothers);3)Stringandtexttypes,including:a)alphanumericcharacters,b)alphanumericstrings(=sequenceofcharacterstorepresentwordsandtext)3)AbstractLevel: includingabstractdatastructures,e.g.queue(FIFO),stack(LIFO),set(noorder,norepeatedvalues),lists,hashtable,arrays,trees,graphs,…4)TechnicalLevel: Applicationdataformats,e.g.text,vectorgraphics,pixelimages,audiosignals,videosequences,multimedia,…5)HospitalLevel: Narrative(textual,naturallanguage)patientrecorddata(structured/unstructuredandstandardized/non‐standardized),Omicsdata(genomics,proteomics,metabolomics,microarraydata,fluxomics,phenomics),numericalmeasurements(physiologicaldata,timeseries,labresults,vitalsigns,bloodpressure,CO2 partialpressure,temperature,…),recordedsignals(ECG,EEG,ENG,EMG,EOG,EP…),graphics(sketches,drawings,handwriting,…);audiosignals,images(cams,x‐ray,MR,CT,PET,…),etc.

13WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 14: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Inbiomedicalinformaticswehavealottodowithabstractdatatypes(ADT),consequentlywebrieflyreviewthemostimportantoneshere.FordetailspleaserefertoacourseonAlgorithm&Datastructures,ortoaclassictextbooksuchas(Aho,Hopcroft&Ullman,1983),(Cormen etal.,2009),orinGerman(Ottmann &Widmayer,2012),(Holzinger,2003)andpleasetakeintoconsiderationthatdatastructuresandalgorithmsgohandinhand,so amust‐have‐on‐the‐deskofeverycomputerscientistis:Cormen,T.H.,Leiserson,C.E.,Rivest,R.L.&Stein,C.2009.IntroductiontoAlgorithms(3rdedition),Cambridge(MA),TheMITPress.

Listisasequentialcollectionofitems , , … , accessibleoneafteranother,beginningattheheadandendingatthetailz.InaVon‐Neumannmachineitisawidelyuseddatastructureforapplicationswhichdonotneedrandomaccess.Itdiffersfromthestack(last‐in‐first‐out,LIFO)andqueue(first‐in‐first‐out,FIFO)datastructuresinsofar,thatadditionsandremovalscanbemadeatanypositioninthelist.Incontrasttoasimpleset theorderisimportant.AtypicalexamplefortheuseofalistisaDNAsequence.ThecombinationofGGGTTTAAAissuchalist,theelementsofthelistarethenucleotidebases.Nucleotides arethejoinedmoleculeswhichformthestructuralunitsoftheRNAandtheDNAandplaythecentralroleinmetabolism.

14WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 15: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Graph isapair , ,where isasetoffinite,non‐emptyvertices(nodes)and isasetofedges(lines,arcs),whichare2‐elementsubsetsof .If isasetoforderedpairsofvertices(arcs,directededges,arrows),thenitisadirectedgraph(digraph).Thedistancesbetweentheedgescanberepresentedwithinadistance‐matric(twodimensionalarray).

Theedgesinagraphcanbemultidimensionalobjects, e.g.vectorscontainingtheresultsofmultipleGen‐expressionmeasures.Forthispurposethedistanceoftwoedgescanbemeasuredbyvariousdistancemetrics.Graphsareideallysuitedforrepresentingnetworksinmedicineandbiology,e.g.metabolismpathways,etc.Inbioinformatics,distancematricesareusedtorepresentproteinstructuresinacoordinate‐independentmanner,aswellasthepairwisedistancesbetweentwosequencesinsequencespace.Theyareusedinstructuralandsequentialalignment,andforthedeterminationofproteinstructuresfromNMRorX‐raycrystallography.Evolutionarydynamicsactonpopulations.Neithergenes,norcells,norindividualsevolve;onlypopulationsevolve.ThissocalledMoranprocess describesthestochasticevolutionofafinitepopulationofconstantsize:Ineachtimestep,anindividualischosenforreproductionwithaprobabilityproportionaltoitsfitness;asecondindividualischosenfordeath.Theoffspringofthefirstindividualreplacesthesecondandindividualsoccupytheverticesofagraph.Ineachtimestep,anindividualisselectedwithaprobabilityproportionaltoitsfitness;theweightsoftheoutgoingedgesdeterminetheprobabilitiesthatthecorrespondingneighborwillbereplacedbytheoffspring.Theprocessisdescribedbyastochasticmatrix ,where denotestheprobabilitythatanoffspringofindividuali willreplaceindividualj.Ateachtimestep,anedge isselectedwithaprobabilityproportionaltoitsweightandthefitnessoftheindividualatitstail.TheMoranprocessisacompletegraphwithidenticalweights(Lieberman,Hauert &Nowak,2005).

Graphs canberepresentedcomputationallybyanAdjacencylist,AdjacencymatrixandanIncidencematrix.Thefirstpre‐processingstepistoproducepoint clouddatasetsfromrawdata,see:Holzinger,A.,Malle,B.,Bloice,M.,Wiltgen,M.,Ferri,M.,Stanganelli,I.&Hofmann‐Wellenhof,R.2014.OntheGenerationofPointCloudDataSets:StepOneintheKnowledgeDiscoveryProcess.In:Holzinger,A.&Jurisica,I.(eds.)InteractiveKnowledgeDiscoveryandDataMininginBiomedicalInformatics,LectureNotesinComputerScience,LNCS8401.BerlinHeidelberg:Springer,pp.57‐80.https://online.tugraz.at/tug_online/voe_main2.getVollText?pDocumentNr=974579&pCurrPk=83005

Forthe specifictaskofgettinggraphsfromimagedatahavealookat:Holzinger,A.,Malle,B.&Giuliani,N.2014.OnGraphExtractionfromImageData.In:Slezak,D.,Peters,J.F.,Tan,A.‐H.&Schwabe,L.(eds.)LectureNotesinArtificialIntelligence,LNAI8609.Heidelberg,Berlin:Springer,pp.552‐563.https://online.tugraz.at/tug_online/voe_main2.getVollText?pDocumentNr=868952&pCurrPk=80830

15WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 16: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Treeisacollectionofelementscallednodes,oneofwhichisdistinguishedasaroot,alongwitharelation("parenthood")thatplacesahierarchicalstructureonthenodes.Anode,likeanelementofalist,canbeofwhatevertypewewish.Weoftendepictanodeasaletter,astring,oranumberwithacirclearoundit.Formally,atreecanbedefinedrecursivelyinthefollowingmanner:1.Asinglenodebyitselfisatree.Thisnodeisalsotherootofthetree.2.Suppose isanodeand 1, 2, . . . , aretreeswithroots 1, 2, . . . , , respectively.Wecanconstructanewtreebymaking betheparentofnodes 1, 2, . . . , .Inthistree istherootand 1, 2, . . . , arethesubtrees oftheroot.Nodes 1, 2, . . . , arecalledthechildrenofnode .

Dendrogram (fromGreekdendron "tree",‐gramma "drawing")isatreediagramfrequentlyusedtoillustratethearrangementoftheclustersproducedbyhierarchicalclustering.Dendrograms areoftenusedincomputationalbiologytoillustratetheclusteringofgenesorsamples.Theoriginofsuchdendrograms canbefoundin(Darwin,1859).Theexampleby(Hufford etal.,2012)showsaneighbor‐joiningtreeandthechangingmorphologyofdomesticatedmaizeanditswildrelatives.Taxaintheneighbor‐joiningtreearerepresentedbydifferentcolors:parviglumis (green),landraces(red),improvedlines(blue),mexicana (yellow)andTripsacum (brown).Themorphologicalchangesareshownforfemaleinflorescencesandplantarchitectureduringdomesticationandimprovement.

16WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 17: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Please rememberthekeyproblemsindealingwithdatainclude:1)Heterogeneousdatasources(needfordatafusionanddataintegration)2)Complexityofthedata(high‐dimensionality)3)Noisy,uncertaindata(challengeofpre‐processing)4)Thediscrepancybetweendata‐information‐knowledge(variousdefinitions)5)Bigdatasets(manualhandlingofthedataisimpossible)

17WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 18: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Nowthatwehaveseensomeexamplesofdatafromthebiomedicaldomain,wecanlookatthe“bigpicture”.Manyika etal.(2011)localizedfourmajordatapoolsintheUShealthcareanddescribethatthedataarehighlyfragmented,withlittleoverlapandlowintegration.Moreover,theyreportthatapprox.30%ofclinicaltext/numericaldataintheUnitedStates,includingmedicalrecords,bills,laboratoryandsurgeryreports,isstillnotgeneratedelectronically.Evenwhenclinicaldataareindigitalform,theyareusuallyheldbyanindividualproviderandrarelyshared(seeSlide2‐4).Biomedicalresearchdata,e.g.clinicaltrials,predictivemodelingetc.,isproducedbyacademiaandpharmaceuticalcompaniesandstoredindatabasesandlibraries.Clinicaldataisproducedinthehospitalandarestoredinhospitalinformationsystems(HIS),picturearchivingandcommunicationsystems(PACS)orinlaboratorydatabases,etc.Muchdataishealthbusinessdataproducedbypayors,providers,insurances,etc.Finally,thereisanincreasingpoolofpatientbehaviorandsentimentdata,producedbyvariouscustomersandstakeholders,outsidethetypicalclinicalcontext,includingthegrowingdatafromthewellnessandambientassistedlivingdomain.

18WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 19: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Amajorchallengeinournetworkedworldistheincreasingamountofdata– todaycalled“bigdata”.Thetrendtowardspersonalizedmedicinehasresultedinasheermassofthegenerated(‐omics)data,(seeSlide2‐7).Inthelifesciencesdomain,mostdatamodelsarecharacterizedbycomplexity,whichmakesmanualanalysisverytime‐consumingandfrequentlypracticallyimpossible(Holzinger,2013).

MoreandmoreOmics‐dataaregenerated,including:1)Genomicsdata(e.g.sequenceannotation),2)Transcriptomics data(e.g.microarraydata);thetranscriptome isthesetofallRNAmolecules,includingmRNA,rRNA,tRNA andnon‐codingRNAproducedinthecells.3)Proteomicsdata:Proteomicstudiesgeneratelargevolumesofrawexperimentaldataandinferredbiologicalresultsstoredindatarepositories,mostlyopenlyavailable;anoverviewcanbefoundhere:(Riffle&Eng,2009).Theoutcomeofproteomicsexperimentsisalistofproteinsdifferentiallymodifiedorabundantinacertainphenotype.Thelargesizeofproteomicsdatasetsrequiresspecializedanalyticaltools,whichdealwithlargelistsofobjects4)Metabolomics(e.g.enzymeannotation),themetabolome representsthecollectionofallmetabolitesinacell,tissue,organororganism.5)Protein‐DNAinteractions,6)Protein‐proteininteractions;PPIareatthecoreoftheentireinteractomics systemofanylivingcell.7)Fluxomics (isotopictracing,metabolicpathways),8)Phenomics (biomarkers),9)Epigenetics,isthestudyofthechangesingeneexpression– othersthantheDNAsequence,thereforetheprefix“epi‐“10)Microbiomics11)LipidomicsOmics‐dataintegrationhelpstoaddressinterestingbiologicalquestionsonthebiologicalsystemsleveltowardspersonalizedmedicine(Joyce&Palsson,2006).

19WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 20: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

MoreandmoreOmics‐dataaregenerated,including:1)Genomicsdata(e.g.sequenceannotation),2)Transcriptomicsdata(e.g.microarraydata);thetranscriptomeisthesetofallRNAmolecules,includingmRNA,rRNA,tRNA andnon‐codingRNAproducedinthecells.3)Proteomicsdata:Proteomicstudiesgeneratelargevolumesofrawexperimentaldataandinferredbiologicalresultsstoredindatarepositories,mostlyopenlyavailable;anoverviewcanbefoundhere:(Riffle&Eng,2009).Theoutcomeofproteomicsexperimentsisalistofproteinsdifferentiallymodifiedorabundantinacertainphenotype.Thelargesizeofproteomicsdatasetsrequiresspecializedanalyticaltools,whichdealwithlargelistsofobjects(Bessarabova etal.,2012).4)Metabolomics(e.g.enzymeannotation),themetabolome representsthecollectionofallmetabolitesinacell,tissue,organororganism.5)Protein‐DNAinteractions,6)Protein‐proteininteractions;PPIareatthecoreoftheentireinteractomics systemofanylivingcell.7)Fluxomics (isotopictracing,metabolicpathways),8)Phenomics (biomarkers),9)Epigenetics,isthestudyofthechangesingeneexpression– othersthantheDNAsequence,thereforetheprefix“epi‐“10)Microbiomics11)LipidomicsOmics‐dataintegrationhelpstoaddressinterestingbiologicalquestionsonthebiologicalsystemsleveltowardspersonalizedmedicine(Joyce&Palsson,2006).

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2908408/

Formoreinformationpleasereferto:Gomez‐Cabrero,D.,Abugessaisa,I.,Maier,D.,Teschendorff,A.,Merkenschlager,M.,Gisel,A.,Ballestar,E.,Bongcam‐Rudloff,E.,Conesa,A.&Tegner,J.2014.Dataintegrationintheeraofomics:currentandfuturechallenges.BMCSystemsBiology,8,(Suppl 2),I1.http://www.biomedcentral.com/1752‐0509/8/S2/I1

20WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 21: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Afurtherchallengeistointegratethedataandtomakeitaccessibletotheclinician.Whilethereismuchresearchontheintegrationofheterogeneousinformationsystems,ashortcomingisintheintegrationofavailabledata.Datafusionistheprocessofmergingmultiplerecordsrepresentingthesamereal‐worldobjectintoasingle,consistent,accurate,andusefulrepresentation(Bleiholder &Naumann,2008).AnexampleforthemixofdifferentdataforsolvingamedicalproblemcanbeseeninSlide2‐8.

AgoodexampleforcomplexmedicaldataisRCQM,whichisanapplicationthatmanagestheflowofdataandinformationintherheumatologyoutpatientclinic(50patientsperday,5daysperweek)ofGrazUniversityHospital,onthebasisofaqualitymanagementprocessmodel.Eachexaminationproduces100+clinicalandfunctionalparametersperpatient.Thisamasseddataaremorphedintobetteruseableinformationbyapplyingscoringalgorithms(e.g.DiseaseActivityScore,DAS)andareconvolutedovertime.Togetherwithpreviousfindings,physiologicallaboratorydata,patientrecorddataandOmicsdatafromthePathologydepartment,thesedataconstitutetheinformationbasisforanalysisandevaluationofthediseaseactivity.Thechallengeisintheincreasingquantitiesofsuchhighlycomplex,multi‐dimensionalandtimeseriesdata,seeanexample here:Simonic,K.M.,Holzinger,A.,Bloice,M.&Hermann,J.OptimizingLong‐TermTreatmentofRheumatoidArthritiswithSystematicDocumentation.ProceedingsofPervasiveHealth‐ 5thInternationalConferenceonPervasiveComputingTechnologiesforHealthcare,2011Dublin.IEEE,550‐554.http://www.biomedcentral.com/1472‐6947/13/103

21WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 22: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Donotconfusestructurewithstandardization(seeSlide2‐9).Datacanbestandardized(e.g.numericalentriesinlaboratoryreports)andnon‐standardized.Atypicalexampleisnon‐standardizedtext– impreciselycalled“Free‐Text”or“unstructureddata”inanelectronicpatientrecord(Kreuzthaleretal.,2011).

Standardizeddata isthe basisforaccuratecommunication.Inthemedicaldomain,manydifferentpeopleworkatdifferenttimesinvariouslocations.Datastandards canensurethatinformationisinterpretedbyalluserswiththesameunderstanding.Moreover,standardizeddatafacilitatecomparabilityofdataandinteroperabilityofsystems.Itsupportsthereusabilityofthedata,improvestheefficiencyofhealthcareservicesandavoidserrorsbyreducingduplicatedeffortsindataentry.

Datastandardizationreferstoa)thedatacontent;b)theterminologiesthatareusedtorepresentthedata;c)howdataisexchanged;andiv)howknowledge,e.g.clinicalguidelines,protocols,decisionsupportrules,checklists,standardoperatingproceduresarerepresentedinthehealthinformationsystem(refertoIOM).Technicalelementsfordatasharingrequirestandardizationofidentification,recordstructure,terminology,messaging,privacyetc.ThemostusedstandardizeddatasettodateistheinternationalClassificationofDiseases(ICD),whichwasfirstadoptedin1900forcollectingstatistics(Ahmadian etal.,2011),whichwewilldiscussin→Lecture3.Non‐standardizeddata isthemajorityofdataandinhibitdataquality,dataexchangeandinteroperability.Well‐structureddata istheminorityofdataandanidealisticcasewheneachdataelementhasanassociateddefinedstructure,relationaltables,ortheresourcedescriptionframeworkRDF,ortheWebOntologyLanguageOWL(see→Lecture3).Note:Ill‐structured isatermoftenusedfortheoppositeofwell‐structured,althoughthistermoriginallywasusedinthecontextofproblemsolving(Simon,1973).Semi‐structuredisaformofstructureddatathatdoesnotconformwiththestrictformalstructureoftablesanddatamodelsassociatedwithrelationaldatabasesbutcontainstagsormarkerstoseparatestructureandcontent,i.e.areschema‐lessorself‐describing;atypicalexampleisamarkup‐languagesuchasXML(see→Lecture3and4).Weakly‐Structureddata isthemostofourdatainthewholeuniverse,whetheritisinmacroscopic(astronomy)ormicroscopicstructures(biology)– see→Lecture5.Non‐structureddata orunstructureddata isanimprecisedefinitionusedforinformation expressedinnaturallanguage,whennospecificstructurehasbeendefined.Thisisanissuefordebate:Texthasalsosomestructure:words,sentences,paragraphs.Ifweareveryprecise,unstructureddatawouldmeantthatthedataiscompleterandomized– whichisusuallycallednoiseandisdefinedby(Duda,Hart&Stork,2000)asanypropertyofdatawhichisnotduetotheunderlyingmodelbutinsteadtorandomness(eitherintherealworld,fromthesensorsorthemeasurementprocedure).

22WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 23: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

“Multivariate”and“multidimensional”aremodernwordsandconsequentlyoverusedinliterature.Eachitemofdataiscomposedofvariables, andifsuchadataitemisdefinedbymorethanonevariableitiscalledamultivariabledataitem.Variablesarefrequentlyclassifiedintotwocategories:dependentorindependent.

Somemorereadings onthehomepageofYosuhua Bengio,UniversityofMontreal:http://www.iro.umontreal.ca/~bengioy/yoshua_en/research.html

AndtheMILA Lab– MontrealInstituteforLearningAlgorithmshttp://www.mila.umontreal.ca/

23WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 24: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

InPhysics,EngineeringandStatisticsavariableisaphysicalpropertyofasubject,whosequantitycanbemeasured,e.g.mass,length,time,temperature,etc.Inmathematicsa0‐dimensionalspace(nil‐dimensional)isatopologicalspacethathasdimensionzero– which isaninfinitesimalsmallpoint.a 1‐dimensionalspaceisalineinR1a2‐dimensionalspaceistheplaneinR2A3‐dimensionalspaceisasphere(orcube,cylinderetc.)inR3

24WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 25: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

SMILESdata(.smi)consists ofastringobtainedbythesymbolnodesencounteredinadepth‐firsttreetraversalofachemicalgraph, whichisfirsttrimmedtoremovehydrogenatomsandcyclesarebrokentoturnitintoaspanningtree.Wherecycleshavebeenbroken,numericsuffixlabelsareincludedtoindicatetheconnectednodes.

25WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 26: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Proteomicanalysisofmesenchymal stemcells(MSCs).Two‐dimensionalgelelectrophoresiswasperformedusingwholeproteincellextractsfromP2MSCculturesofpatientswithrheumatoidarthritis(RA)(n=10)(A)andhealthycontrols(n=6)(B).Afterscanning,spotdetection,quantificationandnormalisation,gelswerecomparedusingHierarchicalClusteringSoftwareandPearsontest(C).Noclustercouldbedetectedusingtheseproteomicprofiles.

Proteomicanalysis:Two‐dimensionalelectrophoresiswasperformedusingP2MSCsinpatientswithRA(n=10)andhealthycontrols(n=6)(fig4A,B).ByusingtheHierarchicalClusteringmethod,wecouldnotdefineanyclusterthatmightdiscriminatepatientandcontrolcells(fig4C).ThePearsoncorrelationcoefficientwasnotsignificantlydifferentbetweenpatientandcontrolcells(r=0.933(0.022)andr=0.929(0.020),respectively).Thesedatacorroboratethelackofsignificantchangesincytokineproductionbetweenpatientsandcontrols.

26WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 27: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

http://www.rcsb.org/pdb/images/3ond_bio_r_500.jpgThePDBisalargerepository containing3‐Dstructuralinformation,establishedin1971Dataastoredin2Dbutcaninfactrepresentbiologicalentitiesinthreeormoredimensions

27WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 28: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Transaxial (left),coronal(middle),andsagittal(right)imagesofapatientwhowasscannedfor30mininlist‐modewiththeBrainPET scanner;therecordingwasstarted20minafterinjectionofabout300MBq fluor‐deoxy‐glucose.

28WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 29: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

InMathematics,henceinInformatics,however,avariableisassociatedwithaspace–oftenann‐dimensionalEuclideanspace – inwhichanentity(e.g.afunction)oraphenomenonofcontinuousnatureisdefined.Thedatalocationwithinthisspacecanbereferencedbyusingarangeofcoordinatesystems(e.g.Cartesian,Polar‐coordinates,etc.):Thedependentvariablesarethoseusedtodescribetheentity(forexamplethefunctionvalue)whilsttheindependentvariablesarethosethatrepresentthecoordinatesystemusedtodescribethespaceinwhichtheentityisdefined.Ifadatasetiscomposedofvariableswhoseinterpretationfitsthisdefinitionourgoalistounderstandhowthe‘entity’isdefinedwithinthen‐dimensionalEuclideanspace .Sometimeswemaydistinguishbetweenvariablesmeaningmeasurementofproperty,fromvariablesmeaningacoordinatesystem,byreferringtotheformerasvariate,andreferringtothelatterasdimension(DosSantos&Brodlie,2002), (dosSantos&Brodlie,2004).Aspaceisasetofpoints.Ametricspacehasanassociatedmetric,whichenablesustomeasuredistancesbetweenpointsinthatspaceand,inturn,implicitlydefinetheirneighborhoods.Consequently,ametricprovidesaspacewithatopology,andametricspaceisatopologicalone.Topologicalspacesfeelalientousbecauseweareaccustomedtohavingametric.BiomedicalExample:Aproteinisasinglechainofaminoacids,whichfoldsintoaglobularstructure.TheThermodynamicsHypothesisstatesthataproteinalwaysfoldsintoastateofminimumenergy.Topredictproteinstructure,wewouldliketomodelthefoldingofaproteincomputationally.Assuch,theproteinfoldingproblembecomesanoptimizationproblem:Wearelookingforapathtotheglobalminimuminaveryhigh‐dimensionalenergylandscape;

29WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 30: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Letuscollect ‐dimensional observationsintheEuclideanvectorspace andweget:Eq.2‐1

, … ,

Acloudofpointssampledfromanysource(e.g.medicaldata,sensornetworkdata,asolid3‐Dobject,surfaceetc.).ThosedatapointscanbecoordinatedasanunorderedsequenceinanarbitrarilyhighdimensionalEuclideanspace,wheremethodsofalgebraictopologycanbeapplied.Themainchallengeisinmappingthedatabackinto ortobemorepreciseinto ,becauseourretinaisinherentlyperceivingdatain .Thecloudofsuchdatapointscanbeusedasacomputationalrepresentationoftherespectivedataobject.Atemporalversioncanbefoundinmotion‐capturedata,wheregeometricpointsarerecordedastimeseries.Nowyouwillaskanobviousquestion:“Howdowevisualizeafour‐dimensionalobject?”Theobviousansweris:“Howdowevisualizeathreedimensionalobject?”Humansdonotseeinthreespatialdimensionsdirectly,butviasequencesofplanarprojectionsintegratedinamannerthatissensedifnotcomprehended.Littlechildrenspendasignificanttimeoftheirfirstyearoflifelearninghowtoinferthree‐dimensionalspatialdatafrompairedplanarprojections,andmanyyearsofpracticehavetunedaremarkableabilitytoextractglobalstructurefromrepresentationsinastrictlylowerdimension(Ghrist,2008).Becausewehavethesameproblemhereinthisbook,wemuststayin andthereforetheexampleinSlide2‐12(Zomorodian,2005).InEinstein'stheoryofSpecialRelativity,Euclidean3‐spaceplustime(the"4th‐dimension")areunifiedintotheMinkowskispace

30WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 31: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Ametricspacehasanassociatedmetric,whichenablestomeasurethedistancesbetweenpointsinthatspaceand,implicitlydefinetheirneighborhoods.Consequently,ametricprovidesaspacewithatopology,henceametricspaceisatopologicalspace.AsetXwithametricfunctiondiscalledametricspace.Wegiveitthemetrictopologyofd,wherethesetofopenballsMostofour“natural”spacesareaparticulartypeofmetricspaces:theEuclideanspaces:TheCartesianproductof copiesof ,thesetofrealnumbers,alongwiththeEuclideanmetric:Eq.2‐2

,

isthe ‐dimensionalEuclideanspace .Wemayinduceatopologyonsubsetsofmetricspacesasfollows:If ⊆ withtopology ,thenwegettherelativeorinducedtopology bydefiningFormoreinformationreferto(Zomorodian,2005)or(Edelsbrunner &Harer,2010).

31WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 32: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

KnowledgeDiscoveryfromData:Bygettinginsightintothedata;thegainedinformationcanbeusedtobuildupknowledge.Thegrandchallengeistomaphigherdimensionaldataintolowerdimensions,hencemakeitinteractivelyaccessibletotheend‐user(Holzinger,2012),(Holzinger,2013).Thismappingfrom → isthecoretaskofvisualizationandamajorcomponentforknowledgediscovery:Enablingeffectiveinteractivehumancontroloverpowerfulmachinealgorithmstosupporthumansensemaking(Holzinger,2012),(Holzinger,2013).

Holzinger,A.2013.Human–ComputerInteraction&KnowledgeDiscovery(HCI‐KDD):Whatisthebenefitofbringingthosetwofieldstoworktogether?In:AlfredoCuzzocrea,C.K.,Dimitris E.Simos,EdgarWeippl,Lida Xu (ed.)MultidisciplinaryResearchandPracticeforInformationSystems,SpringerLectureNotesinComputerScienceLNCS8127.Heidelberg,Berlin,NewYork:Springer,pp.319‐328.

An importanttopicissubspaceclustering:Hund,M.,Sturm,W.,Schreck,T.,Ullrich,T.,Keim,D.,Majnaric,L.&Holzinger,A.2015.AnalysisofPatientGroupsandImmunizationResultsBasedonSubspaceClustering.In:Guo,Y.,Friston,K.,Aldo,F.,Hill,S.&Peng,H.(eds.)BrainInformaticsandHealth,LectureNotesinArtificialIntelligenceLNAI9250.Cham:SpringerInternationalPublishing,pp.358‐368.https://online.tugraz.at/tug_online/voe_main2.getVollText?pDocumentNr=1198810&pCurrPk=85960

32WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 33: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Multivariatedataset isadatasetthathasmanydependentvariables andtheymightbecorrelatedtoeachothertovaryingdegrees.Usuallythistypeofdatasetisassociatedwithdiscretedatamodels.

Multidimensionaldataset isadatasetthathasmanyindependentvariables clearlyidentified,andoneormoredependentvariablesassociatedtothem.Usuallythistypeofdatasetisassociatedwithcontinuousdatamodels.

Inotherwords,everydataitem(orobject)inacomputerisrepresented(stored)asasetoffeatures.Insteadofthetermfeatureswemayusethetermdimensions,becauseanobjectwith ‐featurescanalsoberepresentedasamultidimensionalpointinan ‐dimensionalspace.Dimensionalityreductionistheprocessofmappingan ‐dimensionalpoint,intoalower ‐dimensionalspace– thisisthemainchallengeinvisualizationsee→Lecture9.

Thenumberofdimensionscansometimesbesmall,e.g.simple1D‐datasuchastemperaturemeasuredatdifferenttimes,to3Dapplicationssuchasmedicalimaging,wheredataiscapturedwithinavolume.Standardtechniques—contouringin2D;isosurfacing andvolumerenderingin3D—haveemergedovertheyearstohandlethissortofdata.Thereisnodimensionreductionissueintheseapplications,sincethedataanddisplaydimensionsessentiallymatch.

33WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 34: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Datacanbecategorizedintoqualitative(nominalandordinal)andquantitative(intervalandratio):Intervalandratiodataareparametric,andareusedwithparametrictoolsinwhichdistributionsarepredictable(andoftenNormal).Nominalandordinaldataarenon‐parametric,anddonotassumeanyparticulardistribution.Theyareusedwithnon‐parametrictoolssuchastheHistogram.Theclassicpaperonthetheoryofscalesofmeasurementis(Stevens,1946).

34WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 35: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Wecansummarizewhatwelearnedsofaraboutdata:Datacanbenumeric,non‐numeric,orboth.Non‐numericdatacanincludeanythingfromlanguagedata(text)tocategorical,image,orvideodata.Datamayrangefromcompletelystructured,suchascategoricaldata,tosemi‐structured,suchasanXMLFilecontainingmetainformation,tounstructured,suchasanarrative“free‐text”.Note,thattermunstructureddoesnotmeanthatthedataarewithoutanypattern,whichwouldmeancompleterandomnessanduncertainty,butratherthat“unstructureddata”areexpressedso,thatonlyhumanscanmeaningfullyinterpretit.Structureprovidesinformationthatcanbeinterpretedtodeterminedataorganizationandmeaning,henceitprovidesacontext fortheinformation.Theinherentstructureinthedatacanformabasisfordatarepresentation.Animportant,yetoftenneglectedissuearetemporalcharacteristicsofdata: Dataofalltypesmayhaveatemporal(time)association,andthisassociationmaybeeitherdiscreteorcontinuous(Thomas&Cook,2005).InMedicalInformaticswehaveapermanentinteractionbetweendata,informationandknowledge,withdifferentdefinitions(Bemmel&Musen,1997),seeSlide2‐16:

35WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 36: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Data arethephysicalentitiesatthelowestabstractionlevelwhichare,e.g.generatedbyapatient(patientdata)orabiologicalprocess(e.g.Omicsdata).Accordingto(Bemmel&Musen,1997)datacontainnomeaning.

Informationisderivedbyinterpretationofthedatabyaclinician(humanintelligence).

Knowledge isobtainedbyinductivereasoningwithpreviouslyinterpreteddata,collectedfrommanysimilarpatientsorprocesses,whichisaddedtothesocalledbodyofknowledgeinmedicine,theexplicitknowledge. Thisknowledgeisusedfortheinterpretationofotherdataandtogainimplicitknowledge whichguidestheclinicianintakingfurtheraction.

36WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 37: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Forhypothesisgenerationandtesting,fourtypesofinferencesexist(Peirce,1955):abstraction,abduction,deduction,andinduction.Thefirsttwodrivehypothesisgenerationwhilethelatterdrivehypothesistesting,seeSlide2‐17:Abstractionmeansthatdataarefilteredaccordingtotheirrelevancefortheproblemsolutionandchunkedinschemasrepresentinganabstract descriptionoftheproblem(e.g.,abstractingthatanadultmalewithhaemoglobinconcentrationlessthan14g/dL isananaemicpatient).Followingthis,hypothesesthatcouldaccountforthecurrentsituationarerelatedthroughaprocessofabduction,characterizedbya"backwardflow"ofinferencesacrossachainofdirectedrelationswhichidentifythoseinitialconditionsfromwhichthecurrentabstractrepresentationoftheproblemoriginates.Thisprovidestentativesolutionstotheproblemathandbywayofhypotheses.Forexample,knowingthatdisease willcausesymptom,abductionwilltrytoidentifytheexplanationforB,whiledeductionwillforecastthatapatientaffectedbydisease will

manifestsymptom :bothinferencesareusingthesamerelationalongtwodifferentdirections(Patel&Ramoni,1997).Abduction ischaracterizedbyacyclicalprocessofgeneratingpossibleexplanations(i.e.,identificationofasetofhypothesesthatareabletoaccountfortheclinicalcaseonthebasisoftheavailabledata)andtestingthoseexplanations(i.e.,evaluationofeachgeneratedhypothesisonthebasisofitsexpectedconsequences)fortheabnormalstateofthepatientathand(Patel,Arocha &Zhang,2004).

ThehypothesistestingprocedurescanbeinferredfromSlide2‐17:Generalknowledgeisgainedfrommanypatients,andthisgeneralknowledgeisthenappliedtoanindividualpatient.Wehavetodeterminebetween:Reasoning istheprocessbywhichaclinicianreachesaconclusionafterthinkingaboutallthefacts;Deduction consistsofderivingaparticularvalidconclusionfromasetofgeneralpremises;Induction consistsofderivingalikelygeneralconclusionfromasetofparticularstatements.Reasoninginthe“realworld”doesnotappeartofitneatlyintoanyofthesebasictypes.Therefore,athirdformofreasoninghasbeenrecognizedbyPeirce(1955),wheredeductionandinductionareinter‐mixed;

37WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 38: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Thequestion“whatisinformation?”isstillanopenquestioninbasicresearch,andanydefinitionisdependingontheviewtaken.Forexample,thedefinitiongivenbyCarl‐FriedrichvonWeizsäcker:“Informationiswhatisunderstood,”impliesthatinformationhasbothasenderandareceiverwhohaveacommonunderstandingoftherepresentationandthemeanstoconveyinformationusingsomepropertiesofthephysicalsystems,andhisaddendum:“Informationhasnoabsolutemeaning;itexistsrelativelybetweentwosemanticlevels”impliestheimportanceofcontext(Marinescu,2011).Withoutdoubtinformationisafundamentallyimportantconceptwithinourworldandlifeiscomplexinformation,seeSlide2‐14:

Manysystems,e.g.inthequantumworldtonotobeytheclassicalviewofinformation.Inthequantumworldandinthelifesciencestraditionalinformationtheoryoftenfailstoaccuratelydescribereality…forexampleinthecomplexityofalivingcell:Allcomplexlifeiscomposedofeukaryotic(nucleated)cells(Lane&Martin,2010).Agoodexampleofsuchacellistheprotist EuglenaGracilis (inGerman“Augentierchen”)withalengthofapprox.30 .Lifecanbeseenasadelicateinterplayofenergy,entropyandinformation,essentialfunctionsoflivingbeingscorrespondtothegeneration,consumption,processing,preservationandduplicationofinformation.

P:Complexity<>Information<>Energy<> Entropy

38WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 39: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

TheetymologicaloriginofthewordinformationcanbetracedbacktotheGreek“forma”andtheLatin“information”and“informare”,tobringsomethingintoashape(“in‐a‐form”).Consequently,thenaivedefinitionincomputerscienceis“informationisdataincontext” andthereforedifferentthandataorknowledge.However,wefollowthenotionof(Boisot &Canals,2004)anddefinethatinformationisanextractionfromdatathat,bymodifyingtherelevantprobabilitydistributions,hasdirectinfluenceonanagent’sknowledgebase.Forabetterunderstandingofthisconcept,wefirstreviewthemodelofhumaninformationprocessingbyWickens (1984):ThemodelbyWickens (1984)beautifullyemphasizesourviewondata,informationandknowledge:thephysicaldatafromthereal‐worldareperceivedasinformationthroughperceptualfilters,controlledbyselectiveattentionandformhypotheseswithintheworkingmemory.Thesehypothesesaretheexpectationsdependingonourpreviousknowledgeavailableinourmentalmodel,storedinthelong‐termmemory.Thesubjectivelybestalternativehypothesiswillbeselectedandprocessedfurtherandmaybetakenasoutcomeforanaction.Duetothefactthatthissystemisaclosedloop,wegetfeedbackthroughnewdataperceivedasnewinformationandtheprocessgoeson.

39WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 40: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Theincomingstimulifromthephysicalworldmustpassbothaperceptualfilterandaconceptualfilter.Theperceptualfilter orientatesthesenses(e.g.visualsense)tocertaintypesofstimuliwithinacertainphysicalrange(e.g.visualsignalrange,pre‐knowledge,attentionetc.).Onlythestimuliwhichpassthroughthisfiltergetregisteredasincomingdata–everythingelseisfilteredout.Atthispointitisimportanttofollowourphysicalprincipleofdata:todifferentiatebetweentwonotionsthatarefrequentlyconfused:anexperiment’s(raw,hard,measured,factual)dataandits(meaningful,subjective)interpretedinformationresults.Dataarepropertiesconcerningonlytheinstrument;itistheexpressionofafact. Theresultconcernsapropertyoftheworld.Thefollowingconceptualfiltersextractinformation‐bearingdatafromwhathasbeenpreviouslyregistered.Bothtypesoffiltersareinfluencedbytheagents’cognitiveandaffectiveexpectations,storedintheirmentalmodels.Theenormousutilityofdataresidesinthefactthatitcancarryinformationaboutthephysicalworld.Thisinformationmaymodifysetexpectationsorthestate‐of‐knowledge.Theseprinciplesallowanagent toactinadaptivewaysinthephysicalworld(Boisot &Canals,2004).Conferthisprocesswiththehumaninformationprocessingmodelby(Wickens,1984),seeninSlide2‐19anddiscussedin→Lecture7.

40WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 41: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Entropyhasmanydifferentdefinitionsandapplications,originallyinstatisticalphysicsandmostoftenitisusedasameasurefordisorder.Ininformationtheory,entropycanbeusedasameasurefortheuncertaintyinadataset.

To demonstratehowusefulentropycanbe‐ youcanhavealookatthispaper:Holzinger,A.,Stocker,C.,Peischl,B.&Simonic,K.‐M.2012.OnUsingEntropyforEnhancingHandwritingPreprocessing.Entropy,14,(11),2324‐2350.http://www.mdpi.com/1099‐4300/14/11/2324

41WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 42: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Theconceptofentropywasfirstintroducedinthermodynamics(Clausius,1850),whereitwasusedtoprovideastatementofthesecondlawofthermodynamics.Later,statisticalmechanicsprovidedaconnectionbetweenthemacroscopicpropertyofentropyandthemicroscopicstateofasystembyBoltzmann.Shannonwasthefirsttodefineentropyandmutualinformation.

Shannon(1948)usedaGedankenexperiment(thoughtexperiment)toproposeameasureofuncertaintyinadiscretedistributionbasedontheBoltzmannentropyofclassicalstatisticalmechanics,seeSlide2‐22:

42WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 43: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Anexampleshalldemonstratetheusefulnessofthisapproach:1)Let beadiscretedatasetwithassociatedprobabilities :Eq.2‐5

… , … ,

2)NowweapplyShannon’sequationEq.2‐4:Eq.2‐6

3)Weassumethatoursourcehastwovalues(ball=white,ball=black)LetusdothefamoussimpleGedankenexperiment(thoughtexperiment):Imagineaboxwhichcancontaintwocoloredballs:blackandwhite.Thisisoursetofdiscretesymbolswithassociatedprobabilities.Ifwegrabblindlyintothisboxtogetaball,wearedealingwithuncertainty,becausewedonotknowwhichballwetouch.Wecanask:Istheballblack?NO.THENitmustbewhite,soweneedonequestiontosurelyprovidetherightanswer.Becauseitisabinarydecision(YES/NO)themaximumnumberof(binary)questionsrequiredtoreducetheuncertaintyis:log ,where isthenumberofthepossibleoutcomes.Ifthereare eventswithequalprobability then 1/ .Ifyouhaveonly1blackball,thenlog 1 0,whichmeansthereisnouncertainty.Eq.2‐7

, with , 14)NowwesolvenumericallyEq.2‐6:Eq.2‐8

∗ log1

∗ log1

1Since rangesfrom0(forimpossibleevents)to1(forcertainevents),theentropyvaluerangesfrominfinity(forimpossibleevents)to0(forcertainevents).So,wecansummarizethattheentropyistheweightedaverageofthesurpriseforallpossibleoutcomes.Forourexamplewiththetwoballswecandrawthefollowingfunction:Theentropyvalueis1for =0,5anditisboth0foreither =0or =1.Thisexamplemightseemtrivial,buttheentropyprinciplehasbeendevelopedalotsinceShannonandtherearemanydifferentmethods,whichareveryusefulfordealingwithdata.

43WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 44: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Shannoncalledittheinformationentropy (akaShannonentropy)anddefined:Eq.2‐9

log1

log

where istheprobabilityoftheeventoccurring.If isnotidenticalforalleventsthentheentropy isaweightedaverageofallprobabilities,whichShannondefinedas:Eq.2‐10

2

Basically,theentropyp(x)approacheszeroifwehaveamaximumofstructure– andopposite,theentropyp(x)reacheshighvaluesifthereisnostructure– hence,ideally,iftheentropyisamaximum,wehavecompleterandomness,totaluncertainty.LowEntropymeansdifferences,structure,individuality– highEntropymeansnodifferences,nostructure,noindividuality.Consequently,lifeneedslowentropy.

44WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 45: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Theprinciplewhatwecaninferfromentropyvaluesis:1)Lowentropy valuesmeanhighprobability,highcertainty, henceahighdegreeofstructurization inthedata.2)Highentropy valuesmeanlowprobability,lowcertainty (≅ highuncertainty;‐),hencealowdegreeofstructurization inthedata.Maximumentropywouldmeancompleterandomnessandtotaluncertainty.Highlystructureddatacontainlowentropy;ideallyifeverythingisinorderandthereisnosurprise(nouncertainty)theentropyislow:Eq.2‐11

0

Eq.2‐12 log .

Ontheotherhandifthedataareweaklystructured– asforexampleinbiologicaldata–andthereisnoabilitytoguess(alldataisequallylikely)theentropyishigh:Ifwefollowthisapproach,“unstructureddata”wouldmeancompleterandomness.Letuslookonthehistoryofentropytounderstandwhatwecandoinfuture,seeSlide2‐25.

45WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 46: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Youmight arguewhatthepracticalpurposeofthisapproachis– manifoldapplications!

Example:Heartratevariabilityisthevariationofthetimeintervalbetweenconsecutiveheartbeats.Entropyisacommonlyusedtooltodescribetheregularityofdatasets.Entropyfunctionsaredefinedusingmultipleparameters,theselectionofwhichiscontroversialanddependsontheintendedpurpose.Mayeretal. (2014)describetheresultsoftestsconductedtosupportparameterselection,towardsthegoalofenablingfurtherbiomarkerdiscovery.Theydealtwithapproximate,sample,fuzzy,andfuzzymeasureentropies.AlldatawereobtainedfromPhysioNet https://www.physionet.org,afree‐access,on‐linearchiveofphysiologicalsignals,andrepresentvariousmedicalconditions.Fivetestsweredefinedandconductedtoexaminetheinfluenceof:varyingthethresholdvaluer(asmultiplesofthesamplestandarddeviation?,ortheentropy‐maximizingrChon),thedatalengthN,theweightingfactorsnforfuzzyandfuzzymeasureentropies,andthethresholdsrF andrL forfuzzymeasureentropy.TheresultsweretestedfornormalityusingLilliefors'compositegoodness‐of‐fittest.Consequently,thep‐valuewascalculatedwitheitheratwosamplet‐testoraWilcoxonranksumtest.Thefirsttestshowsacross‐overofentropyvalueswithregardtoachangeofr.Thus,aclearstatementthatahigherentropycorrespondstoahighirregularityisnotpossible,butisratheranindicatorofdifferencesinregularity.Nshouldbeatleast200datapointsforr=0.2?andshouldevenexceedalengthof1000forr=rChon.Theresultsfortheweightingparametersnforthefuzzymembershipfunctionshowdifferentbehaviorwhencoupledwithdifferentrvalues,thereforetheweightingparametershavebeenchosenindependentlyforthedifferentthresholdvalues.ThetestsconcerningrF andrL showedthatthereisnooptimalchoice,butr=rF =rL isreasonablewithr=rChon orr=0.2?.CONCLUSIONS:Someofthetestsshowedadependencyofthetestsignificanceonthedataathand.Nevertheless,asthemedicalconditionsareunknownbeforehand,compromiseshadtobemade.Optimalparametercombinationsaresuggestedforthemethodsconsidered.Yet,duetothehighnumberofpotentialparametercombinations,furtherinvestigationsofentropyforheartratevariabilitydatawillbenecessary.Mayer,C.,Bachler,M.,Hortenhuber,M.,Stocker,C.,Holzinger,A.&Wassertheurer,S.2014.Selectionofentropy‐measureparametersforknowledgediscoveryinheartratevariabilitydata.BMCBioinformatics,15,(Suppl 6),S2.http://www.ncbi.nlm.nih.gov/pubmed/25078574

46WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 47: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

TheoriginmaybefoundintheworkofJakob Bernoulli,describingtheprincipleofinsufficientreason:weareignorantofthewaysaneventcanoccur,theeventwilloccurequallylikelyinanyway.ThomasBayes(1763)andPierre‐SimonLaplace(1774)carriedonandHaroldJeffreys andDavidCoxsolidifieditintheBayesianStatistics,akastatisticalinference.ThesecondpathleadingtotheclassicalMaximumEntropy,en‐routewiththeShannonEntropy,canbeidentifiedwiththeworkofJamesClerkMaxwellandLudwigBoltzmann,continuedbyWillardGibbsandfinallyClaudeElwoodShannon.Thisworkisgearedtowarddevelopingthemathematicaltoolsforstatisticalmodelingofproblemsininformation.Thesetwoindependentlinesofresearchareverysimilar.Theobjectiveofthefirstlineofresearchistoformulateatheory/methodologythatallowsunderstandingofthegeneralcharacteristics (distribution)ofasystemfrompartialandincompleteinformation.Inthesecondrouteofresearch,thesameobjectiveisexpressedasdetermininghowtoassign(initial)numericalvaluesofprobabilitieswhenonlysome(theoretical)limitedglobalquantitiesoftheinvestigatedsystemareknown.RecognizingthecommonbasicobjectivesofthesetwolinesofresearchaidedJaynes inthedevelopmentofhisclassicalwork,theMaximumEntropyformalism.Thisformalismisbasedonthefirstlineofresearchandthemathematicsofthesecondlineofresearch.TheinterrelationshipbetweenInformationTheory,statisticsandinference,andtheMaximumEntropy(MaxEnt)principlebecameclearin1950ies,andmanydifferentmethodsarosefromtheseprinciples(Golan,2008),seenextSlide

47WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 48: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

MaximumEntropy(MaxEn),describedby(Jaynes,1957),isusedtoestimateunknownparametersofamultinomialdiscretechoiceproblem,whereastheGeneralizedMaximumEntropy(GME)modelincludesnoisetermsinthemultinomialinformationconstraints.Eachnoisetermismodeledasthemeanofafinitesetofaprioriknownpointsintheinterval 1,1 withunknownprobabilitieswherenoparametricassumptionsabouttheerrordistributionaremade.AGMEmodelforthemultinomialprobabilitiesandforthedistributions,associatedwiththenoisetermsisderivedbymaximizingthejointentropyofmultinomialandnoisedistributions,undertheassumptionofindependence(Jaynes,1957).TopologicalEntropy (TopEn),wasintroducedby(Adler,Konheim &McAndrew,1965)withthepurpose tointroducethenotionofentropyasaninvariantforcontinuousmappings:Let , beatopologicaldynamicalsystem,i.e.,let beanonemptycompactHausdorff spaceand : → acontinuousmap;theTopEn isanonnegativenumberwhichmeasuresthecomplexity ofthesystem(Adler,Downarowicz &Misiurewicz,2008).GraphEntropywasdescribedby(Mowshowitz,1968)tomeasurestructuralinformationcontentofgraphs,andadifferentdefinition,morefocusedonproblemsininformationandcodingtheory,wasintroducedby(Körner,1973).Graphentropyisoftenusedforthecharacterizationofthethe structureofgraph‐basedsystems,e.g.inmathematicalbiochemistry.Intheseapplicationstheentropyofagraphisinterpretedasitsstructuralinformationcontentandservesasacomplexitymeasure,andsuchameasureisassociatedwithanequivalencerelationdefinedonafinitegraph;byapplicationofShannon’sEq.2.4withtheprobabilitydistributionwegetanumericalvaluethatservesasanindexofthestructuralfeaturecapturedbytheequivalencerelation(Dehmer&Mowshowitz,2011).

MinimumEntropy (MinEn),describedby(Posner,1975),providesustheleastrandom,andtheleastuniformprobabilitydistributionofadataset,i.e.theminimumuncertainty,whichisthelimitofourknowledgeandofthestructureofthesystem.Often,theclassicalpatternrecognitionisdescribedasaquestforminimumentropy.Mathematically,itismoredifficulttodetermineaminimumentropyprobabilitydistributionthanamaximumentropyprobabilitydistribution;whilethelatterhasaglobalmaximumduetotheconcavityoftheentropy,theformerhastobeobtainedbycalculatingalllocalminima,consequentlytheminimumentropyprobabilitydistributionmaynotexistinmanycases(Yuan&Kesavan,1998).CrossEntropy (CE),discussedby(Rubinstein,1997),wasmotivatedbyanadaptivealgorithmforestimatingprobabilitiesofrareeventsincomplexstochasticnetworks,whichinvolvesvarianceminimization.CEcanalsobeusedforcombinatorialoptimizationproblems(COP).Thisisdonebytranslatingthe“deterministic”optimizationproblemintoarelated“stochastic”optimizationproblemandthenusingrareeventsimulationtechniques(DeBoeretal.,2005).Rényi entropy isageneralizationoftheShannonentropy(informationtheory),andTsallis entropyisageneralizationofthestandardBoltzmann–Gibbsentropy(statisticalphysics).Forusmoreimportantare:ApproximateEntropy(ApEn),describedby(Pincus,1991),isuseabletoquantifyregularityindatawithoutanyaprioriknowledgeaboutthesystem,seeanexampleinSlide2‐20.SampleEntropy(SampEn),wasusedby(Richman&Moorman,2000)foranewrelatedmeasureoftimeseriesregularity.SampEn wasdesignedtoreducethebiasofApEn andisbettersuitedfordatasetswithknownprobabilisticcontent.

48WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 49: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Problem:Monitoringbodymovementsalongwithvitalparametersduringsleepprovidesimportantmedicalinformationregardingthegeneralhealth,andcanthereforebeusedtodetecttrends(largeepidemiologystudies)todiscoversevereillnessesincludinghypertension(whichisenormouslyincreasinginoursociety).Thisseeminglysimpledata– onlyfromonenightperiod– demonstratesthecomplexityandtheboundariesofstandardmethods(forexampleFastFourierTransformation)todiscoverknowledge(forexampledeviations,similaritiesetc.).Duetothecomplexityanduncertaintyofsuchdatasets,standardmethods(suchasFFT)comprisethedangerofmodelingartifacts.Sincetheknowledgeofinterestformedicalpurposesisinanomalies(alterations,differences,a‐typicalities,irregularities),theapplicationofentropicmethodsprovidesbenefits.PhotographtakenduringtheEUProjectEMERGEandusedwithpermission.

49WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 50: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

1)Wehaveagivendataset wherecapital isthenumberofdatapoints:Eq.2‐13

, , … ,

2)Nowweformm‐dimensionalvectorsEq.2‐14

, , … ,3)Wemeasurethedistancebetweeneverycomponent,i.e.themaximumabsolutedifferencebetweentheirscalarcomponentsEq.2‐15

, max, ,…,

4)Welook– sotosay– inwhichdimensionisthebiggestdifference;asaresultwegettheApproximateEntropy(ifthereisnodifferencewehavezerorelativeentropy):Eq.2‐16

ApEn , lim →

where istherunlengthand isthetolerancewindow (letusassumethat isequalto ),ApEn (m,r)couldalsobewrittenasH ,5) iscomputedbyEq.2‐17

1 1

ln

withEq.2‐18

1

6) measureswithinthetolerance theregularityofpatternssimilartoagivenoneofwindowlength7)Finallyweincreasethedimensionto 1 andrepeatthestepsbeforeandgetasaresulttheapproximateentropyApEn ,ApEn , , isapproximatelythenegativenaturallogarithmoftheconditionalprobability(CP)thatadatasetoflength,havingrepeateditselfwithinatolerance for points,willalsorepeatitselffor 1 points.Animportantpointto

keepinmindabouttheparameter isthatitiscommonlyexpressedasafractionoftheStandarddeviation(SD)ofthedataandinthiswaymakesApEn ascale‐invariantmeasure.Alowvaluearisesfromahighprobabilityofrepeatedtemplatesequencesinthedata(Hornero etal.,2006).

50WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 51: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Inthisslidewecanseetheplotofthenormalizedapproximateentropyforeachoftheepisodesandthemedianacrossalltheepisodes.Fromthisfigurewecanseethattheentropyisaminimumwherewehavenoalterationsandentropyisincreasingwhenhavingirregularities.Ifwehavenodifferenceswegetzeroentropy

51WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 52: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Afinalexampleshouldmaketheadvantageofsuchanentropymethodtotallyclear:Intherightdiagramitishardtodiscoverirregularitiesforamedicalprofessional–especiallyoveralongerperiod,butananomalycaneasilybedetectedbydisplayingthemeasuredrelativeApEn.Whatcanwelearnfromthisexperiment?Approximateentropyisrelativelyunaffectedbynoise;itcanbeappliedtocomplextimeserieswithgoodreproduction;itisfiniteforstochastic,noisy,compositeprocesses;thevaluescorresponddirectlytoirregularities;anditisapplicabletomanyotherareas– forexamplefortheclassificationoflargesetsoftexts– theabilitytoguessalgorithmicallythesubjectofatextcollectionwithouthavingtoreaditwouldpermitautomatedclassification.

52WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 53: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

A. Holzinger    LV 709.049 Med. Informatik                                

WS 2015 53

Page 54: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Algorithm 1:WithrotateDataPoints”definedwecancalculatetheprojectionprofilepy;j( forarangeofdifferentanglesandwiththosewecancomputetheentropyHy, (X)foreachangle.Holzinger etal.(2012)implementedthisalgorithmasitisdescribedinAlgorithm2.Formoreinformationplease refertothepaper.Thisjustshalldemonstratethestrengthsandweaknessesofusingentropyforskew‐ andslant‐correction– whichisimportantinhandwritingrecognition.However,theentropybasedskewcorrectiondoesnotoutperformoldermethodslikeskewcorrectionbasedontheleastsquaresmethod:thenoiseinthedrawingdistortstherealminimaoftheentropydistribution.Inmanycaseswheretheglobalminimumwasthewrongchoice,therewasalocalminimumclosetotherealerrorangle.Eventhoughbothapproachesyieldsatisfyingresultforwordslongerthanfiveletters,wesuggestfurtherinvestigationintotheentropy‐basedskewcorrectionmethod,withnoisereductioninmind.Ontheotherhand,itshowsthatentropyisinfactusefulwhenperformingslantcorrection,asitdoesoutperformthewindow‐basedapproach!Theconclusionis,thatthewindow‐basedmethodistoomuchdependentonanumberoffactors.Itsperformanceisinfluencedalotbytheoutcomeofzonedetectionandbythewritingstyleofthewriter.Itisalsoinfluencedalotbywindowselection.Holzinger,A.,Stocker,C.,Peischl,B.&Simonic,K.‐M.2012.OnUsingEntropyforEnhancingHandwritingPreprocessing.Entropy,14,(11),2324‐2350

A. Holzinger    LV 709.049 Med. Informatik                                

WS 2015 54

Page 55: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Whatcanwelearnfromthisexperiment?Approximateentropyisrelativelyunaffectedbynoise;itcanbeappliedtocomplextimeserieswithgoodreproducibility;itisfiniteforstochastic,noisy,compositeprocesses;thevaluescorresponddirectlytoirregularities;

anditisapplicabletomanyotherareas– forexamplefortheclassificationoflargesetsoftexts– theabilitytoguessalgorithmicallythesubjectofatextcollectionwithouthavingtoreaditwouldpermitautomatedclassification.

55WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 56: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

56

My DEDICATION is to make data valuable … Thank you!

WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 57: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

57WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 58: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

58WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 59: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

MFC=MinimumFoot ClearanceStride=stepYoucanseebrilliantlywhatyoucanmeasurewithentropy– youcandetermineanomalies,i.e.thebalanceproblemsofelderlygait

MFCPoincaré plots.ToppanelsshowMFCtimeseriesfromahealthyelderlysubject(A)anditscorrespondingPoincaré plot(B).BottompanelsshowMFCtimeseriesfromanelderlysubjectwithbalanceproblem(C)anditscorrespondingPoincaré plot(D).

SignificantrelationshipsofmeanMFCwithPoincaré plotindexes(SD1,SD2)andApEn (r=0.70,p<0.05;r=0.86,p<0.01;r=0.74,p<0.05)werefoundinthefalls‐riskelderlygroup.Ontheotherhand,suchrelationshipswereabsentinthehealthyelderlygroup.Incontrast,theApEn valuesofMFCdataseriesweresignificantly(p<0.05)correlatedwithPoincaré plotindexesofMFCinthehealthyelderlygroup,whereascorrelationswereabsentinthefalls‐riskgroup.TheApEn valuesinthefalls‐riskgroup(meanApEn =0.18± 0.03)wassignificantly(p<0.05)higherthanthatinthehealthygroup(meanApEn =0.13± 0.13).ThehigherApEn valuesinthefalls‐riskgroupmightindicateincreasedirregularitiesandrandomnessintheirgaitpatternsandanindicationoflossofgaitcontrolmechanism.ApEn valuesofrandomlyshuffledMFCdataoffallsrisksubjectsdidnotshowanysignificantrelationshipwithmeanMFC.

59WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 60: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

60WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 61: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

61WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 62: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Surrogatedatarecords.AandBshowthemajorcomponents.A:themeanprocess,whichhassetpointandspikemodes.B:thebaselineprocess,heremeaningtheheartratevariability,modeledasGaussianrandomnumbers.C:theirsum,asurrogatedatarecord.D–F:amorerealisticsurrogatewiththesamefrequencycontentastheobserveddata.D:aclinicallyobserveddatarecordof4,096R‐Rintervals.Thelefthand ordinateislabeledinms andtherighthand ordinateinSD.E:a4,096‐pointisospectral surrogatedatasetformedusingtheinverseFouriertransformoftheperiodogram ofthedatainD.F:thesurrogatedataafteradditionofaclinicallyobserveddecelerationlasting50pointsandscaledsothatthevarianceoftherecordisincreasedfrom1to2.

62WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 63: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

63WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 64: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

64WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 65: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

65WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 66: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

66WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 67: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

http://support.sas.com/documentation/cdl/en/etsug/60372/HTML/default/viewer.htm#etsug_entropy_sect018.htm

Wheremanyotherlanguagesrefertotables,rows,andcolumns/fields,SASusesthetermsdatasets,observations,andvariables.ThereareonlytwokindsofvariablesinSAS:numericandcharacter(string).Bydefaultallnumericvariablesarestoredas(8byte)real.Itispossibletoreduceprecisioninexternalstorageonly.DateanddatetimevariablesarenumericvariablesthatinherittheCtraditionandarestoredaseitherthenumberofdays(fordatevariables)orseconds(fordatetime variables).

http://www.sas.com/technologies/analytics/statistics/stat/index.html

67WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 68: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Hadoop andtheMapReduce programmingparadigmalreadyhaveasubstantialbaseinthebioinformaticscommunity– inparticular inthefieldofhigh‐throughput next‐generationsequencinganalysis.Thisisduetothecost‐effectivenessofHadoop‐basedanalysisoncommodityLinuxclusters,andinthecloudviadatauploadtocloudvendorswhohaveimplementedHadoop/HBase;andduetotheeffectivenessandease‐of‐useoftheMapReduce methodinparallelizationofmanydataanalysisalgorithms.

68WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 69: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Thechallengewefaceisthatanestimatedaverageof5%ofdataarestructured,therestiseithersemi‐structured,weaklystructuredandmostofourdataisunstructured.Maybethemostimportantfield forthefutureisdatamining– especiallynoveltechniquesofdatamining,includingbothtimeandspace(e.g.graph‐based,entropy‐based,topological‐baseddataminingapproaches).Readmorehere:Holzinger,A.2014.ExtravaganzaTutorialonHotIdeasforInteractiveKnowledgeDiscoveryandDataMininginBiomedicalInformatics.In:Slezak,D.,Tan,A.‐H.,Peters,J.F.&Schwabe,L.(eds.)BrainInformaticsandHealth,BIH2014,LectureNotesinArtificialIntelligence,LNAI8609.Heidelberg,Berlin:Springer,pp.502‐515.https://online.tugraz.at/tug_online/voe_main2.getVollText?pDocumentNr=764238&pCurrPk=79139

69WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 70: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

70WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 71: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

http://minnesotafuturist.pbworks.com/w/page/21441129/DIKW

Afunnydescription ofdatainformationknowledge.

71WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 72: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Avery placative image.Nicetolookat– buttheusefulnessisquestionable.

72WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 73: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Allthismodelsareveryquestionable. PleaserememberthatwefollowinourlecturethenotionofBoisot &Canals.

73WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Page 74: A. Holzinger 709.049 Med. Informatikhuman-centered.ai/wordpress/wp-content/uploads/... · Informatics has now been expanded to Biomedical Informatics and is defined as “the interdisciplinary

Theinterestingissue ofthisgraphicisthatitincludesatime‐axis,whichisimportantfordecisionmakingandpredictiveanalytics.“Pastbehaviourisagoodpredictorforfuturebehaviour”Althoughthis isaoversimplification,scientistswhostudyhumanbehavioragreethatpastbehaviormaybeausefulmarkerforfuturebehavior,however,onlyundercertainspecificconditions. Readmore:Ajzen,I.1991.Thetheoryofplannedbehavior.Organizationalbehaviorandhumandecisionprocesses,50,(2),179‐211.

74WS 2015

A. Holzinger    LV 709.049 Med. Informatik