Supervised and Unsupervised Learning - astro.caltech.edugeorge/aybi199/Donalek_Classif.pdf · data...

SupervisedandUnsupervisedLearning

CiroDonalekAy/Bi199–April2011

Summary•  KDDandDataMiningTasks•  Findingtheop?malapproach•  SupervisedModels

–  NeuralNetworks–  Mul?LayerPerceptron–  DecisionTrees

•  UnsupervisedModels–  DifferentTypesofClustering–  DistancesandNormaliza?on–  Kmeans–  SelfOrganizingMaps

•  Combiningdifferentmodels–  CommiOeeMachines–  IntroducingaPrioriKnowledge–  SleepingExpertFramework

KnowledgeDiscoveryinDatabases

•  KDDmaybedefinedas:"Thenontrivialprocessofiden2fyingvalid,novel,poten2allyuseful,andul2matelyunderstandablepa9ernsindata".

•  KDDisaninterac?veanditera?veprocessinvolvingseveralsteps.

Yougotyourdata:what’snext?

Whatkindofanalysisdoyouneed?Whichmodelismoreappropriateforit?…

Cleanyourdata!•  Datapreprocessingtransformstherawdataintoaformatthatwillbemoreeasilyandeffec?velyprocessedforthepurposeoftheuser.

•  Sometasks•  sampling:selectsarepresenta?vesubset

fromalargepopula?onofdata;•  Noisetreatment•  strategiestohandlemissingdata:some?mes

yourrowswillbeincomplete,notallparametersaremeasuredforallsamples.

•  normaliza2on•  featureextrac2on:pullsoutspecifieddata

thatissignificantinsomepar?cularcontext.

Usestandardformats!

MissingData•  Missingdataareapartofalmostallresearch,andweallhaveto

decidehowtodealwithit.•  CompleteCaseAnalysis:useonlyrowswithallthevalues•  AvailableCaseAnalysis•  Subs?tu?on

–  MeanValue:replacethemissingvaluewiththemeanvalueforthatpar?cularaOribute

–  RegressionSubs?tu?on:wecanreplacethemissingvaluewithhistoricalvaluefromsimilarcases

–  MatchingImputa?on:foreachunitwithamissingy,findaunitwithsimilarvaluesofxintheobserveddataandtakeitsyvalue

–  MaximumLikelihood,EM,etc•  SomeDMmodelscandealwithmissingdatabeOerthanothers.•  Whichtechniquetoadoptreallydependsonyourdata

DataMining•  CrucialtaskwithintheKDD•  DataMiningisaboutautoma?ngtheprocessofsearchingforpaOernsinthedata.

•  Moreindetails,themostrelevantDMtasksare:–  associa?on–  sequenceorpathanalysis–  clustering–  classificaDon–  regression–  visualiza?on

FindingSoluDonviaPurposes•  Youhaveyourdata,whatkindofanalysisdoyouneed?

•  Regression–  predictnewvaluesbasedonthepast,inference–  computethenewvaluesforadependentvariablebasedonthevaluesofoneormoremeasuredaOributes

•  Classifica?on:–  dividesamplesinclasses–  useatrainedsetofpreviouslylabeleddata

•  Clustering–  par??oningofadatasetintosubsets(clusters)sothatdataineachsubsetideallysharesomecommoncharacteris?cs

•  Classifica?onisinasomewaysimilartotheclustering,butrequiresthattheanalystknowaheadof?mehowclassesaredefined.

ClusterAnalysis

Howmanyclustersdoyouexpect?

SearchforOutliers

ClassificaDon•  Dataminingtechniqueusedtopredictgroupmembershipfordatainstances.Therearetwowaystoassignanewvaluetoagivenclass.

•  CrispyclassificaDon–  givenaninput,theclassifierreturnsitslabel

•  ProbabilisDcclassificaDon–  givenaninput,theclassifierreturnsitsprobabili?estobelongtoeachclass

–  usefulwhensomemistakescanbemorecostlythanothers(givemeonlydata>90%)

–  winnertakeallandotherrules•  assigntheobjecttotheclasswiththehighestprobability(WTA)

•  …butonlyifitsprobabilityisgreaterthan40%(WTAwiththresholds)

Regression/ForecasDng

•  Datatablesta?s?calcorrela?on– mappingwithoutanypriorassump?ononthefunc?onalformofthedatadistribu?on;

– machinelearningalgorithmswellsuitedforthis.

•  Curvefigng– findawelldefinedandknownfunc?onunderlyingyourdata;

–  theory/exper?secanhelp.

MachineLearning

•  Tolearn:togetknowledgeofbystudy,experience,orbeingtaught.

•  TypesofLearning•  Supervised•  Unsupervised

UnsupervisedLearning

•  Themodelisnotprovidedwiththecorrectresultsduringthetraining.

•  Canbeusedtoclustertheinputdatainclassesonthebasisoftheirsta?s?calproper?esonly.

•  Clustersignificanceandlabeling.•  Thelabelingcanbecarriedoutevenifthelabelsareonlyavailableforasmallnumberofobjectsrepresenta?veofthedesiredclasses.

SupervisedLearning

•  Trainingdataincludesboththeinputandthedesiredresults.

•  Forsomeexamplesthecorrectresults(targets)areknownandaregivenininputtothemodelduringthelearningprocess.

•  Theconstruc?onofapropertraining,valida?onandtestset(Bok)iscrucial.

•  Thesemethodsareusuallyfastandaccurate.•  Havetobeabletogeneralize:givethecorrectresultswhennewdataaregivenininputwithoutknowingapriorithetarget.

GeneralizaDon

•  Referstotheabilitytoproducereasonableoutputsforinputsnotencounteredduringthetraining.

Inotherwords:NOPANICwhen"neverseenbefore"dataaregivenininput!

Acommonproblem:OVERFITTING

•  Learnthe“data”andnottheunderlyingfunc?on•  Performswellonthedatausedduringthetrainingandpoorlywithnewdata.

Howtoavoidit:usepropersubsets,earlystopping.

Datasets•  Trainingset:asetofexamplesusedforlearning,wherethetargetvalueisknown.

•  ValidaDonset:asetofexamplesusedtotunethearchitectureofaclassifierandes?matetheerror.

•  Testset:usedonlytoassesstheperformancesofaclassifier.Itisneverusedduringthetrainingprocesssothattheerroronthetestsetprovidesanunbiasedes?mateofthegeneraliza?onerror.

IRISdataset

•  IRIS– consistsof3classes,50instanceseach– 4numericalaOributes(sepalandpetallengthandwidthincm)  

– eachclassreferstoatypeofIrisplant(Setosa,Versicolor,Verginica)  

–  thefirstclassislinearlyseparablefromtheothertwowhilethe2ndandthe3rdarenotlinearlyseparable

ArDfactsDataset•  PQAr?facts 

–  2mainclassesand4numericalaOributes

–   classesare:trueobjects,ar?facts

DataSelecDon

•  “Garbagein,garbageout”:training,valida?onandtestdatamustberepresenta?veoftheunderlyingmodel

•  Alleventuali?esmustbecovered•  Unbalanceddatasets–  sincethenetworkminimizestheoverallerror,thepropor?onoftypesofdatainthesetiscri?cal;

–   inclusionofalossmatrix(Bishop,1995);–  onen,thebestapproachistoensureevenrepresenta?onofdifferentcases,thentointerpretthenetwork'sdecisionsaccordingly.

ArDficialNeuralNetwork

AnAr?ficialNeuralNetworkisaninforma?onprocessingparadigmthatisinspiredbythewaybiologicalnervoussystemsprocessinforma?on:

“alargenumberofhighlyinterconnectedsimpleprocessing

elements(neurons)workingtogethertosolvespecific

problems”

AsimplearDficialneuron•  Thebasiccomputa?onalelementisonencalledanodeorunit.It

receivesinputfromsomeotherunits,orfromanexternalsource.•  Eachinputhasanassociatedweightw,whichcanbemodifiedso

astomodelsynap?clearning.•  Theunitcomputessomefunc?onoftheweightedsumofits

inputs:

NeuralNetworksANeuralNetworkisusuallystructuredintoaninputlayerofneurons,oneormorehiddenlayersandoneoutputlayer.Neuronsbelongingtoadjacentlayersareusuallyfullyconnectedandthevarioustypesandarchitecturesareiden?fiedbothbythedifferenttopologiesadoptedfortheconnec?onsaswellbythechoiceoftheac?va?onfunc?on.Thevaluesofthefunc?onsassociatedwiththeconnec?onsarecalled“weights”.

ThewholegameofusingNNsisinthefactthat,inorderforthenetworktoyieldappropriateoutputsforgiveninputs,theweightmustbesettosuitablevalues.

Thewaythisisobtainedallowsafurtherdis?nc?onamongmodesofopera?ons.

NeuralNetworks:types

Feedforward:SingleLayerPerceptron,MLP,ADALINE(Adap?veLinearNeuron),RBFSelf‐Organized:SOM(KohonenMaps)

Recurrent:SimpleRecurrentNetwork,HopfieldNetwork.Stochas?c:Boltzmannmachines,RBM.Modular:CommiOeeofMachines,ASNN(Associa?veNeuralNetworks),Ensembles.Others:InstantaneouslyTrained,Spiking(SNN),Dynamic,Cascades,NeuroFuzzy,PPS,GTM.

MulDLayerPerceptron•  TheMLPisoneofthemostusedsupervisedmodel:itconsistsofmul?plelayersofcomputa?onalunits,usuallyinterconnectedinafeed‐forwardway.

•  Eachneuroninonelayerhasdirectconnec?onstoalltheneuronsofthesubsequentlayer.

LearningProcess•  BackPropaga?on

–   theoutputvaluesarecomparedwiththetargettocomputethevalueofsomepredefinederrorfunc?on

–  theerroristhenfedbackthroughthenetwork–   usingthisinforma?on,thealgorithmadjuststheweightsofeach

connec?oninordertoreducethevalueoftheerrorfunc?on

Anerrepea?ngthisprocessforasufficientlylargenumberoftrainingcycles,thenetworkwillusuallyconverge.

HiddenUnits•  Thebestnumberofhiddenunitsdependon:

–  numberofinputsandoutputs

–  numberoftrainingcase–  theamountofnoiseinthetargets

–  thecomplexityofthefunc?ontobelearned

–  theac?va?onfunc?on

•  Toofewhiddenunits=>hightrainingandgeneraliza?onerror,duetounderfigngandhighsta?s?calbias.

•  Toomanyhiddenunits=>lowtrainingerrorbuthighgeneraliza?onerror,duetooverfigngandhighvariance.

•  Rulesofthumbdon'tusuallywork.

AcDvaDonandErrorFuncDons

AcDvaDonFuncDons

Results:confusionmatrix

Results:completenessandcontaminaDon

Exercise:computecompletenessandcontamina?onforthepreviousconfusionmatrix(testset)

DecisionTrees•  Isanotherclassifica?onmethod.•  Adecisiontreeisasetofsimplerules,suchas"ifthesepallengthislessthan5.45,classifythespecimenassetosa."

•  Decisiontreesarealsononparametricbecausetheydonotrequireanyassump?onsaboutthedistribu?onofthevariablesineachclass.

Summary•  KDDandDataMiningTasks•  Findingtheop?malapproach•  SupervisedModels

–  NeuralNetworks–  Mul?LayerPerceptron–  DecisionTrees

•  UnsupervisedModels–  DifferentTypesofClustering–  DistancesandNormaliza?on–  Kmeans–  SelfOrganizingMaps

•  Combiningdifferentmodels–  CommiOeeMachines–  IntroducingaPrioriKnowledge–  SleepingExpertFramework

UnsupervisedLearning

•  Themodelisnotprovidedwiththecorrectresultsduringthetraining.

•  Canbeusedtoclustertheinputdatainclassesonthebasisoftheirsta?s?calproper?esonly.

•  Clustersignificanceandlabeling.•  Thelabelingcanbecarriedoutevenifthelabelsareonlyavailableforasmallnumberofobjectsrepresenta?veofthedesiredclasses.

TypesofClustering•  Typesofclustering:

–  HIERARCHICAL:findssuccessiveclustersusingpreviouslyestablishedclusters•  agglomera?ve(boOom‐up):startwitheachelementinaseparateclusterandmergethemaccordinglytoagivenproperty

•  divisive(top‐down)–  PARTITIONAL:usuallydeterminesallclustersatonce

Distances•  Determinethesimilaritybetweentwoclustersandtheshapeoftheclusters.

Incaseofstrings…•  TheHammingdistancebetweentwostringsofequallengthisthenumberofposi?onsatwhichthecorrespondingsymbolsaredifferent.–  measurestheminimumnumberofsubs2tu2onsrequiredtochangeonestringintotheother

•  TheLevenshtein(edit)distanceisametricformeasuringtheamountofdifferencebetweentwosequences.–  isdefinedastheminimumnumberofeditsneededtotransformonestringintotheother.

10010011000100HD=3

LD(BIOLOGY,BIOLOGIA)=2BIOLOGY‐>BIOLOGI(subsDtuDon)BIOLOGI‐>BIOLOGIA(inserDon)

NormalizaDon

VAR:themeanofeachaOributeofthetransformedsetofdatapointsisreducedtozerobysubtrac?ngthemeanofeachaOributefromthevaluesoftheaOributesanddividingtheresultbythestandarddevia?onoftheaOribute.

RANGE(Min‐MaxNormalizaDon):subtractstheminimumvalueofanaOributefromeachvalueoftheaOributeandthendividesthedifferencebytherangeoftheaOribute.Ithastheadvantageofpreservingexactlyallrela?onshipinthedata,withoutaddinganybias.

SOFTMAX:isawayofreducingtheinfluenceofextremevaluesoroutliersinthedatawithoutremovingthemfromthedataset.Itisusefulwhenyouhaveoutlierdatathatyouwishtoincludeinthedatasetwhiles?llpreservingthesignificanceofdatawithinastandarddevia?onofthemean.

KMeans

KMeans:howitworks

Kmeans:ProandCons

LearningK•  Findabalancebetweentwovariables:thenumberofclusters(K)andtheaveragevarianceoftheclusters.

•  Minimizebothvalues

•  Asthenumberofclustersincreases,theaveragevariancedecreases(uptothetrivialcaseofk=nandvariance=0).

•  Somecriteria:–  BIC(BayesianInforma?onCriteria)– AIC(AkaikeInforma?onCriteria)– Davis‐BouldinIndex–  ConfusionMatrix

SelfOrganizingMaps

SOMtopology

SOMPrototypes

SOMTraining

CompeDDveandCooperaDveLearning

SOMUpdateRule

Parameters

DMwithSOM

SOMLabeling

LocalizingData

ClusterStructure

ClusterStructure‐2

ComponentPlanes

RelaDveImportance

Howaccurateisyourclustering

Trajectories

CombiningModels

CommideeMachines

Aprioriknowledge

SleepingExperts

Supervised and Unsupervised Learning - astro.caltech.edugeorge/aybi199/Donalek_Classif.pdf · data...

Documents

Transcript of Supervised and Unsupervised Learning - astro.caltech.edugeorge/aybi199/Donalek_Classif.pdf · data...

—Bill Gates The Fgeorge/aybi199/4th_paradigm_book... · —Bill Gates “I often tell people working in eScience that they aren’t in this field because they are visionaries or

DOCOBENT RESUME TITLE Selected Mathematics … · UNIT III BOOKKEEPING (5 weeks) A. Learner Objectives. 1. Adds, subtracts, multiplies, and divides whole numbers and decimals. 2.

Engineering Computational Science & Engineeringgeorge/aybi199/Stalzer_CSE.pdf · Engineering Computational Science & Engineering Mark Stalzer ... • Introduction ... Earthquake recorded

Measuring The Universe - astro.caltech.edugeorge/ay1/lec_pdf/Ay1_Lec19.pdf · Distance Indicator Relations! • Need a correlation between a distance-independent quantity, “X”,

Star Clusters and Stellar Dynamics - astro.caltech.edugeorge/ay20/Ay20-Lec15x.pdf · Star Clusters and Stellar Dynamics (This file has a bunch of pictures deleted, in order to save

Validating defense mechanisms of cyber-physical systems ... · Mutation Operators Operator Description Example Add StaticDelta (ASD) Adds/subtracts an absolute, unchanging𝛿to state

Subtracts Within 5 - Shamrocks

Engineering Computational Science & Engineeringgeorge/aybi199/OldLectures/Stalzer.pdf · Solution convergence Time complexity checks ... • Complex social ... Stochastic simulation

Python for Scientific Applications - Caltech Astronomygeorge/aybi199/OldLectures/Lec3_Aivazis.pdfProgramming paradigms A very active area of research dozens of languages and runtime

Andrew Stockman - CVRL Notes/Stockman... · A green pigment subtracts red and blue and reflects green. BLUE PIGMENT A blue pigment subtracts red and green and reflects blue. Andrew

Data Mining and Exploration - Caltech Astronomygeorge/aybi199/Djorgovski_DMintro.pdf · Data Mining and Exploration (a quick and very superficial intro) S. G. Djorgovski AyBi 199b,

Electro Industries / GaugeTech 2 What is System Loss Compensation Billing Point Transmission Line Power Transformer Substation Conductors Billing Meter Load zAdds or subtracts losses

Presentation Ay199 Apr2011george/aybi199/Upchurch_ParProc.pdf• Multi-core Processors – Multiple processors (cores) on a single chip. • Cluster Computing – Use of a combination

Bayesian Statistics - Caltech Astronomygeorge/aybi199/Mog... · Bayesian vs. Frequentist “In academia, the Bayesian revolution is on the verge of becoming the majority viewpoint,

Astronomy 1: The Evolving Universe - astro.caltech.edugeorge/ay1/lec_pdf/Ay1_Lec01.pdf · Astronomy as a Branch of Physics" • Using the apparatus of physics to gather and interpret

Databases 101 - Caltech Astronomygeorge/aybi199/Graham_DB1.pdf · what is a database? Astructured collection of data residing on a computer system that can be easily accessed,managed

Galaxies and Their Properties - astro.caltech.edugeorge/ay1/lec_pdf/Ay1_Lec13.pdf · Galaxies and Their Properties! Ay1 – Lecture 13! Stephan’s Quintet (HST)! 13.1 Galaxy Morphology

FilteringFiltering Filtering is another name for subtractive synthesis because it subtracts frequencies from a soundFiltering is another name for subtractive.

10.1 The Large Scale Velocity Field - astro.caltech.edugeorge/ay21/Ay21_Lec10.pdf · The “Great Attractor” aka the Hydra-Centaurus ... beyond 14 . Peculiar Velocities: ... Here

Ashish Mahabal AyBi199, Caltech 7 May 2009george/aybi199/Old... · 2009. 5. 8. · Ashish Mahabal • X‐Y plot between Vmag and B‐V reveals the famous structure in the dataset: