Machine Learning in the Wild - Clark Science CenterMachine Learning in the Wild Dealing with Messy...

MachineLearningintheWild

DealingwithMessyData

RajmondaS.Caceres

SDS293– SmithCollege October30,2017

AnalyticalChain:FromDatatoActions

Whatthislectureisabout?

• Whataresomeofdataqualityissuesinrealapplications?• Whyshouldwecareaboutdataquality?• Howcanweleveragestatisticaltechniquestoimprovethequalityofrealdata?

DataCollection DataCleaning/Preparation

AnalysisVisual

RepresentationDecisionMaking

DataQualityIssuesInRealApplications

• Missingdata• Outliers• Noisydata• Datamismatch• Curseofdimensionality

• Dataqualitygreatlyeffectsanalysisandinsightswedrawfromdata• Introducesbias• Causesinformationloss

EffectsofMissingDataOnRegression

• Whichregressionmodelshouldwepick?

FiguresarefromChrisBishop’sBook“PatternRecognitionandMachineLearning”

Sufficientdatapointstoperformregression Insufficientdatapointstoperformregression

EffectsofOutliers&MissingDataOnPCA

• Poorqualitydatagreatlyaffectsdimensionalityreductionmethods

Case 1 Case2

MissingData

• Howmuchmissingdataistoomuch?• Roughguideline:Ifless<5-10%,likelyinconsequential,ifgreaterneedtoimpute

Sample X1 X2 X3 X4 Y

4 ? ? ? ? ?

5 ? ? ? ? ?

Observedsubpopulation

UnobservedsubpopulationSampleLevel

FeatureLevel

MechanismsofMissingData(Rubin1976)

• MissingCompletelyAtRandom(MCAR):missingdatadoesnotdependonobservedandunobserveddata• P(missing|complete data)=P(missing)

• MissingAtRandom(MAR):missingdatadependsonlyonobserveddata• P(missing|complete data)=P(missing|observed data)

• MissingNotAtRandom(MNAR):• P(missing|complete data)⍯ P(missing|observed data)

Hardertoaddress

MeanImputation1. Assumethedistributionofmissingdatais

thesameasobserveddata• Replacewithmean,medianorotherpointestimates• Advantages:• WorkswellifMCAR• Convenient,easytoimplement

• Disadvantages:• Introducesbiasbysmoothingoutthevariance• Changesthemagnitudeofcorrelationsbetweenvariables

ImputedDataPoint

Figuretakenfrom:https://www.iriseekhout.com/missing-data/missing-data-methods/imputation-methods/

RegressionImputation• Betterimputation- leverageattributerelationships• Monotonemissingpatterns• Replacemissingvalueswithpredictedscoresfromregressionequation

• Stochasticregression:addanerrorterm• Advantage:• Useinformationfromobserveddata

• Disadvantage:• Overestimatemodelfit,weakensvariance

ImputedDataPoint

yi = axi + b

Figuretakenfrom:https://www.iriseekhout.com/missing-data/missing-data-methods/imputation-methods/

k-NearestNeighbor(kNN)Imputation

• Leveragesimilaritiesamongdifferentsamplesinthedata• Advantages:• Doesn’trequireamodeltopredictthemissingvalues• Simpletoimplement• Cancapturethevariationindataduetoitslocality

• Disadvantages:• Sensitivetohowwedefinewhatsimilarmeans

kNN Imputation,k=3

MultipleImputation

PooledResultsIncomplete

ImputedData AnalysisResults

• Treatthemissingvalueasarandomvariableandimputeitm times• Minimizethebiasintroducedbydataimputationthroughaveraging

Outliers

• Outlier:Observationwhichdeviatessomuchfromotherobservationsastoarousesuspicionitwasgeneratedbyadifferentmechanism”, Hawkins(1980)• Outliersvsnoise• Outlierdetectionvs.noveltydetection

OutlierDetectionTechniques

• Makeassumptionsaboutnormaldataandoutliers• Statistical• Proximity-based• Clustering-basedmethods

• Haveaccesstolabelsofoutlierexamples• Supervised• Semi-supervised• Unsupervised

TypesofOutliers

• Universalvscontextual• O1,O2localoutliers(relativetoclusterC1)• 03globaloutlier

• Singularvscollective

StatisticalMethods• Assume``normal’’observationsfollowsomestatisticalmodel• Example:AssumethenormalpointscomefromaGaussiandistribution.

• Learntheparametersofthemodelfromthedata

• Datapointsnotfollowingthemodelareconsideredoutliers• Example1:Throwawaypointsthatfallatthetails• Example2:Throwawaylowprobabilitypoints

• Disadvantage:Methodisdependedonmodelassumption

Classification-BasedMethod• Iflabelsofoutlierdatapointsexistwecouldtreattheproblemasaclassificationproblem• Trainaclassifiertoseparatethetwoclasses,“normal’andoutliersclass• Usuallyheavilybiasedtowardthenormalclass:“Unbalancedclassificationproblem”• Cannotdetect“unseen”outliers• Oftenunrealistictoassumewehavelabelsforoutlierpoints

• Oneclassclassification:learntheboundaryforthenormalclass.Pointsoutsidetheboundaryareconsideredoutliers• Candetectnewoutliers

Proximity-BasedMethods

• Ifthenear-bypointsarefaraway,considerthedatapointanoutlier• Noassumptiononlabelsormodelsof“normal”distribution• Nofreelunchthough:werelyontherobustnessoftheproximitymeasure• Distance-basedmethods:• Anobservationisanoutlierifitsneighborhooddoesnothaveenoughotherobservations

• Density-basedmethods:• Anobservationisanoutlierifitsdensityisrelativelymuchlowerthanthatofitsneighbors

Density-basedOutlierDetection• LocalOutlierFactor(LOF)Algorithm(Breuning 2000)• Foreachpoint,computetheknearestneighborsN(j)• Computethepointdensity

• Computethelocaloutlierscore:

• Asthenamesuggests,LOFisrobustindetectinglocaloutliers

LOF (i) =1/k

Pj2N(j) f(j)

f(i) = kPj2N(j) d(i,j)

Figuretakenfrom:https://commons.wikimedia.org/wiki/File:LOF-idea.svg

MainTakeaways

• Exploreandunderstandasmuchaspossiblethequalityofyourdata• Identifyappropriatetechniquesthatcanmitigatesomeoftheissuesofdataquality• Someofthesametechniquesyouarelearninginthiscoursecanbeleveragedtoimprovethequalityofthetrainingdata

• Knowwhenyouneedmoredata• Understandbiasesintroducedbydataimputationtechniques• Approachdatascienceasaniterativeprocess- allthecomponentsareconnected

Youranalysisisonlyasgoodasyourdata

References

• Rubin,D.B.(1976).Inferenceandmissingdata.Biometrika 63(3):581-592.• Rubin,D.B.(1987).MultipleImputationforNonresponseinSurveys.NewYork,J.Wiley&Sons.• Little,R.J.andD.B.Rubin(2002).StatisticalAnalysiswithMissingData.Hoboken,NJ,JohnWiley&Sons.• Breunig, M.M.,Kriegel,H.,Ng,R.T.,Sander,J.(2000).LOF:IdentifyingDensity-BasedLocalOutliers

Machine Learning in the Wild - Clark Science CenterMachine Learning in the Wild Dealing with Messy...

Documents

Transcript of Machine Learning in the Wild - Clark Science CenterMachine Learning in the Wild Dealing with Messy...

Class in a Box selbst zusammenstellen - doTerra · Wild Orange Wild Orange Wild Orange Wild Orange Wild Orange Wild Orange Wild Orange Wild Orange Wild Orange Wild Orange Primary

Patch2CAD: Patchwise Embedding Learning for In-the-Wild ...

Forest Plants Wild Harvesting Learning in Europe · PLANT WILD PROJECT Grundtvig Learnership Association Project report Forest Plants Wild Harvesting Learning in Europe 2011-2013

Social-learning abilities of wild vervet monkeys in a two ... · Social-learning abilities of wild vervet monkeys in a two-step task artiﬁcial fruit experiment ... We conclude that

WILD for learning: Interacting through new computing ...roypea/RoyPDF folder/A130_Pea-Maldonado... · WILD for learning: Interacting through new computing ... Interacting through

Learning to Detect and Classify Malicious Executables in the Wild

DRAFT WILD, WILD Aquatic, and Learning Tree Suggested ...

Sustainability Learning Collaborative - Wild Apricot › resources... · Sustainability Learning Collaborative. June 8, 2017. ... gap, including engaging corporate and non-profit

learning the right way kirklin - Wild Apricot

“Wild About Learning” 2012-2013 Jackson Park Elementary.

Deep Learning Face Attributes in the Wild...Deep Learning Face Attributes in the Wild∗ Ziwei Liu1,3 Ping Luo3,1 Xiaogang Wang2,3 Xiaoou Tang1,3 1Department of Information Engineering,

In the Wild: From ML Models to Pragmatic ML Systemskusupati/pubs/wallingford20.pdfOpen world learning Systems in the wild must be capable of learning in an open-world setting - where

WILD Wiring the Learning Infrastructure: Connecting Digital Learning Resources to the Curriculum – A Semantic Web Service for Curriculum Mapping.

University of Groningen Learning in the wild Esteve Del ...

WALES Wild things to do in WALES! - School Learning Zone

Step into the Wild at Fonmon Fun, Adventure and Learning ...€¦ · Step into the Wild at Fonmon - Fun, Adventure and Learning! 23/12/19 You only have to pick up any newspaper or

A comparison of learning behavior between wild type horses ...

Igniting Writing: Margaret Wild Learning intention: To write an epilogue for Fox in the style of Margaret Wild. Learning intention: To write an epilogue.

Wild Rose College of Natural Healing Part 2 …...Wild Rose College of Natural Healing BOTANY FOR HERBALISTS Part 2 Learning Families Lesson 1 Plant Wild Rose College of Natural Healing

Wild Animals - Venu School Teaching Learning Material ... · Wild Animals Lion Tiger . 8500218589 Fox Zebra Wild Animals . 8500218589 Wild Animals Hippopotamus ... Wild Animals Cheetah