Machine Learning in the Wild - Clark Science CenterMachine Learning in the Wild Dealing with Messy...
Transcript of Machine Learning in the Wild - Clark Science CenterMachine Learning in the Wild Dealing with Messy...
MachineLearningintheWild
DealingwithMessyData
RajmondaS.Caceres
SDS293– SmithCollege October30,2017
AnalyticalChain:FromDatatoActions
Whatthislectureisabout?
• Whataresomeofdataqualityissuesinrealapplications?• Whyshouldwecareaboutdataquality?• Howcanweleveragestatisticaltechniquestoimprovethequalityofrealdata?
DataCollection DataCleaning/Preparation
AnalysisVisual
RepresentationDecisionMaking
DataQualityIssuesInRealApplications
• Missingdata• Outliers• Noisydata• Datamismatch• Curseofdimensionality
• Dataqualitygreatlyeffectsanalysisandinsightswedrawfromdata• Introducesbias• Causesinformationloss
EffectsofMissingDataOnRegression
• Whichregressionmodelshouldwepick?
FiguresarefromChrisBishop’sBook“PatternRecognitionandMachineLearning”
Sufficientdatapointstoperformregression Insufficientdatapointstoperformregression
EffectsofOutliers&MissingDataOnPCA
• Poorqualitydatagreatlyaffectsdimensionalityreductionmethods
Case 1 Case2
MissingData
• Howmuchmissingdataistoomuch?• Roughguideline:Ifless<5-10%,likelyinconsequential,ifgreaterneedtoimpute
Sample X1 X2 X3 X4 Y
1 ?
2 ?
3
4 ? ? ? ? ?
5 ? ? ? ? ?
Observedsubpopulation
UnobservedsubpopulationSampleLevel
FeatureLevel
MechanismsofMissingData(Rubin1976)
• MissingCompletelyAtRandom(MCAR):missingdatadoesnotdependonobservedandunobserveddata• P(missing|complete data)=P(missing)
• MissingAtRandom(MAR):missingdatadependsonlyonobserveddata• P(missing|complete data)=P(missing|observed data)
• MissingNotAtRandom(MNAR):• P(missing|complete data)⍯ P(missing|observed data)
Hardertoaddress
MeanImputation1. Assumethedistributionofmissingdatais
thesameasobserveddata• Replacewithmean,medianorotherpointestimates• Advantages:• WorkswellifMCAR• Convenient,easytoimplement
• Disadvantages:• Introducesbiasbysmoothingoutthevariance• Changesthemagnitudeofcorrelationsbetweenvariables
ImputedDataPoint
Figuretakenfrom:https://www.iriseekhout.com/missing-data/missing-data-methods/imputation-methods/
RegressionImputation• Betterimputation- leverageattributerelationships• Monotonemissingpatterns• Replacemissingvalueswithpredictedscoresfromregressionequation
• Stochasticregression:addanerrorterm• Advantage:• Useinformationfromobserveddata
• Disadvantage:• Overestimatemodelfit,weakensvariance
ImputedDataPoint
yi = axi + b
Figuretakenfrom:https://www.iriseekhout.com/missing-data/missing-data-methods/imputation-methods/
k-NearestNeighbor(kNN)Imputation
• Leveragesimilaritiesamongdifferentsamplesinthedata• Advantages:• Doesn’trequireamodeltopredictthemissingvalues• Simpletoimplement• Cancapturethevariationindataduetoitslocality
• Disadvantages:• Sensitivetohowwedefinewhatsimilarmeans
?
kNN Imputation,k=3
MultipleImputation
PooledResultsIncomplete
Data
Set1
Set2
Setm
R1
R2
Rm
.
.
.
.
.
.
ImputedData AnalysisResults
• Treatthemissingvalueasarandomvariableandimputeitm times• Minimizethebiasintroducedbydataimputationthroughaveraging
Outliers
• Outlier:Observationwhichdeviatessomuchfromotherobservationsastoarousesuspicionitwasgeneratedbyadifferentmechanism”, Hawkins(1980)• Outliersvsnoise• Outlierdetectionvs.noveltydetection
OutlierDetectionTechniques
• Makeassumptionsaboutnormaldataandoutliers• Statistical• Proximity-based• Clustering-basedmethods
• Haveaccesstolabelsofoutlierexamples• Supervised• Semi-supervised• Unsupervised
TypesofOutliers
• Universalvscontextual• O1,O2localoutliers(relativetoclusterC1)• 03globaloutlier
• Singularvscollective
StatisticalMethods• Assume``normal’’observationsfollowsomestatisticalmodel• Example:AssumethenormalpointscomefromaGaussiandistribution.
• Learntheparametersofthemodelfromthedata
• Datapointsnotfollowingthemodelareconsideredoutliers• Example1:Throwawaypointsthatfallatthetails• Example2:Throwawaylowprobabilitypoints
• Disadvantage:Methodisdependedonmodelassumption
15
Classification-BasedMethod• Iflabelsofoutlierdatapointsexistwecouldtreattheproblemasaclassificationproblem• Trainaclassifiertoseparatethetwoclasses,“normal’andoutliersclass• Usuallyheavilybiasedtowardthenormalclass:“Unbalancedclassificationproblem”• Cannotdetect“unseen”outliers• Oftenunrealistictoassumewehavelabelsforoutlierpoints
• Oneclassclassification:learntheboundaryforthenormalclass.Pointsoutsidetheboundaryareconsideredoutliers• Candetectnewoutliers
Proximity-BasedMethods
• Ifthenear-bypointsarefaraway,considerthedatapointanoutlier• Noassumptiononlabelsormodelsof“normal”distribution• Nofreelunchthough:werelyontherobustnessoftheproximitymeasure• Distance-basedmethods:• Anobservationisanoutlierifitsneighborhooddoesnothaveenoughotherobservations
• Density-basedmethods:• Anobservationisanoutlierifitsdensityisrelativelymuchlowerthanthatofitsneighbors
Density-basedOutlierDetection• LocalOutlierFactor(LOF)Algorithm(Breuning 2000)• Foreachpoint,computetheknearestneighborsN(j)• Computethepointdensity
• Computethelocaloutlierscore:
• Asthenamesuggests,LOFisrobustindetectinglocaloutliers
LOF (i) =1/k
Pj2N(j) f(j)
f(i)
f(i) = kPj2N(j) d(i,j)
Figuretakenfrom:https://commons.wikimedia.org/wiki/File:LOF-idea.svg
MainTakeaways
• Exploreandunderstandasmuchaspossiblethequalityofyourdata• Identifyappropriatetechniquesthatcanmitigatesomeoftheissuesofdataquality• Someofthesametechniquesyouarelearninginthiscoursecanbeleveragedtoimprovethequalityofthetrainingdata
• Knowwhenyouneedmoredata• Understandbiasesintroducedbydataimputationtechniques• Approachdatascienceasaniterativeprocess- allthecomponentsareconnected
Youranalysisisonlyasgoodasyourdata
References
• Rubin,D.B.(1976).Inferenceandmissingdata.Biometrika 63(3):581-592.• Rubin,D.B.(1987).MultipleImputationforNonresponseinSurveys.NewYork,J.Wiley&Sons.• Little,R.J.andD.B.Rubin(2002).StatisticalAnalysiswithMissingData.Hoboken,NJ,JohnWiley&Sons.• Breunig, M.M.,Kriegel,H.,Ng,R.T.,Sander,J.(2000).LOF:IdentifyingDensity-BasedLocalOutliers