Machine Learning in the Wild - Clark Science CenterMachine Learning in the Wild Dealing with Messy...

Post on 08-Oct-2020

1 views 0 download

Transcript of Machine Learning in the Wild - Clark Science CenterMachine Learning in the Wild Dealing with Messy...

MachineLearningintheWild

DealingwithMessyData

RajmondaS.Caceres

SDS293– SmithCollege October30,2017

AnalyticalChain:FromDatatoActions

Whatthislectureisabout?

• Whataresomeofdataqualityissuesinrealapplications?• Whyshouldwecareaboutdataquality?• Howcanweleveragestatisticaltechniquestoimprovethequalityofrealdata?

DataCollection DataCleaning/Preparation

AnalysisVisual

RepresentationDecisionMaking

DataQualityIssuesInRealApplications

• Missingdata• Outliers• Noisydata• Datamismatch• Curseofdimensionality

• Dataqualitygreatlyeffectsanalysisandinsightswedrawfromdata• Introducesbias• Causesinformationloss

EffectsofMissingDataOnRegression

• Whichregressionmodelshouldwepick?

FiguresarefromChrisBishop’sBook“PatternRecognitionandMachineLearning”

Sufficientdatapointstoperformregression Insufficientdatapointstoperformregression

EffectsofOutliers&MissingDataOnPCA

• Poorqualitydatagreatlyaffectsdimensionalityreductionmethods

Case 1 Case2

MissingData

• Howmuchmissingdataistoomuch?• Roughguideline:Ifless<5-10%,likelyinconsequential,ifgreaterneedtoimpute

Sample X1 X2 X3 X4 Y

1 ?

2 ?

3

4 ? ? ? ? ?

5 ? ? ? ? ?

Observedsubpopulation

UnobservedsubpopulationSampleLevel

FeatureLevel

MechanismsofMissingData(Rubin1976)

• MissingCompletelyAtRandom(MCAR):missingdatadoesnotdependonobservedandunobserveddata• P(missing|complete data)=P(missing)

• MissingAtRandom(MAR):missingdatadependsonlyonobserveddata• P(missing|complete data)=P(missing|observed data)

• MissingNotAtRandom(MNAR):• P(missing|complete data)⍯ P(missing|observed data)

Hardertoaddress

MeanImputation1. Assumethedistributionofmissingdatais

thesameasobserveddata• Replacewithmean,medianorotherpointestimates• Advantages:• WorkswellifMCAR• Convenient,easytoimplement

• Disadvantages:• Introducesbiasbysmoothingoutthevariance• Changesthemagnitudeofcorrelationsbetweenvariables

ImputedDataPoint

Figuretakenfrom:https://www.iriseekhout.com/missing-data/missing-data-methods/imputation-methods/

RegressionImputation• Betterimputation- leverageattributerelationships• Monotonemissingpatterns• Replacemissingvalueswithpredictedscoresfromregressionequation

• Stochasticregression:addanerrorterm• Advantage:• Useinformationfromobserveddata

• Disadvantage:• Overestimatemodelfit,weakensvariance

ImputedDataPoint

yi = axi + b

Figuretakenfrom:https://www.iriseekhout.com/missing-data/missing-data-methods/imputation-methods/

k-NearestNeighbor(kNN)Imputation

• Leveragesimilaritiesamongdifferentsamplesinthedata• Advantages:• Doesn’trequireamodeltopredictthemissingvalues• Simpletoimplement• Cancapturethevariationindataduetoitslocality

• Disadvantages:• Sensitivetohowwedefinewhatsimilarmeans

?

kNN Imputation,k=3

MultipleImputation

PooledResultsIncomplete

Data

Set1

Set2

Setm

R1

R2

Rm

.

.

.

.

.

.

ImputedData AnalysisResults

• Treatthemissingvalueasarandomvariableandimputeitm times• Minimizethebiasintroducedbydataimputationthroughaveraging

Outliers

• Outlier:Observationwhichdeviatessomuchfromotherobservationsastoarousesuspicionitwasgeneratedbyadifferentmechanism”,  Hawkins(1980)• Outliersvsnoise• Outlierdetectionvs.noveltydetection

OutlierDetectionTechniques

• Makeassumptionsaboutnormaldataandoutliers• Statistical• Proximity-based• Clustering-basedmethods

• Haveaccesstolabelsofoutlierexamples• Supervised• Semi-supervised• Unsupervised

TypesofOutliers

• Universalvscontextual• O1,O2localoutliers(relativetoclusterC1)• 03globaloutlier

• Singularvscollective

StatisticalMethods• Assume``normal’’observationsfollowsomestatisticalmodel• Example:AssumethenormalpointscomefromaGaussiandistribution.

• Learntheparametersofthemodelfromthedata

• Datapointsnotfollowingthemodelareconsideredoutliers• Example1:Throwawaypointsthatfallatthetails• Example2:Throwawaylowprobabilitypoints

• Disadvantage:Methodisdependedonmodelassumption

15

Classification-BasedMethod• Iflabelsofoutlierdatapointsexistwecouldtreattheproblemasaclassificationproblem• Trainaclassifiertoseparatethetwoclasses,“normal’andoutliersclass• Usuallyheavilybiasedtowardthenormalclass:“Unbalancedclassificationproblem”• Cannotdetect“unseen”outliers• Oftenunrealistictoassumewehavelabelsforoutlierpoints

• Oneclassclassification:learntheboundaryforthenormalclass.Pointsoutsidetheboundaryareconsideredoutliers• Candetectnewoutliers

Proximity-BasedMethods

• Ifthenear-bypointsarefaraway,considerthedatapointanoutlier• Noassumptiononlabelsormodelsof“normal”distribution• Nofreelunchthough:werelyontherobustnessoftheproximitymeasure• Distance-basedmethods:• Anobservationisanoutlierifitsneighborhooddoesnothaveenoughotherobservations

• Density-basedmethods:• Anobservationisanoutlierifitsdensityisrelativelymuchlowerthanthatofitsneighbors

Density-basedOutlierDetection• LocalOutlierFactor(LOF)Algorithm(Breuning 2000)• Foreachpoint,computetheknearestneighborsN(j)• Computethepointdensity

• Computethelocaloutlierscore:

• Asthenamesuggests,LOFisrobustindetectinglocaloutliers

LOF (i) =1/k

Pj2N(j) f(j)

f(i)

f(i) = kPj2N(j) d(i,j)

Figuretakenfrom:https://commons.wikimedia.org/wiki/File:LOF-idea.svg

MainTakeaways

• Exploreandunderstandasmuchaspossiblethequalityofyourdata• Identifyappropriatetechniquesthatcanmitigatesomeoftheissuesofdataquality• Someofthesametechniquesyouarelearninginthiscoursecanbeleveragedtoimprovethequalityofthetrainingdata

• Knowwhenyouneedmoredata• Understandbiasesintroducedbydataimputationtechniques• Approachdatascienceasaniterativeprocess- allthecomponentsareconnected

Youranalysisisonlyasgoodasyourdata

References

• Rubin,D.B.(1976).Inferenceandmissingdata.Biometrika 63(3):581-592.• Rubin,D.B.(1987).MultipleImputationforNonresponseinSurveys.NewYork,J.Wiley&Sons.• Little,R.J.andD.B.Rubin(2002).StatisticalAnalysiswithMissingData.Hoboken,NJ,JohnWiley&Sons.• Breunig, M.M.,Kriegel,H.,Ng,R.T.,Sander,J.(2000).LOF:IdentifyingDensity-BasedLocalOutliers