exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data...
Transcript of exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data...
Stats170A:ProjectinDataScience
DataVisualizationandExploratoryDataAnalysis
Padhraic SmythDepartmentofComputerScienceBrenSchoolofInformationandComputerSciencesUniversityofCalifornia,Irvine
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 2
Overview
• Lectures/Homeworks uptothispoint– Datamanagement(relationalDBs,query languages,PostgreSQL)– Datamanipulation inPython (Pandas)– Dataformats(JSON,XML)– PracticalexperiencewithTwitterdata,IMDBdata
• Next2weeks– Reviewofdatavisualizationandexploration– Basicprinciplesofmachinelearning (andsomestatistics)– Machinelearningwithtextdata
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 3
HowthisCoursewillwork
• Q1:Weeks1to6:LecturesandAssignments– Reviewgeneralprinciplesofdatascience– Weeks1to3:databases,dataextraction,datacleaning– Weeks4to6:textanalysis,dataexploration,machinelearning– Combination oflectures,assignments,andbackground reading
• Q1:Weeks7to10:ProjectProposals– Projectproposals fromstudent teams– Feedbackfrom instructors, refineproposal, oralpresentationatendofquarter
• Q2:WorkonProjects– Buildanduseaprototype system/pipeline – Develop ideas,implement algorithms,makeuseoflibrariesandpackages– Conductexperimentswithrealdatasets– Testandevaluateyoursysteminasystematicmanner– Communicateyour results(presentations andreports)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 4
Assignment5
RefertotheWikipage
DuenoononMondayFebruary12th toEEEdropbox
Notechange:duebeforeclass(by2pm)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 5
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 6
TypesofData
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 7
TypesofDataforaSingleVariable
• Real-valued,continuous– e.g.,aperson’sweightorincome– valuesmaybediscretrized andbounded, butwewillthinkofasontherealline
• Integer– e.g.Yearofbirth,numberofyearsincollege– Couldabeareal-valuedvariablethatisquantized (ageinyears)
• Ordinal– e.g.,education level={kindergarten, highschool, college,gradschool,…}
• Categorical– e.g.,{red,blue,yellow}or{CA,MA,NY,AZ,….}ortextstrings
(Notethatmanyvisualizationandmachinelearningtechniquesimplicitlyassumereal-valueddata,andotherdatatypesareconvertedtorealsorrep)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 8
MultipleVariables
• Morethan1variable,oftenreferredtoasmultivariateormultidimensional
• Ofteninterestedinrelationshipsbetweenvariablesandgeometricstructureofthedata(forreal-valueddata),e.g.,isitclustered?
• Smallnumbersofvariablescanplotthedataandlookatrelationships
• Forlargenumbersweuseexploratorytechniques– E.g.,clusteringanddimension reduction
• Notethatmanyvisualizationandmachinelearningtechniques implicitlyassumereal-valueddata….categoricaldatatypesareoftenconvertedtoreals(e.g.,binary)orrepresented viagrouping, colors,oricons
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 9
DatawithContext
• Time-seriesdata– Avariablewhosevaluesareindexedbytime– Wecanalsohavemultidimensional time-series
• Sequencedata– Avariableindexedbyposition– E.g.,words(categorical)intext,orDNAsequences
• Spatialdata– Datawhosevaluesareindexedspatially,e.g.,bylat/lon orbycity– Canalsohavemultidimensional time-series
• Spatio-temporal– Indexedbybothspaceandtime,e.g.,stormtracks,vehicletrajectories, etc
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 10
StockMarketIndiceslastweek
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 11
NightLightsfromNorthandSouthKorea
Fromhttps://www.vox.com
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 12
WherePeopleRun
From:https://flowingdata.com/2014/02/05/where-people-run/#jp-carousel-33695
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 13
RelationalData
• Nentities,i =1,….N• NxNrelations:
– canberepresentedasanarrayy(i,j)=1ifi isconnectedtoj,0otherwise– Example:asocialnetwork
• Cancombinewithotherdata,e.g.,– Eachrelationcouldhavemetadata,e.g.,text– Eachrelationcouldbetime-dependent, y(i,j,t)isatimeseriesovertimet
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 14
Visualizationofanemailnetworkusing2-dimensionalgraphdrawingor“embedding”
Datafrom500researchersatHewlett-Packardoverapproximately1year.
Variousstructuralelementsofthenetworkareapparent
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 15
PhilosophybehindthisClass
• Provideanexperienceofhowdatascienceworksinthereal-world– Defining aproblem– Identifying, understanding, exploring relevantdata– Extracting,cleaning,managementofdata– Explorationandanalysisofdata– Buildingmodels fromdata(e.g.,viamachinelearning)– Evaluatingmodels:howwelldotheypredict– Communicating yourresultstoothers
• Tietogetherideasfromdifferentcoursesyouhavetakenandgiveyouexperienceinapplyingtheseideastoreal-worlddata– Databases,software,algorithms,machinelearning, statistics
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 16
DataScience:fromDatatoActions
DataManagement
RawData
PredictiveModeling
ExploratoryDataAnalysis
Consumers
ExternalBusinessCustomers
InternalBusinessCustomers
Scientists
Government
DataWrangling
Databases,Algorithms,SoftwareEngineering
MachineLearning,Statistics
DomainknowledgeBusinessknowledge
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 17
WhyVisualizationandExploration?
• Peoplearegoodatpatternrecognition– Atspottingclusters,trends,outliers, structure…thatcomputersmanymiss
• Usuallytwotypesofusers1. Thedatascientistwhowantstoexplore/analyze/understand
- Forthedatascientist,visualizationandexplorationarepartofaniterativeprocess
2. Thepersonwhoneedsaquicksummary tomakeadecision- Fortheconsumerwewanttocommunicateinformationquicklyandclearly- e.g.,foramedicaldoctor,forapolicy-maker,foraconsumer
- Fordatascientists…itsalwaysagoodideatolookatyourdata- Helpstounderstandwherethesemanticsofthedata…whatthemeasurements
actuallymean
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 18
WhatisExploratoryDataAnalysis?
• Broaderthanjustvisualization
• EDA={visualization,clustering,dimensionreduction,….}
• Forsmallnumbersofvariables,EDA=visualization
• Forlargenumbersofvariables,weneedtobecleverer– Clustering,dimension reduction, embedding algorithms– Thesearetechniques thatessentiallyreducehigh-dimensional datato
something wecanlookat
• PioneeredbyJohnTukey (statisticianatBellLabs,Princeton)inthe1960’s– “letthedataspeak”
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 19
ExploratoryDataAnalysis:SingleVariables
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 20
SummaryStatistics
Mean:“centerofdata”Mode:locationofhighestdatadensityVariance:“spreadofdata”Skew:indicationofnon-symmetry
Range:max- minMedian:50%ofvaluesbelow,50%aboveQuantiles:e.g.,valuessuchthat25%,50%,75%aresmaller
NotethatsomeofthesestatisticscanbemisleadingE.g.,meanfordatawith2clustersmaybeinaregionwithzerodata
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 21
HistogramofUnimodal Data
6 7 8 9 10 11 12 13 140
200
400
600
800
1000
1200
1000datapoints simulatedfromaNormaldistribution, mean10,variance1,30bins
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 22
Histograms:Unimodal Data
6 7 8 9 10 11 12 130
5
10
15
20
25
30
35
40
6 7 8 9 10 11 12 130
5
10
15
20
25
100datapoints fromaNormal,mean10,variance1,with5,10,30bins
6 7 8 9 10 11 12 130
2
4
6
8
10
12
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 23
HistogramofMultimodalData
15000datapoints simulatedfromamixtureof3Normaldistributions, 300bins
5 6 7 8 9 10 11 12 13 140
50
100
150
200
250
300
350
400
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 24
HistogramofMultimodalData
15000datapoints simulatedfromamixtureof3Normaldistributions, 300bins
5 6 7 8 9 10 11 12 13 140
50
100
150
200
250
300
350
400
5 6 7 8 9 10 11 12 13 140
1000
2000
3000
4000
5000
6000
5 6 7 8 9 10 11 12 13 140
500
1000
1500
2000
2500
3000
3500
5 6 7 8 9 10 11 12 13 140
20
40
60
80
100
120
140
160
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 25
SkewedData
0 1 2 3 4 5 6 7 8 90
50
100
150
200
250
300
350
400
450
5000datapoints simulatedfromanexponentialdistribution, 100bins
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 26
AnotherSkewedDataSet
0 20 40 60 80 100 120 140 160 180 2000
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
10000datapoints simulatedfromamixtureof2exponentials, 100bins
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 27
SameSkewedDataaftertakingLogs(base10)
-4 -3 -2 -1 0 1 2 30
50
100
150
200
250
300
350
10000datapoints simulatedfromamixtureof2exponentials, 100bins
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 28
Whatwillthemeanormediantellusaboutthisdata?
9 10 11 12 13 14 15 160
100
200
300
400
500
600
700
800
900
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 29
HistogramwithOutliers
Xvalues
Numberof
Individuals
PimaIndiansDiabetesData,FromUCIrvineMachineLearningRepository
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 30
HistogramwithOutliers
bloodpressure=0?
DiastolicBloodPressure
Numberof
Individuals
PimaIndiansDiabetesData,FromUCIrvineMachineLearningRepository
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 31
BoxPlots:DiabetesData
BodyMassIndex
HealthyIndividuals
DiabeticIndividuals
Twoside-by-sidebox-plotsofindividualsfromthePimaIndiansDiabetesDataSet
Note:significantoverplotting herethatcouldeasilybemissed
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 32
BoxPlots:DiabetesData
BodyMassIndex
HealthyIndividuals
DiabeticIndividuals
Box = middle 50% of data
Plotsalldatapoints outside“whiskers”
1.5xQ3-Q1
Q2(median)
Q3
Q1
UpperWhisker
LowerWhisker
Twoside-by-sidebox-plotsofindividualsfromtheDiabetesDataSet
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 33
MultipleBoxPlots:DiabetesData
healthy diabetic healthy diabetic
DiastolicBloodPressure
24-hourSerumInsulin
PlasmaGlucose
Concentration
BodyMassIndex
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 34
HorizontalBoxPlot forPlanetData
From:https://seaborn.pydata.org/examples/horizontal_boxplot.html
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 35
ExploringPairsofVariables
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 36
RelationshipsbetweenPairsofVariables
• SaywehaveavariableYwewanttopredictandmanyvariablesXthatwecouldusetopredictY
• InexploratorydataanalysiswemaybeinterestedinquicklyfindingoutifaparticularXvariableispotentiallyusefulatpredictingY
• Options?– Linearcorrelation
– Scatterplot:plotYvaluesversusXvalues
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 37
LinearDependencebetweenPairsofVariables
• Covarianceandcorrelationmeasurelineardependence
• AssumewehavetwovariablesorattributesXandYandnobjectstaking valuesx(1),…,x(n)andy(1),…,y(n).ThesamplecovarianceofXandYis:
• ThecovarianceisameasureofhowXandYvarytogether.– largeandpositive iflargevaluesofXareassociatedwith largevaluesofY
andsmallX⇒ smallY
• (PearsonLinear)Correlation=scaledcovariance,variesbetween-1and1
∑=
−−=n
iyiyxix
nYXCov
1
))()()((1),(
21
1
2
1
2
1
))(())((
))()()((),(
⎟⎠
⎞⎜⎝
⎛−−
−−=
∑ ∑
∑
= =
=
n
i
n
i
n
i
yiyxix
yiyxixYXρ
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 38
DataSetonHousingPricesinBoston
1 CRIM percapitacrimeratebytown
2 ZN proportionofresidentiallandzonedforlotsover25,000ft2
3 INDUS proportionofnon-retailbusiness acrespertown
4 NOX Nitrogen oxide concentration(partsper10million)
5 RM averagenumberofroomsperdwelling
6 AGE proportionofowner-occupiedunitsbuiltpriorto1940
7 DIS weighteddistancestofiveBostonemploymentcentres
8 RAD indexofaccessibilitytoradialhighways
9 TAX full-valueproperty-taxrateper$10,000
10 PTRATIO pupil-teacherratiobytown
11 MEDV Medianvalueofowner-occupiedhomesin$1000's
(widely useddatasetinresearchinregression(prediction) research)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 39
MatrixofPairwiseLinearCorrelations
Industry
Nitrousoxide
Percentageoflargeresidentiallots
CrimeRate
-1 0 +1
DataoncharacteristicsofBostonhousing
Average#rooms
Medianhousevalue
Proportionofoldhouses
Distancetoemployment
centers
Highwayaccessibility
Propertytaxrate
Student-teacherratio
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 40
ExamplesofX-Yplotsandlinearcorrelationvalues
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 41
ExamplesofX-Yplotsandlinearcorrelationvalues
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 42
LinearDependence
Non-LinearDependence
Lackoflinearcorrelationdoesnotimply lackofdependence
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 43
SummaryStatisticsforAnscombe’s 4DataSets
SummaryStatisticsofDataSet1N =11MeanofX =9.0MeanofY =7.5
SummaryStatisticsofDataSet3N =11MeanofX =9.0MeanofY =7.5
SummaryStatisticsofDataSet4N =11MeanofX =9.0MeanofY =7.5
SummaryStatisticsofDataSet2N =11MeanofX =9.0MeanofY =7.5
Anscombe,Francis(1973),GraphsinStatisticalAnalysis,TheAmericanStatistician,pp.195-199.
4datasets,eachwith2variablesXandY,withthesamesummarystatistics(imagine thatPython reports thesesummariesandwehavenotyetlookedatthedata)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 44
Anscombe’s 4DataSets
Anscombe,Francis(1973),GraphsinStatisticalAnalysis,TheAmericanStatistician,pp.195-199.
GuesstheLinearCorrelationValuesforeachDataSet
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 45
SummaryStatisticsforeachDataSet
SummaryStatisticsofDataSet1N =11MeanofX =9.0MeanofY =7.5Intercept=3Slope=0.5Correlation=0.82
SummaryStatisticsofDataSet3N =11MeanofX =9.0MeanofY =7.5Intercept=3Slope=0.5Correlation=0.82
SummaryStatisticsofDataSet4N =11MeanofX =9.0MeanofY =7.5Intercept=3Slope=0.5Correlation=0.82
SummaryStatisticsofDataSet2N =11MeanofX =9.0MeanofY =7.5Intercept=3Slope=0.5Correlation=0.82
Anscombe,Francis(1973),GraphsinStatisticalAnalysis,TheAmericanStatistician,pp.195-199.
Lesson:summarystatisticscanbemisleading
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 46
Dangersofsearchingforcorrelationsinhigh-dimensionaldata
Simulated 50randomGaussian/normaldatavectors,eachwith100variablesResultsina50x100datamatrix
Belowisahistogramof the100choose2pairsofcorrelationcoefficients
Evenifdataareentirelyrandom(nodependence) thereisaveryhighprobabilitysomevariableswillappeardependent justbychance.
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 47
CorrelationsinaLargeRandomDataSet
From:https://seaborn.pydata.org/examples/many_pairwise_correlations.html
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 48
Conclusionssofar?
• Summarystatisticsareuseful…..uptoapoint
• Linearcorrelationmeasurescanbemisleading
• Therereallyisnosubstituteforplotting/visualizingthedata
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 49
ScatterPlots
• Plotthevalueofonevariableagainsttheother
• Simple…butcanbeveryinformative,canrevealmorethansummarystatistics
• Forexample,wecan…– Seeifvariablesaredependentoneachother (beyond lineardependence)– Detectifoutliersarepresent– Cancolor-codetooverlaygroup information (e.g.,colorpointsbyclasslabelfor
classificationproblems)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 50
0 2 4 6 8 10 12 14
x 104
0
0.5
1
1.5
2
2.5x 105
MEDIAN PERCAPITA INCOME
MEDIANHOUSEHOLD INCOME
(from US Zip code data: each point = 1 Zip code)
units = dollars
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 51
ConstantVarianceversusChangingVariance
variationinYdoesnotdependonX variationinY changeswiththevalueofXe.g.,Y=annualtaxpaid,X=income
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 52
Scatter-PlotMatrices:ExampleforDiabetesData
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 53
UsingColortoShowGroupInformationinScatterPlots
Figurefromwww.originlab.com
Irisclassificationdataset,3classes
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 54
AnotherExamplewithGroupingbyColor
Figurefromhci.stanford.edu
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 55
OutlierDetection
• Definitionofanoutlier?– Noprecisedefinition– Generally….”Adatapoint thatissignificantlydifferent totherestofthedata”– Buthowdowedefine“significantlydifferent”? (manyanswerstothis…..)– Typicallyassumedtomeanthatthepointwasmeasuredinerror,orisnotatrue
measurement insomesense
Outliersin1dimension Outlierin2dimensions
1 2 3 4 5 6 7 8 92
3
4
5
6
7
8
9
X VALUES
Y VA
LUES
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 56
Assignment5
RefertotheWikipage
DuenoononMondayFebruary12th toEEEdropbox
Notechange:duebeforeclass(by2pm)