exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data...

56
Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science Bren School of Information and Computer Sciences University of California, Irvine

Transcript of exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data...

Page 1: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

Stats170A:ProjectinDataScience

DataVisualizationandExploratoryDataAnalysis

Padhraic SmythDepartmentofComputerScienceBrenSchoolofInformationandComputerSciencesUniversityofCalifornia,Irvine

Page 2: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 2

Overview

• Lectures/Homeworks uptothispoint– Datamanagement(relationalDBs,query languages,PostgreSQL)– Datamanipulation inPython (Pandas)– Dataformats(JSON,XML)– PracticalexperiencewithTwitterdata,IMDBdata

• Next2weeks– Reviewofdatavisualizationandexploration– Basicprinciplesofmachinelearning (andsomestatistics)– Machinelearningwithtextdata

Page 3: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 3

HowthisCoursewillwork

• Q1:Weeks1to6:LecturesandAssignments– Reviewgeneralprinciplesofdatascience– Weeks1to3:databases,dataextraction,datacleaning– Weeks4to6:textanalysis,dataexploration,machinelearning– Combination oflectures,assignments,andbackground reading

• Q1:Weeks7to10:ProjectProposals– Projectproposals fromstudent teams– Feedbackfrom instructors, refineproposal, oralpresentationatendofquarter

• Q2:WorkonProjects– Buildanduseaprototype system/pipeline – Develop ideas,implement algorithms,makeuseoflibrariesandpackages– Conductexperimentswithrealdatasets– Testandevaluateyoursysteminasystematicmanner– Communicateyour results(presentations andreports)

Page 4: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 4

Assignment5

RefertotheWikipage

DuenoononMondayFebruary12th toEEEdropbox

Notechange:duebeforeclass(by2pm)

Page 5: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 5

Page 6: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 6

TypesofData

Page 7: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 7

TypesofDataforaSingleVariable

• Real-valued,continuous– e.g.,aperson’sweightorincome– valuesmaybediscretrized andbounded, butwewillthinkofasontherealline

• Integer– e.g.Yearofbirth,numberofyearsincollege– Couldabeareal-valuedvariablethatisquantized (ageinyears)

• Ordinal– e.g.,education level={kindergarten, highschool, college,gradschool,…}

• Categorical– e.g.,{red,blue,yellow}or{CA,MA,NY,AZ,….}ortextstrings

(Notethatmanyvisualizationandmachinelearningtechniquesimplicitlyassumereal-valueddata,andotherdatatypesareconvertedtorealsorrep)

Page 8: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 8

MultipleVariables

• Morethan1variable,oftenreferredtoasmultivariateormultidimensional

• Ofteninterestedinrelationshipsbetweenvariablesandgeometricstructureofthedata(forreal-valueddata),e.g.,isitclustered?

• Smallnumbersofvariablescanplotthedataandlookatrelationships

• Forlargenumbersweuseexploratorytechniques– E.g.,clusteringanddimension reduction

• Notethatmanyvisualizationandmachinelearningtechniques implicitlyassumereal-valueddata….categoricaldatatypesareoftenconvertedtoreals(e.g.,binary)orrepresented viagrouping, colors,oricons

Page 9: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 9

DatawithContext

• Time-seriesdata– Avariablewhosevaluesareindexedbytime– Wecanalsohavemultidimensional time-series

• Sequencedata– Avariableindexedbyposition– E.g.,words(categorical)intext,orDNAsequences

• Spatialdata– Datawhosevaluesareindexedspatially,e.g.,bylat/lon orbycity– Canalsohavemultidimensional time-series

• Spatio-temporal– Indexedbybothspaceandtime,e.g.,stormtracks,vehicletrajectories, etc

Page 10: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 10

StockMarketIndiceslastweek

Page 11: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 11

NightLightsfromNorthandSouthKorea

Fromhttps://www.vox.com

Page 12: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 12

WherePeopleRun

From:https://flowingdata.com/2014/02/05/where-people-run/#jp-carousel-33695

Page 13: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 13

RelationalData

• Nentities,i =1,….N• NxNrelations:

– canberepresentedasanarrayy(i,j)=1ifi isconnectedtoj,0otherwise– Example:asocialnetwork

• Cancombinewithotherdata,e.g.,– Eachrelationcouldhavemetadata,e.g.,text– Eachrelationcouldbetime-dependent, y(i,j,t)isatimeseriesovertimet

Page 14: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 14

Visualizationofanemailnetworkusing2-dimensionalgraphdrawingor“embedding”

Datafrom500researchersatHewlett-Packardoverapproximately1year.

Variousstructuralelementsofthenetworkareapparent

Page 15: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 15

PhilosophybehindthisClass

• Provideanexperienceofhowdatascienceworksinthereal-world– Defining aproblem– Identifying, understanding, exploring relevantdata– Extracting,cleaning,managementofdata– Explorationandanalysisofdata– Buildingmodels fromdata(e.g.,viamachinelearning)– Evaluatingmodels:howwelldotheypredict– Communicating yourresultstoothers

• Tietogetherideasfromdifferentcoursesyouhavetakenandgiveyouexperienceinapplyingtheseideastoreal-worlddata– Databases,software,algorithms,machinelearning, statistics

Page 16: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 16

DataScience:fromDatatoActions

DataManagement

RawData

PredictiveModeling

ExploratoryDataAnalysis

Consumers

ExternalBusinessCustomers

InternalBusinessCustomers

Scientists

Government

DataWrangling

Databases,Algorithms,SoftwareEngineering

MachineLearning,Statistics

DomainknowledgeBusinessknowledge

Page 17: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 17

WhyVisualizationandExploration?

• Peoplearegoodatpatternrecognition– Atspottingclusters,trends,outliers, structure…thatcomputersmanymiss

• Usuallytwotypesofusers1. Thedatascientistwhowantstoexplore/analyze/understand

- Forthedatascientist,visualizationandexplorationarepartofaniterativeprocess

2. Thepersonwhoneedsaquicksummary tomakeadecision- Fortheconsumerwewanttocommunicateinformationquicklyandclearly- e.g.,foramedicaldoctor,forapolicy-maker,foraconsumer

- Fordatascientists…itsalwaysagoodideatolookatyourdata- Helpstounderstandwherethesemanticsofthedata…whatthemeasurements

actuallymean

Page 18: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 18

WhatisExploratoryDataAnalysis?

• Broaderthanjustvisualization

• EDA={visualization,clustering,dimensionreduction,….}

• Forsmallnumbersofvariables,EDA=visualization

• Forlargenumbersofvariables,weneedtobecleverer– Clustering,dimension reduction, embedding algorithms– Thesearetechniques thatessentiallyreducehigh-dimensional datato

something wecanlookat

• PioneeredbyJohnTukey (statisticianatBellLabs,Princeton)inthe1960’s– “letthedataspeak”

Page 19: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 19

ExploratoryDataAnalysis:SingleVariables

Page 20: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 20

SummaryStatistics

Mean:“centerofdata”Mode:locationofhighestdatadensityVariance:“spreadofdata”Skew:indicationofnon-symmetry

Range:max- minMedian:50%ofvaluesbelow,50%aboveQuantiles:e.g.,valuessuchthat25%,50%,75%aresmaller

NotethatsomeofthesestatisticscanbemisleadingE.g.,meanfordatawith2clustersmaybeinaregionwithzerodata

Page 21: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 21

HistogramofUnimodal Data

6 7 8 9 10 11 12 13 140

200

400

600

800

1000

1200

1000datapoints simulatedfromaNormaldistribution, mean10,variance1,30bins

Page 22: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 22

Histograms:Unimodal Data

6 7 8 9 10 11 12 130

5

10

15

20

25

30

35

40

6 7 8 9 10 11 12 130

5

10

15

20

25

100datapoints fromaNormal,mean10,variance1,with5,10,30bins

6 7 8 9 10 11 12 130

2

4

6

8

10

12

Page 23: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 23

HistogramofMultimodalData

15000datapoints simulatedfromamixtureof3Normaldistributions, 300bins

5 6 7 8 9 10 11 12 13 140

50

100

150

200

250

300

350

400

Page 24: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 24

HistogramofMultimodalData

15000datapoints simulatedfromamixtureof3Normaldistributions, 300bins

5 6 7 8 9 10 11 12 13 140

50

100

150

200

250

300

350

400

5 6 7 8 9 10 11 12 13 140

1000

2000

3000

4000

5000

6000

5 6 7 8 9 10 11 12 13 140

500

1000

1500

2000

2500

3000

3500

5 6 7 8 9 10 11 12 13 140

20

40

60

80

100

120

140

160

Page 25: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 25

SkewedData

0 1 2 3 4 5 6 7 8 90

50

100

150

200

250

300

350

400

450

5000datapoints simulatedfromanexponentialdistribution, 100bins

Page 26: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 26

AnotherSkewedDataSet

0 20 40 60 80 100 120 140 160 180 2000

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

10000datapoints simulatedfromamixtureof2exponentials, 100bins

Page 27: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 27

SameSkewedDataaftertakingLogs(base10)

-4 -3 -2 -1 0 1 2 30

50

100

150

200

250

300

350

10000datapoints simulatedfromamixtureof2exponentials, 100bins

Page 28: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 28

Whatwillthemeanormediantellusaboutthisdata?

9 10 11 12 13 14 15 160

100

200

300

400

500

600

700

800

900

Page 29: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 29

HistogramwithOutliers

Xvalues

Numberof

Individuals

PimaIndiansDiabetesData,FromUCIrvineMachineLearningRepository

Page 30: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 30

HistogramwithOutliers

bloodpressure=0?

DiastolicBloodPressure

Numberof

Individuals

PimaIndiansDiabetesData,FromUCIrvineMachineLearningRepository

Page 31: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 31

BoxPlots:DiabetesData

BodyMassIndex

HealthyIndividuals

DiabeticIndividuals

Twoside-by-sidebox-plotsofindividualsfromthePimaIndiansDiabetesDataSet

Note:significantoverplotting herethatcouldeasilybemissed

Page 32: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 32

BoxPlots:DiabetesData

BodyMassIndex

HealthyIndividuals

DiabeticIndividuals

Box = middle 50% of data

Plotsalldatapoints outside“whiskers”

1.5xQ3-Q1

Q2(median)

Q3

Q1

UpperWhisker

LowerWhisker

Twoside-by-sidebox-plotsofindividualsfromtheDiabetesDataSet

Page 33: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 33

MultipleBoxPlots:DiabetesData

healthy diabetic healthy diabetic

DiastolicBloodPressure

24-hourSerumInsulin

PlasmaGlucose

Concentration

BodyMassIndex

Page 34: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 34

HorizontalBoxPlot forPlanetData

From:https://seaborn.pydata.org/examples/horizontal_boxplot.html

Page 35: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 35

ExploringPairsofVariables

Page 36: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 36

RelationshipsbetweenPairsofVariables

• SaywehaveavariableYwewanttopredictandmanyvariablesXthatwecouldusetopredictY

• InexploratorydataanalysiswemaybeinterestedinquicklyfindingoutifaparticularXvariableispotentiallyusefulatpredictingY

• Options?– Linearcorrelation

– Scatterplot:plotYvaluesversusXvalues

Page 37: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 37

LinearDependencebetweenPairsofVariables

• Covarianceandcorrelationmeasurelineardependence

• AssumewehavetwovariablesorattributesXandYandnobjectstaking valuesx(1),…,x(n)andy(1),…,y(n).ThesamplecovarianceofXandYis:

• ThecovarianceisameasureofhowXandYvarytogether.– largeandpositive iflargevaluesofXareassociatedwith largevaluesofY

andsmallX⇒ smallY

• (PearsonLinear)Correlation=scaledcovariance,variesbetween-1and1

∑=

−−=n

iyiyxix

nYXCov

1

))()()((1),(

21

1

2

1

2

1

))(())((

))()()((),(

⎟⎠

⎞⎜⎝

⎛−−

−−=

∑ ∑

= =

=

n

i

n

i

n

i

yiyxix

yiyxixYXρ

Page 38: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 38

DataSetonHousingPricesinBoston

1 CRIM percapitacrimeratebytown

2 ZN proportionofresidentiallandzonedforlotsover25,000ft2

3 INDUS proportionofnon-retailbusiness acrespertown

4 NOX Nitrogen oxide concentration(partsper10million)

5 RM averagenumberofroomsperdwelling

6 AGE proportionofowner-occupiedunitsbuiltpriorto1940

7 DIS weighteddistancestofiveBostonemploymentcentres

8 RAD indexofaccessibilitytoradialhighways

9 TAX full-valueproperty-taxrateper$10,000

10 PTRATIO pupil-teacherratiobytown

11 MEDV Medianvalueofowner-occupiedhomesin$1000's

(widely useddatasetinresearchinregression(prediction) research)

Page 39: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 39

MatrixofPairwiseLinearCorrelations

Industry

Nitrousoxide

Percentageoflargeresidentiallots

CrimeRate

-1 0 +1

DataoncharacteristicsofBostonhousing

Average#rooms

Medianhousevalue

Proportionofoldhouses

Distancetoemployment

centers

Highwayaccessibility

Propertytaxrate

Student-teacherratio

Page 40: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 40

ExamplesofX-Yplotsandlinearcorrelationvalues

Page 41: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 41

ExamplesofX-Yplotsandlinearcorrelationvalues

Page 42: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 42

LinearDependence

Non-LinearDependence

Lackoflinearcorrelationdoesnotimply lackofdependence

Page 43: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 43

SummaryStatisticsforAnscombe’s 4DataSets

SummaryStatisticsofDataSet1N =11MeanofX =9.0MeanofY =7.5

SummaryStatisticsofDataSet3N =11MeanofX =9.0MeanofY =7.5

SummaryStatisticsofDataSet4N =11MeanofX =9.0MeanofY =7.5

SummaryStatisticsofDataSet2N =11MeanofX =9.0MeanofY =7.5

Anscombe,Francis(1973),GraphsinStatisticalAnalysis,TheAmericanStatistician,pp.195-199.

4datasets,eachwith2variablesXandY,withthesamesummarystatistics(imagine thatPython reports thesesummariesandwehavenotyetlookedatthedata)

Page 44: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 44

Anscombe’s 4DataSets

Anscombe,Francis(1973),GraphsinStatisticalAnalysis,TheAmericanStatistician,pp.195-199.

GuesstheLinearCorrelationValuesforeachDataSet

Page 45: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 45

SummaryStatisticsforeachDataSet

SummaryStatisticsofDataSet1N =11MeanofX =9.0MeanofY =7.5Intercept=3Slope=0.5Correlation=0.82

SummaryStatisticsofDataSet3N =11MeanofX =9.0MeanofY =7.5Intercept=3Slope=0.5Correlation=0.82

SummaryStatisticsofDataSet4N =11MeanofX =9.0MeanofY =7.5Intercept=3Slope=0.5Correlation=0.82

SummaryStatisticsofDataSet2N =11MeanofX =9.0MeanofY =7.5Intercept=3Slope=0.5Correlation=0.82

Anscombe,Francis(1973),GraphsinStatisticalAnalysis,TheAmericanStatistician,pp.195-199.

Lesson:summarystatisticscanbemisleading

Page 46: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 46

Dangersofsearchingforcorrelationsinhigh-dimensionaldata

Simulated 50randomGaussian/normaldatavectors,eachwith100variablesResultsina50x100datamatrix

Belowisahistogramof the100choose2pairsofcorrelationcoefficients

Evenifdataareentirelyrandom(nodependence) thereisaveryhighprobabilitysomevariableswillappeardependent justbychance.

Page 47: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 47

CorrelationsinaLargeRandomDataSet

From:https://seaborn.pydata.org/examples/many_pairwise_correlations.html

Page 48: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 48

Conclusionssofar?

• Summarystatisticsareuseful…..uptoapoint

• Linearcorrelationmeasurescanbemisleading

• Therereallyisnosubstituteforplotting/visualizingthedata

Page 49: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 49

ScatterPlots

• Plotthevalueofonevariableagainsttheother

• Simple…butcanbeveryinformative,canrevealmorethansummarystatistics

• Forexample,wecan…– Seeifvariablesaredependentoneachother (beyond lineardependence)– Detectifoutliersarepresent– Cancolor-codetooverlaygroup information (e.g.,colorpointsbyclasslabelfor

classificationproblems)

Page 50: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 50

0 2 4 6 8 10 12 14

x 104

0

0.5

1

1.5

2

2.5x 105

MEDIAN PERCAPITA INCOME

MEDIANHOUSEHOLD INCOME

(from US Zip code data: each point = 1 Zip code)

units = dollars

Page 51: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 51

ConstantVarianceversusChangingVariance

variationinYdoesnotdependonX variationinY changeswiththevalueofXe.g.,Y=annualtaxpaid,X=income

Page 52: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 52

Scatter-PlotMatrices:ExampleforDiabetesData

Page 53: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 53

UsingColortoShowGroupInformationinScatterPlots

Figurefromwww.originlab.com

Irisclassificationdataset,3classes

Page 54: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 54

AnotherExamplewithGroupingbyColor

Figurefromhci.stanford.edu

Page 55: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 55

OutlierDetection

• Definitionofanoutlier?– Noprecisedefinition– Generally….”Adatapoint thatissignificantlydifferent totherestofthedata”– Buthowdowedefine“significantlydifferent”? (manyanswerstothis…..)– Typicallyassumedtomeanthatthepointwasmeasuredinerror,orisnotatrue

measurement insomesense

Outliersin1dimension Outlierin2dimensions

1 2 3 4 5 6 7 8 92

3

4

5

6

7

8

9

X VALUES

Y VA

LUES

Page 56: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 56

Assignment5

RefertotheWikipage

DuenoononMondayFebruary12th toEEEdropbox

Notechange:duebeforeclass(by2pm)