File 498 Doc 27 03dm Exploratorydataanalysis
description
Transcript of File 498 Doc 27 03dm Exploratorydataanalysis
1
ผู้��ช่�วยศาสตราจารย�จ�ร�ฎฐา ภู�บุ�ญอบุ ([email protected],
08-9275-9797)
EXPLORATEXPLORATORY DATA ORY DATA ANALYSISANALYSIS
2
1. Hypothesis Testing Versus Exploratory Data AnalysisFor example, has increasing fee-structure
led to decreasing market share?Hypothesis Testing: test hypothesis market share has decreased Many statistical hypothesis test procedures available:
Z-test population meant-test population meanZ-test population proportionZ-test difference of two population meanst-test difference of two population meansZ-test difference of two population proportionsChi-Square test independence among categorical variablesAnalysis of variance F-testt-test for the slope of a regression line
And many others, including tests for time-series analysis, quality control tests, and nonparametric tests
3
1. Hypothesis Testing Versus Exploratory Data Analysis (cont’d)
However, not always have a priori notions about dataIn this case, use Exploratory Data Analysis (EDA)Approach useful for:
Delving into dataExamining important interrelationships between attributesIdentifying interesting subsets or patternsDiscovering possible relationships between predictors and target variable
4
2. Getting to Know the Data Set
Graphs, plots, and tables often uncover important relationships in dataThe 3,333 records and 20 variables in churn data set are exploredClementine from SPSS, Inc. shows first 10 records from data set in Figure 3.1Simple approach looks at field values of records
5
2. Getting to Know the Data Set (cont’d)
“churn” attribute indicates customers leaving one company in favor of another company’s products or services
6
3. Dealing with Correlated Variables
Using correlated variables in data model:Should be avoided!Incorrectly emphasizes one or more data inputsCreates model instability and produces unreliable results
Matrix plot of Day Minutes, Day Calls, and Day Charge
7
3. Dealing with Correlated Variables (cont’d)
Estimated regression equation shown in Figure 3.3 (Minitab) expresses relationship “Day Charge equals 0.000613 plus 0.17 times Day
Minutes”Company uses flat-rate billing model of 17 cents/minuteR-squared statistic = 1.0 indicates perfect linear relationship Therefore, Day Charge and Day Minutes are correlated
Regression Analysis: Day Charge versus Day Mins
The regression equation isDay Charge =0.000613 + 0.170 Day Mins
Predictor Coef SE Coef T PConstant 0.0006134 0.0001711 3.59 0.000Day Mins 0.170000 0.000001 186644.31 0.000
S = 0.002864 R-Sq = 100.0% R-Sq(adj) = 100.0%
8
3. Dealing with Correlated Variables (cont’d)One of two variables should be eliminated from
modelDay Charge arbitrarily chosen for removalEvening, Night, and International variable pairs reflect similar results Therefore, Evening Charge, Night Charge, and International Charge also removedNumber of attributes reduced from 20 to 16
9
4. Exploring Categorical VariablesGoals: Exploratory Data Analysis
Investigate variables as part of the Data Understanding Phase
Numeric Analyze Histograms, Scatter Plots, StatisticsCategorical Examine Distributions, Cross-tabulations,
Web GraphsBecome familiar with dataExplore relationships among variable setsWhile performing EDA, remain focused on objective
10
4. Exploring Categorical Variables (cont’d)
International PlanFigure 3.4 shows proportion of customers in International Plan with churn overlayInternational Plan: yes = 9.69%, no = 90.31%Possibly, greater proportion of those in International Plan are churners?
11
4. Exploring Categorical Variables (cont’d)
Again, Proportion of customers in International Plan with churn overlayThis time, same-sized bars used for each category (normalized)Graphically, proportion of “churners” in each category more apparentThose selecting International Plan more likely to churnHowever, relationship not quantified
12
4. Exploring Categorical Variables (cont’d)Cross-tabulation quantifies relationship between Churn and International PlanInternational plan and Churn variables both categorical
Churn
no yes
False. 2,664 186
True. 346 137First column: total International plan = “no”Second column: total International plan = “yes”First row: total Churn = “False”Second row: total Churn = “True”
Data set contains 346 + 137 = 483 churners, and 2,664 + 186 = 3,010 non-churners
13
4. Exploring Categorical Variables (cont’d)Therefore, quantifying the relationship:
42.4% of customers in International Plan churned (137 / (137 + 186))11.5% of customers not in International Plan churned (346 / (346 + 2,664))Customers selecting International Plan more than 3X likely to leave company, as compared to those not in planWhy does International Plan apparently cause customers to leave?Data models predicting churn will likely include International Plan as predictor
14
4. Exploring Categorical Variables (cont’d)
Voice Mail PlanFigure 3.7 shows proportion of customers in Voice Mail Plan with churn overlay (normalized)Voicemail Plan: yes = 27.66%, no = 72.34%Those not participating in Voice Mail Plan appear more likely to churn
15
4. Exploring Categorical Variables (cont’d)
Cross-tabulation quantifies relationship between Churn and Voice Mail Plan
First column: total Voice Mail Plan = “no”Second column: total Voice Mail Plan = “yes”First row: total Churn = “False”Second row: total Churn = “True”
Voice Mail Plan has 842 + 80 = 922 customersRemaining 2,008 + 403 = 2,411 customers not in plan
Churn
no yes
False. 2,008 842
True. 403 80
16
4. Exploring Categorical Variables (cont’d)Only 8.7% = 80/922 of those in plan are churnersOf those not in plan, 16.7% = 403/2,411 are churnersTherefore, those not participating in plan ~2X likely to churn, as compared to those in planPerhaps customer loyalty can be increased by simplifying enrollment into Voice Mail Plan?Data models predicting churn likely to include Voice Mail Plan as predictor
17
4. Exploring Categorical Variables (cont’d)
Two-way Interactions between Voice Mail Plan and International Plan, with respect to churn shown
Voice Mail Plan = no (constant)Many customers have neither plan: 1,878 + 302 = 2,180Of those, 302/2,180 = 14% are churnersCustomers in International Plan and not in Voice Mail Plan churn at rate 101/231 = 44%
18
4. Exploring Categorical Variables (cont’d)
Here, Voice Mail Plan = yes (constant)Many customers have Voice Mail Plan only: 786 + 44 = 830Those in both plans: 56 + 36 = 92Churn rate only 44/830 = 5% when customers participate in Voice Mail Plan onlyHowever, those enrolled in both plans churn at 36/92 = 39%Customers in International Plan churning at higher rate, regardless of Voice Mail Plan participation
19
4. Exploring Categorical Variables (cont’d)
Directed Web Graph shows relationships between International Plan, Voice Mail Plan, and Churn attributes (Clementine)Examine connections from Voice Mail Plan = yes node to Churn = True and Churn = FalseHeavier line connecting Churn = False indicates greater proportion of those in plan not churners
20
5. Exploring Numeric Variables
Numeric summary measures for several variables shownIncludes min and max, mean, median, and standard deviationFor example, Account Length has min = 1 and max = 243Mean and median both ~101, which indicates symmetryVoice Mail Messages not symmetric; mean = 8.1 and median = 0
21
5. Exploring Numeric Variables (cont’d)
Median = 0 indicates half of customers had no voice mail messagesRecall use of correlated variables should be avoidedCorrelations of Customer Service Calls and Day Charge with other numeric variables shownAll correlations are “Weak” except for Day Charge and Day Minutes, where r = 1.0Indicates perfect linear relationship
22
5. Exploring Numeric Variables (cont’d)
Histogram for Customer Service Calls attribute shownIncreases understanding of attribute’s distributionDistribution is right-skewed and has mode = 1However, relationship to Churn not indicated (Left)Figure (Right) shows identical histogram including Churn overlayDetermining whether Churn proportion varies across number of Customer Service Calls difficult to discern
23
5. Exploring Numeric Variables (cont’d)
Again, histogram of Customer Service Calls shownNormalized values enhance pattern of churnCustomers calling customer service 3 or fewer times, far less likely to churnResults: Carefully track number of customer service calls made by customers; Offer incentives to retain those making higher number of callsData mining model will probably include Customer Service Calls as predictor
24
5. Exploring Numeric Variables (cont’d)
Normalized histogram of Day Minutes shown with Churn overlay (Top)Indicates high usage customers churn at significantly greater rateResults: Carefully track customer Day Minutes as total exceeds 200Investigate why those with high usage tend to leaveNormalized histogram of Evening Minutes shown with Churn overlay (Bottom)Higher usage customers churn slightly moreResults: Based on graphical evidence, no specific conclusions drawn
25
5. Exploring Numeric Variables (cont’d)
26
6. Exploring Multivariate RelationshipsPossible multivariate
relationships examinedTwo and three-dimensional scatter plots usedFigure 3.23 shows scatter plot of Customer Service Calls versus Day MinutesUpper-left quadrant indicates high-churn areaIdentifies customers with high number of customer service calls, combined with low day minute usage
27
6. Exploring Multivariate Relationships (cont’d)This relationship not
detected using univariate analysisNote, interaction between two variables makes association apparentUnivariate analysis determined customers with high number Customer Service Calls churn at higher ratesFigure 3.23 shows those with higher day minutes somewhat “protected” from higher churn rate
28
6. Exploring Multivariate Relationships (cont’d)High-churn rate also
shown on far right (Figure 3.23)Those with high day usage churn at higher rate, regardless of their number of customer service callsThree-dimensional scatter plots sometimes enhance analysisFor example, figure shows Day Minutes versus Evening Minutes versus Customer Service Calls
29
7. Selecting Interesting Subsets of the Data for Further Investigation
Scatter plots or histograms identify interesting subsets of dataTop figure shows selection of churners with high day and evening minutesClementine enables selection of records for quantification (upper right)Distribution of churn for this subset shown (bottom)43.5% (192/441) of customers having both high day and evening minutes are churnersThis is ~3X churn rate of entire data set
30
8. Binning Binning categorizes an attribute’s numeric (or categorical) values into reduced set of classesMakes analysis more convenientFor example, number of Day Minutes could be binned into “Low”, “Medium”, and “High” categoriesFor example, State values may be binned into regionsCalifornia, Oregon, Washington, Alaska, and Hawaii are categorized as “Pacific”Binning defined as both data preparation and data exploration activityVarious strategies exist for binning numeric variablesOne approach equalizes number of records in each classAnother partitions values into groups, with respect to target
31
8. Binning (cont’d)
Recall those with fewer Customer Service Calls have lower churn rateFor example, bin number of Customer Service Calls into “low” and “high” categoriesFigure shows churn rate for “low” class is 11.25% (Top)However, those within “high” group have 51.69% churn rate (Bottom)Churn rate more than 4X higher