Data Analytics Project
-
Upload
kevin-dulic -
Category
Education
-
view
17 -
download
0
Transcript of Data Analytics Project
IOE 373: Data Processing
Final Project Fall 2014
Professor Luis Garcia-Guzman
Authors:
Benjamin Bennet
Kevin Dulic
Maria Renee Simon
INTRODUCTION
The purpose of this analysis is to model customer behaviors that result in the
likelihood of mediation/arbitration lawsuits, to determine the most significant factors
that result in these cases, and finally give a recommendation based on the findings of
the data. Tasks required to complete this analysis were: create an Access database
using the Excel data; create new input variables within the data; create queries using
SQL to summarize the data in order to be analyzed; cleanse the data, omit any records
that seem invalid (no valid Customer ID or over 50 vehicle purchases); link the query
to an Excel spreadsheet using VBA; generate pivot tables to summarize the frequency
of Mediation/Arbitration cases; partition the data into a training and validation set into
a 50-50 split; perform a logistics regression on the training set to identify significant
factors; validate the results using the validation set; interpret the results and make a
recommendation to reduce the number of mediation/arbitration lawsuits. Methods
used to perform this analysis were the use of Microsoft Access SQL, Excel VBA to
organize the data in the form of pivot tables, Minitab to partition the data and perform
logistic stepwise regression and create contingency tables to perform the analysis.
METHODOLOGY
The purpose of this analysis is to model customer behaviors that result in the
likelihood of mediation/arbitration lawsuits. In order to do so, the data went the data
mining procedure of SEMMA: Sample, Explore, Modify, Model, and Assess.
First, the given data was prepared and cleansed for it to be later loaded into an
analytical program. Utilizing Microsoft Office Access, all data was aggregated by
customer ID and new variables that seemed significant to the analysis were created.
These new variables include: recency, longevity, number of vehicles purchased,
number of passenger cars, number of purchases, number of leases, number of cases,
maximum and average case duration, number of complaints, “goodwill” indicator,
and average dealer score. All variables were grouped by customer. The analysis
focuses on individual customers and small businesses, therefore it was decided to
limit the analysis to customers that did not exceed 50 vehicle purchases. Further
examination of the data was performed to avoid errors and invalid customer ID’s.
After completing the exploration process and modifying data that would lead
to errors, the final data table was achieved. Next, the data was transferred to Microsoft
Office Excel where the information could be visually analyzed using pivot tables,
which clearly summarized the frequency of Mediation/Arbitration cases.
The last stage was the modeling of the data, in which the analytical tool
Minitab was used. The data was transferred from Excel to Minitab, were it was split
into 50-50 sets, one for training and one for validation by partitioning in Excel.
Afterwards, a logistic stepwise regression model was run on the training set, to
determine which factors were significant for the analysis. Insignificant variables were
eliminated as a result of the stepwise regression. The logistic regression model
obtained was run on the validation set in order to observe the accuracy of the model.
Contingency tables were created to summarize the data and draw conclusions about
the analysis.
RESULTS
Below is a summary of the result from performing a logistic regression in
Minitab. The following continuous variables represented in the table were the
determined as significant by the results below, with the exception of Num_VLK_Veh
and Avg_DealerScore. This can be seen in Table 1.
Table 1: P-Value Summary for Variables in Logistic Model
The regression produced a regression equation given in Equation 1. The Y’
represents the linear input for P(1), where P(1) is the probability of a lawsuit, given
the continuous input variables.
Equation 1: Logistic Regression Equation with P(1) = Probability of Lawsuit
Table 2 gives the coefficients of the linear equation Y’ in the regression equation.
Table 2: Coefficients for Continuous Variables in Equation 1
Table 3 represents the odds ratios for the continuous predictor variables
Table 3: Calculated Odds Ratio for Continuous Predictors
Table 4 shows the results of the contingency table for predicting a lawsuit within the
training set.
Table 4: Contingency Table for Training Set
Table 5 shows the results of the contingency table for predicting a lawsuit within the
validation set, which uses the training set to calculate the amount of lawsuits.
Table 5: Contingency Table for Validation Set
To visualize the data from the final Access data table, a pivot table in Excel
was created using VBA. This table is given below in Table 6.
Table 6: Pivot Table showing frequency of Arbitration / Mediation Cases
CONCLUSIONS
After performing the logistic regression analysis with stepwise, the significant
factors were determined to be Recency_Months, Longevity_Months, Percent_VLK,
Percent_PassCar, Num_New, Num_Purchases, Num_Complaints, and Goodwill. This
can be seen by studying the P-Value for each variable; the variables with P-Values of
0.05 or greater can be excluded, in this case Percent_VLK and Avg_DealerScore are
the only variables in the model that are not significant with a P-Value of 0.147 and
0.082, respectively. The P-Value statistic for each variable in the model is
summarized in the Table 1.
Additional statistical information from Minitab like as the odds ration give
further insight into the data. The odds ratio is a measure of how much one unit of the
input affects the probability of event 1 (lawsuit occurs). If the odds ratio is above 1,
that means that an increase in the input of interest by one unit will increase the
likelihood of the event from occurring. Similarly, if the odds ratio is smaller than 1
the likelihood of the event output is decreased. Minitab calculates the odds ratio for
each variable in the model, and it is summarized in the Table 3. From this table, it can
be determined that an increase in Recency_Months, Longevity_Months,
Num_VLK_Veh, and Avg_DealerScore have very little marginal effect on the
probability of a lawsuit. This is because the odds ratio for those variables are close to
1. A unit increase in Num_New and Num_Purchases increases the odds of a lawsuit
by 1.146 times and 1.117 times, respectively. An increase in Percent_VLK,
Percent_PassCar, Num_Complaints, and Goodwill will decrease the odds of an event
by 0.68, 0.82, 0.76, and 0.87 times, respectively.
Finally, Minitab provides the regression equation for the training set as seen in
the Equation. The probability of an event (lawsuit) is given by P(1), where Y’ is the
linear equation with input variables of Recency_Months, Longevity_Months,
Num_VLK_Veh, Percent_VLK, Percent_PassCar, Num_New, Num_Purchases,
Num_Complaints, Goodwill, and Avg_DealerScore. This equation is very useful to
predict the probability of lawsuit given these variable inputs. For most cases, the
probability threshold of an event occurring is P=0.5. Therefore when calculating a
probability given certain inputs, if P > 0.5 then it can be determined that this event is
likely to happen. In the case of this analysis, if P(1) > 0.5 then it is likely that a
lawsuit will occur.
Using the regression equation to calculate the probabilities of the validation
set will help determine how closely the training set predicts the validation set. As seen
in Tables 4 and 5, the contingency tables for the training set and validation set are
very similar. A predictive model is said to be good if the training set calculated
outcomes approximately match the actual events for the validation set. This is a
common data mining practice and should always be done to verify the model’s
effectiveness. Tables 4 and 5 are very similar, indicating that the regression
techniques used were valid. However the amount of predicted lawsuits is significantly
different than the number of actual lawsuits, indicating an inaccurate predictive
model.
From this analysis it is determined that the most significant factors that are
positively correlated with a lawsuit are Num_New and Num_Purchases, while the
most significant factors that are negatively correlated with a lawsuit are
Percent_VLK, Percent_PassCar, Num_Complaints, and Goodwill. From a practical
perspective, this means that the least likely customer to file a lawsuit is one with low
Num_New and Num_Purchases values, and high Percent_VLK, Percent_PassCar,
Num_Complaints, and Goodwill values. It is recommended that the company focus
on doing with business with this type of customer profile to limit the number of
lawsuits.
During our analysis some unexpected results were produced using the Minitab
logistic regression. Further inspection of the data could lead to a more accurate
predictive model. The current model does not accurately predict likelihood of a
lawsuit due to possible errors in our aggregate functions. The regression techniques
for the analysis are still considered valid, and should be repeated given the corrected
data.