Predicting Cab Booking Cancelations- Executive Summary
-
Upload
lokesh-shanmuganandam -
Category
Documents
-
view
678 -
download
2
Transcript of Predicting Cab Booking Cancelations- Executive Summary
Predicting Cab Booking Cancelations
Project Report
Team 5: Eche Victor Ogah Keon Grey Deepak Vijayakumar Lokesh Shanmuganandam
1
Table of Contents Executive Summary ....................................................................................................................................... 2
Problem Description ..................................................................................................................................... 3
Visualization Analysis .................................................................................................................................... 3
Data Cleaning .............................................................................................................................................. 12
Data Mining ................................................................................................................................................. 13
Takeaway .................................................................................................................................................... 26
Conclusion ................................................................................................................................................... 26
Appendix A – Variable Description ............................................................................................................. 28
Appendix B – Ada Boost R Code.................................................................................................................. 29
Appendix C – Neural Network R Code ........................................................................................................ 39
Appendix D – Random Forest R Code ......................................................................................................... 44
2
Executive Summary In the last half a decade, the rise of Uber and similar companies that create and operate
mobile ride sharing apps to connect riders with drivers is threatening the existence of the
traditional taxi and cab industries around the world. The business model of Uber and similar
companies that simply connect passengers with drivers, and handles the processing of payments
from customer to driver while taking a percentage of the payment has been very profitable and
threatens to send the traditional taxi and cab industry the way the dinosaurs went decades ago. In
order for the traditional taxi and cab industry to remain profitable with their fleet of vehicles,
employees on payroll, and overhead cost, they will need to use data to better help determine
which bookings would be profitable.
Kaggle.com is an online platform for predictive modeling and analytics where companies
and organizations around the world can sponsor data mining competitions, for data miners
around the world to solve. The India School of Business (ISB) and Yourcabs.com based out of
Bangalore, India, sponsored a contest to have students develop predictive models to classify if a
cab booking will be cancelled due to unavailability of cabs. Yourcabs.com was founded by Sree
Harsha and Rajesh Kedilaya in 2011, the company operates technology that aggregate fleet
owners and cabs, to manage the supply and demand of cabs.
Being able to accurately classify if cab bookings would be cancelled due to unavailability
of cabs would be of great advantage to a taxi or cab company, because they would be able to
better plan how much vehicles they have on the road, and direct drivers to customers who are
most likely going to utilize the service. Canceled cab bookings cost companies financially and
this has been evident recently as Uber was caught placing orders for services from their
competitors and then cancelling it. A classification algorithm would help companies avoid this.
This report details the problem description, visualization techniques, data mining
techniques, and conclusion from the analysis of the dataset.
3
Problem Description The problem description comes from the Kaggle.com posting of the competition titled
“Predicting cab booking cancelations”; we are given the tasks of a creating a predictive model to
classify new bookings as to whether they will be cancelled due to unavailability of cabs.
Visualization Analysis
Figure 1- Treemap showing the Total Booking by Area ID
The treemap illustrates the total booking by area ID at Yourcabs.com, the top five areas
that have the highest volume of bookings are the:
1. Area ID 393
2. Area ID 571
3. Area ID 293
4. Area ID 1010
5. Area ID 142
4
Figure 2-Treemap showing the Cancellations Based on zero Booking Date Difference and the Count of Successful booking versus Cancelled Bookings
The treemap illustrates the number of cancellations based on zero booking date
difference. The amount of successful booking was 18,286 compared to 2032 cancelled bookings.
Having a cancellation percentage of 10.00% versus 90% successful booking percentage with
zero booking date difference.
Figure 3-Treemap showing the Cancellations Based on one Booking Date Difference and the Count of Successful booking versus Cancelled Bookings
The treemap illustrates the number of cancellations based on one booking date difference.
The amount of successful booking was 15591 compared to 773 cancelled bookings. Having a
cancellation percentage of 4.723% versus 95.27% successful booking percentage with one
booking date difference.
5
Figure 4-Treemap showing the Cancellations Based on two Booking Date Difference and the Count of Successful booking versus Cancelled Bookings
The treemap illustrates the number of cancellations based on two booking date
difference. The amount of successful booking was 2049 compared to 47 cancelled bookings.
Having a cancellation percentage of 2.24% versus 97.75% successful booking percentage with
two booking date difference.
Figure 5-Treemap showing the Cancellations Based on three Booking Date Difference and the Count of Successful booking versus Cancelled Bookings
The treemap illustrates the number of cancellations based on three booking date
difference. The amount of successful booking was 935 compared to 23 cancelled bookings.
Having a cancellation percentage of 2.40% versus 97.59% successful booking percentage with
three booking date difference.
6
Figure 6--Treemap showing the Cancellations Based on four Booking Date Difference and the Count of Successful booking versus Cancelled Bookings
The treemap illustrates the number of cancellations based on four booking date
difference. The amount of successful booking was 935 compared to 23 cancelled bookings.
Having a cancellation percentage of 2% versus 98% successful booking percentage with four
booking date difference.
Takeaway
After analyzing the data representing the Cancellations Based on zero-four Booking Date
Difference and the Count of Successful booking versus Cancelled Bookings it maybe concluded
that with bookings rate of success is directly proportional to the booking date difference. The
greater the booking date difference the higher the rate of successful booking.
7
Figure 7-Bar Chart and Pie Chart of the Overall Booking Representation
The Bar chart representation of the successful booking and cancelled bookings illustrates
that 40299 for successful booking versus a value of 3132 for cancelled bookings –a cancellation
percentage of 7.21% versus a success rate of 92.78%. The Pie chart representation illustrates the
vehicle model 12 has the highest percentage of successful booking at 72.4% followed by vehicle
model 1 at 6.2% and finally vehicle model 10 at 6.0%.
Figure 8-Bar Chart and Pie Chart of the Overall Booking Representation
The Bar chart representation of the successful booking and cancelled bookings illustrates
that 40299 for successful booking versus a value of 3132 for cancelled bookings –a cancellation
percentage of 7.21% versus a success rate of 92.78%. The Pie chart representation illustrates the
vehicle model 12 has the highest percentage of successful booking at 86.2% followed by vehicle
model 89 at 9.5%.
8
Figure 9-Total Number of Cancellation per Weekday
The Bar chart representation of the number of successful bookings throughout the week.
Friday registered the highest rate of successful booking at 6806, Saturday second at 6224 and
Monday at 6199, Thursday, Sunday, Wednesday and Tuesday followed at 6949, 5475, 5455 and
5191 respectively.
Figure 10-Total Number of Cancellation per Weekday
The Bar chart representation of the number of cancelled bookings throughout the week.
Sunday registered the highest rate of cancellation at 595, Friday came in second at 545 and
Thursday came in third at 527, Saturday, Monday, Wednesday and Tuesday followed at 467,
367, 353 and 278 respectively.
9
Takeaway After analyzing the data representing the total number of cancellation per weekday for
Yourcab.com. The results is as follows the cancellation percentage of Monday is 6.59% with a
success rate of 93.40%, the cancellation percentage of Tuesday is 5.08% with a success rate of
94.91%, the cancellation percentage of Wednesday is 6.07% with a success rate of 93.92%, the
cancellation of Thursday is 8.13% with a success rate of 91.86%, the cancellation of Friday is
8.0% with a success rate of 92%, the cancellation of Saturday is 6.97% with a success rate of
93.02% and finally the cancellation of Sunday is 9.8% with a success rate of 90.20%. This
maybe concluded that the highest rate of cancellation is Sunday, Thursday and in third Friday.
Figure 11-Cross Table and Bar graph showing the number of successful booking/cancellations per month and weekday
The cross table illustrates the number of successful booking and cancellations per month.
The highlighted month for the calendar year represent the successful booking per month. In the
top three we have Fridays, Saturdays and Thursdays which are 6806, 6224 and 5949 respectively
for successful bookings within a calendar year. Simultaneously the bar graph illustrates the
weekly success rate of booking.
1. Friday has the highest rate of success of a total of 6806 with August, September
and July having the highest rate of success on Fridays at 1082, 756 and 714
respectively.
2. Saturday has the second highest rate of success of a total of 6224 with August,
July and June having the highest rate of success on Saturdays at 926, 754 and 719
respectively.
10
3. Thursday has the third highest rate of success of a total of 5949 with August,
October and July having the highest rate of success on Thursdays at 919, 823 and
645 respectively.
4. Sunday has the fourth highest rate of success of a total of 5475 with September,
June and August having the highest rate of success on Sunday at 717, 730 and 616
respectively.
5. Wednesday has the fifth highest rate of success of a total of 5455 with July,
October and May having the highest rate of success on Wednesdays at 717, 730
and 616 respectively.
6. Monday has the sixth highest rate of success of a total of 5199 with July,
September and August having the highest rate of success on Mondays at 700, 699
and 592 respectively.
7. Tuesday has the seventh highest rate of success of a total of 5191 with July,
August and October having the highest rate of success on Tuesdays at 700, 629
and 584 respectively.
Figure 12-Cross Table and Bar graph showing the number of successful booking/cancellations per month and weekday
The cross table illustrates the number of successful booking and cancellations per month.
The highlighted month for the calendar year represent the cancellation bookings per month. In
the top three we have Sunday, Thursday and Friday which 595, 545 and 527 respectively for
cancellation bookings within a calendar year. Simultaneously the bar graph illustrates the weekly
success rate of booking.
11
1. Sundays has the highest rate of cancellation of a total of 595 with October, May
and June having the highest rate of cancellations on Sundays at 149, 110 and 80
respectively.
2. Fridays has the second highest rate of cancellation of a total of 545 with May,
September and October having the highest rate of cancellations on Fridays at 154,
88 and 68 respectively.
3. Thursday has the third highest rate of cancellation of a total of 527 with October,
May and November having the highest rate of cancellations on Thursdays at 290,
55 and 46 respectively.
4. Saturdays has the fourth highest rate of cancellation of a total of 467 with May,
June and October having the highest rate of cancellations on Saturdays at 111, 90
and 75 respectively.
5. Mondays has the fifth highest rate of cancellation of a total of 367 with
September, October and November having the highest rate of cancellations on
Mondays at 86, 86 and 53 respectively.
6. Wednesdays has the sixth highest rate of cancellation of a total of 353 with
October, November and August having the highest rate of cancellations on
Wednesday at 89, 68 and 55 respectively.
7. Tuesdays has the seventh highest rate of cancellation of a total of 278 with
October, November and September having the highest rate of cancellations on
Tuesdays at 68, 63 and 44 respectively.
12
Data Cleaning A record was deleted from the data, because during data exploration we observed that a
booking was made after the scheduled trip, which resulted in skewed data. We also deleted and
created new variables, refer to Table 1 for the variables that were deleted and the reason why,
and refer to Table 2.
Variable Name Reason
ID Not relevant to analysis.
USER_ID Not relevant to analysis.
FROM_AREA_ID
Not relevant to analysis as most cancellations
originate at area ID 393.
TO_AREA_ID Not relevant to analysis.
TO_CITY_ID Not relevant to analysis.
FROM_DATE Created new variables, refer to Table 2.
TO_DATE Not relevant to analysis.
FROM_LAT Coordinates are not relevant to variability of cab.
FROM_LONG Coordinates are not relevant to variability of cab.
BOOKING_CREATED Not relevant to analysis
COST_OF_ERROR Will be calculating as a result of cancelation. Table 1-Variables Removed
Variable Name Reason
BOOKING_MONTH To perform analysis by month.
BOOKING_DAY To perform analysis by days.
LEAD_TIME_DAYS Difference between date of booking and travel date.
FROM_MONTH Created from FROM_DATE variable.
FROM_WEEK Created from FROM_DATE variable.
FROM_TIME Created from FROM_DATE variable. Table 2-Variables Created
13
Data Mining We took the Kaggle_YourCabs_training.csv file and partitioned it 70% for training and
30% for validation. Since we were given a classification tasks, we used the Random Forest, Ada
Boost, and Neural Network algorithms; and generated error matrices, risk charts, lift charts, ROC
curve, precision chart, sensitivity vs. specificity charts, and precision vs. recall charts for the
validation data, and error matrices and ROC cures for the testing data for all three algorithms
used.
ADA Boost Validation Data:
14
Figure 13-Ada Boost Validation Data Error Matrix
Figure 14-Ada Boost Validation Data Risk Plot
Figure 15-Ada Boost Validation Data Lift Chart
15
Figure 16-Ada Boost Validation Data ROC Curve
Figure 17- Ada Boost Validation Data Precision Vs. Recall Plot
16
Figure 18-Ada Boost Validation Data Sensitivity Vs. Specificity Plot
ADA Boost Training Data:
Figure 19-Ada Boost Training Data Error Matrix
17
Figure 20-Ada Boost Training Data ROC Curve
18
Neural Network Validation Data:
Figure 21-Neural Network Validation Data Error Matrix
Figure 22-Neural Network Validation Data Risk Plot
19
Figure 23-Neural Network Validation Data Lift Chart
Figure 24 - Neural Network Validation Data ROC Curve
20
Figure 25 - Neural Network Validation Data Precision Vs. Recall Plot
Figure 26-Neural Network Validation Data Sensitivity Vs. Specificity Plot
21
Neural Network Training Data:
Figure 27-Neural Network Training Data Error Matrix
Figure 28-Neural Network Training Data ROC Curve
22
Random Forest Validation Data:
Figure 29-Random Forest Validation Data Error Matrix
Figure 30-Validation Data Risk Plot
23
Figure 31-Random Forest Validation Data Lift Chart
Figure 32-Random Forest Validation Data ROC Curve
24
Figure 33-Random Forest Validation Data Precision Vs. Recall Plot
Figure 34-Random Forest Validation Data Sensitivity Vs. Specificity Plot
25
Random Forest Training Data:
Figure 35-Random Forest Training Data Error Matrix
Figure 36-Random Forest Training Data ROC Curve
26
Takeaway
Classification Matrix: (confusion matrix or error matrix)
This matrix summarizes the correct and incorrect classifications that a classifier produced
for a certain dataset. Rows and columns of the classification matrix correspond to the true and
predicted classes respectively. The two diagonal cells (upper left, lower right) give the number
of correct classifications, where the predicted class coincides with the actual class of the
observation. The off diagonal cells gives the count of the misclassification. The classification
matrix gives estimates of the true classification and misclassification rates.
Lift Chart:
A Lift Chart graphically represents the improvement that a mining model provides when
compared against a random guess, and measures the change in terms of a lift score. By
comparing the lift scores for various portions of your data set and for different models, you can
determine which model is best, and which percentage of the cases in the data set would benefit
from applying the model’s predictions.
ROC Curve: (Receiver Operating Characteristic Curve) A more popular method for plotting the two measures is through ROC curves. The ROC
curve plots the pairs {sensitivity, 1-specifity} as the cutoff value increases from 0 and 1.
Conclusion The data representing the total number of cancellation per weekday for Yourcabs.com.
The results is as follows the cancellation percentage of Monday is 6.59% with a success rate of
93.40%, the cancellation percentage of Tuesday is 5.08% with a success rate of 94.91%, the
cancellation percentage of Wednesday is 6.07% with a success rate of 93.92%, the cancellation
of Thursday is 8.13% with a success rate of 91.86%, the cancellation of Friday is 8.0% with a
success rate of 92%, the cancellation of Saturday is 6.97% with a success rate of 93.02% and
finally the cancellation of Sunday is 9.8% with a success rate of 90.20%. This maybe concluded
that the highest rate of cancellation is Sunday, Thursday and in third Friday.
After analyzing the data representing the Cancellations Based on zero-four Booking Date
Difference and the Count of Successful booking versus Cancelled Bookings it maybe concluded
that with bookings rate of success is directly proportional to the booking date difference. The
greater the booking date difference the higher the rate of successful booking.
27
To handle the unbalanced nature of the data, we used a stratified sampling
technique. Once you split up the data into train, validation and test set, chances are close to 100%
that your already skewed data becomes even more unbalanced for at least one of the three
resulting sets. This can be overcome by using stratified sampling this assures that the train,
validation and test sets are well balanced. Team 5 decided to use the sampleSize parameter in the
random forest algorithm to achieve it. Since we did not use the stratified sampling with Adaboost
and Neural network algorithms the error value was as low as 0.06 and 0.068 respectively.
Arrived at a conclusion that the best possible algorithm would be the random forest model value
of 0.21.
28
Appendix A – Variable Description This table contains the description of the variables in the Kaggle_YourCabs_training.csv
file as listed on the competition page.
Data Field Description
id booking ID
user_id the ID of the customer (based on mobile number)
vehicle_model_id vehicle model type
package_id type of package (1=4hrs & 40kms, 2=8hrs & 80kms, 3=6hrs & 60kms, 4= 10hrs & 100kms, 5=5hrs & 50kms, 6=3hrs & 30kms, 7=12hrs & 120kms)
travel_type_id type of travel (1=long distance, 2= point to point, 3= hourly rental)
from_area_id unique identifier of area. Applicable only for point-to-point travel and packages
to_area_id unique identifier of area. Applicable only for point-to-point travel
from_city_id unique identifier of city
to_city_id unique identifier of city (only for intercity)
from_date time stamp of requested trip start
to_date time stamp of trip end
online_booking if booking was done on desktop website
mobile_site_booking if booking was done on mobile website
booking_created time stamp of booking
from_lat latitude of from area
from_long longitude of from area
to_lat latitude of to area
to_long longitude of to area
car_cancelation (available only in training data) - whether the booking was cancelled (1) or not (0) due to unavailability of a car
cost_of_error
(available only in training data) - the cost incurred if the booking is misclassified. For an un-cancelled booking, the cost of misclassificaiton is 1. For a cancelled booking, the cost is a function of the cancellation time relative to the trip start time (see Evaluation Page)
Table 3-Description of variables
29
Appendix B – Ada Boost R Code
# Load the data.
crs$dataset <- read.csv("file:///M:/Lab & Assginments/ADS/Final Project/Transformed Training Data
with binned columns.csv", na.strings=c(".", "NA", "", "?"), strip.white=TRUE, encoding="UTF-8")
#============================================================
# Rattle timestamp: 2015-05-01 20:39:14 i386-w64-mingw32
# Note the user selections.
# Build the training/validate/test datasets.
set.seed(crv$seed)
crs$nobs <- nrow(crs$dataset) # 43430 observations
crs$sample <- crs$train <- sample(nrow(crs$dataset), 0.7*crs$nobs) # 30401 observations
crs$validate <- sample(setdiff(seq_len(nrow(crs$dataset)), crs$train), 0.15*crs$nobs) # 6514
observations
crs$test <- setdiff(setdiff(seq_len(nrow(crs$dataset)), crs$train), crs$validate) # 6515 observations
# The following variable selections have been noted.
crs$input <- c("user_id", "vehicle_model_id_red", "package_id", "from_area_id",
"to_area_id", "from_city_id", "to_city_id", "from_date",
"to_date", "online_booking", "mobile_site_booking", "booking_created",
"from_lat", "from_long", "to_lat", "to_long",
"Car_Cancellation", "Cost_of_error", "from_month", "from_weekday",
"from_time", "booking_month", "booking_weekday", "lead_time_days_red")
crs$numeric <- c("user_id", "vehicle_model_id_red", "from_date", "online_booking",
"mobile_site_booking", "booking_created", "Car_Cancellation", "Cost_of_error",
"from_time", "lead_time_days_red")
crs$categoric <- c("package_id", "from_area_id", "to_area_id", "from_city_id",
"to_city_id", "to_date", "from_lat", "from_long",
"to_lat", "to_long", "from_month", "from_weekday",
"booking_month", "booking_weekday")
crs$target <- "travel_type_id"
crs$risk <- NULL
30
crs$ident <- "id"
crs$ignore <- NULL
crs$weights <- NULL
#============================================================
# Rattle timestamp: 2015-05-01 20:39:37 i386-w64-mingw32
# Note the user selections.
# Build the training/validate/test datasets.
set.seed(crv$seed)
crs$nobs <- nrow(crs$dataset) # 43430 observations
crs$sample <- crs$train <- sample(nrow(crs$dataset), 0.7*crs$nobs) # 30401 observations
crs$validate <- sample(setdiff(seq_len(nrow(crs$dataset)), crs$train), 0.15*crs$nobs) # 6514
observations
crs$test <- setdiff(setdiff(seq_len(nrow(crs$dataset)), crs$train), crs$validate) # 6515 observations
# The following variable selections have been noted.
crs$input <- c("vehicle_model_id_red", "package_id", "travel_type_id", "from_city_id",
"online_booking", "mobile_site_booking", "from_month", "from_weekday",
"from_time", "booking_month", "booking_weekday", "lead_time_days_red")
crs$numeric <- c("vehicle_model_id_red", "travel_type_id", "online_booking", "mobile_site_booking",
"from_time", "lead_time_days_red")
crs$categoric <- c("package_id", "from_city_id", "from_month", "from_weekday",
"booking_month", "booking_weekday")
crs$target <- "Car_Cancellation"
crs$risk <- NULL
crs$ident <- NULL
crs$ignore <- c("id", "user_id", "from_area_id", "to_area_id", "to_city_id", "from_date", "to_date",
"booking_created", "from_lat", "from_long", "to_lat", "to_long", "Cost_of_error")
crs$weights <- NULL
#============================================================
# Rattle timestamp: 2015-05-01 20:39:55 i386-w64-mingw32
# Note the user selections.
31
# Build the training/validate/test datasets.
set.seed(crv$seed)
crs$nobs <- nrow(crs$dataset) # 43430 observations
crs$sample <- crs$train <- sample(nrow(crs$dataset), 0.7*crs$nobs) # 30401 observations
crs$validate <- sample(setdiff(seq_len(nrow(crs$dataset)), crs$train), 0.15*crs$nobs) # 6514
observations
crs$test <- setdiff(setdiff(seq_len(nrow(crs$dataset)), crs$train), crs$validate) # 6515 observations
# The following variable selections have been noted.
crs$input <- c("vehicle_model_id_red", "package_id", "travel_type_id", "from_city_id",
"online_booking", "mobile_site_booking", "from_month", "from_weekday",
"from_time", "booking_month", "booking_weekday", "lead_time_days_red")
crs$numeric <- c("vehicle_model_id_red", "travel_type_id", "online_booking", "mobile_site_booking",
"from_time", "lead_time_days_red")
crs$categoric <- c("package_id", "from_city_id", "from_month", "from_weekday",
"booking_month", "booking_weekday")
crs$target <- "Car_Cancellation"
crs$risk <- NULL
crs$ident <- NULL
crs$ignore <- c("id", "user_id", "from_area_id", "to_area_id", "to_city_id", "from_date", "to_date",
"booking_created", "from_lat", "from_long", "to_lat", "to_long", "Cost_of_error")
crs$weights <- NULL
#============================================================
# Rattle timestamp: 2015-05-01 20:41:47 i386-w64-mingw32
# Ada Boost
# The `ada' package implements the boost algorithm.
require(ada, quietly=TRUE)
# Build the Ada Boost model.
set.seed(crv$seed)
crs$ada <- ada(Car_Cancellation ~ .,
data=crs$dataset[crs$train,c(crs$input, crs$target)],
32
control=rpart.control(maxdepth=30,
cp=0.010000,
minsplit=20,
xval=10),
iter=60)
# Print the results of the modelling.
print(crs$ada)
round(crs$ada$model$errs[crs$ada$iter,], 2)
cat('Variables actually used in tree construction:\n')
print(sort(names(listAdaVarsUsed(crs$ada))))
cat('\nFrequency of variables actually used:\n')
print(listAdaVarsUsed(crs$ada))
# Time taken: 54.03 secs
#============================================================
# Rattle timestamp: 2015-05-01 20:42:47 i386-w64-mingw32
# Evaluate model performance.
# Generate an Error Matrix for the Ada Boost model.
# Obtain the response from the Ada Boost model.
crs$pr <- predict(crs$ada, newdata=crs$dataset[crs$validate, c(crs$input, crs$target)])
# Generate the confusion matrix showing counts.
table(crs$dataset[crs$validate, c(crs$input, crs$target)]$Car_Cancellation, crs$pr,
dnn=c("Actual", "Predicted"))
# Generate the confusion matrix showing proportions.
pcme <- function(actual, cl)
{
x <- table(actual, cl)
tbl <- cbind(round(x/length(actual), 2),
Error=round(c(x[1,2]/sum(x[1,]),
x[2,1]/sum(x[2,])), 2))
names(attr(tbl, "dimnames")) <- c("Actual", "Predicted")
33
return(tbl)
};
pcme(crs$dataset[crs$validate, c(crs$input, crs$target)]$Car_Cancellation, crs$pr)
# Calculate the overall error percentage.
overall <- function(x)
{
if (nrow(x) == 2)
cat((x[1,2] + x[2,1]) / sum(x))
else
cat(1 - (x[1,rownames(x)]) / sum(x))
}
overall(table(crs$pr, crs$dataset[crs$validate, c(crs$input, crs$target)]$Car_Cancellation,
dnn=c("Predicted", "Actual")))
# Calculate the averaged class error percentage.
avgerr <- function(x)
cat(mean(c(x[1,2], x[2,1]) / apply(x, 1, sum)))
avgerr(table(crs$pr, crs$dataset[crs$validate, c(crs$input, crs$target)]$Car_Cancellation,
dnn=c("Predicted", "Actual")))
#============================================================
# Rattle timestamp: 2015-05-01 20:43:00 i386-w64-mingw32
# Ada Boost
# The `ada' package implements the boost algorithm.
require(ada, quietly=TRUE)
# Build the Ada Boost model.
set.seed(crv$seed)
crs$ada <- ada(Car_Cancellation ~ .,
data=crs$dataset[crs$train,c(crs$input, crs$target)],
control=rpart.control(maxdepth=30,
cp=0.010000,
minsplit=20,
xval=10),
iter=50)
34
# Print the results of the modelling.
print(crs$ada)
round(crs$ada$model$errs[crs$ada$iter,], 2)
cat('Variables actually used in tree construction:\n')
print(sort(names(listAdaVarsUsed(crs$ada))))
cat('\nFrequency of variables actually used:\n')
print(listAdaVarsUsed(crs$ada))
# Time taken: 43.89 secs
#============================================================
# Rattle timestamp: 2015-05-01 20:43:49 i386-w64-mingw32
# Evaluate model performance.
# Generate an Error Matrix for the Ada Boost model.
# Obtain the response from the Ada Boost model.
crs$pr <- predict(crs$ada, newdata=crs$dataset[crs$validate, c(crs$input, crs$target)])
# Generate the confusion matrix showing counts.
table(crs$dataset[crs$validate, c(crs$input, crs$target)]$Car_Cancellation, crs$pr,
dnn=c("Actual", "Predicted"))
# Generate the confusion matrix showing proportions.
pcme <- function(actual, cl)
{
x <- table(actual, cl)
tbl <- cbind(round(x/length(actual), 2),
Error=round(c(x[1,2]/sum(x[1,]),
x[2,1]/sum(x[2,])), 2))
names(attr(tbl, "dimnames")) <- c("Actual", "Predicted")
return(tbl)
};
pcme(crs$dataset[crs$validate, c(crs$input, crs$target)]$Car_Cancellation, crs$pr)
# Calculate the overall error percentage.
35
overall <- function(x)
{
if (nrow(x) == 2)
cat((x[1,2] + x[2,1]) / sum(x))
else
cat(1 - (x[1,rownames(x)]) / sum(x))
}
overall(table(crs$pr, crs$dataset[crs$validate, c(crs$input, crs$target)]$Car_Cancellation,
dnn=c("Predicted", "Actual")))
# Calculate the averaged class error percentage.
avgerr <- function(x)
cat(mean(c(x[1,2], x[2,1]) / apply(x, 1, sum)))
avgerr(table(crs$pr, crs$dataset[crs$validate, c(crs$input, crs$target)]$Car_Cancellation,
dnn=c("Predicted", "Actual")))
#============================================================
# Rattle timestamp: 2015-05-01 20:44:02 i386-w64-mingw32
# Note the user selections.
# Build the training/validate/test datasets.
set.seed(crv$seed)
crs$nobs <- nrow(crs$dataset) # 43430 observations
crs$sample <- crs$train <- sample(nrow(crs$dataset), 0.7*crs$nobs) # 30401 observations
crs$validate <- sample(setdiff(seq_len(nrow(crs$dataset)), crs$train), 0.3*crs$nobs) # 13029
observations
crs$test <- NULL
# The following variable selections have been noted.
crs$input <- c("vehicle_model_id_red", "package_id", "travel_type_id", "from_city_id",
"online_booking", "mobile_site_booking", "from_month", "from_weekday",
"from_time", "booking_month", "booking_weekday", "lead_time_days_red")
crs$numeric <- c("vehicle_model_id_red", "travel_type_id", "online_booking", "mobile_site_booking",
"from_time", "lead_time_days_red")
crs$categoric <- c("package_id", "from_city_id", "from_month", "from_weekday",
36
"booking_month", "booking_weekday")
crs$target <- "Car_Cancellation"
crs$risk <- NULL
crs$ident <- NULL
crs$ignore <- c("id", "user_id", "from_area_id", "to_area_id", "to_city_id", "from_date", "to_date",
"booking_created", "from_lat", "from_long", "to_lat", "to_long", "Cost_of_error")
crs$weights <- NULL
#============================================================
# Rattle timestamp: 2015-05-01 20:44:10 i386-w64-mingw32
# Ada Boost
# The `ada' package implements the boost algorithm.
require(ada, quietly=TRUE)
# Build the Ada Boost model.
set.seed(crv$seed)
crs$ada <- ada(Car_Cancellation ~ .,
data=crs$dataset[crs$train,c(crs$input, crs$target)],
control=rpart.control(maxdepth=30,
cp=0.010000,
minsplit=20,
xval=10),
iter=50)
# Print the results of the modelling.
print(crs$ada)
round(crs$ada$model$errs[crs$ada$iter,], 2)
cat('Variables actually used in tree construction:\n')
print(sort(names(listAdaVarsUsed(crs$ada))))
cat('\nFrequency of variables actually used:\n')
print(listAdaVarsUsed(crs$ada))
# Time taken: 43.95 secs
#============================================================
# Rattle timestamp: 2015-05-01 20:45:06 i386-w64-mingw32
37
# Evaluate model performance.
# Generate an Error Matrix for the Ada Boost model.
# Obtain the response from the Ada Boost model.
crs$pr <- predict(crs$ada, newdata=crs$dataset[crs$validate, c(crs$input, crs$target)])
# Generate the confusion matrix showing counts.
table(crs$dataset[crs$validate, c(crs$input, crs$target)]$Car_Cancellation, crs$pr,
dnn=c("Actual", "Predicted"))
# Generate the confusion matrix showing proportions.
pcme <- function(actual, cl)
{
x <- table(actual, cl)
tbl <- cbind(round(x/length(actual), 2),
Error=round(c(x[1,2]/sum(x[1,]),
x[2,1]/sum(x[2,])), 2))
names(attr(tbl, "dimnames")) <- c("Actual", "Predicted")
return(tbl)
};
pcme(crs$dataset[crs$validate, c(crs$input, crs$target)]$Car_Cancellation, crs$pr)
# Calculate the overall error percentage.
overall <- function(x)
{
if (nrow(x) == 2)
cat((x[1,2] + x[2,1]) / sum(x))
else
cat(1 - (x[1,rownames(x)]) / sum(x))
}
overall(table(crs$pr, crs$dataset[crs$validate, c(crs$input, crs$target)]$Car_Cancellation,
dnn=c("Predicted", "Actual")))
# Calculate the averaged class error percentage.
avgerr <- function(x)
38
cat(mean(c(x[1,2], x[2,1]) / apply(x, 1, sum)))
avgerr(table(crs$pr, crs$dataset[crs$validate, c(crs$input, crs$target)]$Car_Cancellation,
dnn=c("Predicted", "Actual")))
#============================================================
# Rattle timestamp: 2015-05-01 20:46:09 i386-w64-mingw32
# Score a dataset.
# Obtain probability scores for the Ada Boost model on Transformed Training Data with binned
columns.csv [validate].
crs$pr <- predict(crs$ada, newdata=crs$dataset[crs$validate, c(crs$input)], type="prob")[,2]
# Extract the relevant variables from the dataset.
sdata <- crs$dataset[crs$validate,]
# Output the combined data.
write.csv(cbind(sdata, crs$pr), file="M:\Lab & Assginments\ADS\Final Project\AdaBoost\Ada
Boost_Validation_Score.csv", row.names=FALSE)
39
Appendix C – Neural Network R Code # Load the data. crs$dataset <- read.csv("file:///M:/Lab & Assginments/ADS/Final Project/Transformed Training Data with binned columns.csv", na.strings=c(".", "NA", "", "?"), strip.white=TRUE, encoding="UTF-8") #============================================================ # Rattle timestamp: 2015-05-01 20:48:43 i386-w64-mingw32 # Note the user selections. # Build the training/validate/test datasets. set.seed(crv$seed) crs$nobs <- nrow(crs$dataset) # 43430 observations crs$sample <- crs$train <- sample(nrow(crs$dataset), 0.7*crs$nobs) # 30401 observations crs$validate <- sample(setdiff(seq_len(nrow(crs$dataset)), crs$train), 0.15*crs$nobs) # 6514 observations crs$test <- setdiff(setdiff(seq_len(nrow(crs$dataset)), crs$train), crs$validate) # 6515 observations # The following variable selections have been noted. crs$input <- c("user_id", "vehicle_model_id_red", "package_id", "from_area_id", "to_area_id", "from_city_id", "to_city_id", "from_date", "to_date", "online_booking", "mobile_site_booking", "booking_created", "from_lat", "from_long", "to_lat", "to_long", "Car_Cancellation", "Cost_of_error", "from_month", "from_weekday", "from_time", "booking_month", "booking_weekday", "lead_time_days_red") crs$numeric <- c("user_id", "vehicle_model_id_red", "from_date", "online_booking", "mobile_site_booking", "booking_created", "Car_Cancellation", "Cost_of_error", "from_time", "lead_time_days_red") crs$categoric <- c("package_id", "from_area_id", "to_area_id", "from_city_id", "to_city_id", "to_date", "from_lat", "from_long", "to_lat", "to_long", "from_month", "from_weekday", "booking_month", "booking_weekday") crs$target <- "travel_type_id" crs$risk <- NULL crs$ident <- "id" crs$ignore <- NULL crs$weights <- NULL #============================================================ # Rattle timestamp: 2015-05-01 20:51:27 i386-w64-mingw32
40
# Note the user selections. # Build the training/validate/test datasets. set.seed(crv$seed) crs$nobs <- nrow(crs$dataset) # 43430 observations crs$sample <- crs$train <- sample(nrow(crs$dataset), 0.7*crs$nobs) # 30401 observations crs$validate <- sample(setdiff(seq_len(nrow(crs$dataset)), crs$train), 0.15*crs$nobs) # 6514 observations crs$test <- setdiff(setdiff(seq_len(nrow(crs$dataset)), crs$train), crs$validate) # 6515 observations # The following variable selections have been noted. crs$input <- c("vehicle_model_id_red", "package_id", "travel_type_id", "from_city_id", "online_booking", "mobile_site_booking", "from_month", "from_weekday", "from_time", "booking_month", "booking_weekday", "lead_time_days_red") crs$numeric <- c("vehicle_model_id_red", "travel_type_id", "online_booking", "mobile_site_booking", "from_time", "lead_time_days_red") crs$categoric <- c("package_id", "from_city_id", "from_month", "from_weekday", "booking_month", "booking_weekday") crs$target <- "Car_Cancellation" crs$risk <- NULL crs$ident <- NULL crs$ignore <- c("id", "user_id", "from_area_id", "to_area_id", "to_city_id", "from_date", "to_date", "booking_created", "from_lat", "from_long", "to_lat", "to_long", "Cost_of_error") crs$weights <- NULL #============================================================ # Rattle timestamp: 2015-05-01 20:51:36 i386-w64-mingw32 # Note the user selections. # Build the training/validate/test datasets. set.seed(crv$seed) crs$nobs <- nrow(crs$dataset) # 43430 observations crs$sample <- crs$train <- sample(nrow(crs$dataset), 0.7*crs$nobs) # 30401 observations crs$validate <- sample(setdiff(seq_len(nrow(crs$dataset)), crs$train), 0.3*crs$nobs) # 13029 observations crs$test <- NULL # The following variable selections have been noted. crs$input <- c("vehicle_model_id_red", "package_id", "travel_type_id", "from_city_id", "online_booking", "mobile_site_booking", "from_month", "from_weekday",
41
"from_time", "booking_month", "booking_weekday", "lead_time_days_red") crs$numeric <- c("vehicle_model_id_red", "travel_type_id", "online_booking", "mobile_site_booking", "from_time", "lead_time_days_red") crs$categoric <- c("package_id", "from_city_id", "from_month", "from_weekday", "booking_month", "booking_weekday") crs$target <- "Car_Cancellation" crs$risk <- NULL crs$ident <- NULL crs$ignore <- c("id", "user_id", "from_area_id", "to_area_id", "to_city_id", "from_date", "to_date", "booking_created", "from_lat", "from_long", "to_lat", "to_long", "Cost_of_error") crs$weights <- NULL #============================================================ # Rattle timestamp: 2015-05-01 20:51:40 i386-w64-mingw32 # Neural Network # Build a neural network model using the nnet package. require(nnet, quietly=TRUE) # Build the NNet model. set.seed(199) crs$nnet <- nnet(as.factor(Car_Cancellation) ~ ., data=crs$dataset[crs$sample,c(crs$input, crs$target)], size=10, skip=TRUE, MaxNWts=10000, trace=FALSE, maxit=100) # Print the results of the modelling. cat(sprintf("A %s network with %d weights.\n", paste(crs$nnet$n, collapse="-"), length(crs$nnet$wts))) cat(sprintf("Inputs: %s.\n", paste(crs$nnet$coefnames, collapse=", "))) cat(sprintf("Output: %s.\n", names(attr(crs$nnet$terms, "dataClasses"))[1])) cat(sprintf("Sum of Squares Residuals: %.4f.\n", sum(residuals(crs$nnet) ^ 2))) cat("\n") print(summary(crs$nnet)) cat('\n') # Time taken: 27.31 secs
42
#============================================================ # Rattle timestamp: 2015-05-01 20:54:16 i386-w64-mingw32 # Evaluate model performance. # Generate an Error Matrix for the Neural Net model. # Obtain the response from the Neural Net model. crs$pr <- predict(crs$nnet, newdata=crs$dataset[crs$validate, c(crs$input, crs$target)], type="class") # Generate the confusion matrix showing counts. table(crs$dataset[crs$validate, c(crs$input, crs$target)]$Car_Cancellation, crs$pr, dnn=c("Actual", "Predicted")) # Generate the confusion matrix showing proportions. pcme <- function(actual, cl) { x <- table(actual, cl) tbl <- cbind(round(x/length(actual), 2), Error=round(c(x[1,2]/sum(x[1,]), x[2,1]/sum(x[2,])), 2)) names(attr(tbl, "dimnames")) <- c("Actual", "Predicted") return(tbl) }; pcme(crs$dataset[crs$validate, c(crs$input, crs$target)]$Car_Cancellation, crs$pr) # Calculate the overall error percentage. overall <- function(x) { if (nrow(x) == 2) cat((x[1,2] + x[2,1]) / sum(x)) else cat(1 - (x[1,rownames(x)]) / sum(x)) } overall(table(crs$pr, crs$dataset[crs$validate, c(crs$input, crs$target)]$Car_Cancellation, dnn=c("Predicted", "Actual"))) # Calculate the averaged class error percentage. avgerr <- function(x) cat(mean(c(x[1,2], x[2,1]) / apply(x, 1, sum))) avgerr(table(crs$pr, crs$dataset[crs$validate, c(crs$input, crs$target)]$Car_Cancellation, dnn=c("Predicted", "Actual")))
43
#============================================================ # Rattle timestamp: 2015-05-01 21:02:49 i386-w64-mingw32 # Score a dataset. # Obtain probability scores for the Neural Net model on Transformed Training Data with binned columns.csv [validate]. crs$pr <- predict(crs$nnet, newdata=crs$dataset[crs$validate, c(crs$input)]) # Extract the relevant variables from the dataset. sdata <- crs$dataset[crs$validate,] # Output the combined data. write.csv(cbind(sdata, crs$pr), file="C:\Users\Deepak\Documents\Neural Network_Validation_Score.csv", row.names=FALSE) #============================================================ # Rattle timestamp: 2015-05-01 21:03:18 i386-w64-mingw32 # Score a dataset.
44
Appendix D – Random Forest R Code # Load the data.
crs$dataset <- read.csv("file:///M:/Lab & Assginments/ADS/Final Project/Transformed Training Data
with binned columns.csv", na.strings=c(".", "NA", "", "?"), strip.white=TRUE, encoding="UTF-8")
#============================================================
# Rattle timestamp: 2015-05-01 20:30:41 i386-w64-mingw32
# Note the user selections.
# Build the training/validate/test datasets.
set.seed(crv$seed)
crs$nobs <- nrow(crs$dataset) # 43430 observations
crs$sample <- crs$train <- sample(nrow(crs$dataset), 0.7*crs$nobs) # 30401 observations
crs$validate <- sample(setdiff(seq_len(nrow(crs$dataset)), crs$train), 0.15*crs$nobs) # 6514
observations
crs$test <- setdiff(setdiff(seq_len(nrow(crs$dataset)), crs$train), crs$validate) # 6515 observations
# The following variable selections have been noted.
crs$input <- c("user_id", "vehicle_model_id_red", "package_id", "from_area_id",
"to_area_id", "from_city_id", "to_city_id", "from_date",
"to_date", "online_booking", "mobile_site_booking", "booking_created",
"from_lat", "from_long", "to_lat", "to_long",
"Car_Cancellation", "Cost_of_error", "from_month", "from_weekday",
"from_time", "booking_month", "booking_weekday", "lead_time_days_red")
crs$numeric <- c("user_id", "vehicle_model_id_red", "from_date", "online_booking",
"mobile_site_booking", "booking_created", "Car_Cancellation", "Cost_of_error",
"from_time", "lead_time_days_red")
crs$categoric <- c("package_id", "from_area_id", "to_area_id", "from_city_id",
"to_city_id", "to_date", "from_lat", "from_long",
"to_lat", "to_long", "from_month", "from_weekday",
"booking_month", "booking_weekday")
crs$target <- "travel_type_id"
crs$risk <- NULL
crs$ident <- "id"
45
crs$ignore <- NULL
crs$weights <- NULL
#============================================================
# Rattle timestamp: 2015-05-01 20:31:08 i386-w64-mingw32
# Note the user selections.
# Build the training/validate/test datasets.
set.seed(crv$seed)
crs$nobs <- nrow(crs$dataset) # 43430 observations
crs$sample <- crs$train <- sample(nrow(crs$dataset), 0.7*crs$nobs) # 30401 observations
crs$validate <- sample(setdiff(seq_len(nrow(crs$dataset)), crs$train), 0.15*crs$nobs) # 6514
observations
crs$test <- setdiff(setdiff(seq_len(nrow(crs$dataset)), crs$train), crs$validate) # 6515 observations
# The following variable selections have been noted.
crs$input <- c("vehicle_model_id_red", "package_id", "travel_type_id", "from_city_id",
"online_booking", "mobile_site_booking", "from_month", "from_weekday",
"from_time", "booking_month", "booking_weekday", "lead_time_days_red")
crs$numeric <- c("vehicle_model_id_red", "travel_type_id", "online_booking", "mobile_site_booking",
"from_time", "lead_time_days_red")
crs$categoric <- c("package_id", "from_city_id", "from_month", "from_weekday",
"booking_month", "booking_weekday")
crs$target <- "Car_Cancellation"
crs$risk <- NULL
crs$ident <- NULL
crs$ignore <- c("id", "user_id", "from_area_id", "to_area_id", "to_city_id", "from_date", "to_date",
"booking_created", "from_lat", "from_long", "to_lat", "to_long", "Cost_of_error")
crs$weights <- NULL
#============================================================
# Rattle timestamp: 2015-05-01 20:31:32 i386-w64-mingw32
# Random Forest
# The 'randomForest' package provides the 'randomForest' function.
46
require(randomForest, quietly=TRUE)
# Build the Random Forest model.
set.seed(crv$seed)
crs$rf <- randomForest(as.factor(Car_Cancellation) ~ .,
data=crs$dataset[crs$sample,c(crs$input, crs$target)],
ntree=130,
mtry=4,
sampsize=c(1200,1200),
importance=TRUE,
na.action=na.roughfix,
replace=FALSE)
# Generate textual output of 'Random Forest' model.
crs$rf
# The `pROC' package implements various AUC functions.
require(pROC, quietly=TRUE)
# Calculate the Area Under the Curve (AUC).
roc(crs$rf$y, as.numeric(crs$rf$predicted))
# Calculate the AUC Confidence Interval.
ci.auc(crs$rf$y, as.numeric(crs$rf$predicted))
# List the importance of the variables.
rn <- round(importance(crs$rf), 2)
rn[order(rn[,3], decreasing=TRUE),]
# Time taken: 12.28 secs
#============================================================
# Rattle timestamp: 2015-05-01 20:31:48 i386-w64-mingw32
# Evaluate model performance.
47
# Generate an Error Matrix for the Random Forest model.
# Obtain the response from the Random Forest model.
crs$pr <- predict(crs$rf, newdata=na.omit(crs$dataset[crs$validate, c(crs$input, crs$target)]))
# Generate the confusion matrix showing counts.
table(na.omit(crs$dataset[crs$validate, c(crs$input, crs$target)])$Car_Cancellation, crs$pr,
dnn=c("Actual", "Predicted"))
# Generate the confusion matrix showing proportions.
pcme <- function(actual, cl)
{
x <- table(actual, cl)
tbl <- cbind(round(x/length(actual), 2),
Error=round(c(x[1,2]/sum(x[1,]),
x[2,1]/sum(x[2,])), 2))
names(attr(tbl, "dimnames")) <- c("Actual", "Predicted")
return(tbl)
};
pcme(na.omit(crs$dataset[crs$validate, c(crs$input, crs$target)])$Car_Cancellation, crs$pr)
# Calculate the overall error percentage.
overall <- function(x)
{
if (nrow(x) == 2)
cat((x[1,2] + x[2,1]) / sum(x))
else
cat(1 - (x[1,rownames(x)]) / sum(x))
}
overall(table(crs$pr, na.omit(crs$dataset[crs$validate, c(crs$input, crs$target)])$Car_Cancellation,
dnn=c("Predicted", "Actual")))
# Calculate the averaged class error percentage.
avgerr <- function(x)
cat(mean(c(x[1,2], x[2,1]) / apply(x, 1, sum)))
avgerr(table(crs$pr, na.omit(crs$dataset[crs$validate, c(crs$input, crs$target)])$Car_Cancellation,
48
dnn=c("Predicted", "Actual")))
#============================================================
# Rattle timestamp: 2015-05-01 20:31:58 i386-w64-mingw32
# Score a dataset.
# Read a dataset from file for testing the model.
crs$testset <- read.csv("C:/Users/Deepak/Documents", na.strings=c(".", "NA", "", "?"), header=TRUE,
sep=",", encoding="UTF-8", strip.white=TRUE)
#============================================================
# Rattle timestamp: 2015-05-01 20:32:28 i386-w64-mingw32
# Score a dataset.
# Obtain probability scores for the Random Forest model on Transformed Training Data with binned
columns.csv [validate].
crs$pr <- predict(crs$rf, newdata=na.omit(crs$dataset[crs$validate, c(crs$input)]), type="prob")[,2]
# Extract the relevant variables from the dataset.
sdata <- crs$dataset[crs$validate,]
# Output the combined data.
write.csv(cbind(sdata, crs$pr), file="M:\Lab & Assginments\ADS\Final Project\Random
Forest\Random_Forest_Stratified_Sampling_Valdation_Score.csv", row.names=FALSE)
#============================================================
# Rattle timestamp: 2015-05-01 20:33:49 i386-w64-mingw32
# Random Forest
# The 'randomForest' package provides the 'randomForest' function.
require(randomForest, quietly=TRUE)
# Build the Random Forest model.
49
set.seed(crv$seed)
crs$rf <- randomForest(as.factor(Car_Cancellation) ~ .,
data=crs$dataset[crs$sample,c(crs$input, crs$target)],
ntree=130,
mtry=4,
importance=TRUE,
na.action=na.roughfix,
replace=FALSE)
# Generate textual output of 'Random Forest' model.
crs$rf
# The `pROC' package implements various AUC functions.
require(pROC, quietly=TRUE)
# Calculate the Area Under the Curve (AUC).
roc(crs$rf$y, as.numeric(crs$rf$predicted))
# Calculate the AUC Confidence Interval.
ci.auc(crs$rf$y, as.numeric(crs$rf$predicted))
# List the importance of the variables.
rn <- round(importance(crs$rf), 2)
rn[order(rn[,3], decreasing=TRUE),]
# Time taken: 19.51 secs
#============================================================
# Rattle timestamp: 2015-05-01 20:34:57 i386-w64-mingw32
# Evaluate model performance.
# Generate an Error Matrix for the Random Forest model.
# Obtain the response from the Random Forest model.
crs$pr <- predict(crs$rf, newdata=na.omit(crs$dataset[crs$validate, c(crs$input, crs$target)]))
50
# Generate the confusion matrix showing counts.
table(na.omit(crs$dataset[crs$validate, c(crs$input, crs$target)])$Car_Cancellation, crs$pr,
dnn=c("Actual", "Predicted"))
# Generate the confusion matrix showing proportions.
pcme <- function(actual, cl)
{
x <- table(actual, cl)
tbl <- cbind(round(x/length(actual), 2),
Error=round(c(x[1,2]/sum(x[1,]),
x[2,1]/sum(x[2,])), 2))
names(attr(tbl, "dimnames")) <- c("Actual", "Predicted")
return(tbl)
};
pcme(na.omit(crs$dataset[crs$validate, c(crs$input, crs$target)])$Car_Cancellation, crs$pr)
# Calculate the overall error percentage.
overall <- function(x)
{
if (nrow(x) == 2)
cat((x[1,2] + x[2,1]) / sum(x))
else
cat(1 - (x[1,rownames(x)]) / sum(x))
}
overall(table(crs$pr, na.omit(crs$dataset[crs$validate, c(crs$input, crs$target)])$Car_Cancellation,
dnn=c("Predicted", "Actual")))
# Calculate the averaged class error percentage.
avgerr <- function(x)
cat(mean(c(x[1,2], x[2,1]) / apply(x, 1, sum)))
avgerr(table(crs$pr, na.omit(crs$dataset[crs$validate, c(crs$input, crs$target)])$Car_Cancellation,
dnn=c("Predicted", "Actual")))
#============================================================
# Rattle timestamp: 2015-05-01 20:35:05 i386-w64-mingw32
# Score a dataset.
51
# Obtain probability scores for the Random Forest model on Transformed Training Data with binned
columns.csv [validate].
crs$pr <- predict(crs$rf, newdata=na.omit(crs$dataset[crs$validate, c(crs$input)]), type="prob")[,2]
# Extract the relevant variables from the dataset.
sdata <- crs$dataset[crs$validate,]
# Output the combined data.
write.csv(cbind(sdata, crs$pr), file="M:\Lab & Assginments\ADS\Final Project\Random
Forest\Random_Forest_Unsampled_Data_Validation_Score.csv", row.names=FALSE)