Predicting Cab Booking Cancelations- Executive Summary

Predicting Cab Booking Cancelations

Project Report

Team 5: Eche Victor Ogah Keon Grey Deepak Vijayakumar Lokesh Shanmuganandam

1

Table of Contents Executive Summary ....................................................................................................................................... 2

Problem Description ..................................................................................................................................... 3

Visualization Analysis .................................................................................................................................... 3

Data Cleaning .............................................................................................................................................. 12

Data Mining ................................................................................................................................................. 13

Takeaway .................................................................................................................................................... 26

Conclusion ................................................................................................................................................... 26

Appendix A – Variable Description ............................................................................................................. 28

Appendix B – Ada Boost R Code.................................................................................................................. 29

Appendix C – Neural Network R Code ........................................................................................................ 39

Appendix D – Random Forest R Code ......................................................................................................... 44

2

Executive Summary In the last half a decade, the rise of Uber and similar companies that create and operate

mobile ride sharing apps to connect riders with drivers is threatening the existence of the

traditional taxi and cab industries around the world. The business model of Uber and similar

companies that simply connect passengers with drivers, and handles the processing of payments

from customer to driver while taking a percentage of the payment has been very profitable and

threatens to send the traditional taxi and cab industry the way the dinosaurs went decades ago. In

order for the traditional taxi and cab industry to remain profitable with their fleet of vehicles,

employees on payroll, and overhead cost, they will need to use data to better help determine

which bookings would be profitable.

Kaggle.com is an online platform for predictive modeling and analytics where companies

and organizations around the world can sponsor data mining competitions, for data miners

around the world to solve. The India School of Business (ISB) and Yourcabs.com based out of

Bangalore, India, sponsored a contest to have students develop predictive models to classify if a

cab booking will be cancelled due to unavailability of cabs. Yourcabs.com was founded by Sree

Harsha and Rajesh Kedilaya in 2011, the company operates technology that aggregate fleet

owners and cabs, to manage the supply and demand of cabs.

Being able to accurately classify if cab bookings would be cancelled due to unavailability

of cabs would be of great advantage to a taxi or cab company, because they would be able to

better plan how much vehicles they have on the road, and direct drivers to customers who are

most likely going to utilize the service. Canceled cab bookings cost companies financially and

this has been evident recently as Uber was caught placing orders for services from their

competitors and then cancelling it. A classification algorithm would help companies avoid this.

This report details the problem description, visualization techniques, data mining

techniques, and conclusion from the analysis of the dataset.

3

Problem Description The problem description comes from the Kaggle.com posting of the competition titled

“Predicting cab booking cancelations”; we are given the tasks of a creating a predictive model to

classify new bookings as to whether they will be cancelled due to unavailability of cabs.

Visualization Analysis

Figure 1- Treemap showing the Total Booking by Area ID

The treemap illustrates the total booking by area ID at Yourcabs.com, the top five areas

that have the highest volume of bookings are the:

1. Area ID 393

2. Area ID 571

3. Area ID 293

4. Area ID 1010

5. Area ID 142

4

Figure 2-Treemap showing the Cancellations Based on zero Booking Date Difference and the Count of Successful booking versus Cancelled Bookings

The treemap illustrates the number of cancellations based on zero booking date

difference. The amount of successful booking was 18,286 compared to 2032 cancelled bookings.

Having a cancellation percentage of 10.00% versus 90% successful booking percentage with

zero booking date difference.

Figure 3-Treemap showing the Cancellations Based on one Booking Date Difference and the Count of Successful booking versus Cancelled Bookings

The treemap illustrates the number of cancellations based on one booking date difference.

The amount of successful booking was 15591 compared to 773 cancelled bookings. Having a

cancellation percentage of 4.723% versus 95.27% successful booking percentage with one

booking date difference.

5

Figure 4-Treemap showing the Cancellations Based on two Booking Date Difference and the Count of Successful booking versus Cancelled Bookings

The treemap illustrates the number of cancellations based on two booking date

difference. The amount of successful booking was 2049 compared to 47 cancelled bookings.

Having a cancellation percentage of 2.24% versus 97.75% successful booking percentage with

two booking date difference.

Figure 5-Treemap showing the Cancellations Based on three Booking Date Difference and the Count of Successful booking versus Cancelled Bookings

The treemap illustrates the number of cancellations based on three booking date


Having a cancellation percentage of 2.40% versus 97.59% successful booking percentage with

three booking date difference.

6

Figure 6--Treemap showing the Cancellations Based on four Booking Date Difference and the Count of Successful booking versus Cancelled Bookings

The treemap illustrates the number of cancellations based on four booking date


Having a cancellation percentage of 2% versus 98% successful booking percentage with four

booking date difference.

Takeaway

After analyzing the data representing the Cancellations Based on zero-four Booking Date

Difference and the Count of Successful booking versus Cancelled Bookings it maybe concluded

that with bookings rate of success is directly proportional to the booking date difference. The

greater the booking date difference the higher the rate of successful booking.

7

Figure 7-Bar Chart and Pie Chart of the Overall Booking Representation

The Bar chart representation of the successful booking and cancelled bookings illustrates

that 40299 for successful booking versus a value of 3132 for cancelled bookings –a cancellation

percentage of 7.21% versus a success rate of 92.78%. The Pie chart representation illustrates the

vehicle model 12 has the highest percentage of successful booking at 72.4% followed by vehicle

model 1 at 6.2% and finally vehicle model 10 at 6.0%.

Figure 8-Bar Chart and Pie Chart of the Overall Booking Representation

The Bar chart representation of the successful booking and cancelled bookings illustrates

that 40299 for successful booking versus a value of 3132 for cancelled bookings –a cancellation

percentage of 7.21% versus a success rate of 92.78%. The Pie chart representation illustrates the

vehicle model 12 has the highest percentage of successful booking at 86.2% followed by vehicle

model 89 at 9.5%.

8

Figure 9-Total Number of Cancellation per Weekday

The Bar chart representation of the number of successful bookings throughout the week.

Friday registered the highest rate of successful booking at 6806, Saturday second at 6224 and

Monday at 6199, Thursday, Sunday, Wednesday and Tuesday followed at 6949, 5475, 5455 and

5191 respectively.

Figure 10-Total Number of Cancellation per Weekday

The Bar chart representation of the number of cancelled bookings throughout the week.

Sunday registered the highest rate of cancellation at 595, Friday came in second at 545 and

Thursday came in third at 527, Saturday, Monday, Wednesday and Tuesday followed at 467,

367, 353 and 278 respectively.

9

Takeaway After analyzing the data representing the total number of cancellation per weekday for

Yourcab.com. The results is as follows the cancellation percentage of Monday is 6.59% with a

success rate of 93.40%, the cancellation percentage of Tuesday is 5.08% with a success rate of

94.91%, the cancellation percentage of Wednesday is 6.07% with a success rate of 93.92%, the

cancellation of Thursday is 8.13% with a success rate of 91.86%, the cancellation of Friday is

8.0% with a success rate of 92%, the cancellation of Saturday is 6.97% with a success rate of

93.02% and finally the cancellation of Sunday is 9.8% with a success rate of 90.20%. This

maybe concluded that the highest rate of cancellation is Sunday, Thursday and in third Friday.

Figure 11-Cross Table and Bar graph showing the number of successful booking/cancellations per month and weekday

The cross table illustrates the number of successful booking and cancellations per month.

The highlighted month for the calendar year represent the successful booking per month. In the

top three we have Fridays, Saturdays and Thursdays which are 6806, 6224 and 5949 respectively

for successful bookings within a calendar year. Simultaneously the bar graph illustrates the

weekly success rate of booking.

1. Friday has the highest rate of success of a total of 6806 with August, September

and July having the highest rate of success on Fridays at 1082, 756 and 714

respectively.

2. Saturday has the second highest rate of success of a total of 6224 with August,

July and June having the highest rate of success on Saturdays at 926, 754 and 719

respectively.

10

3. Thursday has the third highest rate of success of a total of 5949 with August,

October and July having the highest rate of success on Thursdays at 919, 823 and

645 respectively.

4. Sunday has the fourth highest rate of success of a total of 5475 with September,

June and August having the highest rate of success on Sunday at 717, 730 and 616

respectively.

5. Wednesday has the fifth highest rate of success of a total of 5455 with July,

October and May having the highest rate of success on Wednesdays at 717, 730

and 616 respectively.

6. Monday has the sixth highest rate of success of a total of 5199 with July,

September and August having the highest rate of success on Mondays at 700, 699


7. Tuesday has the seventh highest rate of success of a total of 5191 with July,

August and October having the highest rate of success on Tuesdays at 700, 629


Figure 12-Cross Table and Bar graph showing the number of successful booking/cancellations per month and weekday

The cross table illustrates the number of successful booking and cancellations per month.

The highlighted month for the calendar year represent the cancellation bookings per month. In

the top three we have Sunday, Thursday and Friday which 595, 545 and 527 respectively for

cancellation bookings within a calendar year. Simultaneously the bar graph illustrates the weekly

success rate of booking.

11

1. Sundays has the highest rate of cancellation of a total of 595 with October, May

and June having the highest rate of cancellations on Sundays at 149, 110 and 80

respectively.

2. Fridays has the second highest rate of cancellation of a total of 545 with May,

September and October having the highest rate of cancellations on Fridays at 154,

88 and 68 respectively.

3. Thursday has the third highest rate of cancellation of a total of 527 with October,

May and November having the highest rate of cancellations on Thursdays at 290,

55 and 46 respectively.

4. Saturdays has the fourth highest rate of cancellation of a total of 467 with May,

June and October having the highest rate of cancellations on Saturdays at 111, 90


5. Mondays has the fifth highest rate of cancellation of a total of 367 with

September, October and November having the highest rate of cancellations on

Mondays at 86, 86 and 53 respectively.

6. Wednesdays has the sixth highest rate of cancellation of a total of 353 with

October, November and August having the highest rate of cancellations on

Wednesday at 89, 68 and 55 respectively.

7. Tuesdays has the seventh highest rate of cancellation of a total of 278 with

October, November and September having the highest rate of cancellations on

Tuesdays at 68, 63 and 44 respectively.

12

Data Cleaning A record was deleted from the data, because during data exploration we observed that a

booking was made after the scheduled trip, which resulted in skewed data. We also deleted and

created new variables, refer to Table 1 for the variables that were deleted and the reason why,

and refer to Table 2.

Variable Name Reason

ID Not relevant to analysis.

USER_ID Not relevant to analysis.

FROM_AREA_ID

Not relevant to analysis as most cancellations

originate at area ID 393.

TO_AREA_ID Not relevant to analysis.

TO_CITY_ID Not relevant to analysis.

FROM_DATE Created new variables, refer to Table 2.

TO_DATE Not relevant to analysis.

FROM_LAT Coordinates are not relevant to variability of cab.

FROM_LONG Coordinates are not relevant to variability of cab.

BOOKING_CREATED Not relevant to analysis

COST_OF_ERROR Will be calculating as a result of cancelation. Table 1-Variables Removed

Variable Name Reason

BOOKING_MONTH To perform analysis by month.

BOOKING_DAY To perform analysis by days.

LEAD_TIME_DAYS Difference between date of booking and travel date.

FROM_MONTH Created from FROM_DATE variable.

FROM_WEEK Created from FROM_DATE variable.

FROM_TIME Created from FROM_DATE variable. Table 2-Variables Created

13

Data Mining We took the Kaggle_YourCabs_training.csv file and partitioned it 70% for training and

30% for validation. Since we were given a classification tasks, we used the Random Forest, Ada

Boost, and Neural Network algorithms; and generated error matrices, risk charts, lift charts, ROC

curve, precision chart, sensitivity vs. specificity charts, and precision vs. recall charts for the

validation data, and error matrices and ROC cures for the testing data for all three algorithms

used.

ADA Boost Validation Data:

14

Figure 13-Ada Boost Validation Data Error Matrix

Figure 14-Ada Boost Validation Data Risk Plot

Figure 15-Ada Boost Validation Data Lift Chart

15

Figure 16-Ada Boost Validation Data ROC Curve

Figure 17- Ada Boost Validation Data Precision Vs. Recall Plot

16

Figure 18-Ada Boost Validation Data Sensitivity Vs. Specificity Plot

ADA Boost Training Data:

Figure 19-Ada Boost Training Data Error Matrix

17

Figure 20-Ada Boost Training Data ROC Curve

18

Neural Network Validation Data:

Figure 21-Neural Network Validation Data Error Matrix

Figure 22-Neural Network Validation Data Risk Plot

19

Figure 23-Neural Network Validation Data Lift Chart

Figure 24 - Neural Network Validation Data ROC Curve

20

Figure 25 - Neural Network Validation Data Precision Vs. Recall Plot

Figure 26-Neural Network Validation Data Sensitivity Vs. Specificity Plot

21

Neural Network Training Data:

Figure 27-Neural Network Training Data Error Matrix

Figure 28-Neural Network Training Data ROC Curve

22

Random Forest Validation Data:

Figure 29-Random Forest Validation Data Error Matrix

Figure 30-Validation Data Risk Plot

23

Figure 31-Random Forest Validation Data Lift Chart

Figure 32-Random Forest Validation Data ROC Curve

24

Figure 33-Random Forest Validation Data Precision Vs. Recall Plot

Figure 34-Random Forest Validation Data Sensitivity Vs. Specificity Plot

25

Random Forest Training Data:

Figure 35-Random Forest Training Data Error Matrix

Figure 36-Random Forest Training Data ROC Curve

26

Takeaway

Classification Matrix: (confusion matrix or error matrix)

This matrix summarizes the correct and incorrect classifications that a classifier produced

for a certain dataset. Rows and columns of the classification matrix correspond to the true and

predicted classes respectively. The two diagonal cells (upper left, lower right) give the number

of correct classifications, where the predicted class coincides with the actual class of the

observation. The off diagonal cells gives the count of the misclassification. The classification

matrix gives estimates of the true classification and misclassification rates.

Lift Chart:

A Lift Chart graphically represents the improvement that a mining model provides when

compared against a random guess, and measures the change in terms of a lift score. By

comparing the lift scores for various portions of your data set and for different models, you can

determine which model is best, and which percentage of the cases in the data set would benefit

from applying the model’s predictions.

ROC Curve: (Receiver Operating Characteristic Curve) A more popular method for plotting the two measures is through ROC curves. The ROC

curve plots the pairs {sensitivity, 1-specifity} as the cutoff value increases from 0 and 1.

Conclusion The data representing the total number of cancellation per weekday for Yourcabs.com.

The results is as follows the cancellation percentage of Monday is 6.59% with a success rate of

93.40%, the cancellation percentage of Tuesday is 5.08% with a success rate of 94.91%, the

cancellation percentage of Wednesday is 6.07% with a success rate of 93.92%, the cancellation

of Thursday is 8.13% with a success rate of 91.86%, the cancellation of Friday is 8.0% with a

success rate of 92%, the cancellation of Saturday is 6.97% with a success rate of 93.02% and

finally the cancellation of Sunday is 9.8% with a success rate of 90.20%. This maybe concluded

that the highest rate of cancellation is Sunday, Thursday and in third Friday.

After analyzing the data representing the Cancellations Based on zero-four Booking Date

Difference and the Count of Successful booking versus Cancelled Bookings it maybe concluded

that with bookings rate of success is directly proportional to the booking date difference. The

greater the booking date difference the higher the rate of successful booking.

27

To handle the unbalanced nature of the data, we used a stratified sampling

technique. Once you split up the data into train, validation and test set, chances are close to 100%

that your already skewed data becomes even more unbalanced for at least one of the three

resulting sets. This can be overcome by using stratified sampling this assures that the train,

validation and test sets are well balanced. Team 5 decided to use the sampleSize parameter in the

random forest algorithm to achieve it. Since we did not use the stratified sampling with Adaboost

and Neural network algorithms the error value was as low as 0.06 and 0.068 respectively.

Arrived at a conclusion that the best possible algorithm would be the random forest model value

of 0.21.

28

Appendix A – Variable Description This table contains the description of the variables in the Kaggle_YourCabs_training.csv

file as listed on the competition page.

Data Field Description

id booking ID

user_id the ID of the customer (based on mobile number)

vehicle_model_id vehicle model type

package_id type of package (1=4hrs & 40kms, 2=8hrs & 80kms, 3=6hrs & 60kms, 4= 10hrs & 100kms, 5=5hrs & 50kms, 6=3hrs & 30kms, 7=12hrs & 120kms)

travel_type_id type of travel (1=long distance, 2= point to point, 3= hourly rental)

from_area_id unique identifier of area. Applicable only for point-to-point travel and packages

to_area_id unique identifier of area. Applicable only for point-to-point travel

from_city_id unique identifier of city

to_city_id unique identifier of city (only for intercity)

from_date time stamp of requested trip start

to_date time stamp of trip end

online_booking if booking was done on desktop website

mobile_site_booking if booking was done on mobile website

booking_created time stamp of booking

from_lat latitude of from area

from_long longitude of from area

to_lat latitude of to area

to_long longitude of to area

car_cancelation (available only in training data) - whether the booking was cancelled (1) or not (0) due to unavailability of a car

cost_of_error

(available only in training data) - the cost incurred if the booking is misclassified. For an un-cancelled booking, the cost of misclassificaiton is 1. For a cancelled booking, the cost is a function of the cancellation time relative to the trip start time (see Evaluation Page)

Table 3-Description of variables

29

Appendix B – Ada Boost R Code

# Load the data.

crs$dataset <- read.csv("file:///M:/Lab & Assginments/ADS/Final Project/Transformed Training Data

with binned columns.csv", na.strings=c(".", "NA", "", "?"), strip.white=TRUE, encoding="UTF-8")

#============================================================

# Rattle timestamp: 2015-05-01 20:39:14 i386-w64-mingw32

# Note the user selections.

# Build the training/validate/test datasets.

set.seed(crv$seed)

crs$nobs <- nrow(crs$dataset) # 43430 observations

crs$sample <- crs$train <- sample(nrow(crs$dataset), 0.7*crs$nobs) # 30401 observations

crs$validate <- sample(setdiff(seq_len(nrow(crs$dataset)), crs$train), 0.15*crs$nobs) # 6514

observations

crs$test <- setdiff(setdiff(seq_len(nrow(crs$dataset)), crs$train), crs$validate) # 6515 observations

# The following variable selections have been noted.

crs$input <- c("user_id", "vehicle_model_id_red", "package_id", "from_area_id",

"to_area_id", "from_city_id", "to_city_id", "from_date",

"to_date", "online_booking", "mobile_site_booking", "booking_created",

"from_lat", "from_long", "to_lat", "to_long",

"Car_Cancellation", "Cost_of_error", "from_month", "from_weekday",

"from_time", "booking_month", "booking_weekday", "lead_time_days_red")

crs$numeric <- c("user_id", "vehicle_model_id_red", "from_date", "online_booking",

"mobile_site_booking", "booking_created", "Car_Cancellation", "Cost_of_error",

"from_time", "lead_time_days_red")

crs$categoric <- c("package_id", "from_area_id", "to_area_id", "from_city_id",

"to_city_id", "to_date", "from_lat", "from_long",

"to_lat", "to_long", "from_month", "from_weekday",

"booking_month", "booking_weekday")

crs$target <- "travel_type_id"

crs$risk <- NULL

30

crs$ident <- "id"

crs$ignore <- NULL

crs$weights <- NULL

#============================================================




set.seed(crv$seed)




observations



crs$input <- c("vehicle_model_id_red", "package_id", "travel_type_id", "from_city_id",

"online_booking", "mobile_site_booking", "from_month", "from_weekday",


crs$numeric <- c("vehicle_model_id_red", "travel_type_id", "online_booking", "mobile_site_booking",


crs$categoric <- c("package_id", "from_city_id", "from_month", "from_weekday",


crs$target <- "Car_Cancellation"

crs$risk <- NULL

crs$ident <- NULL

crs$ignore <- c("id", "user_id", "from_area_id", "to_area_id", "to_city_id", "from_date", "to_date",

"booking_created", "from_lat", "from_long", "to_lat", "to_long", "Cost_of_error")

crs$weights <- NULL

#============================================================



31


set.seed(crv$seed)




observations











crs$risk <- NULL

crs$ident <- NULL



crs$weights <- NULL

#============================================================


# Ada Boost

# The `ada' package implements the boost algorithm.

require(ada, quietly=TRUE)

# Build the Ada Boost model.

set.seed(crv$seed)

crs$ada <- ada(Car_Cancellation ~ .,

data=crs$dataset[crs$train,c(crs$input, crs$target)],

32

control=rpart.control(maxdepth=30,

cp=0.010000,

minsplit=20,

xval=10),

iter=60)

# Print the results of the modelling.

print(crs$ada)

round(crs$ada$model$errs[crs$ada$iter,], 2)

cat('Variables actually used in tree construction:\n')

print(sort(names(listAdaVarsUsed(crs$ada))))

cat('\nFrequency of variables actually used:\n')

print(listAdaVarsUsed(crs$ada))

# Time taken: 54.03 secs

#============================================================


# Evaluate model performance.

# Generate an Error Matrix for the Ada Boost model.

# Obtain the response from the Ada Boost model.

crs$pr <- predict(crs$ada, newdata=crs$dataset[crs$validate, c(crs$input, crs$target)])

# Generate the confusion matrix showing counts.

table(crs$dataset[crs$validate, c(crs$input, crs$target)]$Car_Cancellation, crs$pr,

dnn=c("Actual", "Predicted"))

# Generate the confusion matrix showing proportions.

pcme <- function(actual, cl)

{

x <- table(actual, cl)

tbl <- cbind(round(x/length(actual), 2),

Error=round(c(x[1,2]/sum(x[1,]),

x[2,1]/sum(x[2,])), 2))

names(attr(tbl, "dimnames")) <- c("Actual", "Predicted")

33

return(tbl)

};

pcme(crs$dataset[crs$validate, c(crs$input, crs$target)]$Car_Cancellation, crs$pr)

# Calculate the overall error percentage.

overall <- function(x)

{

if (nrow(x) == 2)

cat((x[1,2] + x[2,1]) / sum(x))

else

cat(1 - (x[1,rownames(x)]) / sum(x))

}

overall(table(crs$pr, crs$dataset[crs$validate, c(crs$input, crs$target)]$Car_Cancellation,

dnn=c("Predicted", "Actual")))

# Calculate the averaged class error percentage.

avgerr <- function(x)

cat(mean(c(x[1,2], x[2,1]) / apply(x, 1, sum)))

avgerr(table(crs$pr, crs$dataset[crs$validate, c(crs$input, crs$target)]$Car_Cancellation,


#============================================================


# Ada Boost




set.seed(crv$seed)




cp=0.010000,

minsplit=20,

xval=10),

iter=50)

34


print(crs$ada)







#============================================================











{




x[2,1]/sum(x[2,])), 2))


return(tbl)

};



35


{

if (nrow(x) == 2)

cat((x[1,2] + x[2,1]) / sum(x))

else


}








#============================================================




set.seed(crv$seed)




observations

crs$test <- NULL








36



crs$risk <- NULL

crs$ident <- NULL



crs$weights <- NULL

#============================================================


# Ada Boost




set.seed(crv$seed)




cp=0.010000,

minsplit=20,

xval=10),

iter=50)


print(crs$ada)







#============================================================


37










{




x[2,1]/sum(x[2,])), 2))


return(tbl)

};




{

if (nrow(x) == 2)

cat((x[1,2] + x[2,1]) / sum(x))

else


}





38




#============================================================


# Score a dataset.

# Obtain probability scores for the Ada Boost model on Transformed Training Data with binned

columns.csv [validate].

crs$pr <- predict(crs$ada, newdata=crs$dataset[crs$validate, c(crs$input)], type="prob")[,2]

# Extract the relevant variables from the dataset.

sdata <- crs$dataset[crs$validate,]

# Output the combined data.

write.csv(cbind(sdata, crs$pr), file="M:\Lab & Assginments\ADS\Final Project\AdaBoost\Ada

Boost_Validation_Score.csv", row.names=FALSE)

39

Appendix C – Neural Network R Code # Load the data. crs$dataset <- read.csv("file:///M:/Lab & Assginments/ADS/Final Project/Transformed Training Data with binned columns.csv", na.strings=c(".", "NA", "", "?"), strip.white=TRUE, encoding="UTF-8") #============================================================ # Rattle timestamp: 2015-05-01 20:48:43 i386-w64-mingw32 # Note the user selections. # Build the training/validate/test datasets. set.seed(crv$seed) crs$nobs <- nrow(crs$dataset) # 43430 observations crs$sample <- crs$train <- sample(nrow(crs$dataset), 0.7*crs$nobs) # 30401 observations crs$validate <- sample(setdiff(seq_len(nrow(crs$dataset)), crs$train), 0.15*crs$nobs) # 6514 observations crs$test <- setdiff(setdiff(seq_len(nrow(crs$dataset)), crs$train), crs$validate) # 6515 observations # The following variable selections have been noted. crs$input <- c("user_id", "vehicle_model_id_red", "package_id", "from_area_id", "to_area_id", "from_city_id", "to_city_id", "from_date", "to_date", "online_booking", "mobile_site_booking", "booking_created", "from_lat", "from_long", "to_lat", "to_long", "Car_Cancellation", "Cost_of_error", "from_month", "from_weekday", "from_time", "booking_month", "booking_weekday", "lead_time_days_red") crs$numeric <- c("user_id", "vehicle_model_id_red", "from_date", "online_booking", "mobile_site_booking", "booking_created", "Car_Cancellation", "Cost_of_error", "from_time", "lead_time_days_red") crs$categoric <- c("package_id", "from_area_id", "to_area_id", "from_city_id", "to_city_id", "to_date", "from_lat", "from_long", "to_lat", "to_long", "from_month", "from_weekday", "booking_month", "booking_weekday") crs$target <- "travel_type_id" crs$risk <- NULL crs$ident <- "id" crs$ignore <- NULL crs$weights <- NULL #============================================================ # Rattle timestamp: 2015-05-01 20:51:27 i386-w64-mingw32

40

# Note the user selections. # Build the training/validate/test datasets. set.seed(crv$seed) crs$nobs <- nrow(crs$dataset) # 43430 observations crs$sample <- crs$train <- sample(nrow(crs$dataset), 0.7*crs$nobs) # 30401 observations crs$validate <- sample(setdiff(seq_len(nrow(crs$dataset)), crs$train), 0.15*crs$nobs) # 6514 observations crs$test <- setdiff(setdiff(seq_len(nrow(crs$dataset)), crs$train), crs$validate) # 6515 observations # The following variable selections have been noted. crs$input <- c("vehicle_model_id_red", "package_id", "travel_type_id", "from_city_id", "online_booking", "mobile_site_booking", "from_month", "from_weekday", "from_time", "booking_month", "booking_weekday", "lead_time_days_red") crs$numeric <- c("vehicle_model_id_red", "travel_type_id", "online_booking", "mobile_site_booking", "from_time", "lead_time_days_red") crs$categoric <- c("package_id", "from_city_id", "from_month", "from_weekday", "booking_month", "booking_weekday") crs$target <- "Car_Cancellation" crs$risk <- NULL crs$ident <- NULL crs$ignore <- c("id", "user_id", "from_area_id", "to_area_id", "to_city_id", "from_date", "to_date", "booking_created", "from_lat", "from_long", "to_lat", "to_long", "Cost_of_error") crs$weights <- NULL #============================================================ # Rattle timestamp: 2015-05-01 20:51:36 i386-w64-mingw32 # Note the user selections. # Build the training/validate/test datasets. set.seed(crv$seed) crs$nobs <- nrow(crs$dataset) # 43430 observations crs$sample <- crs$train <- sample(nrow(crs$dataset), 0.7*crs$nobs) # 30401 observations crs$validate <- sample(setdiff(seq_len(nrow(crs$dataset)), crs$train), 0.3*crs$nobs) # 13029 observations crs$test <- NULL # The following variable selections have been noted. crs$input <- c("vehicle_model_id_red", "package_id", "travel_type_id", "from_city_id", "online_booking", "mobile_site_booking", "from_month", "from_weekday",

41

"from_time", "booking_month", "booking_weekday", "lead_time_days_red") crs$numeric <- c("vehicle_model_id_red", "travel_type_id", "online_booking", "mobile_site_booking", "from_time", "lead_time_days_red") crs$categoric <- c("package_id", "from_city_id", "from_month", "from_weekday", "booking_month", "booking_weekday") crs$target <- "Car_Cancellation" crs$risk <- NULL crs$ident <- NULL crs$ignore <- c("id", "user_id", "from_area_id", "to_area_id", "to_city_id", "from_date", "to_date", "booking_created", "from_lat", "from_long", "to_lat", "to_long", "Cost_of_error") crs$weights <- NULL #============================================================ # Rattle timestamp: 2015-05-01 20:51:40 i386-w64-mingw32 # Neural Network # Build a neural network model using the nnet package. require(nnet, quietly=TRUE) # Build the NNet model. set.seed(199) crs$nnet <- nnet(as.factor(Car_Cancellation) ~ ., data=crs$dataset[crs$sample,c(crs$input, crs$target)], size=10, skip=TRUE, MaxNWts=10000, trace=FALSE, maxit=100) # Print the results of the modelling. cat(sprintf("A %s network with %d weights.\n", paste(crs$nnet$n, collapse="-"), length(crs$nnet$wts))) cat(sprintf("Inputs: %s.\n", paste(crs$nnet$coefnames, collapse=", "))) cat(sprintf("Output: %s.\n", names(attr(crs$nnet$terms, "dataClasses"))[1])) cat(sprintf("Sum of Squares Residuals: %.4f.\n", sum(residuals(crs$nnet) ^ 2))) cat("\n") print(summary(crs$nnet)) cat('\n') # Time taken: 27.31 secs

42

#============================================================ # Rattle timestamp: 2015-05-01 20:54:16 i386-w64-mingw32 # Evaluate model performance. # Generate an Error Matrix for the Neural Net model. # Obtain the response from the Neural Net model. crs$pr <- predict(crs$nnet, newdata=crs$dataset[crs$validate, c(crs$input, crs$target)], type="class") # Generate the confusion matrix showing counts. table(crs$dataset[crs$validate, c(crs$input, crs$target)]$Car_Cancellation, crs$pr, dnn=c("Actual", "Predicted")) # Generate the confusion matrix showing proportions. pcme <- function(actual, cl) { x <- table(actual, cl) tbl <- cbind(round(x/length(actual), 2), Error=round(c(x[1,2]/sum(x[1,]), x[2,1]/sum(x[2,])), 2)) names(attr(tbl, "dimnames")) <- c("Actual", "Predicted") return(tbl) }; pcme(crs$dataset[crs$validate, c(crs$input, crs$target)]$Car_Cancellation, crs$pr) # Calculate the overall error percentage. overall <- function(x) { if (nrow(x) == 2) cat((x[1,2] + x[2,1]) / sum(x)) else cat(1 - (x[1,rownames(x)]) / sum(x)) } overall(table(crs$pr, crs$dataset[crs$validate, c(crs$input, crs$target)]$Car_Cancellation, dnn=c("Predicted", "Actual"))) # Calculate the averaged class error percentage. avgerr <- function(x) cat(mean(c(x[1,2], x[2,1]) / apply(x, 1, sum))) avgerr(table(crs$pr, crs$dataset[crs$validate, c(crs$input, crs$target)]$Car_Cancellation, dnn=c("Predicted", "Actual")))

43

#============================================================ # Rattle timestamp: 2015-05-01 21:02:49 i386-w64-mingw32 # Score a dataset. # Obtain probability scores for the Neural Net model on Transformed Training Data with binned columns.csv [validate]. crs$pr <- predict(crs$nnet, newdata=crs$dataset[crs$validate, c(crs$input)]) # Extract the relevant variables from the dataset. sdata <- crs$dataset[crs$validate,] # Output the combined data. write.csv(cbind(sdata, crs$pr), file="C:\Users\Deepak\Documents\Neural Network_Validation_Score.csv", row.names=FALSE) #============================================================ # Rattle timestamp: 2015-05-01 21:03:18 i386-w64-mingw32 # Score a dataset.

44

Appendix D – Random Forest R Code # Load the data.

crs$dataset <- read.csv("file:///M:/Lab & Assginments/ADS/Final Project/Transformed Training Data

with binned columns.csv", na.strings=c(".", "NA", "", "?"), strip.white=TRUE, encoding="UTF-8")

#============================================================




set.seed(crv$seed)




observations



crs$input <- c("user_id", "vehicle_model_id_red", "package_id", "from_area_id",

"to_area_id", "from_city_id", "to_city_id", "from_date",

"to_date", "online_booking", "mobile_site_booking", "booking_created",

"from_lat", "from_long", "to_lat", "to_long",

"Car_Cancellation", "Cost_of_error", "from_month", "from_weekday",


crs$numeric <- c("user_id", "vehicle_model_id_red", "from_date", "online_booking",

"mobile_site_booking", "booking_created", "Car_Cancellation", "Cost_of_error",


crs$categoric <- c("package_id", "from_area_id", "to_area_id", "from_city_id",

"to_city_id", "to_date", "from_lat", "from_long",

"to_lat", "to_long", "from_month", "from_weekday",


crs$target <- "travel_type_id"

crs$risk <- NULL

crs$ident <- "id"

45

crs$ignore <- NULL

crs$weights <- NULL

#============================================================




set.seed(crv$seed)




observations











crs$risk <- NULL

crs$ident <- NULL



crs$weights <- NULL

#============================================================


# Random Forest

# The 'randomForest' package provides the 'randomForest' function.

46

require(randomForest, quietly=TRUE)

# Build the Random Forest model.

set.seed(crv$seed)

crs$rf <- randomForest(as.factor(Car_Cancellation) ~ .,

data=crs$dataset[crs$sample,c(crs$input, crs$target)],

ntree=130,

mtry=4,

sampsize=c(1200,1200),

importance=TRUE,

na.action=na.roughfix,

replace=FALSE)

# Generate textual output of 'Random Forest' model.

crs$rf

# The `pROC' package implements various AUC functions.

require(pROC, quietly=TRUE)

# Calculate the Area Under the Curve (AUC).

roc(crs$rf$y, as.numeric(crs$rf$predicted))

# Calculate the AUC Confidence Interval.

ci.auc(crs$rf$y, as.numeric(crs$rf$predicted))

# List the importance of the variables.

rn <- round(importance(crs$rf), 2)

rn[order(rn[,3], decreasing=TRUE),]


#============================================================



47

# Generate an Error Matrix for the Random Forest model.

# Obtain the response from the Random Forest model.

crs$pr <- predict(crs$rf, newdata=na.omit(crs$dataset[crs$validate, c(crs$input, crs$target)]))


table(na.omit(crs$dataset[crs$validate, c(crs$input, crs$target)])$Car_Cancellation, crs$pr,




{




x[2,1]/sum(x[2,])), 2))


return(tbl)

};

pcme(na.omit(crs$dataset[crs$validate, c(crs$input, crs$target)])$Car_Cancellation, crs$pr)



{

if (nrow(x) == 2)

cat((x[1,2] + x[2,1]) / sum(x))

else


}

overall(table(crs$pr, na.omit(crs$dataset[crs$validate, c(crs$input, crs$target)])$Car_Cancellation,





avgerr(table(crs$pr, na.omit(crs$dataset[crs$validate, c(crs$input, crs$target)])$Car_Cancellation,

48


#============================================================


# Score a dataset.

# Read a dataset from file for testing the model.

crs$testset <- read.csv("C:/Users/Deepak/Documents", na.strings=c(".", "NA", "", "?"), header=TRUE,

sep=",", encoding="UTF-8", strip.white=TRUE)

#============================================================


# Score a dataset.

# Obtain probability scores for the Random Forest model on Transformed Training Data with binned


crs$pr <- predict(crs$rf, newdata=na.omit(crs$dataset[crs$validate, c(crs$input)]), type="prob")[,2]




write.csv(cbind(sdata, crs$pr), file="M:\Lab & Assginments\ADS\Final Project\Random

Forest\Random_Forest_Stratified_Sampling_Valdation_Score.csv", row.names=FALSE)

#============================================================


# Random Forest

# The 'randomForest' package provides the 'randomForest' function.

require(randomForest, quietly=TRUE)

# Build the Random Forest model.

49

set.seed(crv$seed)

crs$rf <- randomForest(as.factor(Car_Cancellation) ~ .,

data=crs$dataset[crs$sample,c(crs$input, crs$target)],

ntree=130,

mtry=4,

importance=TRUE,

na.action=na.roughfix,

replace=FALSE)

# Generate textual output of 'Random Forest' model.

crs$rf

# The `pROC' package implements various AUC functions.

require(pROC, quietly=TRUE)

# Calculate the Area Under the Curve (AUC).

roc(crs$rf$y, as.numeric(crs$rf$predicted))

# Calculate the AUC Confidence Interval.

ci.auc(crs$rf$y, as.numeric(crs$rf$predicted))

# List the importance of the variables.

rn <- round(importance(crs$rf), 2)

rn[order(rn[,3], decreasing=TRUE),]


#============================================================



# Generate an Error Matrix for the Random Forest model.

# Obtain the response from the Random Forest model.

crs$pr <- predict(crs$rf, newdata=na.omit(crs$dataset[crs$validate, c(crs$input, crs$target)]))

50


table(na.omit(crs$dataset[crs$validate, c(crs$input, crs$target)])$Car_Cancellation, crs$pr,




{




x[2,1]/sum(x[2,])), 2))


return(tbl)

};

pcme(na.omit(crs$dataset[crs$validate, c(crs$input, crs$target)])$Car_Cancellation, crs$pr)



{

if (nrow(x) == 2)

cat((x[1,2] + x[2,1]) / sum(x))

else


}

overall(table(crs$pr, na.omit(crs$dataset[crs$validate, c(crs$input, crs$target)])$Car_Cancellation,





avgerr(table(crs$pr, na.omit(crs$dataset[crs$validate, c(crs$input, crs$target)])$Car_Cancellation,


#============================================================


# Score a dataset.

51

# Obtain probability scores for the Random Forest model on Transformed Training Data with binned


crs$pr <- predict(crs$rf, newdata=na.omit(crs$dataset[crs$validate, c(crs$input)]), type="prob")[,2]




write.csv(cbind(sdata, crs$pr), file="M:\Lab & Assginments\ADS\Final Project\Random

Forest\Random_Forest_Unsampled_Data_Validation_Score.csv", row.names=FALSE)

Predicting Cab Booking Cancelations- Executive Summary

Documents

Transcript of Predicting Cab Booking Cancelations- Executive Summary