IEOR 265 Final Paper_Minchao Lin

18
IEOR 265 Final Project Application of Machine Learning Techniques to Forecast Bike Rental Demand in the Capital Bikeshare Program in Washington, D.C. by Minchao Lin May 8, 2015

Transcript of IEOR 265 Final Paper_Minchao Lin

Page 1: IEOR 265 Final Paper_Minchao Lin

IEOR 265 Final Project

Application of Machine Learning Techniques to Forecast

Bike Rental Demand in the Capital Bikeshare Program in

Washington, D.C.

by

Minchao Lin

May 8, 2015

Page 2: IEOR 265 Final Paper_Minchao Lin

Abstract

Forecasting demand is a crucial issue in efficient resource management and different

machine learning techniques can help build and refine a model to learn from observed data and

make predictions. Specifically, supervised learning in machine learning helps in modeling the

relation between a set of predictor variables and one or more response variables on the basis of a

finite set of observations.

The objective of this project is to combine historical usage patterns with weather data in

order to forecast the total count of bikes rented during each hour of the bike sharing system in

Washington, D.C. In this paper, multiple machine learning techniques including ordinary least

squares regression, lasso regression, elastic net, ensemble learning methods, neural network and

local linear regression are discussed and their efficiencies in predicting the response variables are

evaluated and compared.

Page 3: IEOR 265 Final Paper_Minchao Lin

1 Introduction

1.1 Background

A bicycle sharing system provides bicycles available for shared use to individuals on a

short term basis. These systems are becoming more and more popular in major cities as a

convenient means of transportation. As of June 2014, public bicycle sharing systems were

available on five continents, including 712 cities, operating approximately 806,200 bicycles at

37,500 stations. With these systems, bicycle rental is completely automated via a network of

kiosk locations throughout a city and people are able to rent a bike from one location and return

it to a different location. In order to determine the right number of bicycles that meet the demand

in the city, historical data is a good resource to help perform the demand analysis.

1.2 Data Description

Hourly rental data spanning two years from 2011 to 2012 are provided for this project, with

variables including date & time, season, holiday, working day, weather, temperature, humidity,

wind speed, number of registered and non-registered user rentals initiated, and number of total

rentals. To test the effectiveness of a model, the historical data will be split into three sets. The

training set is comprised of the first 15 days of each month, the test set is comprised of the days

of 16 to 19 of each month, while the validation set includes the 20th to the end of the month.

Details on the predictor variables and response variables are listed in the Appendix. Each method

in the following sections is first performed on the training set and test on the test set with the

mean squared errors calculated every time. For those methods with lower mean squared errors,

we will apply these methods to the combination of training set and test set together and test on

the validation set to get the root mean squared logarithmic error.

Page 4: IEOR 265 Final Paper_Minchao Lin

1.2.1 Convert Categorical Variables to Dummy Variables

One of the solutions to the dummy variable trap is to drop one of the categorical variables. If

there are m number of categories, use m-1 in the model, the value left out can be thought of as

the reference value and the fit values of the remaining categories represent the change from this

reference. For the bike sharing demand data, year, month, hour, weekday, season, holiday,

working day, and weather are categorical variables and are converted to dummy variables.

1.2.2 Relationship between numerical predictor variables

Figure 1. Scatter plot matrix that plots each numerical variables against one another.

The above scatter plot matrix shows the relationship between each numerical predictor variables

as well as the response variable with the order from top to bottom or from left to right having

variable names: temperature, “feels like” temperature, relative humidity, wind speed, total

number of rentals. The plot shows rather independent relationships between each variables

except for the one between temperature and “feels like” temperature, which is reasonable as

Page 5: IEOR 265 Final Paper_Minchao Lin

these two variables are generally very close to each other. Because multicollinearity can increase

the variance of the coefficient estimates and make the estimates very sensitive to minor changes,

we will apply regularization to the methods to counteract this tendency.

1.3 Performance Metrics

For regression problem, the method of measuring the distance between the estimated outputs

from the actual outputs is used to quantify the model's performance. The Mean Squared Error

penalizes the bigger difference more because of the square effect. On the other hand, if we want

to reduce the penalty of bigger difference, we can log transform the numeric quantity first. The

effect of introducing the logarithm function is to balance the emphasis on small and big

predictive errors. For this project, the effectiveness of the models will be evaluated based on the

Mean Squared Error (MSE) and the Root Mean Squared Logarithmic Error (RMSLE):

√1

𝑛∑(log(𝑝𝑖 + 1) − log(𝑎𝑖 + 1))2𝑛

𝑖=1

Where:

n is the number of hours in the test set

pi is the predicted count

ai is the actual count

log(x) is the natural logarithm

Page 6: IEOR 265 Final Paper_Minchao Lin

2 Ordinary Least Squares Regression

2.1 Method Description

Ordinary least squares (OLS) is a method for estimating the unknown parameters in a linear

regression model, with the goal to minimize the differences between the observed responses and

the predicted responses.

Let X be a n × p dimensional training data input matrix where n is the total number of

observations and p is the number of features for each observation, Y be a n × 1 dimensional

vector of the training data response values, where n is the total number of observations, and β be

a p × 1 dimensional vector of unknown parameters. Then the OLS estimate of β for the linear

model is defined as

�̂� = (𝑋′𝑋)−1(𝑋′𝑌)

2.2 Performance Metric

Mean Squared Error = 10015

2.3 Result Analysis

The mean squared error is rather high and the relationship between bike rental demand and its

exogenous factors appear to be rather complex and nonlinear, making it difficult to be modeled

through traditional linear regression.

Page 7: IEOR 265 Final Paper_Minchao Lin

3 Lasso Regularization and Elastic Net

3.1 Method Description

3.1.1 Lasso Regularization

Lasso Regression is a regularized version of linear regression which uses the constraint L1-norm

to minimize the sum of squared errors. In this paper, a 5-fold-cross-validated sequence of models

with lasso is fitted in order to produce shrinkage estimates with potentially lower predictive

errors than ordinary least squares.

3.1.2 Elastic Net

Elastic net is a combination of ridge regression and lasso regularization. Similar to lasso, elastic

net can also generate zero-valued coefficients. Empirical studies suggested that elastic net can

outperform lasso on data with highly correlated predictors.

3.2 Performance Metric

Figure 2. Lambda vs. MSE for Lasso fit Figure 3. Lambda vs. MSE for Elastic Net fit

Page 8: IEOR 265 Final Paper_Minchao Lin

Mean Squared Error of Lasso = 10101

Mean Squared Error of Elastic Net = 10118

3.3 Result Analysis

The large mean squared errors of both lasso and elastic net indicate that even regularized linear

regression is not a good approach to forecast the bike sharing demand. In the following sections,

we will explore multiple nonlinear regression techniques.

4 Ensemble Learning and Ensemble Regularization

4.1 Method Description

Ensemble methods use multiple learning algorithms to obtain better predictive performance. An

ensemble is a technique for combining many weak learners in order to produce a strong learner.

4.1.1 Least Squares Boosting

Least Squares Boosting is a type of ensemble learning which fits regression ensembles in order

to minimize mean squared error. At every step, the ensemble fits a new learner to the difference

between the observed response and the aggregated prediction of all learners grown previously.

4.1.2 Bagging

Bagging is another type of ensemble learning which works by training learners on resampled

versions of the data. The resampling is done by bootstrapping observations in the training set.

Although the flexibility of ensemble makes ensemble easier to over-fit the training data, bagging

tend to reduce this problem.

Page 9: IEOR 265 Final Paper_Minchao Lin

4.1.3 Ensemble Regularization

Ensemble regularization helps choose fewer weak learners in a way that does not diminish

predictive performance. Specifically, it finds an optimal set of learner weights by tuning the lasso

parameter to minimize the ensemble resubstitution error.

4.2 Performance Metric

Mean Squared Error of Least Squares Boosting = 10030

Mean Squared Error of Bagging = 3656.5

Mean Squared Error of Regularized Bagging = 3473.4

Root Mean Squared Logarithmic Error of Regularized Bagging for Validation data set = 0.63302

4.3 Result Analysis

4.3.1 Least Squares Boosting

1) Figure 1 estimates the generalization error

by cross validation. The line shows that it is

sufficient to obtain satisfactory performance

from a smaller ensemble, perhaps one

containing from 100 to 120 trees.

Figure 4. Number of trees vs. Cross-validated MSE

2) Variables for generating the model with higher number representing greater importance:

Page 10: IEOR 265 Final Paper_Minchao Lin

We see that hour, month, atemp, temp humidity, season, and year have greater importance.

3) Let the errors be the difference between the

predicted and the real count .The normal probability

plot of errors shows that residuals are closed to

normally distributed in the center of the data while

skewe away from normal above and below the mean.

Figure 5. Normal Probability Plot

4) We separate the errors into groups by different categorical variables to see if there is any

distribution in a certain period that is significantly different from that of the others.

Figure 6. Breakdown of Errors by hour Figure 7. Breakdown of Errors by month

Page 11: IEOR 265 Final Paper_Minchao Lin

Figure 8. Breakdown of Errors by weekday Figure 9. Breakdown of Errors by season

We observe that the errors during hours at 7, 8, 17, and 18 indicate a major change of errors

from those that are before or after these hours. For the breakdown of errors by weekday,

Saturday and Sunday pattern appear to be different from those of the workdays where the

variance of the errors tend to be smaller.

4.3.1 Bagging

1) Importance of Variables:

Compared to Least Squares Boosting, “hours” now becomes the only variable that stands out in

the value for importance.

4.3.2 Regularized Bagging

1) Comparing regularized and unregularized ensembles:

Page 12: IEOR 265 Final Paper_Minchao Lin

Figure 10. Lasso parameter vs. Resubstitution MSE Figure 11. Lasso parameter vs. number of learners with nonzero weights

(‘x’ denotes value at lambda = 0 & logarithmic scale, same for all five figures)

Figure 12. Lambda vs. MSE for resubstituion and cross-validation Figure 13. Lambda vs. Number of learners for resubstituion and cv

From Figure 11, we can see that the number

of learners has reduced by over 1/3 for

regularized ensemble. Because the

resubstitution MSE values are likely to be

overly optimistic, we cross validate the

ensemble for different values of lambda.

Figure 14. Number of trees

Page 13: IEOR 265 Final Paper_Minchao Lin

The cross-validated error in Figure 12 shows that the cross-validation MSE is almost flat for

lambda up to a bit over 103. With the regularization, there are only 42 trees in the new ensemble,

notably reduced from the 200 in the unregularized ensemble. The reduced ensemble is about

19.8% the size of the original while giving lower loss.

2) Figure 15 suggests our model encounters

some problems in predicting higher counts

where all of residuals are biased in the same

direction, this shows that there is some effects

occurred during high counts that the model

doesn't do a good job in capturing.

Figure 15. Predicted values vs. Residuals

3) Use a simple chart to show predicted versus actual count for 6 months of data in 2011:

Figure 16. True Count vs. Regularized Bag Ensemble for days 16 to 19 of January to June in 2011

Page 14: IEOR 265 Final Paper_Minchao Lin

The blue line represents the real counts while the red line represents the predicted counts. The

data includes dates of 16 to 19 of January to June in 2011. As illustrated by the graph, the model

is not very efficient in capturing the peak values of real data.

5 Neural Network (NN)

According to the DARPA Neural Network Study (1988, AFCEA International Press, p. 60), “a

neural network is a system composed of many simple processing elements operating in parallel

whose function is determined by network structure, connection strengths, and the processing

performed at computing elements or nodes.” Generally, a neural network consists of many

processing units connected by communication channels which carry numeric. The processing

units operate only on their local data and on the inputs they receive via the connections.

5.1 Method Description

In order to fit a neural network to the bike sharing demand data, parameters to configure include

the type of neural network, the number of layers for the neural network, the number of neurons

in each layer, the transfer functions between layers, the performance metric and the training

function. After multiple attempts, the best network structure is a cascade-forward three-layer

network with neurons 10, 15, and 10 for each layer. The transfer functions are tangent sigmoid

for the first three layers and linear for the last. The performance metric is Mean squared error and

training function is Bayesian regularization backpropagation, which updates the weight and bias

values according to Levenberg-Marquardt optimization to minimize a combination of squared

errors and weights.

Page 15: IEOR 265 Final Paper_Minchao Lin

5.2 Best Performance Metric

Mean Squared Error = 2582.7

Root Mean Squared Logarithmic Error for Validation data set = 0.79834

5.3 Result Analysis

Figure 17. Cascade-forward network with three layers each with neurons 10, 15 and 10

Figure 18. Epochs vs. MSE Figure 19. Target value vs. Output

Figure 17 displays a view of the cascade-forward network that generates the best result. For

Figure 18, the mean squared error for training data began to stay constant while the MSE for test

data kept dropping after around 34 epochs, which means the model started to over-fit the data

afterwards, so we stopped here. The R square in Figure 19 indicates that the model is a good fit.

Page 16: IEOR 265 Final Paper_Minchao Lin

6 Local Linear Regression (LLR)

For a regression model that is highly nonlinear and of unknown structure, local linear regression

may be applied to fit the model. The local linear regression performs weighted local averaging

with the weights determined by a kernel function. Within a radius of bandwidth h from an

observation x0 in the training data, the new input x will use the parameters for x0 to generate new

responses, where for each x0, its parameters β are determined locally using ordinary least squares

optimization.

6.1 Method Description

For the equation above we have,

𝑊ℎ = 𝑑𝑖𝑎𝑔(𝐾 (‖𝑥1 − 𝑥0‖

ℎ) ,… , 𝐾 (

‖𝑥𝑛 − 𝑥0‖

ℎ))

𝑋0 = 𝑋 − 1𝑛𝑥0′

where X and Y are defined as in the previous description for ordinary least squares regression.

For the bike sharing demand data, h = 0.5, 5, 10, 30 are each applied to the method and Gaussian

kernel function is used.

6.2 Best Performance Metric

Mean square error > 10000

Page 17: IEOR 265 Final Paper_Minchao Lin

6.3 Result Analysis

The high mean squared errors for each bandwidth h indicates that local linear regression might

not be an ideal method to predict the bike sharing demand, yet by adjusting the bandwidth h,

there is possibly some space to improve the performance metric.

7 Conclusion

To conclude, the methods that generate better mean squared errors are ensemble learning and

neural network among all the methods. For this two methods, cross validation are performed to

ensure the model quality and multiple attempts are necessary in order to find the best fit. For

ensemble learning, regularization of bag ensemble has seen significant improved result for the

performance metric. For neural network, number of layers from one to four and number of

neurons from small to large has been tested on. For further steps, more attempts of different

tuning parameters for current methods might improve the predicted results and deeper analysis of

data and its features can be analyzed to identify any missing relationships. For each method,

Matlab code is provided in the Appendix for further explanation.

References

"Neural Network Toolbox." Http://www.mathworks.com/help/nnet/. Web.

S, Warren, and Cary Sarle, Cary. Ftp://ftp.sas.com/pub/neural/FAQ.html. Web.

Page 18: IEOR 265 Final Paper_Minchao Lin

Appendix

(i) Predictor Variables

Categorical Variables:

·Year: 2011 to 2012

·Month: January to December

·Hour: 01 to 24 hours

·Weekday: Sunday to Monday

·Season: spring to winter

·holiday: whether the day is considered a holiday

·working day: whether the day is neither a weekend nor holiday

·weather:

1: Clear, Few clouds, Partly cloudy, Partly cloudy

2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist

3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds

4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

Numerical Variables:

·temperature in Celsius

· “feels like” temperature in Celsius

·relative humidity

·wind speed

(ii) Response Variables

·number of non-registered user rentals initiated

·number of registered user rentals initiated

·number of total rentals

(iii) Matlab code

The Matlab code for each method performed above is attached in a separate file named

“matlabcode_IEOR265paper_MinchaoLin.html”