Benchmark Dataset for Timetable Optimization of Bus Routes ...Benchmark Dataset for Timetable...

Benchmark Dataset for Timetable Optimization ofBus Routes in the City of New Delhi

Anubhav Jain, Avdesh Kumar, Saumya Balodi, Pravesh BiyaniIIIT Delhi, India

{anubhav15129, avdesh15135, saumya15172, praveshb}@iiitd.ac.in

Abstract—Public transport is one of the major forms oftransportation in the world. This makes it vital to ensure thatpublic transport is efficient. This research presents a novelreal-time GPS bus transit data for over 500 routes of busesoperating in New Delhi. The data can be used for modelingvarious timetable optimization tasks as well as in other domainssuch as traffic management, travel time estimation, etc. Thepaper also presents an approach to reduce the waiting timeof Delhi buses by analyzing the traffic behavior and proposinga timetable. This algorithm serves as a benchmark for thedataset. The algorithm uses a constrained clustering algorithmfor classification of trips. It further analyses the data statisticallyto provide a timetable which is efficient in learning the inter-and intra-month variations.

Index Terms—Timetable Optimization, Bus Scheduling, DataAnalytics, Bus Transit Dataset

I. INTRODUCTION

Public transport is one of the most popular means oftransportation in various metropolitan cities across the globe.According to the Economic Survey of Delhi 2005-06, busesaccount for nearly 60% of the total demand. While it iswell understood that the public transport helps in combatingair pollution and congestion caused due to single-occupancyvehicles, the usage of buses in Delhi and other cities in Indiahas seen a nominal decline while the overall travel demandhas simultaneously increased.

One of the main reasons for this decline in the city ofDelhi (and other cities in India) is the lack of reliability ofthe bus routes. The timetable is often not made by the transitauthorities. Moreover, it is often outdated soon due to the rapidchange in infrastructure and the traffic conditions resulting indegradation in the reliability of buses. Finally, this decreasein reliability leads to unknown waiting times at the bus stopsfor the passengers. Due to the absence of a timetable, mostbus trips are operated in an ad-hoc fashion making it extremelydifficult for the passengers to trust the public transport networkleading to a decrease in passenger trips.

Interestingly, the various trips in a given bus-route stillfollow a certain pattern in a given day thanks to the patternin the traffic conditions throughout the day. In other words,even when the transit operators do not follow an “explicit”timetable, there is an “implicit” timetable that is followedwhich is not completely random. The main goal of this workis to unearth this pattern and develop an operational timetableof the bus routes in the city of Delhi. The efficacy of this“suggested” timetable is measured in terms of the average

waiting time a passenger has to endure at various stops in thegiven route assuming she follows the timetable. This waitingtime should ideally be lower than when the passenger doesnot follow the suggested timetable arrives at the same stop ina random fashion.

The paper presents a novel database containing real-timedata for over 500 routes of buses operating in Delhi. Toarrive at the timetable for benchmarking the dataset, thepaper samples two routes operated by the Delhi transportcorporation. Out of the two routes, one is frequent, while theother operates at an average frequency of thirty minutes. Thepaper uses the collected GPS feed of the buses.

The paper presents an approach which simplifies the prob-lem statement to propose an efficient algorithm for definingthe timetable based on GPS bus data collected over a periodof two months.

II. RELATED WORK

Researchers have been exploring the problem of reducingwaiting time as well as reducing bunching in buses. Patnaik etal. [1] applied data mining on data collected using AutomaticPassenger Counters. The paper uses clustering to divide datapoints into different headways based on a decision tree. Themethod informs whether the existing headway would workfor the buses. Wang et al. [2] used a sequential clusteringmethod to ensure a fixed order for buses so that time divisioncould be properly applied. They calculated the travel timeby incorporating bus dwell time at stops which depends onthe number of riders and also the road and intersection timedepending on the traffic on the road at that time. Kornfeld etal. [3] proposed an approach to minimize waiting time usingtwo models which are based on the assumption of how peoplearrive at the bus stop.

Yang et al. [4] proposed a method to optimize the timetablefor the subway system by proposing an integer programmingmodel to maximize the overlap time with the headway time.Wihartiko et al. [5] also proposed an integer programmingmodel which uses a modified generic algorithm for timetablingof buses. Chakroborty et al. [6] and Deb et al. [7], amixed integer nonlinear programming model was proposedfor timetabling of buses where the objective was to minimizethe total waiting time of passengers. Saargunawathy et al.[8] studied Open traffic platform which analysis traffic datalinked to open street map. They used it to analyze the trafficconditions in Kuala Lumpur. They listed different roads and

arX

iv:1

910.

0890

3v1

[cs

.OH

] 2

0 O

ct 2

019

junctions with heavy traffic flow at different times. Sunil et al.[9] proposed a dynamic GPS based time-tabling algorithm forpublic transport that can predict the waiting time consideringthe real-time location of the user and the bus, and thuspredicting estimated time of arrival.

The major issue with current research in the domain oftimetable optimization for buses is that there is not a standarddataset which is being used by researchers. Various algorithmsare applied on various datasets, which does not provideconclusive evidence of improvement in the state of the art.

The proposed algorithm used for benchmarking the pro-posed dataset goes beyond the ones in literature in two ways.Firstly, literature does not take the simplicity of the algorithminto consideration. This is vital primarily as the IT Departmentof most transportation corporations are not educated enoughto work with complex algorithms. The proposed algorithmclosely revolves around the traffic behavior to provide the mostoptimal timetable. This algorithm is being deployed to updatethe timetable of over a hundred buses in New Delhi, Indiawhich would affect the lives of millions of daily bus users.

Fig. 1. Route for bus no. 425

III. PROPOSED DATABASE

The proposed bus transit database 1 contains (i) Static and(ii) Dynamic data for buses operating in Delhi. The dynamicdata for three months has been stored separately which canbe used for applications which require previous data. Thisis the first of its kind database which provides access toreal-time transit data. This dataset would provide means forstandardization of timetable optimization algorithms whichwas not possible earlier as researchers were using classifieddatasets which weren’t available publicly. This informationcan easily be used for building real-time applications whichspan from building timetable optimization algorithms, which

1https://bus-data.github.io/

Fig. 2. Route for bus no. 534

Seconds

Pro

babi

lity

Den

sity

Fig. 3. Probability distribution of the arrival time at stop 17 for route no.425 at different times of the day.

has been benchmarked in this paper along with various otherapplications such as live GPS tracked monitoring of traffic,tracking anomalies in live traffic data for security purposes,travel time estimation and optimization. As the live GPSdata is sampled at every 10s for all the routes, it providessubstantial information about different roads on in the cityand the expected travel speed possible on those routes.

For creating the database, the paper has pre-processed thedata using static information about the bus stops for thatparticular route. Through the initial step of pre-processing, thedata was divided into relevant bus stops which were consideredto be nodes. To do so, every raw data point which lied withina 50-meter radius of the coordinate of a bus stop was mappedto the stop. The 50 meter threshold is chosen keeping in mind

situations where the bus traveling at a high speed might notstore a data point close to the bus stop. By doing so, theraw data was narrowed down to the stop based data where allthe nodes were the stops. This brought uniformity for furtherprocessing of the data. The bus routes are represented in figure1 and figure 2 for routes 425 and 534 respectively. The routefrom the first stop to the last is considered as the up route,and the opposite direction is referred to as the down route.

A. Dataset Statistics

The dataset captures the following two categories, men-tioned below.

1) Dynamic Data: The data for all 543 routes and 3464stops are provided real-time. The data was in the form ofProtocol Buffers and consists of information about all thebuses operating at that time. The data provided has a frequencyof 10s, where each data packet contains the date, time, latitudeand longitude values for each bus route along with bus-specificdetails like number plate, route number, direction of route (up/down). This can easily be accessed for real-time applicationsand can be stored and used as statistical data. The data tracksall the buses throughout the time they are active.

2) Static Data: The static data provides information aboutthe routes, stops, trips and stops times for all the buses whichare being modeled dynamically. The information containsspecific information to use dynamic data and extract therequired information. The dynamic data is coded using theinformation provided in the static data. This compresses thedynamic data packets and makes it more efficient for real-timeapplications.

To use this data for optimizing timetable, the dynamicdata has been stored as for over three months in a databasewhich can be directly used for the application of timetableoptimization algorithms. The data has been decoded and isprovided in a directly usable format. This has been done forroutes 425 and 534. These datasets have also been benchmarksusing the proposed approach.

IV. PROPOSED APPROACH

Current bus timetables are formulated without consideringthe traffic behavior based on time of the day and the day ofthe week. It has been observed from the data that these aresignificant components which affect the arrival time of the busat a particular stop which needs to be carefully scrutinized toprovide a more efficient timetable.

The paper provides benchmarks for two routes- 425 and534. These buses were specifically chosen as they differ quitesignificantly in their frequency and reliability. To model anefficient timetable, the paper looks at these two buses where534 is extremely frequent and has an average waiting time ofaround 5 minutes while buses on route 425 are more unreliableand have an average waiting time close to 22 minutes.

The variables are defined as:• Total bus stops N are present on the route.• The data has been taken for d days.• i represents the trip variable, ∈ (0, xk).

• j denotes the stop variable, ∈ (0, N).• k represents the day variable ∈ (0, d).• xk represents the number of trips on the kth day.• tki,j represents the arrival time of the bus at the jth stop

for the ith trip on the kth day.To reduce the effect of noise in the data, the algorithm

samples the data at every 3rd stop. The algorithm can besummarized in two stages, definition of the starting time atthe first stop and calculation of arrival time for the followingstops based on the first stop.

The optimization problem finds the most optimal t̂i,j suchthat,

t̂i,j = min V ar[tki,j ] (1)

A. Defining the starting time

As the timetable depends on the departure time of the busfrom the first stop, the algorithm uses K-means clustering [10]for this step. Unlike the standard clustering algorithm, whichminimizes the inter-class variance, this approach minimizesthe variances while keeping a minimum distance betweenthe clusters. The minimum distance between two clusters isequivalent to the standard frequency of the buses as suggestedby the Transportation Department.

Each of the set of data points in each cluster created usingthis approach is represented by C(n) at iteration n whichcontains M (n) data points. The centroid of the lth set ofcluster is denoted by µ(n)

l at iteration n and the total number ofclusters are c. Each of these cluster sets can be mathematicallyrepresented using equation 2, where the new data point beingclassified is tp.

C(n)l = {tp : ||tp− µ(n)

l ||2 < ||tp− µ(n)

m ||2,∀m; 1 < m < c}(2)

The definition of new clusters is conditioned on equation 3,where T1 is the frequency of the bus.

{||µ(n)l − µ(n)

m || > T1,∀l,m; 1 < l,m < c} (3)

The mean is updated using the equation 4.

µ(n+1)l =

∑tk,(n)i,j ∈C(n)

l

t(n),ki,j

|C(n)l |

=µ(n)l ∗M (n) + tp(n+1)

M (n) + 1(4)

In the first step, the average departure time for a bus iscalculated. While considering each new bus, the nearest meanwas searched, and if the difference between the departure timeand the nearest mean is less than the threshold T1, then themean of that cluster was updated by including this data pointas well. Otherwise, this would be considered as a new cluster.Finally, only if the total number of points in a cluster isgreater than a threshold T2, the starting time is consideredvalid. Otherwise, the data point is considered to be an outliercase and is not considered for further calculations. Due tonoise in the data and inconsistency in the starting time of the

first bus, the paper grid searches for the best threshold to findthe appropriate starting time which reduces the waiting timeas well as the bunching of buses. The threshold T2 is takenas ten days which is one-third of the total number of days forwhich the data is used.

B. Adjacent Stops

For every subsequent bus stop, the average time to reachthat bus stop is calculated from the first node for every 15-minute interval. The timetable for these nodes is defined as thestart time plus the average time taken to reach that stop (inthe particular time range). The optimal time for every adjacentstop can be written as,

t̂i,j = t̂i,1 +

∑dk=1(t

ki,j − t̂i,1)d

∀j ∈ (2, N) (5)

Figure 3 shows that the arrival time of the bus at each stopis in the form of a Gaussian distribution which has highervariance as the day progresses. By taking the mean in theabove approach, the algorithm minimizes the waiting time,which is defined as the variance of this distribution thus givingthe most optimal timetable.

The data is first divided into 15-minute time slots for moreaccurate calculation and estimation of traffic behavior. Thisdynamic time-based timetable takes into consideration theregular changes which come in the traffic behavior every 15minutes.

C. Calculating Waiting Time

The pre-timetable waiting time (preWT) is defined as theexpected time a person will wait if he/she arrives at a bus stopat any random time. If the next bus arrives after N0 minutesfrom the previous bus then the expected waiting time for thenext bus at that stop will be:

preWT =

N0∑n=1

n/N0 = (N0 + 1)/2 (6)

The algorithm averages this over all the trips for a bus routeto get the mean waiting time for different stops.

For calculating the post-timetable waiting time, equation7 has been used. Where the mentioned instantaneous post-timetable waiting time (ipostWT(i,j,k)) has been averaged overall the days/ trips for various experiments.

ipostWT (i, j, k) = minx

(tki,j − t̂x,j) s.t. tki,j > t̂x,j (7)

V. RESULTS

The paper proposes an approach to create a timetable for thetwo buses using a relatively direct procedure which resultedin a significant decrease in the waiting time which has beenapplied to the two bus routes - 425 and 534. It has been appliedseparately to the up and down routes of the buses.

The following experiments have been performed to showthe efficiency of the algorithm.

Bus Route Number

Wai

ting

Tim

e (M

inut

es)

0

5

10

15

20

25

425-Up 425-Down 534-Up 534-Down

Pre-Timetable Waiting Time Post-Timetable Waiting Time

Fig. 4. Comparison of Waiting Time at the 1st Node for all the Buses.

Time

Wai

ting

Tim

e (M

inut

es)

0

10

20

30

4-8 8-12 12-16 16-20 20-24

Pre-Timetable Waiting Time Post_Timetable Waiting Time

Fig. 5. Comparison of waiting time for bus no. 425 Down Route before andafter the timetable when training and testing on alternate days.

Time

Wai

ting

Tim

e (M

inut

es)

0

10

20

30

4-8 8-12 12-16 16-20 20-24


Fig. 6. Comparison of waiting time for bus no. 425 Up Route before andafter the timetable when training and testing on alternate days.

• The first experiment shows the performance of using K-

Time

Wai

ting

Tim

e (M

inut

es)

0

2

4

6

8

4-8 8-12 12-16 16-20 20-24


Fig. 7. Comparison of waiting time for bus no. 534 down route when trainingand testing on alternate days.

Time

Wai

ting

Tim

e (M

inut

es)

0

2

4

6

8

4-8 8-12 12-16 16-20

Pre-TimeTable Waiting Time Post-TimeTable Waiting Time

Fig. 8. Comparison of waiting time for bus no. 534 up route when trainingand testing on alternate days.

Time

Wai

ting

Tim

e (M

inut

es)

0

5

10

15

20

25

4-8 8-12 12-16 16-20 20-24


Fig. 9. Comparison of waiting time for bus no. 425 Down Route before andafter the timetable when testing on second month data.

Time

Wai

ting

Tim

e (M

inut

es)

0

10

20

30

4-8 8-12 12-16 16-20 20-24


Fig. 10. Comparison of waiting time for bus no. 425 Up Route before andafter the timetable when testing on second month data.

Time

Wai

ting

Tim

e (M

inut

es)

0

2

4

6

8

4-8 8-12 12-16 16-20 20-24


Fig. 11. Comparison of waiting time for bus no. 534 down route when testingon second month data.

means clustering for classification of the data into trips.This experiment is specific to the first node, i.e., thestarting point of the bus. The algorithm has been trainedon the first-month data, and it has been tested on second-month data.

• To judge the model for the formation of the timetableof the remaining stops as well as learning intra-monthvariations, the model was trained and tested on alternatedays of the first and second month. The data was trainedon all the odd days of the two months, and it was testedon even days.

• For learning inter-month variations in a data whichalready contains high randomness, the algorithm wastrained on first-month data and tested on the second-month data.

The results have been presented in the form of bar graphsshowcasing the difference brought by the introduction of theproposed timetable. The impact of clustering for the initialnode, by following the first protocol, has been presentedin figure 4. This shows that there a significant amount ofrandomness in the starting time of these buses, as the currentset timetable is not followed. The proposed approach has

Time

Wai

ting

Tim

e (M

inut

es)

0

2

4

6

8

10

4-8 8-12 12-16 16-20


Fig. 12. Comparison of waiting time for bus no. 534 up route when testingon second month data.

been able to reduce this randomness of the starting time. Ifthis is strictly followed it would yield much more promisingresults for the subsequent stops as well. As the resultantincrease in waiting time of the following bus stops is theaddition of the randomness in the 1st stop along with therandomness in the traffic behavior. The first can be controlledby proper implementation of the timetable by public/ privatetransportation departments.

For the second experiment, the paper looks at the formationof a timetable by looking at the inter-month variations in thedata. In doing so, the model shows its efficiency in learningthe randomness which is present within the constraints of thesame month. Figure 7 shows a comparison of the waitingtime without any timetable with the waiting time post the newtimetable for bus no. 534. Figure 8 shows the graph for the theup-route while figure 7 shows the graph for the down-route.Both the graphs compare the waiting time at different timeintervals.

Lastly, the model has been tested on a completely unseendata after being trained on the first-month data. The resultsupon being tested on the second-month data have been pre-sented in figure 9 and figure 10 for the up and down route ofbus number 425 respectively.

The impact of a timetable is more in the case of bus route425 due to higher irregularity. Due to the high frequency of busnumber 534 introducing large changes is not possible. Duringthe calculation of final waiting time, the randomness in thestarting time due to human constraints have been taken intoconsideration. As the final waiting time is calculated usingthe same probabilistic distribution of the starting stop. Thus,showing that even if the timetable is not followed closely,it would still impact the waiting time considerably. Thesesituations make this essential for real-world applications wherehuman constraints are significant.

VI. CONCLUSION AND FUTURE WORK

The paper proposes a novel and first of its kind freelyavailable benchmark dataset for analyzing real-time bus transit.The data captures real-time variations in traffic, and moreover,

it can be used for standardization of timetable optimizationalgorithms, which are currently applied on a variety of datasetswhich aren’t publicly available. For benchmarking the dataset,the paper presents an approach to propose a timetable takinginto consideration traffic variations across time. The papershows the efficiency of this algorithm in reducing the waitingtime of passengers on Delhi bus routes. The algorithm isapplied to the Delhi Transportation Corporation buses number425 and 534.

While this approach looks at the efficiency with Delhi basedbuses, it would be interesting to see the implementation ofbuses from different cities. The paper assumes that the trafficbehavior does not vary to a large extent across days. This is an-other factor which can be added while defining the timetable.Another exciting area of research would be incorporating alimit on the number of passengers. The timetable can beestablished to keep a check on such factors which influencepassenger satisfaction.

REFERENCES

[1] J. Patnaik, S. Chien, and A. Bladikas, “Estimation of bus arrival timesusing APC data,” Journal of public transportation, vol. 7, no. 1, p. 1,2004.

[2] J. Wang and Y. Cao, “Operating time division for a bus route based onthe recovery of GPS data,” Journal of Sensors, vol. 2017, 2017.

[3] S. Kornfeld, W. Ma, and A. Resnikoff, “Optimizing bus schedules tominimize waiting time,” Operations Research II, pp. 21–392, 2014.

[4] X. Yang, X. Li, Z. Gao, H. Wang, and T. Tang, “A cooperativescheduling model for timetable optimization in subway systems,” IEEETransactions on Intelligent Transportation Systems, vol. 14, pp. 438–447, March 2013.

[5] F. Wihartiko, A. Buono, and B. Silalahi, “Integer programming modelfor optimizing bus timetable using genetic algorithm,” IOP ConferenceSeries: Materials Science and Engineering, vol. 166, no. 1, p. 012016,2017.

[6] P. Chakroborty, K. Deb, and B. Srinivas, “Network wide optimalscheduling of transit systems using genetic algorithms,” Computer AidedCivil and Infrastructure Engineering, vol. 13, pp. 363 – 376, 12 2002.

[7] K. Deb and P. Chakroborty, “Time scheduling of transit systems withtransfer considerations using genetic algorithms,” Evolutionary compu-tation, vol. 6, pp. 1–24, 02 1998.

[8] S. Manogaran, M. Ali, K. M. Yusof, and R. Suhaili, “Analysis ofvehicular traffic flow in the major areas of Kuala Lumpur utilizingopen-traffic,” in AIP Conference Proceedings, vol. 1883, p. 020013, AIPPublishing, 2017.

[9] N. Gunjal Sunil, V. Joshi Ajinkya, C. Gosavi Swapnil, and B. Kshir-sagar Vyanktesh, “Dynamic bus timetable using GPS,” InternationalJournal of Advanced Research in Computer Engineering & Technology(IJARCET) Vol, vol. 3, no. 3, pp. 775–778, 2014.

[10] J. A. Hartigan and M. A. Wong, “Algorithm as 136: A k-meansclustering algorithm,” Journal of the Royal Statistical Society. SeriesC (Applied Statistics), vol. 28, no. 1, pp. 100–108, 1979.

Benchmark Dataset for Timetable Optimization of Bus Routes ...Benchmark Dataset for Timetable...

Documents

Transcript of Benchmark Dataset for Timetable Optimization of Bus Routes ...Benchmark Dataset for Timetable...