Predicting Flight Delays
-
Upload
mihai-enescu -
Category
Data & Analytics
-
view
64 -
download
1
Transcript of Predicting Flight Delays
1
Data Mining ProjectBy Mihai, Shruti, Jasdeep, Ankita
How long would I have to wait for my flight???
2Objective
Predict flight delays Flight delays are a major problem within the airline industry as it
leads to financial losses and negative impact on business reputation.
Examine causes for flight delays This project explores what factors influence the occurrence of flight
delays.
Model Flight DelayAirline
Companies
Optimize Operations
Reduce further loss
Predict Help
3
Understanding the Data
Variable Selection Sub-Setting dataReplacing Missing
ValuesRemove Outliers
Re-creating variables
KNN
C5.0
MisclassificationSummary
Project flow map
4
7 million Records
303 Origins
303 Destinatio
ns
20 Carrier
s
29 Variable
s
5Understanding the DataSummary Statistics
Arrival Delay (ArrDelay)
NAS Delay
Distance
Correlation
6Variable Selection
29 Variables
Flight Details
>Arrival time>Departure time>Origin>Destination> Month, Week, Day> Unique Carrier
> Flight Num>Tail Num>Year
Other informati
on
>Delay minutes>Weather delay>NAS delay> Unique Carrier Delay> Late Aircraft Delay> Security Delay
>Actual arrival time>Taxi In>Taxi Out> Departure Delay> CRS Elapsed Time
We took expert advice from Prof Dehnad
7
TAKING CHARGE OF DATA
Step 1 - Subsetting
Busiest Origins Top Carriers by Origin
Carrier Hubs
American Airlines
Dallas
Delta Atlanta
United Chicago
91,265
93,775
126,045130,289
130,872
138,506
139,088
140,587
161,989172,876
185,172
199,408
215,608241,443
281,281
350,380414,513
Origin (color) and count of Origin (size).
AA MQ DL EV UA FL OO YV US CO NW OH 9E XE F9 AS B60K
10K
20K
30K
40K
50K
60K
70K
80K
90K
100K
110K
120K
130K
140K
150K
160K
170K
180K
190K
200K
210K
220K
230K
240K
250K
Count of Unique Carrier
Count of Unique Carrier for each Unique Carrier. Color shows details about Unique
8
Step 2 – Replacing Missing Values
99.75%
0.25%
Diverted (group) (color) and % of Total Count of Diverted (size). Percents are based on each column of the table.
% of Flights Diverted
Dropped all records Where Diverted=1
Missing Values present when: Flights Diverted Flights Cancelled
Cancelled
No. of flights
0 6,872,294
1 137,434
Dropped all record Where Cancelled=1
Replaced NA’s with 0 for Below columns:-CarrierDelay-SecurityDelay-LateArrivalDelay-NASDelay-WeatherDelay
Final Record Count = 479,281
9
Step 3 – Outliers
Box Plot from R Distance based on Origin
Destinations with Large Distance: HNL = Honolulu ,
Hawaii OGG = Kahului,
Hawaii ANC = Alaska
ATL DFW ORD
4184
2165
2846
3043
1547
3417
4502
2072
3711
Distance for each Origin. The view is filtered on Origin, which keeps ATL,
Distance 10
Arrival Delay
Box Plot from R Delays by Month
0 1 2 3 4 5 6 7 8 9 10 11 12 13Month
450493
699
368
424
495
839
862
420
475
401
456
340
417
535
586
616
441483518
717
868
934
402
474
548
644
768
1263
-63
420
475
556
630
699
821
1034
327365
463
809
286324
396
487
346
754
867879
1022
369410
495
576
656666
736
768
1170
Month vs. Arr Delay. The data is filtered on Unique Carrier and Origin. The Unique Carrier filter keeps AA, DL and UA. The Origin filter keeps ATL, DFW and ORD.
11
Misclassifications:
If ArrDelay<15, then Delay in last 5 columns = 0
12
Step 4 – Output Column
Generated based on ArrDelay.
70% of Data has flights arriving on time.
Found the Quantiles for ArrDelay.
And we thought we are done with data treatment part of the project….
…Lets start with KNN
Low-Delay High-DelayOn time
13
K- Nearest Neighbor
14
Categorical Variables
Carrier – AA,UA & DA
Origin – ATL, DFW, ORD
Day of the Week=1 to 7
Month = 1 to 12
Day of the Month = 1 to 31
Destination = 117
Create two dummy variables for each column
Number of Dummy Variables are large.Is there a better way to divide the data???
15
Day of Week
Month
Grouped based on weather delays.
Made a list of holidays in 2008.
People travel 2-3 days before and after holidays.
Divided into holidays and non-holidays.
Day of Month Destination
Found Busiest Destinations.
Grouped into:1)Less Busy2)Medium Busy3)High Busy
16
Running KNN
Standardized the data.
Created Training-Test dataset having 70-30 ratio using Random Sampling.
Using knn-function
Used “train” function to find optimal Value of K -> Ran for more than 6 hours.
Used For loop and ran for k=1:20 overnight.
17
KNN Result
For loop : least error rate when k =17
Error rate decreased from 11.75%(k=1)to 10.69% (k=17)
Increased from 10.69% to 10.76%(k=20)
18
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200.1
0.105
0.11
0.115
0.12
0.1069235
K value
Mis
cla
ssifi
cati
on
ra
te
Classification Tree• Which variables factors impact the DELAYs the most?
• Is KNN the best method to predict delays?
19
20Classification Tree
C5.0Split is not Binary.
The tree created is more “bushy”
CARTSplits categorical variables in binary splits
Tree is less “bushy”
21First output of decision tree
• 159 Nodes• Error Rate- 9.7%• Attribute Usage:
100.00%Weather Delay 100.00% NAS Delay 99.88% Late
Aircraft Delay 93.29% Carrier
Delay 18.25% Unique
Carrier 15.79% Origin 10.90% Distance 10.29% CRS Arr
Time 7.89% CRS Dep
Time 3.88% Weather 3.80% Dest 3.11% holiday
22Tree after pruning
Pruning methodology used: C5.0Control (minCases = 3000)
What variables are important:
Error rates increases from 9.7% to 10.8%
Number of nodes falls from 159 to 12
Variable Usage Variable Usage
NAS Delay 100.00% Unique Carrier 7.39%
Late Aircraft Delay
98.28% Origin 6.56%
Carrier Delay 81.47% CRS Dep Time 5.29%
Weather Delay 7.49% CRS Arr Time 3.89%
23Tree after pruning
24Our learnings from the tree
1 Flights tend to face higher delay time in evening (After 4:30PM)
2 Carrier delays in UA are likely to cause high delays
3 Flights from Chicago (ORD) is more likely to face long delays due to aircraft and NAS delays
25Misclassification
Method Other details Classification error
Prediction error
KNN K =1 NA 11.75%
K = 17 NA 10.69%
K = 20 NA 10.76%
C5.0 • All variables
• Without pruning
9.7% 10.0%
C5.0 • All variables
• With pruning
10.8% 10.7%
26Re-run KNN and C5.0 without delay columns
• 1 Nodes• Misclassification Rate-
30.6%
C5.0Knn
• At K up to 20• Misclassification Rate-
29.18%
knn0
20
40
60
80
Prediction error
27Management Summary
Fly during the first half of the day to avoid delays
Shorter flights ~ Longer delays
Higher delays on Friday
Highest delays in summer
AA
DL
UA
Percentage of flightsATLDFWORD
FOR PASSENGERS FOR CARRIERS
ORD DFW
High Delay
Low Delay
AA UA AA UA
UA should shift its hub from Chicago to other origins to reduce the number of delays
28Issues/ Next Steps
Try Random Forest to improve predictions
Make predictions without variables- reason for delays
Data is Biased~ Almost 70% flights arrive on time
Running with large dataset is limited by the processing capability of our laptops. Hence need to divide the data.
For large dataset, we increased the heap size in R
29Appendix- Data, packages and software
Dataset
http://stat-computing.org/dataexpo/2009/the-data.html
Softwares: R- studio
Tableau
Packages: Corrplot
Class
C50
Rpart
30Appendix –R codes
conclusion.R
Initial Processing-Part1.R
Tree_final.R
knn.R
Initial Processing- Part2.R
31