Predicting Flight Delays

31
Data Mining Project By Mihai, Shruti, Jasdeep, Ankita How long would I have to wait for my flight??? 1

Transcript of Predicting Flight Delays

Page 1: Predicting Flight Delays

1

Data Mining ProjectBy Mihai, Shruti, Jasdeep, Ankita

How long would I have to wait for my flight???

Page 2: Predicting Flight Delays

2Objective

Predict flight delays Flight delays are a major problem within the airline industry as it

leads to financial losses and negative impact on business reputation.

Examine causes for flight delays This project explores what factors influence the occurrence of flight

delays.

Model Flight DelayAirline

Companies

Optimize Operations

Reduce further loss

Predict Help

Page 3: Predicting Flight Delays

3

Understanding the Data

Variable Selection Sub-Setting dataReplacing Missing

ValuesRemove Outliers

Re-creating variables

KNN

C5.0

MisclassificationSummary

Project flow map

Page 4: Predicting Flight Delays

4

7 million Records

303 Origins

303 Destinatio

ns

20 Carrier

s

29 Variable

s

Page 5: Predicting Flight Delays

5Understanding the DataSummary Statistics

Arrival Delay (ArrDelay)

NAS Delay

Distance

Correlation

Page 6: Predicting Flight Delays

6Variable Selection

29 Variables

Flight Details

>Arrival time>Departure time>Origin>Destination> Month, Week, Day> Unique Carrier

> Flight Num>Tail Num>Year

Other informati

on

>Delay minutes>Weather delay>NAS delay> Unique Carrier Delay> Late Aircraft Delay> Security Delay

>Actual arrival time>Taxi In>Taxi Out> Departure Delay> CRS Elapsed Time

We took expert advice from Prof Dehnad

Page 7: Predicting Flight Delays

7

TAKING CHARGE OF DATA

Page 8: Predicting Flight Delays

Step 1 - Subsetting

Busiest Origins Top Carriers by Origin

Carrier Hubs

American Airlines

Dallas

Delta Atlanta

United Chicago

91,265

93,775

126,045130,289

130,872

138,506

139,088

140,587

161,989172,876

185,172

199,408

215,608241,443

281,281

350,380414,513

Origin (color) and count of Origin (size).

AA MQ DL EV UA FL OO YV US CO NW OH 9E XE F9 AS B60K

10K

20K

30K

40K

50K

60K

70K

80K

90K

100K

110K

120K

130K

140K

150K

160K

170K

180K

190K

200K

210K

220K

230K

240K

250K

Count of Unique Carrier

Count of Unique Carrier for each Unique Carrier. Color shows details about Unique

8

Page 9: Predicting Flight Delays

Step 2 – Replacing Missing Values

99.75%

0.25%

Diverted (group) (color) and % of Total Count of Diverted (size). Percents are based on each column of the table.

% of Flights Diverted

Dropped all records Where Diverted=1

Missing Values present when: Flights Diverted Flights Cancelled

Cancelled

No. of flights

0 6,872,294

1 137,434

Dropped all record Where Cancelled=1

Replaced NA’s with 0 for Below columns:-CarrierDelay-SecurityDelay-LateArrivalDelay-NASDelay-WeatherDelay

Final Record Count = 479,281

9

Page 10: Predicting Flight Delays

Step 3 – Outliers

Box Plot from R Distance based on Origin

Destinations with Large Distance: HNL = Honolulu ,

Hawaii OGG = Kahului,

Hawaii ANC = Alaska

ATL DFW ORD

4184

2165

2846

3043

1547

3417

4502

2072

3711

Distance for each Origin. The view is filtered on Origin, which keeps ATL,

Distance 10

Page 11: Predicting Flight Delays

Arrival Delay

Box Plot from R Delays by Month

0 1 2 3 4 5 6 7 8 9 10 11 12 13Month

450493

699

368

424

495

839

862

420

475

401

456

340

417

535

586

616

441483518

717

868

934

402

474

548

644

768

1263

-63

420

475

556

630

699

821

1034

327365

463

809

286324

396

487

346

754

867879

1022

369410

495

576

656666

736

768

1170

Month vs. Arr Delay. The data is filtered on Unique Carrier and Origin. The Unique Carrier filter keeps AA, DL and UA. The Origin filter keeps ATL, DFW and ORD.

11

Page 12: Predicting Flight Delays

Misclassifications:

If ArrDelay<15, then Delay in last 5 columns = 0

12

Page 13: Predicting Flight Delays

Step 4 – Output Column

Generated based on ArrDelay.

70% of Data has flights arriving on time.

Found the Quantiles for ArrDelay.

And we thought we are done with data treatment part of the project….

…Lets start with KNN

Low-Delay High-DelayOn time

13

Page 14: Predicting Flight Delays

K- Nearest Neighbor

14

Page 15: Predicting Flight Delays

Categorical Variables

Carrier – AA,UA & DA

Origin – ATL, DFW, ORD

Day of the Week=1 to 7

Month = 1 to 12

Day of the Month = 1 to 31

Destination = 117

Create two dummy variables for each column

Number of Dummy Variables are large.Is there a better way to divide the data???

15

Page 16: Predicting Flight Delays

Day of Week

Month

Grouped based on weather delays.

Made a list of holidays in 2008.

People travel 2-3 days before and after holidays.

Divided into holidays and non-holidays.

Day of Month Destination

Found Busiest Destinations.

Grouped into:1)Less Busy2)Medium Busy3)High Busy

16

Page 17: Predicting Flight Delays

Running KNN

Standardized the data.

Created Training-Test dataset having 70-30 ratio using Random Sampling.

Using knn-function

Used “train” function to find optimal Value of K -> Ran for more than 6 hours.

Used For loop and ran for k=1:20 overnight.

17

Page 18: Predicting Flight Delays

KNN Result

For loop : least error rate when k =17

Error rate decreased from 11.75%(k=1)to 10.69% (k=17)

Increased from 10.69% to 10.76%(k=20)

18

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200.1

0.105

0.11

0.115

0.12

0.1069235

K value

Mis

cla

ssifi

cati

on

ra

te

Page 19: Predicting Flight Delays

Classification Tree• Which variables factors impact the DELAYs the most?

• Is KNN the best method to predict delays?

19

Page 20: Predicting Flight Delays

20Classification Tree

C5.0Split is not Binary.

The tree created is more “bushy”

CARTSplits categorical variables in binary splits

Tree is less “bushy”

Page 21: Predicting Flight Delays

21First output of decision tree

• 159 Nodes• Error Rate- 9.7%• Attribute Usage:

100.00%Weather Delay 100.00% NAS Delay 99.88% Late

Aircraft Delay 93.29% Carrier

Delay 18.25% Unique

Carrier 15.79% Origin 10.90% Distance 10.29% CRS Arr

Time 7.89% CRS Dep

Time 3.88% Weather 3.80% Dest 3.11% holiday

Page 22: Predicting Flight Delays

22Tree after pruning

Pruning methodology used: C5.0Control (minCases = 3000)

What variables are important:

Error rates increases from 9.7% to 10.8%

Number of nodes falls from 159 to 12

Variable Usage Variable Usage

NAS Delay 100.00% Unique Carrier 7.39%

Late Aircraft Delay

98.28% Origin 6.56%

Carrier Delay 81.47% CRS Dep Time 5.29%

Weather Delay 7.49% CRS Arr Time 3.89%

Page 23: Predicting Flight Delays

23Tree after pruning

Page 24: Predicting Flight Delays

24Our learnings from the tree

1 Flights tend to face higher delay time in evening (After 4:30PM)

2 Carrier delays in UA are likely to cause high delays

3 Flights from Chicago (ORD) is more likely to face long delays due to aircraft and NAS delays

Page 25: Predicting Flight Delays

25Misclassification

Method Other details Classification error

Prediction error

KNN K =1 NA 11.75%

K = 17 NA 10.69%

K = 20 NA 10.76%

C5.0 • All variables

• Without pruning

9.7% 10.0%

C5.0 • All variables

• With pruning

10.8% 10.7%

Page 26: Predicting Flight Delays

26Re-run KNN and C5.0 without delay columns

• 1 Nodes• Misclassification Rate-

30.6%

C5.0Knn

• At K up to 20• Misclassification Rate-

29.18%

knn0

20

40

60

80

Prediction error

Page 27: Predicting Flight Delays

27Management Summary

Fly during the first half of the day to avoid delays

Shorter flights ~ Longer delays

Higher delays on Friday

Highest delays in summer

AA

DL

UA

Percentage of flightsATLDFWORD

FOR PASSENGERS FOR CARRIERS

ORD DFW

High Delay

Low Delay

AA UA AA UA

UA should shift its hub from Chicago to other origins to reduce the number of delays

Page 28: Predicting Flight Delays

28Issues/ Next Steps

Try Random Forest to improve predictions

Make predictions without variables- reason for delays

Data is Biased~ Almost 70% flights arrive on time

Running with large dataset is limited by the processing capability of our laptops. Hence need to divide the data.

For large dataset, we increased the heap size in R

Page 29: Predicting Flight Delays

29Appendix- Data, packages and software

Dataset

http://stat-computing.org/dataexpo/2009/the-data.html

Softwares: R- studio

Tableau

Packages: Corrplot

Class

C50

Rpart

Page 30: Predicting Flight Delays

30Appendix –R codes

conclusion.R

Initial Processing-Part1.R

Tree_final.R

knn.R

Initial Processing- Part2.R

Page 31: Predicting Flight Delays

31