Informed Traveler

10
Seda Davtyan Informed Traveler

description

Insight Data Engineering Project

Transcript of Informed Traveler

Page 1: Informed Traveler

Seda Davtyan

Informed Traveler

Page 2: Informed Traveler

Problem

Page 3: Informed Traveler

Datasets• RITA (Research and Innovative Technology

Administration) On Time Performance dataset (October, 1987 till July, 2014)o Updated quarterly

• After unzipping all the files (1 file per year and month) the data is about 65 GB

Page 4: Informed Traveler

UI Demo• Informed Traveler

Page 5: Informed Traveler

Some InsightJet Blue 01/2012-

01/2013On Time: 67.39%

Carrier: 12.33%

Jet Blue 01/2013-01/2014

On Time: 61.36%

Carrier: 14.81%

Southwest 01/2012-01/2013

On Time: 80.76%

Carrier: 6.41%

Southwest 01/2013-01/2014

On Time: 72.17%

Carrier: 9.56%

American Airline

01/2012-01/2013

On Time: 75.48%

Carrier: 8.00%

American Airline

01/2013-01/2014

On Time: 68.70%

Carrier: 9.74%

Delta 01/2012-01/2013

On Time: 80.67%

Carrier: 5.61%

Delta 01/2013-01/2014

On Time: 81.81%

Carrier: 5.76%

Security + Weather Related Delays < 2%

Page 6: Informed Traveler

ScreenshotsJet Blue 01/2013 – 01/2014

South West 01/2013 – 01/2014

AA 01/2013 – 01/2014 Delta 01/2013 – 01/2014

Page 7: Informed Traveler

Data Pipeline

Flask API

RI TA

Data Collection

Page 8: Informed Traveler

Tradeoffs

Calculate the number of delays of each type

Delay > 15

Carrier

Weather

NAS Security

Late Aircraft

Unclassified

New Field

Page 9: Informed Traveler

Tradeoffs• Pig to clean up and transform the data

• UDF from piggybank to handle commas in a .csv file (CSVExcelStorage)

• HBaseStorage to transfer the data into HBase

• Construct composite row keys for fast HBase querying o Airline_Year-Month-Dayo Airline_Year-Montho Airline_Year

Page 10: Informed Traveler

Seda Davtyan• PhD in CS&E, 2014

• Love Coffee and Camping• Greatly Enjoy Baking