New York Metro Data Analysis
description
Transcript of New York Metro Data Analysis
Analyzing New York Metro
Image credits
"NYC subway-4D" by CountZ at English Wikipedia. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:NYC_subway-
4D.svg#/media/File:NYC_subway-4D.svg
https://c1.staticflickr.com/3/2188/2363525822_9285c44cd7_b.jpg
*proud member of
Udacity Intro to Data ScienceFinal Project
© Kevin Hung 2015
by Kevin Hung
@.
Question of Interest|How to Model Hourly Ridership Entries?
Image credits
http://i.dailymail.co.uk/i/pix/2012/11/05/article-2227401-15DC2645000005DC-20_634x407.jpg
sec3: Clue #1| Hourly Schedule
• Do people follow a predictable timetable or itinerary in ridership?
• Peak hours seem intuitive for plausible reasons
sec3: Clue #2 | Work Week
• Whopping 10 Million Difference in our subset!
• Found a Great Feature
sec3: Clue #3 | Is it Raining?
• Do people tend to ride less on rainy days, or is it because there are more non-rainy days than rainy ones in our particular
dataset?
• Next: Let’s test this!
sec1 | Statistical TestQ: Why use statistical significance test?
A1: Draw valid inferences!
A2: Formal framework to compare & evaluate data
A3: Tell us if perceived effects are reflective as a whole
Single-Sided Mann-Whitney U-Test[1]: Are 2 populations the same?
H0: “The distributions of rainy and non-rainy ridership populations are equal!”
HA: “No! Ridership of one population tends to be bigger than the other”
sec1 | Mann-Whitney U-TestResult
⇒
Reject the Null Hypothesis!
H0: “The distributions of rainy and non-rainy ridership populations are equal!”
HA: “No! Ridership of one population tends to be bigger than the other”
Rain may be a good feature…
sec2 | Building Our Model
We’ll use the Normal Equation to Find our Solution!
Easy as
123
Design (Data → Features → Matrix)
Target (Ridership Entries as Integer Vector)
Parameters (Solution Vector thatMinimizes Squared Error)
sec2 | Linear RegressionOur Model
Coefficient of Determination
Interpretation
• ~ 53% of the variation in ridership entries is
explained by our model
sec2 | Model Appropriateness
• Residual Plots Show that our model often under
predicts ridership for entries 2000+
• Using Hour, Weekday, UNIT, Rain may not be adequate!
• Suggestions: High Bias Model → Find more Features
sec4 | Conclusion
• Mann-Whitney U-Test & Paired Histogram Show Possibility of People
Tending to Ride the Metro More on Non-Rainy days
• Rainy Feature Contributes to 1% Increase in R2
• Need More Features: Incorporate Weather Factors? Foggy? Thunder?
Temperature Values?Image creditshttp://pix.avaxnews.com/avaxnews/6a/ab/0001ab6a_medium.jpeg
• [1] "Mann–Whitney U Test." Wikipedia . Wikimedia Foundation, n.d. Web.• [2] "CS220 Lecture notes." Andrew Ng .
http://cs229.stanford.edu/notes/cs229notes1.pdf• [3] Frost, Jim. "Regression Analysis: How Do I Interpret Rsquared and
Assess the GoodnessofFit?“ Minitab, 30 May 2013. Web. 15 Sept. 2015.• [4] NIST/SEMATECH eHandbook of Statistical Methods,
http://www.itl.nist.gov/div898/handbook/pri/section2/pri24.htm• [5] "GraphPad Curve Fitting Guide." GraphPad Curve Fitting Guide .
GraphPad, n.d. Web. 16 Sept. 2015. <http://www.graphpad.com/guides/prism/6/curvefitting/index.htm?reg_analysischeck_linearreg.htm>
* data sources: <http://web.mta.info/developers/developer-data-terms.html#data>
sec0 | References