New York Metro Data Analysis

Analyzing New York Metro

Image credits

"NYC subway-4D" by CountZ at English Wikipedia. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:NYC_subway-

4D.svg#/media/File:NYC_subway-4D.svg

https://c1.staticflickr.com/3/2188/2363525822_9285c44cd7_b.jpg

*proud member of

Udacity Intro to Data ScienceFinal Project

[email protected]

© Kevin Hung 2015

by Kevin Hung

@.

Question of Interest|How to Model Hourly Ridership Entries?

Image credits

http://i.dailymail.co.uk/i/pix/2012/11/05/article-2227401-15DC2645000005DC-20_634x407.jpg

sec3: Clue #1| Hourly Schedule

• Do people follow a predictable timetable or itinerary in ridership?

• Peak hours seem intuitive for plausible reasons

sec3: Clue #2 | Work Week

• Whopping 10 Million Difference in our subset!

• Found a Great Feature

sec3: Clue #3 | Is it Raining?

• Do people tend to ride less on rainy days, or is it because there are more non-rainy days than rainy ones in our particular

dataset?

• Next: Let’s test this!

sec1 | Statistical TestQ: Why use statistical significance test?

A1: Draw valid inferences!

A2: Formal framework to compare & evaluate data

A3: Tell us if perceived effects are reflective as a whole

Single-Sided Mann-Whitney U-Test[1]: Are 2 populations the same?

H0: “The distributions of rainy and non-rainy ridership populations are equal!”

HA: “No! Ridership of one population tends to be bigger than the other”

sec1 | Mann-Whitney U-TestResult

⇒

Reject the Null Hypothesis!

H0: “The distributions of rainy and non-rainy ridership populations are equal!”

HA: “No! Ridership of one population tends to be bigger than the other”

Rain may be a good feature…

sec2 | Building Our Model

We’ll use the Normal Equation to Find our Solution!

Easy as

123

Design (Data → Features → Matrix)

Target (Ridership Entries as Integer Vector)

Parameters (Solution Vector thatMinimizes Squared Error)

sec2 | Linear RegressionOur Model

Coefficient of Determination

Interpretation

• ~ 53% of the variation in ridership entries is

explained by our model

sec2 | Model Appropriateness

• Residual Plots Show that our model often under

predicts ridership for entries 2000+

• Using Hour, Weekday, UNIT, Rain may not be adequate!

• Suggestions: High Bias Model → Find more Features

sec4 | Conclusion

• Mann-Whitney U-Test & Paired Histogram Show Possibility of People

Tending to Ride the Metro More on Non-Rainy days

• Rainy Feature Contributes to 1% Increase in R2

• Need More Features: Incorporate Weather Factors? Foggy? Thunder?

Temperature Values?Image creditshttp://pix.avaxnews.com/avaxnews/6a/ab/0001ab6a_medium.jpeg

• [1] "Mann–Whitney U Test." Wikipedia . Wikimedia Foundation, n.d. Web.• [2] "CS220 Lecture notes." Andrew Ng .

http://cs229.stanford.edu/notes/cs229notes1.pdf• [3] Frost, Jim. "Regression Analysis: How Do I Interpret Rsquared and

Assess the GoodnessofFit?“ Minitab, 30 May 2013. Web. 15 Sept. 2015.• [4] NIST/SEMATECH eHandbook of Statistical Methods,

http://www.itl.nist.gov/div898/handbook/pri/section2/pri24.htm• [5] "GraphPad Curve Fitting Guide." GraphPad Curve Fitting Guide .

GraphPad, n.d. Web. 16 Sept. 2015. <http://www.graphpad.com/guides/prism/6/curvefitting/index.htm?reg_analysischeck_linearreg.htm>

* data sources: <http://web.mta.info/developers/developer-data-terms.html#data>

sec0 | References

New York Metro Data Analysis

Documents

Transcript of New York Metro Data Analysis