Data Analysis, Statistics, Machine Learningwilkinson... · Comment on Emanuel Parzen [Nonparametric...
Transcript of Data Analysis, Statistics, Machine Learningwilkinson... · Comment on Emanuel Parzen [Nonparametric...
Data Analysis, Statistics, Machine Learning
Leland Wilkinson Adjunct Professor UIC Computer Science Chief Scien<st H2O.ai [email protected]
2
Exploring o Exploratory Data Analysis (John W. Tukey , EDA)
Summaries Transforma<ons Smoothing Robustness Interac<vity What EDA is not … LeQng the data speak for itself Fishing expedi<ons Null hypothesis tes<ng
Qualita<ve Data Analysis Mixed methods Old wine in new boVles
Copyright © 2016 Leland Wilkinson
Exploring
3
“Probability modelers seem to want to believe that their models are en<rely correct. Data analysts regard their models as a basis from which to measure devia<on, as a convenient benchmark in the wilderness, expec<ng liVle truth and relying on less.”
Tukey (1979)
Copyright © 2016 Leland Wilkinson
4
Exploring o Summaries
o LeVer values o M (median) sort and split batch o H (hinges) split each half as if it were a new batch o E (eighths) split again, and so on … o Medians and hinges yield a 5-‐number summary
o 1. lower extreme o 2. lower hinge o 3. median o 4. upper hinge o 5. upper extreme H-‐spread is (upper hinge – lower hinge) Range is (upper extreme – lower extreme)
extreme median hinge extreme hinge
Copyright © 2016 Leland Wilkinson
5
Exploring o Summaries
o What leVer values reveal o Symmetry
o Outliers o A Step is 1.5 <mes H-‐spread o Inner fences are 1 step outside hinges o Outer fences are 2 steps outside hinges o Adjacent values are those at each end closest to, but s<ll inside inner fences o Outside values are between inner fence and neighboring outer fence o Far out values are beyond outer fences toward extremes
outside far out
Copyright © 2016 Leland Wilkinson
6
Exploring o Transforma<ons
o Tukey Ladder of Powers (re-‐expressions) o Assume data are posi<ve, or use X + 1 if non-‐nega<ve o Tukey formula
o X ⟼ Xp
o Box & Cox formula (derived from Tukey’s idea) o X ⟼ (Xp – 1) / p
o Values of p o p = 2 yields X2 o p = 1 yields X o p = .5 yields sqrt(X) o p = 0 yields log(X) o p = -‐1 yields 1 / X
o For Box & Cox formula o p = 0 yields log(X) because limp→0 (Xp – 1) / p = log(X) o Also, dividing by p in Box & Cox formula preserves polarity of X
o Ascending the ladder (p > 1) spreads out large values and compresses small values. Descending the ladder (p < 1) compresses large values and spreads out small values.
Copyright © 2016 Leland Wilkinson
7
Exploring o Transforma<ons
o Dealing with skewness o Posi<ve skew: descend the ladder (p < 1) o Nega<ve skew: ascend the ladder (p > 1)
p = 2 p = 1 p = .5 p = 0 p = -‐.5 p = -‐1
0 1000 2000 3000 4000 5000 6000BRAINWEIGHT
-1 0 1 2 3 4LOG10(BRAINWEIGHT)
Copyright © 2016 Leland Wilkinson
8
Exploring o Transforma<ons
Wilkinson, Blank, & Gruber (1996)
Copyright © 2016 Leland Wilkinson
9
Exploring o Transforma<ons
o Spread-‐level plot o Divide batch into quin<les and plot H-‐spread against median o 1 – slope of line is es<mate of p o In this case, p = 0 is best choice
Copyright © 2016 Leland Wilkinson
10
Exploring o Smoothing
o See any paVern here?
0 20 40 60 80Day
30405060708090100
Temperature
Copyright © 2016 Leland Wilkinson
11
Exploring o Smoothing
o Give yourself a medal if you saw this
0 20 40 60 80Day
30405060708090100
Temperature
Velleman & Hoaglin (1981)
Copyright © 2016 Leland Wilkinson
12
Exploring o Smoothing
o Data = smooth + rough o Data = fit + residuals
o Fit a model o Compute residuals o Examine residuals for systema<c varia<on o If residuals look nonrandom, fit a model to the residuals o Iterate
Copyright © 2016 Leland Wilkinson
13
Exploring o Robustness
Tukey was skep<cal regarding Gaussian assump<on Inspired a search for sta<s<cal es<mators that are robust against outliers and other forms of contamina<on Simple loca<on es<mators involved trimming outliers Median Winsorizing Trimmed mean Others (Tukey, Hampel, …) involved weigh<ng func<ons Peter Huber developed maximum-‐likelihood-‐like methods
Copyright © 2016 Leland Wilkinson
14
Exploring o Interac<vity
o Linking o Brushing o Projec<ng
o Tukey, Friedman, Fisherkeller: Prim9 o https://www.youtube.com/watch?v=B7XoW2qiFUA
o Tukey and Friedman: Projec<on pursuit o https://www.youtube.com/watch?v=n5i9RLCe1rQ
Copyright © 2016 Leland Wilkinson
15
Exploring o Qualita<ve Data Analysis (QDA)
o The QDA movement is a reac<on against… o Quan<ta<ve analysis (mathema<cs in general, sta<s<cs in par<cular) o Scien<fic objec<vism, realism, and posi<vism o Peer review (controversial within QDA community) o Educa<onal tes<ng
o Subjec<vity o Hermeneu<cs
o Transla<onal o Postmodernist
o The researcher constructs own reality that others may not share o Reliance on “trustworthiness” instead of formal measures of validity
o credibility, dependability, auditability, confirmability, corrobora<on o Focus on symbolic interpreta<ons of icons (text, videos, …) leads to “mixed methods”
o Fluidity o No predefined measures or hypotheses o Progressive data collec<on and coding leads to “grounded theory”
o Poli<cs o Peculiar QDA journals o Ac<vism in academic departments
Copyright © 2016 Leland Wilkinson
16
Exploring o Qualita<ve Data Analysis
o Nothing new here o Introspec<on (Wundt, …) o Clinical observa<on (Freud, Piaget, …) o Personal knowledge (Polanyi, …) o Par<cipant observa<on (Malinowski, Mead, …) o Community psychology interviewing (Sarason, Levine, Kelly, …) o Group dynamics (Lewin, Bales, Slater, …)
o BoVom line: o If you can’t quan<fy or qualify something, you don’t understand it
o In science, understanding means being able to communicate to a ra<onal person o In religion, understanding is a non-‐cogni<ve experience of the transcendent o In aesthe<cs, understanding is a judgment of taste (Kant) o But you can’t build a science on subjec<ve or non-‐cogni<ve founda<ons
o Quan<fica<on doesn’t mean simply assigning a number o It can mean “these two things are not comparable” o Or, “this is greater than that” o Or, “these two things are related”
Copyright © 2016 Leland Wilkinson
17
Exploring Qualita<ve Data Analysis Alterna<ves
Text analysis (Shepard, Rosenberg, …) Collect the data through simple comparisons (no numbers) Scale them by exploi<ng distance and ordering constraints
Copyright © 2016 Leland Wilkinson
18
Exploring Qualita<ve Data Analysis Alterna<ves
Sequence analysis (Agrawal A priori algorithm) Associa<on rules
Copyright © 2016 Leland Wilkinson
19
Exploring Qualita<ve Data Analysis Alterna<ves
Network analysis No numbers here
Andris C, Lee D, Hamilton MJ, Mar<no M, Gunning CE, Selden JA (2015) The Rise of Par<sanship and Super-‐Cooperators in the U.S. House of Representa<ves. PLoS ONE 10(4): e0123507.
Copyright © 2016 Leland Wilkinson
20
Exploring Qualita<ve Data Analysis Alterna<ves
Innova<ve experimental paradigms No numbers used here
Color vision and hue categoriza<on in young human infants. Bornstein, Marc H.; Kessen, William; Weiskopf, Sally Journal of Experimental Psychology: Human Percep>on and Performance, Vol 2(1), Feb 1976, 115-‐129
"Infant looking at shiny object" by Mehregan Javanmard, Wikipedia
Copyright © 2016 Leland Wilkinson
21
Exploring o References
o Andrews, D., P. Bickel, F. Hampel, P. Huber, W. Rogers, and J. W. Tukey (1972). Robust Estimates of Location: Survey and Advances. Princeton University Press.
o Box, G. E. P. and Cox, D. R. (1964). An Analysis of Transformations, Journal of the Royal Statistical Society, pp. 211-243, discussion pp. 244-252.
o Friedman, J.H., and Stuetzle, W. (2002). John W. Tukey’s Work on Interactive Graphics. Annals of Statistics 30.6: 1629–39.
o Hampel, F.R., Ronchetti, E.M., Rousseeuw,P.J. and Stahel, W.A. (1986). Robust Statistics: The Approach Based on Influence Functions. Wiley.
o Huber, P.J., and Ronchetti, E.M. (2009), Robust Statistics, 2nd ed., Wiley. o Mosteller, F. and Tukey, J.W. (1977). Data Analysis and Regression. Addison-Wesley. o Stigler, S.M. (2010), ”The Changing History of Robustness”, The American Statistician, 64,
277-281. o Tukey, J.W. (1977). Exploratory Data Analysis. Addison-Wesley. o Tukey, J.W. (1979). Comment on Emanuel Parzen [Nonparametric statistical data
modeling], Journal of the American Statistical Association, 74, 121-122. o Velleman, P. and Hoaglin, D. (1981). The ABC’s of EDA: Applications, Basics, and
Computing of Exploratory Data Analysis, Duxbury.
Copyright © 2016 Leland Wilkinson