Data, Responsibly - Computer Science Departmentjulia/cs500/documents/... · 2015-08-17 · Data,...
Transcript of Data, Responsibly - Computer Science Departmentjulia/cs500/documents/... · 2015-08-17 · Data,...
![Page 2: Data, Responsibly - Computer Science Departmentjulia/cs500/documents/... · 2015-08-17 · Data, Responsibly!!!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Analyzing](https://reader034.fdocuments.net/reader034/viewer/2022042416/5f31f63f2490a563767f56ae/html5/thumbnails/2.jpg)
Julia Stoyanovich
Analyzing user profiles
2
Data analysis: What patterns are we looking for? How do we look for patterns? Why do we look for them? !
age edu inc sign
20 BS 27K Aries
35 MS 102K Taurus
62 PhD 200K Virgo
70 BS 80K Leo
… … … …
![Page 3: Data, Responsibly - Computer Science Departmentjulia/cs500/documents/... · 2015-08-17 · Data, Responsibly!!!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Analyzing](https://reader034.fdocuments.net/reader034/viewer/2022042416/5f31f63f2490a563767f56ae/html5/thumbnails/3.jpg)
Julia Stoyanovich
The scientific methodA method or procedure that has characterized natural science since the 17th century, consisting in systematic observation, measurement, and experiment, and the formulation, testing, and modification of hypotheses. !
The Oxford English Dictionary!
!
• hypothesis is formulated based on observation!
• hypothesis is tested under controlled conditions using sound reasoning!
• avoids confirmation bias - people tend to observe what they expect to observe !
• the process is reproducible
3
![Page 4: Data, Responsibly - Computer Science Departmentjulia/cs500/documents/... · 2015-08-17 · Data, Responsibly!!!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Analyzing](https://reader034.fdocuments.net/reader034/viewer/2022042416/5f31f63f2490a563767f56ae/html5/thumbnails/4.jpg)
Julia Stoyanovich
Scientific method in data analysis
4
http://bioserv.fiu.edu/~walterm/human_online/labs/scientific_meth/sci_meth1/scientific_method_files/image001.gif
![Page 5: Data, Responsibly - Computer Science Departmentjulia/cs500/documents/... · 2015-08-17 · Data, Responsibly!!!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Analyzing](https://reader034.fdocuments.net/reader034/viewer/2022042416/5f31f63f2490a563767f56ae/html5/thumbnails/5.jpg)
Julia Stoyanovich
FalsifiabilityA crucial component of the scientific method - every hypothesis must be falsifiable (refutable, testable), i.e., it must be possible to prove that the statement in question is false
5
![Page 6: Data, Responsibly - Computer Science Departmentjulia/cs500/documents/... · 2015-08-17 · Data, Responsibly!!!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Analyzing](https://reader034.fdocuments.net/reader034/viewer/2022042416/5f31f63f2490a563767f56ae/html5/thumbnails/6.jpg)
Julia Stoyanovich
Is Astrology a Science?
6
http://www.bodymemory.com/uploads/2/3/2/8/23288628/s266135201567006809_p25_i2_w1000.jpeg
![Page 7: Data, Responsibly - Computer Science Departmentjulia/cs500/documents/... · 2015-08-17 · Data, Responsibly!!!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Analyzing](https://reader034.fdocuments.net/reader034/viewer/2022042416/5f31f63f2490a563767f56ae/html5/thumbnails/7.jpg)
Julia Stoyanovich
Probability and Statistics• Experiments have outcomes!
• An event is an occurrence of a set of outcomes!
• Probability of an event measures how likely this event is to occur
7
Tossing a fair coin, on Earth, subject to the usual gravity laws!
Experiment: toss the coin once!
Outcomes: {H, T}!
Events: {H}, {T}, {} - neither heads nor tails, {H,T} - either heads or tails!
Probabilities of events: P({H}) = 0.5 P({T}) = 0.5, P{{}} = 0, P{{H,T}} = 1!
Probabilities of disjoint events: P({H}) = 0.5 P({T}) = 0.5
What if our experiment is to toss the coin twice?!
What is the coin is biased?
![Page 8: Data, Responsibly - Computer Science Departmentjulia/cs500/documents/... · 2015-08-17 · Data, Responsibly!!!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Analyzing](https://reader034.fdocuments.net/reader034/viewer/2022042416/5f31f63f2490a563767f56ae/html5/thumbnails/8.jpg)
Julia Stoyanovich
Independent Events
8
Experiment: toss 2 fair coins, call them A and B. !
We write A=1 if A lands heads, A=0 if A lands tails.!
Outcomes: {A=0, B=0}, {A=0, B=1}, {A=1, B=0}, {A=1, B=1}
P(A) = 1/2 ! P(B) = 1/2!
P(A, B) = 1/4 !
P(A, B) = P(A) P(B)
Independent events: events A and B are independent if the fact that A occurs does not affect the probability of B occurring.
More than 2 independent events?
![Page 9: Data, Responsibly - Computer Science Departmentjulia/cs500/documents/... · 2015-08-17 · Data, Responsibly!!!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Analyzing](https://reader034.fdocuments.net/reader034/viewer/2022042416/5f31f63f2490a563767f56ae/html5/thumbnails/9.jpg)
Julia Stoyanovich
Conditional Probability
9
Conditional probability measures the probability of an event given that another event has occurred.
Example: A - probability that a person smokes; B - probability that there is an ashtray in the person’s home!
U
S A
P(A | B) = P(AI B)P(B)
| U | = 100 people!
| S | = 15 smokers!
| A | = 20 have ashtrays in their home!
! | A and S | = 10
![Page 10: Data, Responsibly - Computer Science Departmentjulia/cs500/documents/... · 2015-08-17 · Data, Responsibly!!!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Analyzing](https://reader034.fdocuments.net/reader034/viewer/2022042416/5f31f63f2490a563767f56ae/html5/thumbnails/10.jpg)
Julia Stoyanovich
Conditional Probability
10
Conditional probability measures the probability of an event given that another event has occurred.
Example: A - probability that a person smokes; B - probability that there is an ashtray in the person’s home!
U
S A
P(A | B) = P(AI B)P(B)
P(S) = 0.15 !P(A) = 0.2!
P(S and A) = 0.1 !
P(S | A) = 0.1 / 0.2 = 0.5!
P(A | S) = 0.1 / 0.15 = 0.67
Baye’s rule: P(A | B) =P(B | A)P(A)
P(B)
![Page 11: Data, Responsibly - Computer Science Departmentjulia/cs500/documents/... · 2015-08-17 · Data, Responsibly!!!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Analyzing](https://reader034.fdocuments.net/reader034/viewer/2022042416/5f31f63f2490a563767f56ae/html5/thumbnails/11.jpg)
Julia Stoyanovich
Co-occurrence
11
age edu inc sign
20 BS 27K Aries
35 MS 102K Taurus
62 PhD 200K Virgo
70 BS 80K Leo
… … … …
An event is an assignment of a value to a variable (age, edu, income, sign)!
An experiment is a particular tuple in the database!
Outcomes: { domain(A) x domain(E) x domain (I) x domain (S) }!
!P(sign = Taurus AND edu = BS) = 5%!
P(sign = Taurus AND edu = HS) = 2%
is 5% frequent?!
is 2% frequent?
Co-occurence: events A and B frequently occur together
![Page 12: Data, Responsibly - Computer Science Departmentjulia/cs500/documents/... · 2015-08-17 · Data, Responsibly!!!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Analyzing](https://reader034.fdocuments.net/reader034/viewer/2022042416/5f31f63f2490a563767f56ae/html5/thumbnails/12.jpg)
Julia Stoyanovich
Co-occurrenceCo-occurence: events A and B frequently occur together
12
P(sign = Taurus AND edu = BS) = 5%!
P(sign = Taurus AND edu = HS) = 2%
is 5% frequent?!
is 2% frequent?
![Page 13: Data, Responsibly - Computer Science Departmentjulia/cs500/documents/... · 2015-08-17 · Data, Responsibly!!!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Analyzing](https://reader034.fdocuments.net/reader034/viewer/2022042416/5f31f63f2490a563767f56ae/html5/thumbnails/13.jpg)
Julia Stoyanovich
Co-occurrence vs. CorrelationCo-occurence: events A and B frequently occur together
13
P(sign = Taurus) = 8%
Correlation: events A and B occur together more frequently than by random chance
Considerations of model and sample size!
![Page 14: Data, Responsibly - Computer Science Departmentjulia/cs500/documents/... · 2015-08-17 · Data, Responsibly!!!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Analyzing](https://reader034.fdocuments.net/reader034/viewer/2022042416/5f31f63f2490a563767f56ae/html5/thumbnails/14.jpg)
Julia Stoyanovich
Co-occurrence vs. CorrelationCo-occurence: events A and B frequently occur together
14
Correlation: events A and B occur together more frequently than by random chance
P(sign = Cancer AND edu = PhD) = 2%
P(sign = Libra AND edu = PhD) = 0
Considerations of model and sample size!
![Page 15: Data, Responsibly - Computer Science Departmentjulia/cs500/documents/... · 2015-08-17 · Data, Responsibly!!!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Analyzing](https://reader034.fdocuments.net/reader034/viewer/2022042416/5f31f63f2490a563767f56ae/html5/thumbnails/15.jpg)
Julia Stoyanovich
Correlation vs. CausationCo-occurence: events A and B frequently occur together
15
Correlation: events A and B occur together more frequently than by random chance
Causation: if event A occurs, then event B is more likely to occur
![Page 16: Data, Responsibly - Computer Science Departmentjulia/cs500/documents/... · 2015-08-17 · Data, Responsibly!!!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Analyzing](https://reader034.fdocuments.net/reader034/viewer/2022042416/5f31f63f2490a563767f56ae/html5/thumbnails/16.jpg)
Julia Stoyanovich
Correlation vs. CausationFor correlated events A and B, the following relationships are possible!
1. A causes B (direct causation)!
2. B causes A (reverse causation)!
3. A and B are consequences of a common cause, but do not cause each other!
4. A causes B and B causes A (cyclic causation)
16
![Page 17: Data, Responsibly - Computer Science Departmentjulia/cs500/documents/... · 2015-08-17 · Data, Responsibly!!!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Analyzing](https://reader034.fdocuments.net/reader034/viewer/2022042416/5f31f63f2490a563767f56ae/html5/thumbnails/17.jpg)
Julia Stoyanovich
Causation vs. Coincidence
17
https://en.wikipedia.org/wiki/File%3aPiratesVsTemp%28en%29.svg
![Page 18: Data, Responsibly - Computer Science Departmentjulia/cs500/documents/... · 2015-08-17 · Data, Responsibly!!!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Analyzing](https://reader034.fdocuments.net/reader034/viewer/2022042416/5f31f63f2490a563767f56ae/html5/thumbnails/18.jpg)
Julia Stoyanovich
Ockham’s razor
If multiple hypotheses explain an observation, the simplest one should be preferred.
18
William of Ockham!(1285-1347)
Lex parsimoniae!
Used as a heuristic to help identify a promising hypothesis to test
Ockham’s motivation: can one prove the existence of God?Many applications today: biology, probability theory, ethics
![Page 19: Data, Responsibly - Computer Science Departmentjulia/cs500/documents/... · 2015-08-17 · Data, Responsibly!!!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Analyzing](https://reader034.fdocuments.net/reader034/viewer/2022042416/5f31f63f2490a563767f56ae/html5/thumbnails/19.jpg)
Julia Stoyanovich
An example: the art of medical diagnosis
19
"It's your ribs. I'm afraid they're delicious." - New Yorker Cartoon By: Paul Noth Item #: 10684223
![Page 20: Data, Responsibly - Computer Science Departmentjulia/cs500/documents/... · 2015-08-17 · Data, Responsibly!!!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Analyzing](https://reader034.fdocuments.net/reader034/viewer/2022042416/5f31f63f2490a563767f56ae/html5/thumbnails/20.jpg)
Julia Stoyanovich
An example: overfitting
20
vs.
![Page 21: Data, Responsibly - Computer Science Departmentjulia/cs500/documents/... · 2015-08-17 · Data, Responsibly!!!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Analyzing](https://reader034.fdocuments.net/reader034/viewer/2022042416/5f31f63f2490a563767f56ae/html5/thumbnails/21.jpg)
Julia Stoyanovich
Bonferroni PrincipleIn massive datasets, unusual events may appear, by coincidence, to be more frequent than expected. If the expected frequency of an event is lower than what is expected by random chance - the event cannot be reliably detected!!
Example (sec 1.2.3 of LRU): “evil-doers” gather at hotels to plan evil deeds. A pair of people who is at the same hotel on 2 or more occasions are potentially evil-doers. !
There are 1012 people (1 billion). Everyone decides with probability 0.01 to stay at a random hotel on any given day. There are 105 hotels. We examine 1000 days worth of hotel records.
21
What is the probability that 2 people will visit the same hotel on 2 separate occasions?
![Page 22: Data, Responsibly - Computer Science Departmentjulia/cs500/documents/... · 2015-08-17 · Data, Responsibly!!!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Analyzing](https://reader034.fdocuments.net/reader034/viewer/2022042416/5f31f63f2490a563767f56ae/html5/thumbnails/22.jpg)
Julia Stoyanovich
Incentives
22
Data analysis: What patterns are we looking for? How do we look for patterns? Why do we look for them? !
![Page 23: Data, Responsibly - Computer Science Departmentjulia/cs500/documents/... · 2015-08-17 · Data, Responsibly!!!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Analyzing](https://reader034.fdocuments.net/reader034/viewer/2022042416/5f31f63f2490a563767f56ae/html5/thumbnails/23.jpg)
Julia Stoyanovich
Visualizing data
23
http://www.businessinsider.com/fox-news-obamacare-chart-2014-3
![Page 24: Data, Responsibly - Computer Science Departmentjulia/cs500/documents/... · 2015-08-17 · Data, Responsibly!!!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Analyzing](https://reader034.fdocuments.net/reader034/viewer/2022042416/5f31f63f2490a563767f56ae/html5/thumbnails/24.jpg)
Julia Stoyanovich 24