TODAY’S LECTURE - UMD

TODAY’S LECTURE

Data collection

Data processing

Exploratory analysis

& Data viz

Analysis, hypothesis testing, &

ML

Insight & Policy

Decision

1

BIG THANKS: Zico Kolter (CMU) & Amol Deshpande (UMD)

STATISTICAL INFERENCEStatistical inference is the discipline that concerns itself with the development of procedures, methods, and theorems that allow us to extract meaning and information from data that has been generated by stochastic (random) processes.

• Process of going from the world to the data, and then back to the world

• Often the goal is to develop a statistical model of the world from observed data

Conclusion is typically: • an estimate; • or a confidence interval; • or rejection of a hypothesis • or clustering or classification of data points into groups

2

BASIC PROBABILITY IProbability is concerned with the outcome of a trial (also called experiment or observation) Sample Space: Set of all possible outcomes of a trial • Probability of Sample Space = 1 Event is the specification of the outcome of a trial

• For example: Trial = Tossing a coin; Sample Space = {Heads, Tails}; Event = Heads

If two events E and F are independent, then: Probability of E does not change if F has already happend = P(E), i.e., P(E | F) = P(E) Also: P(E AND F) = P(E) * P(F) If two events E and F are mutually exclusive, then: P(E UNION F) = P(E) + P(F)

3

BASIC PROBABILITY IIBayes Theorem P(A | B) = P(B | A) * P(A) / P(B) Simple equation, but fundamental to Bayesian inference Conditional Independence: A and B are conditionally independent given C if: Pr(A AND B | C) = Pr(A | C) * Pr(B | C) Powerful in reducing the computational efforts in storing and manipulating large joint probability distributions Entropy: A measure of the uncertainty in a probability distribution Wikipedia Article

4

http://en.wikipedia.org/wiki/Entropy_(information_theory)

http://en.wikipedia.org/wiki/Entropy_(information_theory)

SAMPLE STATISTICSTo be more consistent with standard statistics notation, we’ll introduce the notion of a population and a a sample

Mean

Variance

7

SamplePopulation

μ = E[X ]

σ = E[(X − μ)2]

x̄ =1m

m

∑i=1

x(i)

s2 =1

m − 1

m

∑i=1

(x(i) − x̄)2

RECALL: NORMAL DISTRIBUTION

99.7% values will fall within 3 standard deviations (around the mean)

• 95% for 2 standard deviations; 68% for 1

Central Limit Theorem: As sample size approaches infinity, distribution of sample means will follow a normal distribution irrespective of the original distribution

9

RANDOM VARIABLE• We want to determine if a tweet was generated by a bot.

• Sample a random tweet and have a human decide.

• Denote it as a binary random variable,

10

X ∈ {0,1}

X = {1, if bot generated0, otherwise

PROBABILITY DISTRIBUTION (DISCRETE)• It is a function

• Probability that a randomly sampled tweet is bot generated is given by:

• Suppose, somehow, we know that then,

11

p(X = 1)

P : D → [0,1]

p(X = 1) = 0.7

p(X = 0) = 0.3

EXPECTATION• Now sample, n = 100 random tweets.

• How many of those would be bot generated

• For this single sample:

• Now for 100 random tweets, the expected value

12

E(X ) = ∑x∈D

xp(X = x)

E(X ) = 0 × p(X = 0) + 1 × p(X = 1) = 0 × 0.3 + 1 × 0.7 = 0.7

E(Y ) = E(X1 + X2 + … + X100)

= E(X1) + E(X2) + … + E(X100)

= 0.7 + 0.7 + … + 0.7= 100 × 0.7 = 70

ESTIMATION

13

LAW OF THE LARGE NUMBERS• Given independently sampled random variables

with for all i the law of large numbers states that the sample mean

tends to the population mean

14

X1, X2, …, Xn

E(Xi) = μ

1n ∑

i

Xi → μ

CENTRAL LIMIT THEOREM• LLN - estimates built using sample mean tend to correct

answer

• CLT describes how the estimates are spread

• Now variance of our random tweet example

15

var(X ) = E(X − E(X ))2

var(X ) = ∑D

(x − E(X ))2 p(X = x)

= (0 − p)2 × (1 − p) + (1 − p)2 × p

= p2(1 − p) + (1 − p)2 p

= p(1 − p)(p + (1 − p)) = p(1 − p)

CENTRAL LIMIT THEOREM• CLT tends towards a normal distribution as

• As sample size increases the distribution of sample means is well approximated by a normal distribution

16

n → ∞

THE NORMAL DISTRIBUTION• We write Y is normally distributed with mean, , and

standard deviation, , as

• We write probability density function as

• Satisfies the conditions:

17

μσ

p(Y = y) =1

2πσe

− 12 ( y − μ

σ )2

p(Y = y) ≥ 0 for all values Y ∈ (−∞, ∞)∞

∫−∞

p(Y = y)dy = 1

Y ∼ N(μ, σ)

THE NORMAL DISTRIBUTION

18

THE NORMAL DISTRIBUTION - STANDARDIZED• We write Y is normally distributed with mean, , and

standard deviation, , as

• Standardized probability density function as

19

μσ

p(Z = z) =1

2πe− 1

2 z2

Y ∼ N(μ, σ)

Y ∼ N(0,1)

CONCLUDING CLT• Given independently sampled random variables

with for all i and and their sample mean

The standard deviation of Y is called standard error

20

X1, X2, …, Xn

E(Xi) = μ

1n ∑

i

Xi

sd(Xi) = σ

se(Y ) =σ

n

CONCLUDING CLT• The distribution Y tends towards

• Thus, as sample size increases the distribution of sample means is well approximated by a normal distribution and the spread goes to zero at the rate of square root of sample size.

• Assumptions : • are iid (independent and identically distributed)

random variables •

21

N(μ,σ

n) as n → ∞

X1, X2, …, Xn

var(X ) < ∞

CONCLUDING CLT

22

HYPOTHESIS TESTINGAccepting or rejecting a statistical hypothesis about a population

H_0: null hypothesis, and H_1: the alternative hypothesis • Mutually exclusive and exhaustive • H_0 can never be proven to be true, but can be rejected • Sometimes don’t have H_1 at all (Fisher’s test)

Statistical significance: probability that the result is not due to chance Example: Deciding if a coin is fair

• http://20bits.com/article/hypothesis-testing-the-basics

23

HYPOTHESIS TESTINGH0: null hypothesis, and H1: the alternative hypothesis • Mutually exclusive and exhaustive • H0 can never be proven to be true Statistical significance: probability that the result is not due to chance Process: • Decide on H0 and H1 • Decide which test statistic is appropriate

• Roughly, how well does my sample agree with the null hypothesis? • Key question: what is the distribution of the test statistic over samples?

• Select a significance level (sigma), a probability threshold below which the null hypothesis will be rejected -- typically 5% or 1%.

• Compute the observed value of the test statistic tobs from the sample • Compute p-value: the probability that the test statistic took that value by

chance • Use the distribution above to compute the p-value

• Reject the null hypothesis if the p-value < \sigma

24

HYPOTHESIS TESTING• Hypothesis: More than 50% of tweets are bot-generated.

• One approach is to reject the hypothesis.

• Null Hypothesis: 50% or less of tweets are bot generated • Alternative hypothesis: More than 50% of tweets are bot

generated

25

H0 : p ≤ .5 (null)H1 : p > .5 (alternative)

HYPOTHESIS TESTING• According to CLT, estimates of p from n samples would be

distributed as

• if estimate is too far from p = .5, then reject the null hypothesis.

26

N(.5,.5(1 − .5)

n )

P(Y ≥ ̂p) ≥ 0.95

HYPOTHESIS TESTING

27

HYPOTHESIS TESTING

28

HYPOTHESIS TESTING

29

HYPOTHESIS TESTING ERRORS

30

SCIENTIFIC METHOD: STATISTICAL ERRORS

Nature Article P values not as reliable as many scientists assume p-hacking: cherry picking data points etc., to get the p-values; repeating experiments if they fail till you get the result Much discussion/debate about this issue in recent years

31

http://www.nature.com/news/scientific-method-statistical-errors-1.14700

http://www.nature.com/news/scientific-method-statistical-errors-1.14700

SAMPLING BIASESSampling effective at reducing the data you need to analyze Ideally you want random sample

• Otherwise you need to account for bias, which can be tricky Bias in sampling: need to be very careful when generalizing inferences drawn from a sample

• Even for random samples

Questions to ask: How was the sample selected? Was it truly random? Potential biases? How were questions worded? How is missing data/attrition handled? Was the sample size large enough?

32

SOME POTENTIAL SOURCES OF BIASES

Sample Bias • Selection bias: some subjects more likely to be selected • Volunteer bias: people who volunteer are not representative • Nonresponse bias: people who decline to be interviewed

Survey/Response Bias • Interviewer bias • Acquiescence bias – tendency to agree with all questions • Social desirability bias: people are not going to admit to

embarrassing things Also watch out for:

• Confirmation bias • Anchor bias

33

SOME POTENTIAL SOURCES OF BIASES

Gold Standard: Randomized Clinical Trials • Some people receive "treatment", others in a

"control" group • Picked randomly to take care of all

confounding factors • Problems:

• Ethically feasible only if clinically equipoise

• Can't ask some people to smoke to figure out the effects of smoking

• Very expensive and cumbersome • Impossible in many cases

Recall: Recent Facebook experiment on emotions

34

A true state of equipoise exists when

one has no good basis for a choice between two or more

care options. - NIH

BEWARE OF CHARTS !

35

Terry Schiavo Case

http://en.wikipedia.org/wiki/Terri_Schiavo_case

http://en.wikipedia.org/wiki/Terri_Schiavo_case

NEWSPAPERS EVEN MORE

Source A Washington Post article says: In the first study of its kind, researchers from Washington State University and elsewhere found a 14 percent greater risk of head injuries to cyclists associated with cities that have bike share programs. In fact, when they compared raw head injury data for cyclists in five cities before and after they added bike share programs, the researchers found a 7.8 percent increase in the number of head injuries to cyclists. Actually: head injuries declined from 319 to 273, and overall injuries declined from 757 to 545

• So the proportion of head injuries went up !!

36

http://andrewgelman.com/2014/06/17/lie-statistics-example-23110/

http://andrewgelman.com/2014/06/17/lie-statistics-example-23110/

MACHINE LEARNING

37

38

K-NEAREST NEIGHBORS

NEAREST NEIGHBOR CLASSIFIERSBasic idea: • If it walks like a duck, quacks like a duck, then it’s probably a

duck

Training Records

Test RecordCompute Distance

Choose k of the “nearest” records

39

NEAREST-NEIGHBOR CLASSIFIERS

40

� Requires three things – The set of stored records – Distance Metric to compute

distance between records – The value of k, the number of

nearest neighbors to retrieve

� To classify an unknown record: – Compute distance to other

training records – Identify k nearest neighbors – Use class labels of nearest

neighbors to determine the class label of unknown record (e.g., by taking majority vote)

Unknown record

DEFINITION OF NEAREST NEIGHBOR

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data points that have the k smallest distances to x

41

1-NEAREST NEIGHBORVoronoi Diagram defines the classification boundary

The area takes the class of the green point

42

NEAREST NEIGHBOR CLASSIFICATIONCompute distance between two points: • Euclidean distance

Determine the class from nearest neighbor list • Take the majority vote of class labels among the k-nearest neighbors • Weight the vote according to distance

• E.g., weight factor w = 1/d2

∑ −=i ii

qpqpd 2)(),(

43

NEAREST NEIGHBOR CLASSIFICATION…Choosing the value of k: • If k is too small, sensitive to noise points • If k is too large, neighborhood may include points from other classes

X

44

NEAREST NEIGHBOR CLASSIFICATION…Scaling issues • Attributes may have to be scaled to prevent distance

measures from being dominated by one of the attributes • Example:

• height of a person may vary from 1.5m to 1.8m • weight of a person may vary from 90lb to 300lb • income of a person may vary from $10K to $1M

Standardize variables, like in Mini-Project #2.

45

NEAREST NEIGHBOR CLASSIFICATION…Problem with Euclidean measure: • High dimensional data

• The curse of dimensionality – data becomes sparse relative to the total volume of the space, distance metrics “lose meaning”

• Can produce counter-intuitive results

1 1 1 1 1 1 1 1 1 1 1 0

0 1 1 1 1 1 1 1 1 1 1 1

1 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 1vs

d = 1.4142 d = 1.4142 Solution: Normalize the vectors to unit length

46

NEAREST NEIGHBOR CLASSIFICATION…k-NN classifiers are lazy learners • It does not build models explicitly • Unlike eager learners such as decision tree induction and rule-

based systems Classifying unknown records are relatively expensive • Naïve algorithm: O(n) • Need for structures to retrieve nearest neighbors fast

• The Nearest Neighbor Search problem • CMSC420 covers spatial data structures extensively

47

PICKING THE NUMBER OF NEIGHBORS

48

TODAY’S LECTURE - UMD

Documents

Transcript of TODAY’S LECTURE - UMD