1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression.

36
1 Statistical Techniques Chapter 10

Transcript of 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression.

Page 1: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression.

1

Statistical Techniques

Chapter 10

Page 2: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression.

2

10.1 Linear Regression Analysis

baxy

2x

xyb

cnxnaxaxaxanxxxxf .......)...,,( 332211321

Simple Linear Regression

n

yb

n

ya

Page 3: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression.

3

Table 10.1 • District Office Building Data

Space Offices Entrances Age Value

2310 2 2 20 $142,0002333 2 2 12 $144,0002356 3 1.5 33 $151,0002379 3 2 43 $150,0002402 2 3 53 $139,0002425 4 2 23 $169,0002448 2 1.5 99 $126,0002471 2 2 34 $142,9002494 3 3 23 $163,0002517 4 4 55 $169,0002540 2 3 22 $149,000

Multiple Linear Regression with Excel

Page 4: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression.

4

A Regression Equation for the District Office Building Data

83.5231724.23421.2553

77.1252964.27

AgeEntrances

OfficesSpaceValue

Table 10.2 • Regression Statistics for the Office Building Data

–234.2371645 2553.211 12529.77 27.64139 52317.8313.26801148 530.6692 400.0668 5.429374 12237.360.996747993 970.5785 #N/A #N/A #N/A459.7536742 6 #N/A #N/A #N/A1732393319 5652135 #N/A #N/A #N/A

Page 5: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression.

5

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

2200 2250 2300 2350 2400 2450 2500 2550 2600

Acc

esse

dV

alu

e

Floor Space

Page 6: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression.

6

Test 1

Test 3Test 2

Test 4

>=

>=

>=

<< >=

<

<

LRM1 LRM2 LRM3

LRM4 LRM5

Regression Trees

Page 7: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression.

7

Amt

TotCost

TotCost

<= 246

LRM6

LRM7

Trips

Trips

TotCost

LRM1 Amt

LRM4 LRM5

LRM8 LRM9

Amt

LRM2 LRM3

<= 178 > 178 <= 136 > 136

> 171<= 171 > 390<= 390 > 309

> 7.5

> 39

<= 309

<= 7.5

<= 39

> 246

Page 8: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression.

8

Transforming the Linear Regression Model

Logistic regression is a nonlinear regression technique that associates a conditional probability with each data instance.

10.2 Logistic Regression

Page 9: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression.

9

The Logistic Regression Model

exp as denoted often logarithms natural of basetheis

where

1)|1(

e

xypc

c

e

e

ax

ax

0.000

0.200

0.400

0.600

0.800

1.000

1.200

-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6

x

P(y

= 1

| x

)

Page 10: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression.

10

Logistic Regression: An Example

691.17415.0314.8

827.190001.0

AgeSex

InsCreditCardIncomecax

Table 10.3 • Logistic Regression: Dependent Variable = Life Insurance Promotion

Credit Card Life Insurance ComputedInstance Income Insurance Sex Age Promotion Probability

1 40K 0 1 45 0 0.0072 30K 0 0 40 1 0.9873 40K 0 1 42 0 0.0244 30K 1 1 43 1 1.0005 50K 0 0 38 1 0.9996 20K 0 0 55 0 0.0497 30K 1 1 35 1 1.0008 20K 0 1 27 0 0.5849 30K 0 1 43 0 0.00510 30K 0 0 41 1 0.98111 40K 0 0 43 1 0.98512 20K 0 1 29 1 0.38013 50K 0 0 39 1 0.99914 40K 0 1 55 0 0.00015 20K 1 0 19 1 1.000

Page 11: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression.

11

10.3 Bayes Classifier

H

H

EP

HPHEPEHP

withassociated evidence theis E

testedbe tohypothesis theis where

)(

)()|()|(

Page 12: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression.

12

Table 10.4 • Data for Bayes Classifier

Magazine Watch Life Insurance Credit CardPromotion Promotion Promotion Insurance Sex

Yes No No No MaleYes Yes Yes Yes FemaleNo No No No MaleYes Yes Yes Yes MaleYes No Yes No FemaleNo No No No FemaleYes Yes Yes Yes MaleNo No No No MaleYes No No No MaleYes Yes Yes No Female

Bayes Classifier: An Example

Page 13: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression.

13

The Instance to be Classified

Magazine Promotion = Yes

Watch Promotion = Yes

Life Insurance Promotion = No

Credit Card Insurance = No

Sex = ?

Table 10.5 • Counts and Probabilities for Attribute Sex

Magazine Watch Life Insurance Credit CardPromotion Promotion Promotion Insurance

Sex Male Female Male Female Male Female Male Female

Yes 4 3 2 2 2 3 2 1No 2 1 4 2 4 1 4 3

Ratio: yes/total 4/6 3/4 2/6 2/4 2/6 3/4 2/6 1/4Ratio: no/total 2/6 1/4 4/6 2/4 4/6 1/4 4/6 3/4

Page 14: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression.

14

Computing The Probability For Sex = Male

)(

)()|()|(

EP

malesexPmalesexEPEmalesexP

Page 15: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression.

15

Conditional Probabilities for Sex = Male

P(magazine promotion = yes | sex = male) = 4/6

P(watch promotion = yes | sex = male) = 2/6

P(life insurance promotion = no | sex = male) = 4/6

P(credit card insurance = no | sex = male) = 4/6

P(E | sex =male) = (4/6) (2/6) (4/6) (4/6) = 8/81

Page 16: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression.

16

The Probability for Sex=Male Given Evidence E

P(sex = male | E) 0.0593 / P(E)

The Probability for Sex=Female Given Evidence E

P(sex = female| E) 0.0281 / P(E)

Page 17: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression.

17

Zero-Valued Attribute Counts

attribute for the valuespossible ofnumber

total theofpart fractional equal an is p

1)(usually 1 and 0 between a value is

))((

k

kd

pkn

Page 18: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression.

18

Missing Data

With Bayes classifier missing data items are ignored.

Page 19: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression.

19

Numeric Data

where

e = the exponential function

= the class mean for the given numerical attribute

= the class standard deviation for the attribute

x = the attribute value

)2/()( 22

)2/(1)( xexf

Page 20: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression.

20

Table 10.6 • Addition of Attribute Age to the Bayes Classifier Dataset

Magazine Watch Life Insurance Credit CardPromotion Promotion Promotion Insurance Age Sex

Yes No No No 45 MaleYes Yes Yes Yes 40 FemaleNo No No No 42 MaleYes Yes Yes Yes 30 MaleYes No Yes No 38 FemaleNo No No No 55 FemaleYes Yes Yes Yes 35 MaleNo No No No 27 MaleYes No No No 43 MaleYes Yes Yes No 41 Female

Page 21: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression.

21

Agglomerative Clustering

1. Place each instance into a separate partition.

2. Until all instances are part of a single cluster:

a. Determine the two most similar clusters.

b. Merge the clusters chosen into a single cluster.

3. Choose a clustering formed by one of the step 2 iterations as a final result.

10.4 Clustering Algorithms

Page 22: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression.

22

Table 10.7 • Five Instances from the Credit Card Promotion Database

Instance Income Magazine Watch Life InsuranceRange Promotion Promotion Promotion Sex

I1 40–50K Yes No No MaleI2 25–35K Yes Yes Yes FemaleI3 40–50K No No No MaleI4 25–35K Yes Yes Yes MaleI5 50–60K Yes No Yes Female

Agglomerative Clustering: An Example

Page 23: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression.

23

Table 10.8 • Agglomerative Clustering: First Iteration

I1 I2 I3 I4 I5

I1 1.00I2 0.20 1.00I3 0.80 0.00 1.00I4 0.40 0.80 0.20 1.00I5 0.40 0.60 0.20 0.40 1.00

Table 10.9 • Agglomerative Clustering: Second Iteration

I1 I3 I2 I4 I5

I1 I3 0.80I2 0.33 1.00I4 0.47 0.80 1.00I5 0.47 0.60 0.40 1.00

Page 24: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression.

24

A final clustering

• Compare the average within-cluster similarity to the overall similarity

• Compare the similarity within each cluster to the similarity between each cluster

• Examine the rule sets generated by each saved clustering

Page 25: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression.

25

Conceptual Clustering

1. Create a cluster with the first instance as its only member.

2. For each remaining instance, take one of two actions at each tree level.

a. Place the new instance into an existing cluster.

b. Create a new concept cluster having the new instance as its only member.

Page 26: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression.

26

Data for Conceptual Clustering

Table 10.10 • Data for Conceptual Clustering

Tails Color Nuclei

I1 One Light OneI2 Two Light TwoI3 Two Dark TwoI4 One Dark ThreeI5 One Light TwoI6 One Light TwoI7 One Light Three

Page 27: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression.

27

N4

Tails

Nuclei

Color

OneTwo

1.01.0

.71

.291.01.0

.71

.29LightDarkOneTwo

Three

.14

.57

.29

1.01.01.0

P(N) = 7/7 P(V|C) P(C|V)

Tails

Nuclei

Color

OneTwo

.670.0

1.00.0

.670.0

1.00.0

LightDarkOneTwo

Three

0.01.00.0

0.01.00.0

P(N5) = 2/3 P(V|C) P(C|V)

Tails

Nuclei

Color

OneTwo

.330.0

1.00.0

.330.0

1.00.0

LightDarkOneTwo

Three

1.00.00.0

1.00.00.0

P(N3) = 1/3 P(V|C) P(C|V)

Tails

Nuclei

Color

OneTwo

.60.0

1.00.0

.60.0

1.00.0

LightDarkOneTwo

Three

.33

.670.0

1.0.50.0

P(N1) = 3/7 P(V|C) P(C|V)

Tails

Nuclei

Color

OneTwo

0.01.0

0.01.0

.2

.5.5.5

LightDarkOneTwo

Three

0.01.00.0

0.0.5

0.0

P(N2) = 2/7 P(V|C) P(C|V)

Tails

Nuclei

Color

OneTwo

.40.0

1.00.0

.2

.5.5.5

LightDarkOneTwo

Three

0.00.01.0

0.00.01.0

P(N4) = 2/7 P(V|C) P(C|V)

I2

I5 I6

I3 I4 I7

I1

N1

N5

N2

N3

N

Page 28: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression.

28

COBWEB(Fisher 1987)

Heuristic measure of partition quality

Category utility

Page 29: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression.

29

Expectation Maximization

1. Similar to the K-Means procedure

2. Makes use of the finite Gaussian mixtures model

3. The mixture model assigns each individual data instance a probability

Page 30: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression.

30

3.3 The K-Means Algorithm

1. Choose a value for K, the total number of clusters.

2. Randomly choose K points as cluster centers.

3. Assign the remaining instances to their closest cluster center.

4. Calculate a new cluster center for each cluster.

5. Repeat steps 3-5 until the cluster centers do not change.

Page 31: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression.

31

Table 3.6 • K-Means Input Values

Instance X Y

1 1.0 1.52 1.0 4.53 2.0 1.54 2.0 3.55 3.0 2.56 5.0 6.0

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6

f(x)

x

Page 32: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression.

32

Expectation Maximization

1. Guess initial values for the parameters.

2. Until a termination criterion is achieved:

a. Use the probability density function for normal distributions to compute the cluster

probability for each instance.

b. Use the probability scores assigned to each instance in step 2(a) to re-estimate the parameters.

Page 33: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression.

33

Table 10.11 • An EM Clustering of Gamma-Ray Burst Data

Cluster 0 Cluster 1 Cluster 2

# Instances 518 340 321

Log Fluence

Mean –5.6670 –4.8131 –6.3657SD 0.4088 0.5301 0.5812

Log HR321

Mean 0.0538 0.2949 0.5478SD 0.3018 0.1939 0.2766

Log T90

Mean 1.2709 1.7159 –0.3794SD 0.4906 0.3793 0.4825

Page 34: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression.

34

Inductive problem-solving methods

• Query and visualization techniques

• Machine learning techniques

• Statistical techniques

10.5 Heuristics or Statistics?

Page 35: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression.

35

Query and Visualization Techniques

• Query tools and OLAP tools–Unable to find hidden patterns

• Visualization tools–Decision trees, bar and pie charts, histograms, maps, surface plot diagrams–Applied after a data mining process to help us understand what has been discovered

Page 36: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression.

36

Machine Learning and Statistical Techniques

1. Statistical techniques typically assume an underlying distribution for the data whereas machine learning techniques do not.

2. Machine learning techniques tend to have a human flavor.

3. Machine learning techniques are better able to deal with missing and noisy data.

4. Most machine learning techniques are able to explain their behavior.

5. Statistical techniques tend to perform poorly with large-sized data.