And Here We Go … Get ready to study for the AP Stats test! Only 1050 minutes of class time until...

105
And Here We Go … Get ready to study for the AP Stats test! Only 1050 minutes of class time until the big day… Friday,MAY 10!

Transcript of And Here We Go … Get ready to study for the AP Stats test! Only 1050 minutes of class time until...

And Here We Go … Get ready to study for the AP Stats test!

Only 1050 minutes of class time until the big

day…

Friday,MAY 10!

How much studying will you do for $521.04?

plus book…

The Exam ItselfTo maximize your score on the AP Statistics Exam, you first need to know how the exam is organized and how it will be scored.

The AP Statistics Exam consists of two separate sections:

Section I 40 Multiple-Choice questions

90 minutes counts 50 percent of exam score

Section II Free-Response questions

90 minutes

counts 50 percent of exam score

Questions are designed to test your statistical reasoning and your communication skills.

SCORING:Five open-ended problems @ 13 minutes; each counts 15 percent of free-response scoreOne investigative task @ 25 minutes; counts 25 percent of free-response score

Each free-response question is scored on a 0 to 4 scale. General descriptors for each of the scores are:

Your work is graded holistically, meaning that your entire response to a problem is considered before a score is assigned.

4 Complete Response NO statistical errors and clear communication

3 Substantial Response Minor statistical error/omission or fuzzy communication

2Developing Response

Important statistical error/omission or lousy communication

1 Minimal ResponseA "glimmer" of statistical knowledge related to the problem

0Inadequate Response

No glimmer; statistically dangerous to himself and others

Calculator PolicyEach student is expected to bring to the exam a graphing calculator with statistical capabilities. The computational capabilities should include standard statistical univariate and bivariate summaries, through linear regression. The graphical capabilities should include common univariate and bivariate displays such as histograms, boxplots, and scatterplots.

• You can bring two calculators to the exam.

• The calculator memory will not be cleared but you may only use the memory to store programs, not notes.

• For the exam, you're not allowed to access any information in your graphing calculators or elsewhere if it's not directly related to upgrading the statistical functionality of older graphing calculators to make them comparable to statistical features found on newer models. The only acceptable upgrades are those that improve the computational functionalities and/or graphical functionalities for data you key into the calculator while taking the examination. Unacceptable enhancements include, but aren't limited to, keying or scanning text or response templates into the calculator.

• During the exam, you can't use minicomputers, pocket organizers, electronic writing pads, or calculators with QWERTY (i.e., typewriter) keyboards.

2008-09 List of Graphing CalculatorsGraphing calculators having the expected built-in capabilities listed above are indicated with an asterisk (*). However, students may bring any calculator on the list to the exam; any model within each series is acceptable.

CasioFX-6000 seriesFX-6200 seriesFX-6300 seriesFX-6500 seriesFX-7000 seriesFX-7300 seriesFX-7400 seriesFX-7500 seriesFX-7700 seriesFX-7800 seriesFX-8000 seriesFX-8500 seriesFX-8700 seriesFX-8800 seriesFX-9700 series *FX-9750 series *FX-9860 series *CFX-9800 series *CFX-9850 series *CFX-9950 series *CFX-9970 series *FX 1.0 series *Algebra FX 2.0 series *

Hewlett-PackardHP-9GHP-28 series *HP-38G *HP-39 series *HP-40 series* HP-48 series *HP-49 series *HP-50 series*

Radio ShackEC-4033EC-4034EC-4037

SharpEL-5200EL-9200 series *EL-9300 series *EL-9600 series *† EL-9900 series *

Texas Instruments TI-73TI-80TI-81TI-82 *TI-83/TI-83 Plus *TI-83 Plus Silver *TI-84 Plus *TI-84 Plus Silver *TI-85 * TI-86 *TI-89 *TI-89 Titanium *TI-Nspire *TI-Nspire CAS *

OtherDatexx DS-883 Micronta Smart2

Exam grade 2008 Statistics Goins 2008

5 14,009 12.8% 3 12%

4 24,528 22.6% 7 28%

3 25,707 23.8% 8 32%

2 20,403 18.8% 4 16%

1 23,637 21.9% 3 12%

Number of students 108,284 25

3 or higher / % 64,244 59.2% 18 72%

Mean grade 2.86 3.12

Standard deviation 1.34

1st AP Statistics test: 1997 ~ 7500 students2008 AP Stat test: ~ 100,000 students

Exam grade 2009 Statistics Goins 2009

5 12.3% 2 4.3%

4 22.3% 6 12.8%

3 24.2% 17 36.2%

2 19.1% 12 25.5%

1 22.2% 10 21.3%

Number of students 116,876 47

3 or higher / % 68,679 58.8% 25 53.3%

Mean grade 2.83 2.56

Standard deviation 1.33

1st AP Statistics test: 1997 ~ 7500 students2009 AP Stat test: 116,876 students

Exam grade 2010 Statistics Goins 2010

5 12.8% 5 13.9%

4 22.4% 10 27.8%

3 23.5% 11 30.6%

2 18.2% 6 16.7%

1 23.1% 4 11.1%

Number of students 129,899 36

3 or higher / % 58.7% 72.3%

Mean grade 2.84 3.167

Standard deviation 1.35 1.2

1st AP Statistics test: 1997 ~ 7500 students2010 AP Stat test: ~ 109,609 students

Exam grade 2011 Statistics Goins 2011

5 12.1% 8 16.0%

4 21.3% 18 36.0%

3 25.0% 14 28.0%

2 17.8% 7 14.0%

1 23.9% 3 6.0%

Number of students 142,910 50

3 or higher / % 58.8% 80.0%

Mean grade 2.82 3.42

Standard deviation 1.34 1.1

1st AP Statistics test: 1997 ~ 7500 students2011 AP Stat test: ~ 137,498 students

Exam grade 2012 Statistics Goins 2012

5 12.5% 5 8.2%

4 21.1% 13 21.3%

3 25.6% 17 27.9%

2 18.0% 16 26.2%

1 22.8% 10 16.4%

Number of students 153,859 61

3 or higher / % 59.2% 57.4%

Mean grade 2.83 2.62

Standard deviation 1.33

1st AP Statistics test: 1997 ~ 7500 students2012 AP Stat test: ~ 143,554 students

The AP Statistics Exam covers material in these areas:

I. Exploring data: describing patterns and departures from patterns (20-30%) Analyze data using graphical and numerical techniques Emphasis on interpreting info from graphical and numerical displays

and summaries

II. Sampling and experimentation: planning and conducting a study (10–15%) Collecting data with a well developed plan Clarifying the question and deciding on a method of data collection and

analysis

III. Anticipating patterns: Exploring random phenomena using probability and simulations (20-30%) Anticipating what the distribution of data should look like under a given

model

IV. Statistical inference: Estimating population parameters and testing hypotheses (30-40%) Selecting appropriate models for statistical inferences

So. . . Let’s get started!What do you call

data that has only ONE variable?

UNIVARIATE DATA

What are the two types of univariate data sets?

Categorical: qualitative (brand)

Numerical: quantitative (numerical in nature)

Type of computer you use Car you drive Area codes

height Price of textbookAmount of cola in can

What are the two types of numerical data?

Discrete: possible values are isolated points on a number line

Continuous: possible values form an interval (measurements are usually continuous)

Number of AP classes

Distance lives from school

What are appropriate graphical displays for

categorical data?Bar Graphs• Bars do not touch• Categorical variable is

typically on the horizontal axis

• To describe – comment on which occurred the most often or least often

• May make a double bar graph or segmented bar graph for bivariate categorical data sets

Subject Preference

0

5

10

15

20

25

History Math Science English Business Foreignlanguage

Subject preference by gender

0

2

4

6

8

10

12

14

History Math Science English Business Foreignlanguage

Male

Female

Pie Charts• To make:

– Proportion X 360° – Using a protractor, mark off each part

• To describe – comment on which occurred the most often or least often

What are appropriate graphical displays for

categorical data?

Subject Preference

History6%

Math44%

Science27%

English13%

Business2%

Foreign language 8%

What are appropriate graphical displays for

numerical data?Dot Plot

• Used with numerical data (either discrete or continuous)

• Made by putting dots (or X’s) on a number line

• Can make comparative dotplots by using the same axis for multiple groups

Stem (and leaf) Plot

• Used with univariate, numerical data

• Must have key so that we know how to read numbers

• Can split stems when you have long list of leaves

• Can have a comparative stemplot with two groups (back to back)

What are appropriate graphical displays for

numerical data?Histograms• Used with numerical data• Bars touch on histograms• Two types

– Discrete• Bars are centered over discrete values

– Continuous• Bars cover a class (interval) of values

• For comparative histograms – use two separate graphs with the same scale on the horizontal axis

• Use no fewer than 5 classes (bars)• Check to see if scale is misleading• Look for symmetry and skewness

• . . . is used to answer questions about percentiles. • Percentiles are the percent of individuals that are at or

below a certain value.• Quartiles are located every 25% of the data. The first

quartile (Q1) is the 25th percentile, while the third quartile (Q3) is the 75th percentile. What is the special name for Q2?

• Interquartile Range (IQR) is the range of the middle half (50%) of the data.

IQR = Q3 – Q1

What are appropriate graphical displays for numerical data?

Cumulative Relative

Frequency Plot(Ogive)

What are appropriate graphical displays for numerical data?

Boxplot (and whisker)

• Used with numerical data (either discrete or continuous)

• Modified shows outliers• Can make comparative

by showing side-by-side on same scale

• Good for comparing quartile, medians, and spread

 

Why use boxplots?• ease of construction• convenient handling

of outliers• construction is not

subjective (like histograms)

• Used with medium or large size data sets (n > 10)

• useful for comparative displays

• does not retain the individual observations

• should not be used with small data sets (n < 10)

Why not use boxplots?

How to construct• find five-number summary

Min Q1 Med Q3 Max• draw box from Q1 to Q3• draw median as center line in the

box• extend whiskers to min & max

Modified boxplots• display outliers • fences mark off mild &

extreme outliers• whiskers extend to largest

(smallest) data value inside the fence

ALWAYS use modified boxplots in this class!!!

Inner fence

Q1 Q3

Q1 – 1.5IQR Q3 + 1.5IQRAny observation outside this fence is an outlier! Put a dot

for the outliers.

Interquartile Range (IQR) – is the range (length) of the box

Q3 - Q1

Modified Boxplot . . .

Q1 Q3

Draw the “whisker” from the quartiles to the observation that is within the

fence!

Outer fence

Q1 Q3

Q1 – 3IQR Q3 + 3IQR

Any observation outside this fence is an extreme outlier!

Any observation between the fences is considered a mild outlier.

Symmetrical boxplots Approximately symmetrical boxplot

Skewed boxplot

the average number of text sent per month

the Math SAT Score for students at your school

the area code of an individual

the favorite movie type of AP Stat students by gender

the birth weights of female babies born at a large hospital

the number of speeding tickets each student in AP Stat received

Histogram

the number of TV’s in the homes of AP Stat students

the color of M&M candies selected at random from a bag

Continuous numerical

the income of adults in your city

Graphthe heights of male students in

your school

Type of variableVariable

Discrete numerical

Categorical Bar graph

Dot Plot

Stem Plot

Discrete numerical

Discrete numerical Dot Plot

Continuous numerical

Histogram

Categorical

Categorical

Bar graph – segmented or double

Bar graph

Discrete numerical

Cumulative frequency plot (ogive)

Histogram

Continuous numerical

Just

CUSS and

BS!

How do you describe univariate data?

Center“the typical value”

MedianMean

Gaps

Outliers

Unusual Features

Shapesingle vs. multiple

modes(unimodal, bimodal)

symmetry vs. skewness

Illustrated Distribution Shapes

Unimodal Bimodal Multimodal

Symmetric Skew positively(right)

Skew negatively(left)

Spread“how tightly values cluster around the

center”

Range

Standard deviation

5-number summary

IQR

And Be Specific!

Measures of Central Tendency

• Median - the middle of the data; 50th percentile–Observations must be in

numerical order–Is the middle single value if n is

odd–The average of the middle two

values if n is even

NOTE: n denotes the sample size

Measures of Central Tendency

• Mean - the arithmetic average

–Use to represent a population mean

–Use x to represent a sample mean

nx

x FormulaFormula: : is the capital Greek

letter sigma – it means to sum the values that

follow

parameter

statistic

Measures of Central Tendency

• Mode – the observation that occurs the most often

–Can be more than one mode

–If all values occur only once – there is no mode

–Not used as often as mean & median

Suppose we are interested in the number of lollipops that are bought at a certain store. A sample of 5 customers buys the following number of lollipops. Find the median.

22 3 3 4 4 8 8 12 12

The numbers are in order & n is odd – so

find the middle observation.

The median is 4 lollipops!

Suppose we have sample of 6 customers that buy the following number of lollipops. The median is …

22 3 3 4 4 6 6 8 8 12 12

The numbers are in order & n is even – so find the middle two

observations.

The median is 5 lollipops!

Now, average these two values.

5

Suppose we have sample of 6 customers that buy the following number of lollipops. Find the mean.

22 3 3 4 4 6 6 8 8 12 12

To find the mean number of lollipops add the observations

and divide by n.

61286432 833.5x

What would happen to the median & mean if the 12 lollipops were 20?

22 3 3 4 4 6 6 8 8 20 20

The median is . . .

5

The mean is . . .

62086432

7.17

What happened?

What would happen to the median & mean if the 20 lollipops were 50?

22 3 3 4 4 6 6 8 8 50 50

The median is . . .

5

The mean is . . .

65086432

12.17

What happened?

Resistant -

• Statistics that are not affected by outliers

• Is the median resistant?

►Is the mean resistant?Is the mean resistant?

YES

NO

Now find how each observation deviates from the mean.

What is the sum of the deviations from the mean?

Look at the following data set. Find the mean.

22 23 24 25 25 26 29 30

5.25x

xx 0

Will this sum always equal zero?

YESThis is the deviation from

the mean.

Look at the following data set. Find the mean & median.

Mean =

Median =

21 23 23 24 25 25 26 2626 27

27 27 27 28 30 30 30 3132 32

27Create a histogram with

the data. (use x-scale of 2) Then find the mean

and median.

27

Look at the placement of the mean and median in this symmetrical distribution.

Look at the following data set. Find the mean & median.

Mean =

Median =

22 29 28 22 24 25 2821 25

23 24 23 26 36 38 6223

25Create a histogram with

the data. (use x-scale of 8) Then find the mean

and median.

28.176

Look at the placement of the mean and

median in this right skewed distribution.

Look at the following data set. Find the mean & median.

Mean =

Median =

21 46 54 47 53 60 55 5560

56 58 58 58 58 62 63 64

58Create a histogram with

the data. Then find the mean and median.

54.588

Look at the placement of the mean and

median in this skewed left distribution.

Recap:

• In a symmetrical distribution, the mean and median are equal.

• In a skewed distribution, the mean is pulled in the direction of the skewness.

• In a symmetrical distribution, you should report the mean!

• In a skewed distribution, the median should be reported as the measure of center!

Trimmed mean:Purpose is to remove outliers from a

data setTo calculate a trimmed mean:• Multiply the % to trim by n• Truncate that many observations from

BOTH ends of the distribution (when listed in order)

• Calculate the mean with the shortened data set

Find a 10% trimmed mean with the following data.

12 14 19 20 22 24 25 26 2635

10%(10) = 1

So remove one observation from each side!

228

2626252422201914

Why is the study of variability Why is the study of variability important?important?

• Allows us to distinguish between usual & unusual values

• In some situations, want more/less variability–scores on standardized tests

–time bombs

–medicine

Range: • Single number – not an interval

• Sensitive to outliers

• Midrange – average of the max and min values - VERY sensitive to outliers

13 QQIQR

Interquartile Range (IQR): .Quartiles:The first quartile (Q1) is the value for which 25% of the observations are less than. It is the Median of the first half of the set of observations. (the 25th percentile)

The third quartile (Q3) is the value for which 75% of the observations are less than. It is the Median of the second half of the set of observations. (the 75th percentile)

IQR is insensitive to outliers.

The average of the deviations squared is called the variance.

Population Sample

2 2s

parameter statistic

A standard deviation is a measure of the average deviation from the mean.

Population Sample

s

Suppose that we have this population:

24 34 26 30 3716 28 21 35 29

Find the mean

Find the deviations. x

What is the sum of the deviations from the mean?

( )

24 34 26 30 3716 28 21 35 29

Square the deviations: 2x

Find the average of the squared deviations:

2

2 x

n

Calculation of variance Calculation of variance of a sampleof a sample

1

2

2

nxx

s n

df

Degrees of Freedom Degrees of Freedom (df)(df)

• n deviations contain (n - 1) independent pieces of information about variability

Calculation of standard Calculation of standard deviation of a sampledeviation of a sample

1

2

n

xxs n

Note: Variance and Standard Deviation are used to measure spread when the mean is used to describe

center.

Note: IQR is typically used to describe spread when Median is used to describe center.

Note: When the distribution is approximately symmetric, the mean and standard deviation are

generally used to summarize the distribution. If the distribution is skewed, a five number summary is

generally use

When to use what??????

Which measure(s) of variability is/are

resistant?

Linear transformation ruleLinear transformation rule

• When adding a constant to a random variable, the mean changes but not the standard deviation.

• When multiplying a constant to a random variable, the mean and the standard deviation changes.

An appliance repair shop charges a $30 service call to go to a home for a repair. It also charges $25 per hour for labor. From past history, the average length of repairs is 1 hour 15 minutes (1.25 hours) with standard deviation of 20 minutes (1/3 hour). Including the charge for the service call, what is the mean and standard deviation for the charges for labor?

25.61$)25.1(2530

33.8$31

25

Rules for Combining two variablesRules for Combining two variables

• To find the mean for the sum (or difference), add (or subtract) the two means

• To find the standard deviation of the sum (or differences), ALWAYS add the variances, then take the square root.

• Formulas:

baba

baba

22baba

If variables are independent

Bicycles arrive at a bike shop in boxes. Before they can be sold, they must be unpacked, assembled, and tuned (lubricated, adjusted, etc.). Based on past experience, the times for each setup phase are independent with the following means & standard deviations (in minutes). What are the mean and standard deviation for the total bicycle setup times?

Phase Mean SD

Unpacking 3.5 0.7

Assembly 21.8 2.4

Tuning 12.3 2.7

minutes6.373.128.215.3 T

minutes680.37.24.27.0 222 T

Normal Normal DistributionsDistributions

• Symmetrical bell-shaped (unimodal) density curve

• AboveAbove the horizontal axis• N(, )• The transition points occur at + • Probability is calculated by finding the area area

under the curveunder the curve• As increasesincreases, the curve flattens &

spreads out• As decreasesdecreases, the curve gets

taller and thinner

How is this done

mathematically?

Normal distributions occur Normal distributions occur frequently.frequently.

• Length of newborn child• Height• Weight• ACT or SAT scores• Intelligence• Number of typing errors • Chemical processes

A

B

Do these two normal curves have the same mean? If so, what is it?

Which normal curve has a standard deviation of 3?

Which normal curve has a standard deviation of 1?

6

YESYES

BB

AA

Empirical RuleEmpirical Rule•Approximately 68%68% of the

observations fall within of •Approximately 95%95% of the

observations fall within 2 of •Approximately 99.7%99.7% of the

observations fall within 3 of

Suppose that the height of male students at SHS is normally distributed with a mean of 71 inches and standard deviation of 2.5 inches. What is the probability that the height of a randomly selected male student is more than 73.5 inches?P(X > 73.5) = 0.16

71

68%

1 - .68 = .32

Standard Normal Density Standard Normal Density CurvesCurves

Always has = 0 & = 1

To standardize:

x

zMust have

this memorize

d!

Strategies for finding Strategies for finding probabilities or proportions in probabilities or proportions in

normal distributionsnormal distributions

1.State the probability statement

2.Draw a picture3.Calculate the z-score4.Look up the probability

(proportion) in the table

The lifetime of a certain type of battery is normally distributed with a mean of 200 hours and a standard deviation of 15 hours. What proportion of these batteries can be expected to last less than 220 hours?P(X < 220) =

33.115

200220

z

.9082

Write the probability statement

Draw & shade the

curve

Calculate z-score

Look up z-score in

table

The lifetime of a certain type of battery is normally distributed with a mean of 200 hours and a standard deviation of 15 hours. What proportion of these batteries can be expected to last more than 220 hours?P(X>220) =

33.115

200220

z

1 - .9082 = .0918

The lifetime of a certain type of battery is normally distributed with a mean of 200 hours and a standard deviation of 15 hours. How long must a battery last to be in the top 5%?P(X > ?) = .05

675.22415

200645.1

x

x .95.05

Look up in table 0.95 to find z- score

1.645

The heights of the female students at SHS are normally distributed with a mean of 65 inches. What is the standard deviation of this distribution if 18.5% of the female students are shorter than 63 inches?P(X < 63) = .185

6322.2

9.2

65639.

What is the z-score for the 63?

-0.9

The heights of female teachers at SHS are normally distributed with mean of 65.5 inches and standard deviation of 2.25 inches. The heights of male teachers are normally distributed with mean of 70 inches and standard deviation of 2.5 inches. •Describe the distribution of differences of heights (male – female) teachers.

Normal distribution with = 4.5 & = 3.3634

• What is the probability that a randomly selected male teacher is shorter than a randomly selected female teacher?

4.5

P(X<0) =

34.13634.3

5.40

z

.0901

Will my calculator do any of this normal

stuff?• Normalpdf – use for graphing ONLYONLY

• Normalcdf – will find probability of area from lower bound to upper bound

• Invnorm (inverse normal) – will find z-score for probability

Bivariate data

• x – variable: is the independent or explanatory variable

• y- variable: is the dependent or response variable

• Use x to predict y

bxay ˆ

b – is the slope– it is the approximate amount by which y increases when x increases by 1 unit

a – is the y-intercept– it is the approximate height of the line

when x = 0– in some situations, the y-intercept has

no meaning

y - (y-hat) means the predicted y

Be sure to put the hat on the y

Least Squares Regression LineLSRL

• The line that gives the bestbest fit to the data set

• The line that minimizesminimizes the sum of the squares of the deviations from the line

Slope:

For each unitunit increase in xx, there is an approximateapproximate increase/decreaseincrease/decrease of bb in yy.

Interpretations

Correlation coefficient:There is a direction, strength, lineardirection, strength, linear of association between xx and yy.

Identify as having a positivepositive association, a negativenegative association, or nono association.1. Heights of mothers & heights of their

adult daughters++

2. Age of a car in years and its current value

3. Weight of a person and calories consumed

4. Height of a person and the person’s birth month

5. Number of hours spent in safety training and the number of accidents that occur

--++NONO

--

Correlation Coefficient (r)-• A quantitativequantitative assessment of the

strength & direction of the linear relationship between bivariate, quantitative data

• Pearson’s sample correlation is used most

• parameter - rho)

• statistic - r

y

i

x

i

s

yy

s

xx

nr

1

1

Moderate CorrelationStrong correlation

Properties of r(correlation coefficient)

• legitimate values of r is [-1,1]

0 .5 .8 1-1 -.8 -.5

No Correlation

Weak correlation

Properties of r(correlation coefficient)

•value of r is not changed by any transformationstransformations

•value of r does not depend on which of the two variables is labeled x

•value of r is non-resistantnon-resistant

•value of r is a measure of the extent to which x & y are linearlylinearly related

The correlation coefficient and the LSRL are both non-resistantnon-resistant measures.

Correlation does not imply causation

Correlation does not imply causation

Correlation does not Correlation does not imply causationimply causation

Interpolation (good): • Using a regression line for estimating predicted values between known values.

•Extrapolation (bad):Extrapolation (bad):It is unknown whether the pattern observed in the scatterplot continues outside this range. The LSRL should notshould not be used to predict y for values of x outside the data set.

Formulas – on chart

x

y

i

ii

s

srb

xbyb

xx

yyxxb

xbby

1

10

21

10ˆ

The following statistics are found for the variables posted speed limit and the average number of accidents.

99814818

61140

.,.,

,.,

rsy

sx

y

x

Find the LSRL & predict the number of accidents for a posted speed limit of 50 mph.

9210723 ..ˆ xy accidents2325.ˆ y

Residuals (error) -Residuals (error) -

• The vertical deviation between the observations & the LSRL

• the sum of the residuals is alwaysalways zero zero

• error = observed - expected

yy ˆresidual

Residual plotResidual plot

• A scatterplot of the (x, residual) pairs.

• Residuals can be graphed against other statistics besides x

• Purpose is to tell if a linear associationlinear association exist between the x & y variables

• If no patternno pattern exists between the points in the residual plot, then the association is linearlinear.

Residuals

x

Residuals

x

LinearLinear Not linearNot linear

Coefficient of determination-Coefficient of determination-

• r2

• gives the approximate proportion of variationvariation in yy that can be attributed to an linear relationship between x & y

• remains the same no matter which variable is labeled x

Interpretation of r2

Approximately rr22%% of the variation in yy can be explained by the LSRL of xx & yy.

Outlier –Outlier –• In a regression setting, an outlier is a

data point with a largelarge residual

•Influential point-Influential point- A point that influences where the LSRL is located If removed, it will significantly change the slope of the LSRL

(189,30) could be influential. Remove & recalculate LSRL

(189,30) was influential since it moved the LSRL

Which of these measures are Which of these measures are resistant?resistant?

• LSRL

• Correlation coefficient

• Coefficient of determination

NONENONE – all are affected by outliers

What to do if the data is not linear…

yx log&yx log&log

yx &

Calculate the LSRL

Is the residual plot scattered?

NO

Transform data:

yx 1&YES

Appropriate model