How can you best represent statistical information and draw conclusions from it?

Post on 17-Jan-2016

215 views 0 download

Transcript of How can you best represent statistical information and draw conclusions from it?

How can you best represent statistical information and draw

conclusions from it?

What is statistics?Statistics is the branch of mathematics that is concerned with the collection, organization, display and interpretation of data.

S.1 Organizing DataHow can data be shown on a table or in a

graph and how can you read such data?

What is categorical data?When should you use a pie chart and how are

they made?How do you organize a frequency distribution?

Data types:categorical and

numericCategorical—any non numeric data

Use frequency distributionsBar chartsPie charts

Numeric—anything that can be measured and list by numberDotplotsStem and leafFrequency distributionshistograms

Does this data mean anything to you and can you answer questions about it in its current form?Example

Leisure time activitiesW T A W G T W WC W T W A T T WG W W C A W A WW W T W W T

W=walkingT=weight training C=cyclingG= gardening A=aerobics

Displaying Catagoric DataHow can you display and interpret catagoric

data?

catagoric—anything that can’t be measured and listed by number

Frequency distributionsBar ChartsPie Charts

Frequency DistributionDisplays all categories and a tally for eachRelative frequency—the percentage as a

decimal of time this category appears in the data Category Tally Frequenc

yRelative

Frequency

Walking

Weight training

Cycling

Gardening

Aerobics

Leisure time activities

W T A W G T W W C W

T W A T T W G W W C

A W A W W W T W W T

/

/ //

/

/

/

/

/

/ //

/

/

/

/

/

/

----

/ /

/

/

/

/

/

/

/

----

/

/

/

/

/

/

/

/

//

/

/

/

----

/

/

/

/

/

/

/

/

/

/

/

/

/

----

/

/

/

15

7

2 2

4

Total = 30

.5

2

2 2

2

Total = 1

Bar ChartGraphs the frequency of categorical dataBars DO NOT touchCategories are on the x-axisFrequencies are on the y-axis

Walking Wt Training Cycling Gardening Aerobic

Pie Charts (circle graphs)Used when there are not too many categories

Rule of thumb 8 or fewerEach “slice” is determined by the relative

frequencyDegrees in slice = rel freq x 360

HomeworkWorksheet 1

S-2 Displaying Numeric DataEQ: How do you construct and read

stem and leaf plots, dotplots, frequency distributions and histograms?

Numeric—anything that can be measured and list by numberDotplotsStem and leafFrequency distributionshistograms

DotplotsSimple way to represent small amounts of

dataEach piece of data has its own dotDots stack vertically above the position on

the x-axisDepending on the data set, you may lose the

exact value for each piece512 615 524 632 645

575 592 716 618 521

682 675 549 523 651 5 6

7

Stem PlotWorks for a small to moderate set of dataStems go in a vertical columnStems may be split low and high (0-4 and 5-

9)Comparative or double stemplot—shows

multiple data sets51 61 52 63 6457 59 71 61 5268 67 54 52 65

5

6

7

51 61 52 73 5457 59 71 61 5268 67 74 52 65

1 2 2 2 4 7 9

1 1 3 4 5 7 8

1

1 2 2 2 4 7 9

1 1 5 7 8

1 3 4

HistogramsA bar chart for numeric dataCenter the rectangle over the indicated value

on the x-axis—the bars touchCan be drawn off of the frequency or the

relative frequency distribution

# of partners in local law firms frequency

relative frequency

1 2 0.12 3 0.153 6 0.34 6 0.35 3 0.15

Totals 20 1

Shapes of HistogramsUnimodal—has one peak

Bimodal—has two peaks

Multimodal—has more than two peaks

Types of Unimodal CurvesSymmetric

Normal or Bell Shaped

Heavy tailed--Having long tailsLarger standard dev.

Light Tailed--Having short tailsSmaller Standard dev.

Skewed Curves

Lower (left) tail Upper (right) tail

When there is an outlier to the right, the curve is skewed right

When there is an outlier to the left, the curve is skewed left

Skewness is judged by the tail not where the majority of the data lies.

Skewness is judged by the tail not where the majority of the data lies.

Skewness is judged by the tail not where the majority of the data lies.

Frequency DistributionsContinuous and Discrete DataDiscrete Data

Individual data pointsThe range is always from the set of integers or

whole numbers Continuous Data

Data that may include decimals

Frequency Distributions

There are no natural breaks for continuous dataWe create our own

Ex. The fuel efficiency of a particular car ranges from 25.3 to 29.8 mpg we decide to use an interval of .5 Note:

Always start at an even increment lower than the lowest piece of data and go to an even increment higher than the highest piece of data

Interval # Interval

Low High

1 25.0 25.5

2 25.5 26.0

3 26.0 26.5

4 26.5 27.0

5 27.0 27.5

6 27.5 28.0

7 28.0 28.5

8 28.5 29.0

9 29.0 29.5

10 29.5 30.0

In which interval would you place 27.5 mpg? HxL i

Homework Numeric DataWorksheet 2

Density GraphsWhen data is unevenly distributed

You may want to use unequal groups or intervals This may only be done if you graph the density

widthclass

class of freq rel.density

interval name low high frequency

relative frequency density

1 1 10 2 0.09091 0.008262 10 20 3 0.13636 0.006493 20 30 4 0.18182 0.005874 30 40 3 0.13636 0.003335 40 50 6 0.27273 0.005356 50 100 1 0.04545 0.000457 100 200 2 0.09091 0.000458 200 1000 1 0.04545 0.00005

total 22

S-3 Describing the Center of a Data Set

EQ:What are the measures of central tendency and how can they be determined?

Center and SpreadTwo of the most critical descriptors of a data setGraphical methods such as those in the last chapter give a general impression

of bothNumerical methods give precise value that can be compared in detail

The three M’sMean

Median

Mode

• Also known as the average

• Also called the middle

• Most Frequent

The Meanformula for the sample mean

• x= each piece of data • xi= i indicates the position of the data from within the

original data set• n= number of pieces of data in the data set• ∑ = Greek letter Sigma means to add what follows

Always use more accuracy (more decimals) than any one piece of data has.

µ is used for the population meanGreek letters are always used for population values

n

xx

n

ii

1

The Median

The middle value in a list of ordered values

Median has no symbol but is often abbreviated Med

If n is odd then the median is the exact middle number

If n is even then the median is the mean of the two middle numbers

Comparison and Contrast of the Mean and Median

Median divides the data into two equal parts 50% of the data is on either side of the median

Mean is where the fulcrum would cause the “data scale” to balance if the values had weight

It is very sensitive to outliers

Balancing the “data scale”

Normal/Bell curve

meanmedian

Skewed Left Skewed Right

Trimmed MeanMakes the mean less susceptible to outliers

Order the data Remove the same number of pieces of data from each

end Recalculate the mean

% x n = number of pieces to be removed from EACH end

A small to moderate trim is 5% to 25%

Trimmed MeanExample:Find the 15% Trimmed mean of: 3, 6, 8, 2, 9, 10, 7, 15, 4, 12, 20, 36, 15, 5, 3, 7, 10, 16, 17,

12

Order the numbers: 2, 3, 3, 4, 5, 6, 7, 7, 8, 9, 10, 10, 12, 12, 15, 15, 16, 17,

20, 36,

20 items • .15 = 3

4, 5, 6, 7, 7, 8, 9, 10, 10, 12, 12, 15, 15, 16 =

8.914

136

Weighted Meanis similar to an arithmetic mean (the most

common type of average), where instead of each of the data points contributing equally to the final average, some data points contribute more than others.

Weighted Mean# of students Class average

1st period 20 75

2nd period 35 79

55

79357520 ave. weighted

Homework worksheet 3

S-4 SpreadWhat are the quartiles, percentiles, and box

plots?

RangeHigh - Low

• IQRIQR = upper quartile (Q3) – lower

quartile (Q1)

Lower quartile—the median of the lower halfUpper quartile—the median of the upper half

IF n is odd, the exact median is excluded from the quartiles

Used because it is resistant to outliersThere is no special name for the population IQR

Interquartile Range

Boxplot• Can be used for many types of summarizations

• Iqr = Q3 – Q1• Outlier = data more than 1.5•iqr from the end of the

box• Extreme=data more than 3•iqr from the end of the

box

25% 25% 25% 25%

Outlier(closed circle)

ExtremeOutlier(open circle)

Modified Boxplot

Percentages and percentiles:

Percentage: “ the score “ * 100 total possible points

Percentile: “The position of the score w/in an ordered list”*100 the total number of items

EX: 10 students took a 90 point test60, 65, 68, 74, 75, 80, 81, 81, 84, 90 (note: an ordered list)1 2 3 4 5 6 7 8 9 10

What is the percent and the percentile for a score of 81?

Percent: 81/90 *100=90%

 Percentile: 7/10*100= 70ieth percentile

10 2 5 720 1 630 5 8 9 940 2 3 5 7 850 260 3 6

•the median

•the first quartile

•the third quartile

•the interquartile range

•the mode

•the percentile for .271

•the value closest to the 60th percentile

EXAMPLE:Given a stem and leaf plotFIND:

S-5 Measures of VariabilityHow do the measures of variability help us to

better understand what our data set might look like?

S-5 Measures of VariabilityRange = high – low

Deviation from the mean= xi – if positive then xi is larger than the mean

if negative then xi is smaller than the mean Mean deviation is the average of the deviations

Sample Variance

x

1

)(2

12

n

xxs

n

ii

Sample Standard Deviation“average distance” the items fall from the

mean

A small s or s2 indicates low variabilityA high s or s2 indicates large variability

2ss

Population Variance (knowing all the data)

Population Standard Deviation

compute to the same accuracy as the population

n

xxn

ii

1

2

2

)(

2

Uses of the IQRStandard deviation can be approximated by

SD = IQR/1.35

If SD > IQR/1.35 it suggests heavier or longer tails than the normal curve

Example20, 15, 12, 18, 17, 15, 17, 16, 18, 25

Reorder12, 15, 15, 16, 17, 17, 18, 18 20, 25

range =iqr =

sd =

x

Median= 17Q1= 15 Q3= 18

continuedFind the mean deviation and the standard

deviationBy hand

i xi Xi- (xi- )2

1 12

2 15

3 15

4 16

5 17

6 17

7 18

8 18

9 20

10 25

totals

xx

Given 12, 15, 15, 16, 17, 17, 18, 18, 20, 25Find the SD By iqr

By calculator

Homework worksheet 5

S-6 Translation and ScaleWhat is the difference in the impact of translation and scale change on data?

In class project:

Hints for review #1How many intervals should be used for a set

of data?The book recommends

data ofpieces of#

Homework

TEST 1

S-7 Data CollectionHow do you know which method of data

collection is most appropriate?

Random SamplesWhat methods of data collection constitute

collecting a random sample?

SamplingSince time and money usually do not permit a

scientist to collect the opinion or measure the effect on every person in the population, they take samples which should include all groups so they can make accurate statements about the entire population

Simple Random SampleEach object in the population has an equal chance

of being selected for the sample

Each object in the sample is chosen independently of any other object in the sampleIndependent—choosing one has no bearing on the

choice of the next object Independent example

All names are placed in a hat and 10 are chosen Dependent example

Two names are drawn and they each ask 4 people to participate with them

BiasWhen one group is over-represented in

sample Causes:

Basis of selection Who responds Who asks the questions or how they are asked

Stratified SampleThe population is divided into groups and a

specified number are chosen from each group

River Project

The Normal DistributionHow does normally distributed data begin

to relate statistics to probability?

The Normal DistributionWhen most of the data falls close to the average and only a few pieces of data fall at a distance from

the mean. This configuration is often called a bell shaped or normal curve. Research has found that when data is normally distributed:68% of the data lies within one standard deviation of the mean95% of the data lies within two standard deviations

(13.5% lies in the one to two SD range)99.7% of the data lies within three standard deviations

(2.35% lies in the two to three SD range) 

.15% of the data lies beyond each of the three standard deviation range

3X 2X X X X 2X 3X

Normal curves are symmetric to the mean some are narrow and some are wide—this is determined by the value of one standard deviation.

The area under a normal curve represents all the data—100% or 1. The area under any section represents the percentage and therefore probability that a given piece of data will fall to the left of this region of the curve.

Normal distributions have a direct link to Probability through something called z-scores. The z-score tells exactly how many full and partial standard deviations a particular piece of data falls from the mean. A negative number means the data is to the left of the mean, a positive number tell you the data is to the right of the mean.

 the formula for z-scores is

The attached table gives the probability that a given value has a z-score less than a given value. (falls to the left of a particular spot on the normal curve)

xx

z

Return to problem a

Return to problem b and c

Examples: Find the z-score for each of the following:

a) 45 when = 50 and = 4x

Return to z-chart

b) 56 when = 60 and = 10

c) between 20 and 60 = 50 and = 10x

x

Return to z-chart