Over the Skittles Rainbow A Statistical Analysis of 14 ags ...

26
Over the Skittles Rainbow A Statistical Analysis of 14 Bags of Candy Cheryl L. Casazza Salt Lake Community College Math 1040 “Somewhere over the rainbow, way up high…” -Dorothy

Transcript of Over the Skittles Rainbow A Statistical Analysis of 14 ags ...

Page 1: Over the Skittles Rainbow A Statistical Analysis of 14 ags ...

Over the Skittles Rainbow

A Statistical Analysis of 14 Bags of Candy

Cheryl L. Casazza Salt Lake Community College

Math 1040

“Somewhere over the rainbow, way up high…” -Dorothy

Page 2: Over the Skittles Rainbow A Statistical Analysis of 14 ags ...

2

The bell-shaped curve illustrates a normal distribution as it rises “way up high” before falling

to create a symmetrical plot showing where data falls. There are many formulas for analyzing

and interpreting patterns in data. Learning to understand and apply these formulas will

improve one’s critical thinking skills.

“I could think of things I’d never thought before, if I only had a brain...” -Scarecrow

I can think of things I’ve never thought before, now that I have taken a statistics course. The

crowning achievement of the course is a project in which each member of the class analyzes a

2.17 oz. bag of Skittles candy, then contributes the data to create a simple random sample

representative of the entire population of bags of Skittles. Each member of the class goes

through the process of organizing categorical and quantitative data, generating charts and

graphs, creating confidence interval estimates and performing hypothesis using this collected

data. This low-cost, real world application progresses from the basic skill of counting to high-

level synthesis—performing hypothesis tests.

Page 3: Over the Skittles Rainbow A Statistical Analysis of 14 ags ...

3

“Because, because, because, because, because–Because of the wonderful things…”

– Wizard of Oz Lyrics

The most challenging part of a quantitative literacy project is using data to justify decisions.

For professional statisticians, applications rest on “because, because, because, because,

because…” Because life, limb and liability may be at stake. For beginners, it is preferable to run

a low-stakes game counting candy colors. No matter how far off the results may be, no one

gets hurt. Besides, who doesn’t enjoy working with candy? (This is a rhetorical question, not

begging a statistical response!)

Page 4: Over the Skittles Rainbow A Statistical Analysis of 14 ags ...

4

THE RAINBOW COLORS: ORGANIZING AND DISPLAYING CATEGORICAL DATA

Red0.192

Orange0.204

Yellow0.212

Green0.189

Purple0.204

PIE CHARTProportions of Skittles Colors from Class Sample

Red Orange Yellow Green Purple

n = 840 skittles (14 bags of candy)

Red0.175

Orange0.254

Yellow0.127

Green0.159

Purple0.286

PIE CHARTProportions of Skittles Colors from My Sample

Red Orange Yellow Green Purple

n = 63 skittles (1 bag of candy)

Page 5: Over the Skittles Rainbow A Statistical Analysis of 14 ags ...

5

0.212

0.204 0.204

0.1920.189

0.175

0.180

0.185

0.190

0.195

0.200

0.205

0.210

0.215

Yellow Orange Purple Red Green

Pro

po

rtio

n o

f Sk

ittl

es C

an

die

s

Skittles Colors

PARETO CHARTProportions of Skittles Colors from Class Sample

Yellow Orange Purple Red Green n = 840 (14 bags of candy)

0.286

0.254

0.1750.159

0.127

0

0.05

0.1

0.15

0.2

0.25

0.3

Purple Orange Red Green Yellow

Pro

po

rtio

ns

of

Skit

tles

Ca

nd

ies

Skittles Colors

PARETO CHARTProportions of Skittles Colors from My Sample

Purple Orange Red Green Yellow n = 63 skittles (1 bag of candy)

Page 6: Over the Skittles Rainbow A Statistical Analysis of 14 ags ...

6

SAMPLE PROPORTIONS

TABLE Proportions of Skittles Colors from My Sample

Color Proportion

Red 0.175

Orange 0.254

Yellow 0.127

Green 0.159

Purple 0.286

TOTAL 1.001

I observed inconsistency in the proportions. I was not surprised that the data from one bag

did not match the data from 14 bags of candy. For example, yellow candies took the lead in the

class sample, yet they were last in my sample. After viewing these colorful charts, I decided to

assess the standard deviations in the colors.

Standard Deviation of the Proportions

Color Red Orange Yellow Green Purple

Standard Deviation

1.787 2.547 4.479 2.620 3.167

The mean of these five values = 2.920. That is almost 3 standard deviations away from the

mean, a measure of center. A casual inference might be that proportions of Skittles colors vary

TABLE Proportions of Skittles Colors from Class Sample

Color Proportion

Red 0.192

Orange 0.204

Yellow 0.212

Green 0.189

Purple 0.204

TOTAL 1.001

Page 7: Over the Skittles Rainbow A Statistical Analysis of 14 ags ...

7

greatly, but somehow each bag ends up with all of the promised colors. Wm. Wrigley Jr.

Company, the maker of Skittles, markets a wide variety of flavors, including Banana Berry,

Mango Tangelo, Passion Fruit, and Cherry Lemonade. Labels such as “Tropical”, “Wild Berry”,

“Tart’n’Tangy”, “Crazy Cores”, “Skittles Confused” and “Smoothie Mix” adorn the small bags of

coveted candy. The simplistic approach to Skittles belongs to the past. It is logical to assume

that the Wrigley Company has a way of sorting theses millions of individual candies. I tried to

research how they do it, but most online sources refer to amateurs making candy sorting

machines for a hobby.

My conclusion is that it is not a priority for the company to closely control the exact

proportions of Skittles in the millions of bags of candy they sell. It is a very profitable business,

they are not monitored by government agencies on this point, so why would they bother to put

money into more accurate sorting methods? This comes under the category of “practical

significance” as opposed to “statistical significance”.

SAMPLE STATISTICS: Mean: 60.0

Standard Deviation: 2.69

5-Number Summary:

Min: 54.0

Q1: 59.0

Med: 61.0

Q3: 61.0

Max: 63.0

Page 8: Over the Skittles Rainbow A Statistical Analysis of 14 ags ...

8

ORGANIZING & DISPLAYING QUANTITATIVE DATA: THE NUMBER OF CANDIES PER BAG

In quantitative literacy, analyzing a five-number summary often leads to creating a boxplot.

It is easy to spot potential outliers on a boxplot as shown here.

IQR = 2.0

IQR*1.5 = 3.0

Q1=59.0, and 59.0-3.0=56.0, so anything < 56 is an outlier

0

1

2

3

4

5

6

54 55 56 57 58 59 60 61 62 63

Freq

uen

cy

Number of Candies per Bag

HistogramFrequency of Number of Candies per Bag in Class Sample

Page 9: Over the Skittles Rainbow A Statistical Analysis of 14 ags ...

9

“And my head I’d be scratchin’ while my thoughts were busy hatchin’ if I only had a brain…”

Scarecrow

QUANTITATIVE DATA

Fortunately for us, we have calculators and computers to “hatch” our numbers as we

perform analyses of data. One of the first values calculated is usually the mean, a measure of

center and an unbiased estimator of where data falls. The mean takes into account all of the

data, but it is sensitive to outliers. The lay term, “average”, calls to mind synonyms such as

typical, usual, likely, moderate, regular, normal, and middle of the road.

Page 10: Over the Skittles Rainbow A Statistical Analysis of 14 ags ...

10

“[Dorothy’s house]…landed on the Wicked Witch in the middle of the road…”

-Munchkins

With a sample mean of 60.0 and a sample median of 61.0, the spread of the distribution

appears relatively narrow. The data is skewed to the left since ten of the fourteen values are

greater than or equal to 60.0, the mean. The presence of two outliers, 54and 56, has

influenced the mean here. Using the interquartile range, 2.0, times 1.5 equals 3.0. The Q1

value of 59, minus 3.0 equals 56, thus 54 is definitely an outlier. The value of 56 is just on the

border. Technically, it is not an outlier but it is interesting to see what happens to the data

when an outlier and a borderline value are removed. After removing these two values, the

mean shifts to 60.83 and the standard deviation decreases to equal 1.75. The shape of the

distribution is not so skewed. In a sample of only 14 bags, removing two values can make a

discernible difference in the statistics.

Page 11: Over the Skittles Rainbow A Statistical Analysis of 14 ags ...

11

This project calls for analyzing the entire sample of 14 bags of candy. My expectations

before doing the number crunching was that there would be some level of variety within

certain reasonable parameters. In terms of practical significance, the difference between 54

and 63 pieces of candy in a bag of Skittles would probably not disturb many customers. I have

never seen anyone weighing bags at the store prior to purchasing candy. I have been known to

weigh prepackaged bags of vegetables, such as celery, to find the heaviest one. Counting pieces

of candy such as M& M’s, Skittles and other small piece products, is typically a casual source of

entertainment pursued by people who are interested in numbers, comparisons and consuming

the candy immediately after counting it! That’s exactly what I did. My bag of candy was on the

high end, containing the maximum value of 63 pieces of candy. Therefore, I was one of the

“lucky” ones. Based on our relatively small sample of 14 bags, there was only a 0 .214

probability of getting 63 candies in one’s bag.

“I’d be clever as a gizzard if the wizard is a wizard…” - Scarecrow

Page 12: Over the Skittles Rainbow A Statistical Analysis of 14 ags ...

12

It makes no sense to aim to be as “clever as a gizzard.” Making sense is very important

statistics. In order to make sense, one must have a clear understanding of the differences

between categorical and quantitative data. Categorical data encompasses groups or categories,

such as political affiliations, colors, professions, and pets. Other than in terms of frequency,

these groups do not translate into quantities—numbers—thus they cannot be analyzed in

terms of mathematical relationships such as mean, standard deviation and variance. Mode

could be labeled as the one appearing most frequently. The best way to display categorical data

is in a well-planned pie chart or bar graph that illustrates the frequency. A Pareto chart is

especially useful since its descending order immediately focuses the viewer on the most

frequent event, perhaps the most prominent or important part of something. One has to be

careful about misleading representations such as distorted 3 -dimensional charts and graphs as

well as non-zero axis graphs that exaggerate differences in values.

Quantitative data deals with numerical values and has a true zero. It can be compared in

terms of proportion, mean, standard deviation and variance. Obviously, to apply formulas

taught in statistics, one needs numbers. Quantitative data may include any number of variables.

In Math 1040 we have studied bivariate data in terms of the x and y axes and finding the

equation of the line of regression. Many types of graphs fit quantitative data, including

frequency polygons, line graphs, stem and leaf plots, box plots, scatter diagrams, bar charts and

histograms and Pareto charts. Pie charts may be used, but they are usually not the most

informative choice for displaying this type of data. Quantitative data can include units of

measurement such as centimeters, yards, hours, dollars, etc. Numbers that substitute for

Page 13: Over the Skittles Rainbow A Statistical Analysis of 14 ags ...

13

names, such as those on the jerseys of athletes, qualify as categorical data because the

numbers do not represent a mathematical relationship.

“I could change my habits, never more be scared of rabbits if I only had the nerve!” -Cowardly Lion

Confidence is a wonderful thing; a confidence interval is a wonderful construct carefully

calculated using formulas involving probability and proportional relationships. A confidence

interval defines a range of values aiming to include the true but unknown value of a population

parameter, such as the mean height of all women in the United States. It is built around a point

estimate taken from a sample value. The level of confidence derives from the alpha, or amount

of area in the uncertain part of the range of values. It is possible for a true population

parameter to fall outside of the range, but depending on the level of confidence, it is relatively

unlikely for that to happen. Confidence levels are often fixed at 90, 95or 99%. The higher the

confidence level, the wider the range of values. Outside of the confidence interval there can be

a left-tail for “less than “ tests, a right tail for “greater than” tests or one tail on each extreme

Page 14: Over the Skittles Rainbow A Statistical Analysis of 14 ags ...

14

for “not equal to” tests. The tails contain the alpha value, the amount of uncertainty for a

particular test. The value of alpha chosen depends on the consequences of an error.

Confidence intervals can be used for making decisions ranging from marketing level significance

to life and death situations.

DISCUSSION OF THREE CONFIDENCE INTERVALS

One of the most common applications of statistics is using sample statistics to construct

confidence intervals that establish lower and upper limits for population parameters. The

degree of certainty about the accuracy of these limits is quantified by a percentage, such as

95% or 99%. The true value of the population parameter, perhaps for the proportion, mean or

standard deviation, may be impossible to ascertain. At best, it is not practical to do so. Even

with the best data and experienced professionals working, there is always the slight possibility

of error, but experienced statisticians know how to set confidence levels for specific real world

applications.

The first confidence interval defines lower and upper limits for the true proportion of purple

candies in the population of Skittles. The 95% confidence level indicates the 5% possibility that

the true population proportion of purple candies does not lie in the interval from 0.177 to

0.237. The sample statistic will always be exactly in the middle of this interval because the

interval is created by subtracting and adding E to the sample statistic.

The second confidence interval sets lower and upper limits for the true mean number of

candies per bag. This is a statistic that could hold meaning for true lovers of Skittles, who want

Page 15: Over the Skittles Rainbow A Statistical Analysis of 14 ags ...

15

to make sure they get their fair share in each bag. It was determined that with 99% confidence,

the true population parameter for the mean number of candies per bag lies within the interval

from 57.835 to 62.165. Of course, Skittles in the real world arrive in whole numbers, so

approximately 58 to 62 Skittles is a reasonable estimate for number of candies in most bags.

The sample mean, 60 per bag is exactly in the middle of these whole numbers.

STANDARD DEVIATION CONFIDENCE INTERVALS

“Which way do we go?” -Dorothy

“People do go both ways.” -Scarecrow

The third confidence interval addresses the question of variety, or standard deviation of the

mean number of candies per bag. Symbolically, there is variety inherent in the calculation of

variety. To mathematicians, this is perfectly logical, but for students it requires extra thinking.

This interval is not constructed by subtracting and adding an E (margin of error) value. The fun

Page 16: Over the Skittles Rainbow A Statistical Analysis of 14 ags ...

16

part of this calculation is in using the chi square distribution table. On this table, the value from

the left is placed in the denominator on the right and vice versa. People-- and numbers --do go

both ways, as the scarecrow stated.

The sample standard deviation of 2.69 is not exactly in the middle of the confidence interval

because the chi square distribution is not symmetrical. Here, the lower limit equals 4.107 and

the upper limit equals 27.688. This is quite a wide interval, but the confidence level is very

high—98%. One can say with98% confidence that the true population parameter for the

standard deviation of mean number of candies per bag lies within the interval from 4.107 to

27.688.

“Somewhere over the rainbow skies are blue, and the dreams that you dare to dream really do

come true…”-Dorothy

Page 17: Over the Skittles Rainbow A Statistical Analysis of 14 ags ...

17

HYPOTHESIS TESTS

The purpose of a hypothesis test is to use quantitative analysis to weigh evidence, then make

decisions. These decisions include the limit for the number of people allowed in a particular

room due to fire safety considerations, whether or not to purchase a specific math program for

a school district and how much to charge for a movie ticket in a certain city. The applications

are practically unlimited! Without the use of numerical data and proper formulas, these

decisions would be made in an imprecise, inconsistent and unsafe manner.

There are several conditions for doing interval estimates and hypothesis tests for population

proportions:

1. The sample is a simple random sample

2. The conditions for a binomial distribution are satisfied. There are two mutually exclusive

outcomes possible (yes/no), a fixed number of independent trials and probability is

consistent throughout.

3. There are at least 5 successes and 5 failures.

These conditions are met by our sample although it is a small sample—only 14 bags of

candy. In the case of proportion of one color, yellow is success and not yellow is failure. The

bags were purchased at various places in at least two counties in Utah. There is really no

way to know if geographical location of purchase affects randomness here, but in general a

variety of locations improves randomness. It was not a convenience sample, with one

person buying all 14 bags at onetime in one place.

Page 18: Over the Skittles Rainbow A Statistical Analysis of 14 ags ...

18

THE EMERALD SKITTLES

Testing the Claim that 20% of all Skittles Candies are Green

The results of the test show test statistic Z= -0.7763237543 (technology). At alpha = .01,

the critical value for Z = ± 2.575. Since the test statistic is in the fail to reject region, much

less extreme than the critical values, we will fail to reject the hypothesis. There is not

sufficient evidence to warrant rejection of the claim that the true proportion of green Skittles

is 20%.

Testing the Claim that the mean number of candies = 56 per bag

Since the population mean (mu) is unknown, we will use the Student t distribution table.

Alpha = 0.05 puts us at a confidence level of 95%. The test statistic, t=5.57 (rounded) is

more extreme than the critical value of t= 2.160. Thus, we reject the null hypothesis. There

is sufficient evidence to warrant rejection of the claim that the mean number of candies per

Page 19: Over the Skittles Rainbow A Statistical Analysis of 14 ags ...

19

bag is 56. This could be good or bad: If we are getting less than the mean, we could use

Skittles as comfort food. If we’re getting more, we could gather samples of the many

different flavors of Skittles, have a taste test party and then start another delicious

statistical analysis…

“Skittles, taste the rainbow”™

Page 20: Over the Skittles Rainbow A Statistical Analysis of 14 ags ...

20

Page 21: Over the Skittles Rainbow A Statistical Analysis of 14 ags ...

21

Page 22: Over the Skittles Rainbow A Statistical Analysis of 14 ags ...

22

Page 23: Over the Skittles Rainbow A Statistical Analysis of 14 ags ...

23

Page 24: Over the Skittles Rainbow A Statistical Analysis of 14 ags ...

24

Page 25: Over the Skittles Rainbow A Statistical Analysis of 14 ags ...

25

REFLECTION

What does the word “statistics” mean to you? Is it a heinous subject, full of confusing,

difficult tasks that must be completed in order to achieve an important goal? Or does the

sound of “statistics” conjure up a beautiful image of a kingdom in which the magical

language of mathematics rules and laws are based on true principles determined by precise,

quantitative calculations? A confidence interval estimate of my own positive feelings about

“statistics” would fall somewhere in the middle, higher than the mean but not high enough

to pursue a PhD in this challenging, fascinating subject!

There are several very practical reasons for educated people be literate about statistics.

We live in a world full of studies. Every day we hear statistics quoted as advertisers,

politicians, healthcare professionals and many other people try to persuade us to believe

their claims are true. With the skills learned in Math 1040, one is much better equipped to

evaluate and accept or reject these claims, if one wants to do the research and analysis.

Improved critical thinking skills always increase the quality of decision making.

The very practical skill of using the TI 84 Graphing Calculator may change my life for the

better. I am also grateful that I will be able to apply these skills in future classes, such as

Chemistry and Physiology. In the future, the knowledge I have gained will support my

chosen profession of nursing, as I will be reading and analyzing medical journal reports of

various studies. In the healthcare field, life and death are at stake. In this situation, more in-

depth knowledge is required than in other fields. Quantitative literacy applies more here

than almost anywhere else.

Page 26: Over the Skittles Rainbow A Statistical Analysis of 14 ags ...

26

Credits

Wizard of Oz Lyrics

www.lyricsmode.com