Introduction to Statistics Lecture Notes Chapters 3-5 Please sign in (SIGNATURES) as you come in to...

20
Introduction to Statistics Lecture Notes Chapters 3-5 Please sign in (SIGNATURES) as you come in to class. It will save my voice instead of my taking attendance (this is only to settle the class roster).

Transcript of Introduction to Statistics Lecture Notes Chapters 3-5 Please sign in (SIGNATURES) as you come in to...

Page 1: Introduction to Statistics Lecture Notes Chapters 3-5 Please sign in (SIGNATURES) as you come in to class. It will save my voice instead of my taking attendance.

Introduction to StatisticsLecture Notes

Chapters 3-5

Please sign in (SIGNATURES) as you come in to class. It will save my voice instead of my taking attendance (this is only to settle the class roster).

Page 2: Introduction to Statistics Lecture Notes Chapters 3-5 Please sign in (SIGNATURES) as you come in to class. It will save my voice instead of my taking attendance.

What’s up with the powerpoint? I don’t usually use slides, but am going to try

to use these to save my voice somewhat.

Notes: Still working on getting the class roster settled. Has been some movement on the waitlist, will keep in touch as things develop. Be sure you’ve signed in!

First homework is posted (on our course website), but isn’t due until next Friday (the 4th). The additional problem is NOT optional, that just means it is not a book problem.

Page 3: Introduction to Statistics Lecture Notes Chapters 3-5 Please sign in (SIGNATURES) as you come in to class. It will save my voice instead of my taking attendance.

Handouts for Today There is one handout on graphs/descriptive

statistics going around. Save this to use tomorrow in class.

There is a second handout – the anonymous survey largely designed by the class on Monday. Please go ahead and take a few minutes to fill this out (no names!) and get it back to me. We’ll take a look at this data next week in lab.

If you missed class Monday, I have extra course syllabuses at the front as well.

Page 4: Introduction to Statistics Lecture Notes Chapters 3-5 Please sign in (SIGNATURES) as you come in to class. It will save my voice instead of my taking attendance.

The “W”’s of a Data Set Who – the observations (population – set of all objects

you are interested in obtaining the value of some parameter for – since we usually can’t observe all objects, we take a sample of objects – a subset of the overall population of objects to observe)

Note: There is NO such thing as a population sample or sample population.

What – the variables Why – why was the data collected How – how was the data collected (related to

design/sampling in chapters 12-13) When/Where – more information that could be

relevant

Page 5: Introduction to Statistics Lecture Notes Chapters 3-5 Please sign in (SIGNATURES) as you come in to class. It will save my voice instead of my taking attendance.

Chapters 3-5 Overview Covers basic graphs and descriptive statistics

for both categorical and quantitative variables This is what you would do as a “preliminary

analysis” for a variable.

Recall: a data set can have multiple variables in it.

These chapters focus on mostly univariate (single variable) analyses. There is one comparative graph – a side-by-side boxplot in Chapter 5.

Page 6: Introduction to Statistics Lecture Notes Chapters 3-5 Please sign in (SIGNATURES) as you come in to class. It will save my voice instead of my taking attendance.

3 Rules of Data Analysis Rule 1- Make a picture Rule 2 – Make a picture (really, before you do

anything else) Rule 3 – Make a picture (really, we mean a

well-chosen picture for your variables)

Page 7: Introduction to Statistics Lecture Notes Chapters 3-5 Please sign in (SIGNATURES) as you come in to class. It will save my voice instead of my taking attendance.

Categorical Variable Prelim Analysis Frequency tables (one variable) – summarize

counts by category Contingency tables (2 or more variables) –

summarize counts by category for multiple variables

Bar charts Pie charts

Page 8: Introduction to Statistics Lecture Notes Chapters 3-5 Please sign in (SIGNATURES) as you come in to class. It will save my voice instead of my taking attendance.

Frequency What is frequency?

Frequency is the number of objects/cases per category

You can also look at relative frequency. Relative frequency is the number of objects/cases

per category divided by the total number of objects.

Hence it gives proportions for each category out of the total.

It is often converted to %.

Page 9: Introduction to Statistics Lecture Notes Chapters 3-5 Please sign in (SIGNATURES) as you come in to class. It will save my voice instead of my taking attendance.

Bar Charts One bar per category – height is determined

by frequency or relative frequency Order of categories is arbitrary. Does NOT let you talk about the shape of a

distribution.

“Area” principle – areas are supposed to be relative. This is often violated when people try to make graphs “cool” and go 3-D, etc. (see Example passed around).

Page 10: Introduction to Statistics Lecture Notes Chapters 3-5 Please sign in (SIGNATURES) as you come in to class. It will save my voice instead of my taking attendance.

Pie Charts Take 100% of cases and divide up 360

degrees based on relative frequencies.

We will look at bar charts over pie charts.

Note that for bar charts you do not need to create bars for 100% of the cases. You could look at the top three risk factors for a disease, etc. However, we usually do have 100% of cases shown.

Page 11: Introduction to Statistics Lecture Notes Chapters 3-5 Please sign in (SIGNATURES) as you come in to class. It will save my voice instead of my taking attendance.

Contingency Tables - Example See first page of Handout Totals for rows/columns give marginal

distributions for each variable. You can also look at conditional

distributions. Fix a row or column and work solely within that row or column.

Concept of independence (will formalize later): If the distribution of one variable is the same for

all categories of another variable, then the two variables are independent.

Page 12: Introduction to Statistics Lecture Notes Chapters 3-5 Please sign in (SIGNATURES) as you come in to class. It will save my voice instead of my taking attendance.

On Your Own Text has some discussion of segmented bar-

charts and side-by-side (feel free to read or skip)

Page 13: Introduction to Statistics Lecture Notes Chapters 3-5 Please sign in (SIGNATURES) as you come in to class. It will save my voice instead of my taking attendance.

Simpson’s Paradox Something that can happen when you aggregate

categorical data Looking at overall averages or % can be misleading Can get different results looking at breakdown Berkeley Discrimination Data Example (see bottom

of page one of the handout) Claims of Sexual Discrimination in1973 Graduate

School Admissions Overall, 44.28% of males who applied were admitted,

while only 34.58% of females were admitted. Look what happens when you breakdown by the 6

largest departments though! (try this on your own or with a partner). Is there evidence of discrimination against females at the dept. level? What is going on?

Page 14: Introduction to Statistics Lecture Notes Chapters 3-5 Please sign in (SIGNATURES) as you come in to class. It will save my voice instead of my taking attendance.

Quantitative Variables Preliminary Analysis Graphs

Dot plot – won’t use much – read about on your own Stem and leaf – won’t use much – read about on

your own Histogram Boxplot (chapter 5) Qqplot (Friday or next week) Time plot (Friday or next week)

Descriptive statistics Measures of center: mean, median Measures of spread: standard deviation, IQR, range

Page 15: Introduction to Statistics Lecture Notes Chapters 3-5 Please sign in (SIGNATURES) as you come in to class. It will save my voice instead of my taking attendance.

Describing the distribution of a quantitative variable You should focus on three things when

describing the distribution of a quantitative variable: Shape – unimodal (one peak), bimodal (two

peaks), multimodal (many peaks), bell-shaped, skewed left (tail to the left), skewed right (tail to the right), symmetric, uniform (no peaks, basically flat)

Center – estimate the center (or use a descriptive statistic) If multiple peaks, report the peak locations

Spread – estimate the spread (can use a descriptive statistic)

Page 16: Introduction to Statistics Lecture Notes Chapters 3-5 Please sign in (SIGNATURES) as you come in to class. It will save my voice instead of my taking attendance.

Dot Plot – On Your Own Most basic quantitative graph Use for a low number of observations (<50) Basically use a number line and place a dot

above it for each value you have observed. Example from wikipedia:

Page 17: Introduction to Statistics Lecture Notes Chapters 3-5 Please sign in (SIGNATURES) as you come in to class. It will save my voice instead of my taking attendance.

Stem and Leaf – On Your Own Your book discusses lots of options for these,

including split leaves (which is something R/Rcmdr will do).

Basics: You take your values and set a stem – maybe tens. Then the leaves are the ones place. For each stem, you list the leaves that coincide in numeric order.

Usually works decently for fewer than 100 observations

Try it. Suppose you have scores on a pre-test for an at-risk youth group as follows:

5, 11, 13, 21, 34, 36, 45, 47, 48, 48, 49

Page 18: Introduction to Statistics Lecture Notes Chapters 3-5 Please sign in (SIGNATURES) as you come in to class. It will save my voice instead of my taking attendance.

Histogram Take the quantitative variable and break it up into

“piles” or “bins” (usually the same width). Count the number of observations in each bin or pile. Plot the frequencies per bin. Usually no spaces between bins (if there is, it is a

gap – NOT like a bar chart). You DO need to know the boundaries. (5,10], (10,15]

as bins IS different from [5,10),[10,15). (If anyone needs me to explain open/closed brackets, please ask).

Technology lets us vary the width of bins (effectively the number)

You can also use unequal bin widths but then you need something called density, not frequency.

Page 19: Introduction to Statistics Lecture Notes Chapters 3-5 Please sign in (SIGNATURES) as you come in to class. It will save my voice instead of my taking attendance.

Examples See page 2 of the handout

Try to describe the shape of each histogram

Then see page 3 of the handout We’re going to create a histogram by hand if there

is time If no time, you can do this on your own.

Page 20: Introduction to Statistics Lecture Notes Chapters 3-5 Please sign in (SIGNATURES) as you come in to class. It will save my voice instead of my taking attendance.

Cookie Lab Time Permitting (otherwise, Friday)

The last page (to turn in) is not due till the end of class tomorrow. So don’t worry if we don’t get to it today. You can look at it tonight or tomorrow in class (I’ll give last five minutes of class for you to work on it).