STATISTIC & INFORMATION THEORY (CSNB134) MODULE 2 NUMERICAL DATA REPRESENTATION.
-
Upload
phyllis-lee -
Category
Documents
-
view
218 -
download
0
Transcript of STATISTIC & INFORMATION THEORY (CSNB134) MODULE 2 NUMERICAL DATA REPRESENTATION.
STATISTIC & INFORMATION THEORY
(CSNB134)
MODULE 2NUMERICAL DATA REPRESENTATION
Recap: Module 1
In Module 1, we have learned several techniques to describe data by using graphs / charts.
Is it effective?Graphs / Charts are effective at giving the overall view of a situation / a populationHOWEVERGraphs / Charts cannot give precise information for inferential purposes (note: infer == to make conclusions)IN FACTGraphs / Charts may not be suitable for all cases (e.g. How to describe a student result for this semester?)
Describing Data with Numerical Measures Numerical measures can be created for
both populations and samples.- A parameter is a numerical descriptive measure calculated for a population.- A statistic is a numerical descriptive measure calculated for a sample.
It is best to describe data by using both numerical and graphical representations whenever possible.
Arithmetic Mean or Average The mean of a set of measurements is
the sum of the measurements divided by the total number of measurements (i.e. the average).
n
xx i
n
xx i
where n = number of measurements
∑ xi = sum of all measurements
Example
The set: 2, 9, 11, 5, 6
n
xx i 6.6
5
33
5
651192
When do you often use mean? When the measures of overall population follows a normal distribution. E.g. height, weight, income etc.
The median of a set of measurements is the middle measurement when the measurements are ranked from smallest to largest.
The position of the median is
Median
.5(n + 1)
once the measurements have been ordered.
Example The set: 2, 4, 9, 8, 6, 5, 3 n = 7 Sort: 2, 3, 4, 5, 6, 8, 9 Position: .5(n + 1) = .5(7 + 1) = 4th
Median = 5 (i.e. 4th largest measurement
The set: 2, 4, 9, 8, 6, 5 n = 6 Sort: 2, 4, 5, 6, 8, 9 Position: .5(n + 1) = .5(6 + 1) = 3.5th
Median = (5 + 6)/2 = 5.5 — average of the 3rd and 4th measurements
Mode
The mode is the measurement which occurs most frequently.
The set: 2, 4, 9, 8, 8, 5, 3 The mode is 8, which occurs twice
The set: 2, 2, 9, 8, 8, 5, 3 There are two modes which are 8 and 2
(bimodal) The set: 2, 4, 9, 8, 5, 3
There is no mode (each value is unique).
Example
Mean?
Median?
Mode? (Highest peak)
Calculate the mean, median and mode for the number of quarts of milk purchased by the following 25 households:
0 0 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 4 4 4 5
2.225
55
n
xx i
2m
2mode
Measures of Center
A measure along the horizontal axis of the data distribution that locates the center of the distribution.
What do you use as a measure of centre? (a) Mean? (b) Median? (c) Mode?
The mean is more easily affected by extremely large or small values than the median.
Extreme Values
The median is often used as a measure of center when the distribution is skewed.
Extreme Values (cont.)
Skewed left: Mean < Median
Skewed right: Mean > Median
Symmetric: Mean = Median
Skewed Right (Positively Skewed)
Skewed Right – long tail to the right A few high numbers pull the mean above
the median The set: The graph:
00.5
11.5
22.5
33.5
44.5
5
Frequency
1234
Num. Frequency
1 3
2 5
3 3
4 1
Mean = [1(3) + 2(5) + 3(3) + 4(1)] / 12 = 2.17
Median = 2
Mean > Median
Skewed Left (Negatively Skewed)
Skewed Left – long tail to the left A few low numbers pull the mean below the
medianThe set: The graph:
00.5
11.5
22.5
33.5
44.5
5
Frequency
1234
Num. Frequency
1 1
2 3
3 5
4 3
Mean = [1(1) + 2(3) + 3(5) + 4(3)] / 12 = 2.83
Median = 3
Mean < Median
Measures of Centre Vs. Variability
I was told that the average height of plants here is only 1 feet.
But this tree is 10 feet high!!! !#$&*^(&**
Often, measure of centre does not give the true picture. Need to know the measure of variability from the centre too….
Measures of Variability
A measure along the horizontal axis of the data distribution that describes the spread of the distribution from the center.
The Range
The range, R, of a set of n measurements is the difference between the largest and smallest measurements.
Example: A botanist records the number of petals on 5 flowers:
5, 12, 6, 8, 14 The range is
R = 14 – 5 = 9
The Variance
The variance is measure of variability that uses all the measurements (as oppose to range R that uses only 2 measurements, maximum and minimum).
It measures the average deviation of the measurements from their mean.
Flower petals: 5, 12, 6, 8, 14
95
45x 9
5
45x
4 6 8 10 12 14
The variance of a population of N measurements is the average of the squared deviations of the measurements about their mean m.
The Variance
The variance of a sample of n measurements is the sum of the squared deviations of the measurements about their mean, divided by (n – 1).
N
xi2
2 )(
N
xi2
2 )(
1
)( 22
n
xxs i
1
)( 22
n
xxs i
In calculating the variance, we squared all of the deviations, and in doing so changed the scale of the measurements.
To return this measure of variability to the original units of measure, we calculate the standard deviation, the positive square root of the variance.
The Standard Deviation
2
2
:deviation standard Sample
:deviation standard Population
ss
2
2
:deviation standard Sample
:deviation standard Population
ss
2 Ways to Calculate the Sample Variance
1
)( 22
n
xxs i
5 -4 16
12 3 9
6 -3 9
8 -1 1
14 5 25
Sum 45 0 60
Use the Definition Formula:ix xxi
2)( xxi
154
60
87.3152 ss
2 Ways to Calculate the Sample Variance
1
)( 22
2
nnx
xs
ii
5 25
12 144
6 36
8 64
14 196
Sum 45 465
Use the Calculational Formula:
ix2ix
154
545
4652
87.3152 ss
The value of s is ALWAYS positive. The larger the value of s2 or s, the larger
the variability of the data set. Why divide by n –1?
The sample standard deviation s is often used to estimate the population standard deviation s. Dividing by n –1 gives us a better estimate of s.
Some Notes
1. Question: Find the mean, median and mode of:5, 7, 3, 5, 6, 8, 5, 6, 4, 6, 25
Solution: Note: First, arrange the data3, 4, 5, 5, 5, 6, 6, 6, 7, 8, 25
median = 6; mean = 80/11 = 7.27 ; modes = 5 and 6
2. Question: Eliminate the last observation x= 25 and then find the mean, median and mode. How do these values compare with those found using the full data set? Solution: median = 5.5; mean = 55/10 = 5.5; modes = 5 and 6. The mean is smaller.
3. Question: How do possible outliers (such as 25) affect these values? Solution: The mean is very much affected by the outlier, while the median and mode are not so.
Exercise 1
Given the observations 7, 9, 10, 6, 8, 7, 8, 9, 8
calculate:1. the range
Solution : R = 10 – 6 = 4
2. the meanSolution : Mean = 72 / 9 = 8
3. the varianceSolution : Variance = [588 – (722/9)] / 8 = 12 / 8 = 1.5
4. the standard deviationSolution : Standard Deviation = √1.5 = 1.225
Exercise 2
STATISTIC & INFORMATION THEORY
(CSNB134)
NUMERICAL DATA REPRESENTATION
--END--