Descrptive statistics

Lecture by Dr Zahid Khan King Faisal University,KSA.

1

Descriptive Statistics

04/11/2023

2

Types of Statistical Analyses

Five Types of Statistical Analysis:

1. Descriptive analysis: used to describe the data set

2. Inferential analysis: used to generate conclusions about the population’s characteristics based on the sample data

Descriptive Statistics

Tools for summarising, organising, simplifying data.

1. Tables & Graphs2. Measures of Central Tendency3. Measures of Variability

04/11/2023

3

04/11/2023

4

Understanding Data Via Descriptive Analysis

Two sets of descriptive measures:–Measures of central tendency: used to report a

single piece of information that describes the most typical response to a question

–Measures of variability: used to reveal the typical difference between the values in a set of values

04/11/2023

5

Understanding Data Via Descriptive Analysis Measures of Central Tendency:

– Mode: the value in a string of numbers that occurs most often

– Median: the value whose occurrence lies in the middle of a set of ordered values

– Mean: sometimes referred to as the “arithmetic mean”; the average value characterizing a set of numbers

11/04/2023

6

Understanding Data Via Descriptive Analysis Measures of Variability:

– Frequency distribution reveals the number (percent) of occurrences of each number or set of numbers

– Range identifies the maximum and minimum values in a set of numbers

– Standard deviation indicates the degree of variation in a way that can be translated into a bell-shaped curve distribution

Descriptive statistics

If we wanted to characterize the students in a first year medical class we would find that they are:– Young– Fit– Male & Females

How young? How fit is this class? What is the distribution of males and females?

11/04/2023

Frequency distribution

The frequency with which observations are assigned to each category or point on a measurement scale.– Most basic form of descriptive statistics– May be expressed as a percentage of the total

sample found in each category

11/04/2023

Frequency distribution

The distribution is “read” differently depending upon the measurement level– Nominal scales are read as discrete measurements at each level– Ordinal measures show tendencies, but categories should not be

compared– Interval and ratio scales allow for comparison among categories

11/04/2023

SPSS Output for Frequency Distribution

IQ

1 4.2 4.2 4.2

1 4.2 4.2 8.3

1 4.2 4.2 12.5

2 8.3 8.3 20.8

1 4.2 4.2 25.0

1 4.2 4.2 29.2

1 4.2 4.2 33.3

1 4.2 4.2 37.5

1 4.2 4.2 41.7

1 4.2 4.2 45.8

1 4.2 4.2 50.0

1 4.2 4.2 54.2

1 4.2 4.2 58.3

1 4.2 4.2 62.5

1 4.2 4.2 66.7

1 4.2 4.2 70.8

1 4.2 4.2 75.0

1 4.2 4.2 79.2

1 4.2 4.2 83.3

2 8.3 8.3 91.7

1 4.2 4.2 95.8

1 4.2 4.2 100.0

24 100.0 100.0

82.00

87.00

89.00

93.00

96.00

97.00

98.00

102.00

103.00

105.00

106.00

107.00

109.00

111.00

115.00

119.00

120.00

127.00

128.00

131.00

140.00

162.00

Total

ValidFrequency Percent Valid Percent

CumulativePercent

11/04/2023

Grouped Relative Frequency Distribution

Relative Frequency Distribution of IQ for Two Classes

IQ FrequencyPercent Cumulative Percent

80 – 89 3 12.5 12.590 – 99 5 20.8 33.3100 – 109 6 25.0 58.3110 – 119 3 12.5 70.8120 – 129 3 12.5 83.3130 – 139 2 8.3 91.6140 – 149 1 4.2 95.8150 and over 1 4.2 100.0

Total 24 100.0 100.0

11/04/2023

Discrete and Continuous data

Data consisting of numerical (quantitative) variables can be further divided into two groups: discrete and continuous.

1. If the set of all possible values, when pictured on the number line, consists only of isolated points.

2. If the set of all values, when pictured on the number line, consists of intervals.

The most common type of discrete variable we will encounter is a

counting variable.

11/04/2023

Accuracy & Precision Accuracy: the closeness of the measurements to the “actual” or “real” value

of the physical quantity.– Statistically this is estimated using the standard error of the mean

Precision: is used to indicate the closeness with which the measurements agree with one another.

- Statistically the precision is estimated by the standard deviation of the mean

Precision is related to random errors that can be dealt with using statistics

Accuracy is related to systematic errors and are difficult to deal with using statistics

11/04/2023

Accuracy and precision: The target analogy

High accuracy but low precision

High precision but low accuracy

What does High accuracy and high precision look like?

11/04/2023

http://en.wikipedia.org/wiki/Image:Accuracy_and_precision-highaccuracylowprecision.gif

http://en.wikipedia.org/wiki/Image:Accuracy_and_precision-highprecisionlowaccuracy.gif

Accuracy and precision:The target analogy

High accuracy and high precision

11/04/2023


Two types of error Systematic error

– Poor accuracy– Definite causes– Reproducible

Random error– Poor precision– Non-specific causes– Not reproducible

11/04/2023



Systematic error

Diagnosis– Errors have consistent signs– Errors have consistent magnitude

Treatment– Calibration– Correcting procedural flaws– Checking with a different procedure

11/04/2023

Random error

Diagnosis– Errors have random sign– Small errors more likely than large errors

Treatment– Take more measurements– Improve technique– Higher instrumental precision

11/04/2023

Statistical graphs of data A picture is worth a thousand words!

Graphs for numerical data:

Histograms

Frequency polygons

Pie

Graphs for categorical data

Bar graphs

Pie

11/04/2023

Box-PlotsA way to graphically portray almost all the descriptive

statistics at once is the box-plot.

A box-plot shows: Upper and lower quartiles

Mean

Median

Range

Outliers (1.5 IQR)11/04/2023

Box-Plots

IQ

80.00

100.00

120.00

140.00

160.00

180.00

123.5

96.5

106.5

82

162

M=110.5

IQR = 27; There is no outlier.

11/04/2023

Bar Graphs For categorical data Like a histogram, but with gaps between bars to show

that each bar is a separate group. Useful for showing two samples side-by-side

11/04/2023

PoorPoor BelowAverageBelowAverage

AverageAverage AboveAverage AboveAverage

ExcellentExcellent

Fre

qu

en

cy

Fre

qu

en

cy

RatingRating

Bar Graph

1122

33

44

55

66

77

88

991010

Marada Inn Quality Ratings

Good?

Bad?

11/04/2023

Histograms

f on y axis (could also plot p or % ) X values (or midpoints of class intervals) on x axis Plot each f with a bar, equal size, touching No gaps between bars

11/04/2023

Pie Chart The pie chart is a commonly used graphical device for presenting relative frequency distributions for qualitative data.

First draw a circle; then use the relative frequencies to subdivide the circle into sectors that correspond to the

relative frequency for each class.

11/04/2023

BelowAverage 15%

BelowAverage 15%

Average 25%Average 25%

AboveAverage 45%

AboveAverage 45%

Poor10%Poor10%

Excellent 5%Excellent 5%

Toyota Quality RatingsToyota Quality Ratings

Pie Chart

11/04/2023

Skewness of distributions Measures look at how lopsided distributions are—how far from the ideal

of the normal curve they are When the median and the mean are different, the distribution is skewed.

The greater the difference, the greater the skew. Distributions that trail away to the left are negatively skewed and those

that trail away to the right are positively skewed If the skewness is extreme, the researcher should either transform the

data to make them better resemble a normal curve or else use a different set of statistics—nonparametric statistics—to carry out the analysis

11/04/2023

Symmetric– Left tail is the mirror image of the right tail– Examples: heights and weights of people

Histogram (Common categories)

Rela

tive F

req

uen

cyR

ela

tive F

req

uen

cy

.05.05

.10.10

.15.15

.20.20

.25.25

.30.30

.35.35

00

11/04/2023

Histogram Moderately Skewed Left

– A longer tail to the left– Example: exam scores

Rela

tive F

req

uen

cyR

ela

tive F

req

uen

cy

.05.05

.10.10

.15.15

.20.20

.25.25

.30.30

.35.35

00

11/04/2023

Moderately Right Skewed– A Longer tail to the right– Example: housing values

Histogram

Rela

tive F

req

uen

cyR

ela

tive F

req

uen

cy

.05.05

.10.10

.15.15

.20.20

.25.25

.30.30

.35.35

00

11/04/2023

Positively Skewed

11/04/2023

Negatively Skewed

11/04/2023

Symmetry: Kurtosis

A high kurtosis distribution has a sharper "peak" and fatter "tails", while a low kurtosis distribution has a more rounded peak with wider "shoulders".

11/04/2023

5. Measures of relationship…

Spearman Rho Pearson r

11/04/2023

correlations

determines whether and to what degree a relationship exists between two or more quantifiable variables

the degree of the relationship is expressed as a coefficient of correlation.

the presence of a correlation does not indicate a cause-effect relationship primarily because of the possibility of multiple confounding factors

11/04/2023

…the presence of a correlation does not indicate a cause-effect relationship primarily because of the possibility of multiple confounding factors

11/04/2023

Correlation coefficient…

-1.00 +1.00

strong negative strong positive

0.00

no relationship

11/04/2023

Spearman Rho...…a measure of correlation used for

rank and ordinal data

11/04/2023

Pearson r...…a measure of correlation used for

data of interval or ratio scales

…assumes that the relationship between the variables being correlated is linear

11/04/2023

So

Descriptive statistics are used to summarize data from individual respondents, etc.– They help to make sense of large numbers of individual

responses, to communicate the essence of those responses to others

They focus on typical or average scores, the dispersion of scores over the available responses, and the shape of the response curve

11/04/2023

Chi square (χ2 ) test

11/04/2023

Used to:

Test for goodness of fit Test for independence of attributes Testing homogeneity Testing given population variance

11/04/2023

Introduction

The test we use to measure the differences between what is observed and what is expected according to an assumed hypothesis is called the chi-square test.

11/04/2023

Important The chi square test can only be used on

data that has the following characteristics:

The data must be in the form of frequencies

The frequency data must have a precise numerical value and

must be organised into categories or groups.

The total number of observations must be greater than 30.

The expected frequency in any one cell of the table must be

greater than 5.

11/04/2023

Formulaχ 2 = ∑ (O – E)2

E

χ2 = The value of chi squareO = The observed valueE = The expected value∑ (O – E)2 = all the values of (O – E) squared then added together

11/04/2023

/04/2023

Construct a table with the information you have observed or obtained.

Observed Frequencies (O)

Money Health Love Row Total

men 82 446 355 883

women 46 574 273 893

Column total

128 1020 628 1776

11/04/2023

Work out the expected frequency.

Expected frequency = row total x column total

Grand total

money health love Row Total

men 63.63 507.128 312.23 883

women 64.36 512.87 315.76 893

Column Total 128 1020 628 1776

11/04/2023

For each of the cells calculate.

money

health love Row Total

Men 5.30 7.37 5.85

women 5023 7.29 5.8

Column Total

χ2Calc. =

36.873

(O – E)2

E

11/04/2023

χ2Calc. = sum of all ( O-E)2/ E values in the cells.

Here χ 2Calc. =36.873

Find χ 2critical From the table with degree of freedom 2 and level of

significance 0.05χ 2

Critical =5.99

11/04/2023

Χ2 table

Conclusion

Compare χ2Calc. and Χ2

critical obtained from the table If χ2

Calc. Is larger than χ2Critical. then reject null hypothesis and

accept the alternative Here since χ 2

Calc. is much greater than χ 2Critical, we can easily

reject null hypothesisthat is ; there lies a relation between the gender and choice of selection.

11/04/2023

Any Questions !!!!!

•Thank You.

5311/04/2023

Descrptive statistics

Health & Medicine

Transcript of Descrptive statistics