Statistics I Chapter 2: Analysis of univariate · PDF fileChapter 2: Analysis of univariate...
Transcript of Statistics I Chapter 2: Analysis of univariate · PDF fileChapter 2: Analysis of univariate...
Statistics IChapter 2: Analysis of univariate data
Chapter 2: Analysis of univariate data
Contents
1. Representations and graphs
I Frequency tables.
I Bar and pie charts, pictograms, histograms, frequency polygons.Other graphs. Lying with graphs.
2. Numerical summary:
I Central tendency (mean, median, mode)
I Location (quartiles and percentiles). Box plots.
I Spread (variance, standard deviation, range, IQR, coefficient ofvariation)
I Shape (coefficients of asymmetry and kurtosis)
Chapter 2: Analysis of univariate data
Recommended reading
I Pena, D., Romo, J. Introduccion a la Estadıstica para las CienciasSociales (1997).
I Chapters 2, 3, 4 y 5.
I Newbold, P. Statistics for Business and Economics (2008).I Chapters 1 y 2
PID: Mini-Vıdeos de autoformacion por Internet y telefonosmoviles
I Descriptiva: Una variable cualitativa
I Descriptiva: Una variable cuantitativa (salvo Desigualdad deChebyshev)
Description of qualitative variables
I Sample: 46 professionals of a computer company in the UnitedStates.
I Variable: EDUC: education level (1=High School; 2=College;3=Advanced Degree)
I Variable: MGT: position of responsibility (1=yes; 0=no)
In order to obtain information:
How to summarize primary data in a more useful way that allows a quickvisual interpretation?
Description of qualitative variables: frequency tables andbar charts
Education level Number of employees Proportion of employeesHigh School 14 0.304
College 19 0.413Advanced Degree 13 0.283
Total 46 1
Description of qualitative variables: frequency tables andbar charts
Position of responsibility Number of employees Proportion of employees
No 26 0.562Yes 20 0.435
Total 46 1
Description of qualitative variables: general outline of afrequency table
Freq. Freq.Class, xi Absolute, ni Relative, fi
x1 n1 f1 = n1n
x2 n2 f2 = n2n
......
...xk nk fk = nk
n
Total n 1
Note:
I ni = number of xi in the sample, fi = number of xin
I 0 ≤ fi ≤ 1
Description of qualitative variables: the bar chart
I Bars are of the same width and equally-spaced, with the heightscorresponding to the frequencies
I There are gaps between the bars
I Bars are labeled with class names
Other graphics: the pie chart
I Sample: Among the 1100 visitors of the art exhibition ”Turner andthe Masters” (Prado Museum, June 22 to September 19 2010),those who bought their tickets online (a 20.3 %) Source: Institutefor Tourism Studies
I Variable: Main reason for buying the ticket online
Instituto de Estudios Turísticos
Turner y los Maestros
21
Tabla 7. Visitantes según si ha tenido que esperar para entrar a la exposición
Filtro: Adquiere la entrada en taquilla
%
Ha tenido que esperar 12,1
No ha tenido que esperar 87,9
Total 100,0
Tabla 8. Visitantes por actividades realizadas en el tiempo de espera para entrar a la exposición
Filtro: Ha adquirido la entrada en taquilla y ha tenido que esperar desde que ha sacado
la entrada hasta que ha accedido a la exposición
%
Visitar la colección del Museo 16,6
Visitar o estar en la cafetería del Museo 7,7
Visitar la tienda del Museo 28,1
Estar o visitar otros espacios del Museo que no tienen colección
33,0
Esperar en el exterior del Museo 27,5
Tabla 9. Visitantes por la razón principal para adquirir la entrada por vía telemática
Filtro: Adquiere la entrada por vía telemática
%
Por comodidad 60,5
Rapidez 10,1
Puedo elegir el día y la hora de la visita 14,0
No tengo que esperar en taquilla 9,5
Porque la entrada es más barata 4,3
Por el horario 24 horas 1,2
Había oído hablar bien del servicio 0,4
Total 100,0
Other graphics: the pie chartI Each slice is a fraction of the total size of the pieI Many software programs rank slices alphabeticallyI Although ’pretty’ harder to read than barchartsI Avoid 3D piecharts, for those the area in the background seems to
be smaller than the area in the foreground
Other graphics: the Pareto chartI Bar chart in which the categories of the variable are ranked in order
of frequency.
I Applies only to nominal qualitative variables.
I Useful in the detection of the most significant “reasons”(a fewoptions account for almost all the purchasing frequency)
Other graphics: pictograms
I Sample: 70 university students from Madrid
I Variable: Preferred political party
Preferred political party Students numb. Students prop.PSOE 23 0.33
PP 15 0.21Unidos Podemos 20 0.29
Ciudadanos 7 0.10Otros 5 0.07Total 70 1
The area of the graph is proportional to the frequency.
ExerciseResults from a survey conducted among 15-20 year-old about theirfavorite leisure activity
I What is the objective variable and who are the individuals?
I For what percentage of young people reading is the preferred leisureactivity?
Exercise
From a test taken by a group of students, graded between 1 and 8, thefollowing table has been obtained:
Grade, xi ni fi1 4 0.082 43 0.164 7 0.145 56 107 7 0.148
I How many students have taken the test?
I What percentage of students have obtained a grade greater than orequal to 6?
Exercise
In a survey about health habits, 30 randomly chosen students have beenasked about the sport they usually practice. The results are shown in thefollowing table:
Sport, xi ni fiBasket 12 0.4Swimming 3 0.1Football 9 0.3None 6 0.2Total 30 1
To which of the following bar charts do data correspond?
Exercise Estadística Aplicada
a) c)
b) d)
Deporte
0
2
4
6
8
10
12
14
Baloncesto Natación Fútbol Ningún deporte
Deporte
0
2
4
6
8
10
12
14
Baloncesto Natación Fútbol Ningún deporte
Deporte
0
2
4
6
8
10
12
14
Baloncesto Natación Fútbol Ningún deporte
Deporte
0
2
4
6
8
10
12
14
Baloncesto Natación Fútbol Ningún deporte
Description of discrete quantitative variables: the table offrequencies
I Sample: 50 plants treated by the new pesticide
I Variable: The number of leaves attacked by a pest.
Cumulative CumulativeAbsolute Relative Absolute Relative
xi Frequency ni Frequency fi Frequency Ni Frequency Fi
0 6 0.12 6 0.121 10 0.20 16 0.322 12 0.24 28 0.563 8 0.16 36 0.724 5 0.10 41 0.825 4 0.08 45 0.906 3 0.06 48 0.968 1 0.02 49 0.98
10 1 0.02 50 1Total 50 1
Description of discrete quantitative variables: the table offrequencies
I What percentage of the sampled plants had only 4 leaves attacked?
I How many plants had no more than 3 leaves attacked?
I How many plants had at least 5 leaves attacked?
I What percentage of plants have between 3 and 5 leaves attacked?
I What percentage of plants had at least 8 leaves attacked?
I What percentage of plants had at most 3 leaves attacked?
Description of discrete quantitative variables: the bar chartBar charts can also be created for discrete data if there are not too manydifferent values.
Description of discrete quantitative variables: the generalformat of the table
Cumulative CumulativeAbsolute Relative Absolute Relatative
Class, xi Freq., ni Freq., fi Freq., Ni Freq., Fi
x1 n1 f1 = n1n N1 = n1 F1 = f1
x2 n2 f2 = n2n N2 = N1 + n2 F2 = F1 + f2
......
......
...xk nk fk = nk
n Nk = n Fk = 1Total n 1
Note:
I x1 < x2 < . . . < xk
I ni = number of xi in the sample, fi = number of xin
I Ni = Ni−1 + ni , Fi = Fi−1 + fiI 0 ≤ fi ,Fi ≤ 1
I Fi and Ni also make sense for qualitative ordinal variables
Qualitative ordinal variables: cumulative frequencies
We can also include cumulative frequencies in the table.
I Sample: 901 employees.
I Variable: levels of satisfaction (S=satisfied, V=very, U=unsatisfied)
Cumulative CumulativeAbsolute Relative Absolute Relatative
Class Frequency Frequency Frequency FrequencyVU 62 0.07 62 0.07U 108 0.12 170 0.19S 319 0.35 489 0.54
VS 412 0.46 901 1Total 901 1
Qualitative ordinal variables: bar charts with cumulativefrequencies
Attention! Many software programs rank in alphabetical order when thevariable is qualitative. If it is an ordinal variable, it must be ranked inascending order.
Bar charts for discrete data
I Sample: 46 professionals of a computer company in the United States.
I variable: EXPRNC: number of years working in the company
Experience, xi Absolute freq., ni Relative freq., fi1 5 0,1092 4 0,0873 4 0,0874 4 0,0875 3 0,0656 4 0,0877 1 0,0228 4 0,08710 4 0,08711 2 0,04312 2 0,04313 2 0,04314 1 0,02215 1 0,02216 3 0,06517 1 0,02220 1 0,022
Total 46 1
Description of discrete quantitative variables: the bar chart
Too many different values.
Description of continuous quantitative variables
I Sample: 46 professionals of a computer company in the United States.
I Variable: EXPRNC: years of experience
I Variable: SALARY: anual gross income (in US dollars)
Grouping by class intervals: continuous (or discrete) data
Class Interval Midpoint ni fi Ni Fi
[l0, l1] x1 = l0+l12 n1 f1 N1 F1
(l1, l2] x2 = l1+l22 n2 f2 N2 F2
......
......
......
(lk−1, lk ] xk = lk−1+lk2 nk fk n 1
Total n 1
Note:
I Left end-point is excluded, but right end-point is included in Excel(it is a convention)
I Reverse end-point convention can be applied - check your softwarefor definition
I Useful for tabulating discrete data if X takes many values
Grouping by class intervals
I Very often class intervals have the same width
I Determine the width w of each interval by
w =largest number - smallest number
number of desired intervals
I How many intervals? Roughly between 5 and 20. More specifically:I k ≈
√n if n is ’small’
I k ≈ 1 + 3.22 log(n) if n is ’large’
I Intervals never overlap
I Round up the interval width to get desirable interval endpoints
Grouping by class intervals: the histogram and thefrequency polygon
I Find range: 20− 1 = 19
I Select number of classes: say k =√
46 = 6.78 ≈ 7
I Compute interval width: 19/7 = 2.71⇒ 3.
I Determine the end-points (beginning before the first one and endingafter the last one): [0, 3], (3, 6], . . . , (18, 21]
Description of quantitative variables: the histogram andthe frequency polygon
I There are no gaps between the bars/bins
I Bin widths = widths of class intervals (identical), class boundariesare marked on the horizontal axis
I Bin heights = frequencies (here, absolute)
I Bin areas are proportional to the frequencies
Quantitative variables: the histogram
Description of quantitative variables: the histogram andthe frequency polygon
Other graphics: cartograms (INE, Encuesta de Turismo de residentes)Average travel expenses per person (third quarter, 2016)
Average expenditure on excursions per person (third quarter, 2016)
Other graphics: pictograms
Other graphics: time series
INE, Encuesta de Poblacion Activa
How to lie with pictograms
Published in ”La Voz de Galicia”, on October 24, 2010.
I Letting height proportional to frequency gives a false impression.
I Is there anything else you don’t like?
Lying with graphsImproper use of scales: the coordinate origin is not 0
Lying with graphs
Lying with graphsTh escale is upside down