8/2/2019 Data Analysis1
1/60
Data Analysis
Kulwant Singh Kapoor
8/2/2019 Data Analysis1
2/60
Data Structure
The process of arranging data in groups orclasses according to resemblances andsimilarities is technically calledclassification.
Types of Classification:
Geographical
Chronological Qualitative
Quantitative
8/2/2019 Data Analysis1
3/60
Geographical DataIn geographical classification data are classified on the
basis of place.Example: geographical distribution of National Income
COUNTRY INCOME IN US DOLLARSCanada 7950
USA 7880
West Germany 7510France 6730
USSR 2800
India 500
8/2/2019 Data Analysis1
4/60
Chronological DataWhen the data are classified on the basis of time,
also known as time series.Example: production of polio vaccine by a company
X.
YEAR No. of Vaccines
2005 12,800
2006 15,600
2007 18,2002008 16,600
2009 20,000
2010 20,800
8/2/2019 Data Analysis1
5/60
Qualitative Data
When data are classified on the basis of descriptivecharacteristics or attributes.
Examples:
Male/ Female
Strongly agree/ Agree/Disagree/Strongly Disagree
Low/Medium/High
Diabetic/Non- Diabetic Hypertensive/Mildly Hypertensive/Non
Hypertensive
8/2/2019 Data Analysis1
6/60
Quantitative Classification
When classification is based on characteristics
which are capable of Quantitative measurement.
Example:
Height/Weight
Income/Expenditure
Blood PressureBody Temperature
Blood Count
8/2/2019 Data Analysis1
7/60
Quantitative Data
Ungrouped Grouped
Raw Data Discreet data Continuous data
8/2/2019 Data Analysis1
8/60
Mean
Median
Mode
Quartile Percentile
MEASURE OF CENTRAL
TENDENCY
8/2/2019 Data Analysis1
9/60
MEAN
Arithmetic Mean of a given set of observations istheir sum divided by the number of observations.For example if X1, X2, X3,.. Xn are the given nobservations then their arithmetic mean, denotedby
1 2 1........
n
i
n i
x x x x
Xn n
8/2/2019 Data Analysis1
10/60
EXAMPLE 1MARKS OF 24STUDENTS
12 43 54 67 87 98 65 43
54 67 89 90 98 76 54 56
54 98 89 78 90 98 99 87
TOTAL 1746
# OF OBSERVATIONS 24
MEAN 72.75
8/2/2019 Data Analysis1
11/60
Arithmetic's Mean for Un-GroupedSeries
Employee Income X-A
1 1000 -
2 1500 - 3 800 -
4 1200 -
5 900 -
8/2/2019 Data Analysis1
12/60
For discreet data mean is calculated with
respect to frequencies.In Case of continuous data, the value of X istaken as the mid value of the correspondingclass.
1 1 2 2 1
1 2
1
..............
n
i i
n n in
ni
i
f x
f x f x f xX f f f
f
8/2/2019 Data Analysis1
13/60
EXAMPLE 2 NUMBER OF STUDENTS ABSENT IN A YEAR
X f Xf
1 8 8
2 9 18
3 21 63
4 32 128
5 12 60
6 22 132
7 24 1688 37 296
9 15 135
10 20 200
TOTAL 200 1208MEAN 6.04
8/2/2019 Data Analysis1
14/60
Marks Students X-40
X f d F*d
20 8 - -
30 12 - -
40 20 - -50 10 - -
60 6 - -
70 4 - -
Total 60
8/2/2019 Data Analysis1
15/60
EXAMPLE 3 DISTRIBUTION OF NUMBER OF
PROCESSED ARTICLES PER DAY
PER PERSON
LIMITS f X fX
80-100 7 90 630
100-120 50 110 5500120-140 80 130 10400
140-160 60 150 9000
160-180 3 170 510
TOTAL 200 26040
MEAN 130.2
8/2/2019 Data Analysis1
16/60
Mathematical Properties ofArithmetic Mean
Property 1 The Algebraic sum of thedeviations of the given set ofobservations from their arithmetic
mean is zero Property 2 If the sizes and the mean
of two component series is known thenthe mean of resultant series obtainedon combining the given series can befound
8/2/2019 Data Analysis1
17/60
Merits and demerits ofArithmetic Mean
Merits:
i. It is rigidly defined.
ii. It is easy to calculate and understand.
iii. It is based on all the observations
iv. It is suitable for further mathematicaltreatment.
v. Of all the averages, arithmetic mean isaffected least by fluctuations of samplingor arithmetic mean is a stableaverage.
(contd.)
8/2/2019 Data Analysis1
18/60
Merits and demerits ofArithmetic Mean
Demerits:
i. It is affected by extreme observations.ii. It cannot be used in case of open end classes such as less than 10
and more than 70, etc.
iii. It can not be determined by inspection nor can it be locatedgraphically.
iv. It cannot be used in dealing with qualitative characteristics.v. It cannot be obtained if a single observation is missing or lost.vi. It is not representative of the distribution and hence is not a suitable
measure of locationvii. It may lead to wrong conclusion if the details of the data from whichit is obtained are not available.
viii. Arithmetic mean may not be one of the values which the variableactually takes and is termed as fictitious mean
8/2/2019 Data Analysis1
19/60
Mean For Combined Data
If is the mean for observations and
If is the mean for observations
The combined mean is given by
1X 1n
2
X2n
1 1 2 2
1 2
n X n X X
n n
8/2/2019 Data Analysis1
20/60
Example
Mean height of 25 Male worker in thefactory is 61 inches and Mean height of 35female worker is the same factory is 58
inches. Find out the combine Mean of 60workers
8/2/2019 Data Analysis1
21/60
Median
Median is that value of the variable whichdivides the group in two equal parts, onepart comprising all the values greater and
the other, all the values less than themedian.
Median is only a positional average i.e, itsvalue depends on the position occupied bya value in the frequency distribution.
8/2/2019 Data Analysis1
22/60
Calculation of Median
Case I: Ungrouped data: If the number of observation is odd,then the median is the middle value after the observationshave been arranged in ascending or descending order ofmagnitude.
Case II: Discreet Distribution: In case of frequencydistribution where the variable takes the value X1, X2,, , Xnwith respective frequencies 1,2,, ,n with =N, totalfrequency, median is the size of the (N+1)/2th item or
observation. In this case the use of cumulative frequency(c. .) distribution facilitates the calculations.
8/2/2019 Data Analysis1
23/60
EXAMPLE 4
MARKS OF 10
STUDENTS ARE4 7 6 8 9 4 3 2 7 8
IN ORDER 2 3 4 4 6 7 7 8 8 9
MEDIAN 6.5
MARKS OF 11STUDENTS ARE
4 7 6 8 9 4 3 2 7 8 4
IN ORDER 2 3 4 4 4 6 7 7 8 8 9
MEDIAN 6
8/2/2019 Data Analysis1
24/60
ND NUMBER OF HEAD ARE NOATED
THE EXPERIMENT IS REPEATED 256 TIMES
# HEADS FREQUENCY
X f CF xf
0 1 1 0
1 9 10 9
2 26 36 523 59 95 177
4 72 167 288
5 52 219 260
6 29 248 174
7 7 255 498 1 256 8
N/2 128 1017
MEDIAN 4 mean 3.972656
8/2/2019 Data Analysis1
25/60
Case III: Continuous distribution: Compute cumulative frequency (cf)
Find N/2
See cf just greater than N/2
The corresponding class contains the median valuecalled median class
2
h N Median l C f
Where l is the lower limit of median classf is the frequency of the median classH is the magnitude of the median classN is the total frequencyC is the CF of the class preceding the median class
8/2/2019 Data Analysis1
26/60
8/2/2019 Data Analysis1
27/60
Merits:
i. It is rigidly definedii. It is easy to understand and calculate for a non medical
person.iii. It is not affected by extreme observations and as such is very
useful in the case of skewed distributionsiv. It can be computed by dealing with the distribution with open
end classesv. It can sometimes be located by simple inspection and can
also be computed graphicallyvi. It is the only average to be used while dealing with qualitative
characteristics which can not be measured quantitatively butstill can be arranged in ascending oe descending order ofmagnitude.
Merits And Demerits
8/2/2019 Data Analysis1
28/60
Merits And Demerits
Demerits:
i. In case of even number of observations ofungrouped data it can not be determined
exactly.ii. It is not based on each and every item of thedistribution.
iii. It is not suitable for further mathematical
treatment.iv. It is relatively less stable than mean, particularly
for small samples.
8/2/2019 Data Analysis1
29/60
Quartile
The values which divide the givendata into four equal parts areknown as quartiles. Therefore,there will be only three such points
8/2/2019 Data Analysis1
30/60
Quartile
The values which divide the given data into fourequal parts are known as quartiles. Therefore,there will be only three such points Q1, Q2 andQ3such that Q1Q2Q3termed as the three quartiles.
Q1known as the lower or first quartile is the valuewhich has 25% of the items of the distributionbelow it and consequently 75% of the items aregreater than it. Q2, the second quartile coincideswith the median and has equal number of
observations above and below it. Q3upper or thirdquartile, has 75% of the observations below it andconsequently 25% of the observations above it
8/2/2019 Data Analysis1
31/60
1 4
h N
Q l Cf
3
3
4
h NQ l C
f
8/2/2019 Data Analysis1
32/60
Percentile
Percentiles are the values which divide theseries into 100 equal parts. So, there are 99percentiles P1, P2 P99 such that P1 P2
P99. The ith percentile value is:
100i
h iNP l Cf
8/2/2019 Data Analysis1
33/60
MODE
Mode is the value which has thegreatest frequency density
Mode for continuous distribution is
given by
1 0
1 0 2 1
h f f
Mode l f f f f
8/2/2019 Data Analysis1
34/60
EXAMPLE 7
f x xf
10-20 4 15 60
20-30 6 25150
30-40 5 35 175
40-50 10 45 450
50-60 20 55 1100
60-70 22 65 1430
70-80 21 75 1575
80-90 6 85 510
90-100 2 95 190
100-110 1 105 105
f1=22 h=10 5745f0=20 97
f2=21 mean 59.2268
l=60
mode= 66.6666667
8/2/2019 Data Analysis1
35/60
Measures of Dispersion
Range
Quartile deviation
Mean Deviation
Variance
Standard deviation
8/2/2019 Data Analysis1
36/60
RANGE
max min Range X X
Range is the difference between the two extremeobservations of distribution
OR
It is the difference between the greatest (maximum) and thesmallest (minimum) observation of the distribution.
It is the simplest but crude measure of dispersion. It isrigidly defined, readily comprehensible and easiest to
compute requiring very little calculations
RANGE
8/2/2019 Data Analysis1
37/60
EXAMPLE
MARKS OF STUDENTS
ROLL NO. MARKS SORTED
123 98 52
125 95 56
126 96 56127 87 66
128 56 78
134 52 87
135 89 89
136 78 95
137 56 96
138 66 98
RANGE 98-52= 46
RANGE
8/2/2019 Data Analysis1
38/60
Merits and Demerits of Range
It is not based in the entire set of data.
Its value varies very widely from sample tosample.
If the Xmax and Xminremain unaltered and all theother values are replaced by a set of observationthe range of distribution remains the same.
It can not be used when dealing with open endclasses
Not Suitable for mathematical treatment.It is very sensitive to the size of the sample.
It is too indefinite to be used as a practicalmeasure of dispersion.
8/2/2019 Data Analysis1
39/60
QUARTILE DEVIATION
3 1D
2
Q QQuartile eviation
It is a measure of dispersion based on the upper quartileQ3 and the lower quartile Q1.
Inter-quartile Range= Q3 - Q1
Quartile Deviation is obtained from inter quartile rangeon dividing by 2.
8/2/2019 Data Analysis1
40/60
Merits and Demerits of Quartile
Merits:
It is quite easy to understand & calculate.
It makes use of 50% of the data & as such isbetter measure than range
As it ignore 25% of data from the beginning and25% from the top end, it is not affected at all by
extreme observations.It can be Computed from the Frequency
distribution with open end classes .
(Contd.)
8/2/2019 Data Analysis1
41/60
Demerits:
It is not based on all observations.
It is affected considerably byfluctuations of sampling.
It is not suitable for furthermathematical treatment.
Merits and Demerits of Quartile
EXAMPLE
8/2/2019 Data Analysis1
42/60
DISTRIBUTION OF MONTHLY EARNING
MONTH EARNING
1 10239
2 10250
3 10251
4 10251
5 10257
6 10258
7 10260
8 10261
9 10262
10 10262
11 1027312 10275
Q1 10251
Q3 10262
QUARTILE DEVIATIO 5.5
8/2/2019 Data Analysis1
43/60
MEAN DEVIATION
1D i Mean eviation X X
n
1D i i Mean eviation f X X N
Average or Mean deviation is the average amount of scatterof the items in a distribution from either the mean or themedian, ignoring the signs of deviation. The average that istaken of the scatter is an arithmetic mean, which accounts forthe fact that this measure is often called the mean deviation.
For grouped data
For ungrouped data
EXAMPLE
8/2/2019 Data Analysis1
44/60
EXAMPLE
DISTRIBUTION OF SERIES OF DAILY RENTS
HOUSE RENT -MEAN
1 3000 18192 3000 1819
3 3000 1819
4 3750 1069
5 4000 819.4
6 4000 819.4
7 4000 819.4
8 4500 319.4
9 4750 69.44
10 5000 180.6
11 5000 180.6
12 5000 180.6
13 5250 430.6
14 5250 430.615 5500 680.6
16 6250 1431
17 6500 1681
18 9000 4181
TOTAL 86750 18750
MEAN 4819.4
8/2/2019 Data Analysis1
45/60
EXAMPLE
DISTRIBUTION OF HEIGHTS OF STUDEN
HEIGHT # OF STUDENTS
X f fX (X-MEAN
158 15 2370 49.1667
159 20 3180 45.5556
160 32 5120 40.8889161 35 5635 9.72222
162 33 5346 23.8333
163 21 3423 36.1667
164 10 1640 27.2222
165 8 1320 29.7778
166 6 996 28.3333
TOTAL 180 29030 290.667
MEAN 161.278
MD 1.61481
8/2/2019 Data Analysis1
46/60
STANDARD DEVIATION
It is defined as the positive square root of themean of the squares of the deviations of the givenobservations from their mean
21
Standard Deviation iX Xn
21
Standard Deviationi i
f X X N
For un-grouped data
For grouped data
8/2/2019 Data Analysis1
47/60
VARIANCE
22 1
i iVariance f X X N
2
2 1iVariance X X
n
It is the square of standard deviation and is denotedby 2
For un-grouped data
For grouped data
PROPERTIES OF STANDARD
8/2/2019 Data Analysis1
48/60
PROPERTIES OF STANDARDDEVIATION
PROPERTY 1
is independent of change of origin but not scale
PROPERTY 2
Is the minimum value of the root mean square deviation
PROPERTY 3
Is suitable for further mathematical treatment
PROPERTY 4
SD < Range
8/2/2019 Data Analysis1
49/60
MERITS AND DEMERITS OF SD
Is the most important and widely usedmeasure of dispersion
It is defined on all the observations
The squaring of the deviations removes thedrawback of ignoring the signs of deviationsin computing the mean deviation
It is affected least by fluctuations ofsampling
8/2/2019 Data Analysis1
50/60
EXAMPLE
X (X-MEAN)^2
12 13.69
15 0.49
24 68.89
12 13.69
13 7.29
15 0.4914 2.89
12 13.69
16 0.09
24 68.89
TOTAL 157 190.1
MEAN 15.7
VARIAN 19.01
SD 4.36
8/2/2019 Data Analysis1
51/60
EXAMPLE
# LETTERS IN WORREQUENCY X-MEAN
X f fX d fd^d
1 3 3 -3.277 32.208
2 8 16 -2.277 41.463
3 9 27 -1.277 14.667
4 10 40 -0.277 0.765
5 5 25 0.723 2.617
6 4 24 1.723 11.880
7 3 21 2.723 22.251
8 1 8 3.723 13.864
9 3 27 4.723 66.932
10 1 10 5.723 32.757
TOTAl 47 201 239.404
MEAN 4.277
VARIANCE 5.094
8/2/2019 Data Analysis1
52/60
EXAMPLE f x xf d^2 fd^2
30-39 1 29.5-39. 34.5 34.5 1128.96 1128.96
40-49 4 39.5-49. 44.5 178 556.96 2227.8450-59 14 49.5-59. 54.5 763 184.96 2589.44
60-69 20 59.5-69. 64.5 1290 12.96 259.2
70-79 22 69.5-79. 74.5 1639 40.96 901.12
80-89 12 79.5-89. 84.5 1014 268.96 3227.52
90-99 2 89.5-99. 94.5 189 696.96 1393.92
TOTAL 75 5107.5 11728
MEAN 68.1
VARIANCE 156
SD 12.5
8/2/2019 Data Analysis1
53/60
CORRELATION
When the relationships of quantitativenature, the appropriate statistical tool fordiscovering and measuring the relationship
and expressing it in a brief formula is knownas correlation
It is defined as an analysis of the co-
variation between two or more variables
8/2/2019 Data Analysis1
54/60
Types of Correlation
a) Positive and negative correlationb) Linear and non-linear correlation
METHODS OF STUDYING
8/2/2019 Data Analysis1
55/60
METHODS OF STUDYINGCORRELATION
1. Scatter diagram
2. Karl Pearsons coefficient ofcorrelation
3. Bi-variate correlation method
4. Rank correlation
S Di
8/2/2019 Data Analysis1
56/60
Scatter Diagram
Karl Pearsons Coefficient of
8/2/2019 Data Analysis1
57/60
Karl Pearsons Coefficient of
Correlation
Is a numerical measure of linearrelationship between them and isdefined as the ratio of the covariancebetween X & Y to the product of thestandard deviations
( , )
x y
C o v x yr
8/2/2019 Data Analysis1
58/60
2 2
1( )( )
1 1( ) ( )
x x y y
nr
x x y yn n
2 2 2 2
( )( )
[ ( ) ][ ( ) ]
n xy x yr
n x x n y y
EXAMPLE
8/2/2019 Data Analysis1
59/60
EXAMPLE
ADVERTISING Sales
EXPENSES
x y x-mx y-my dx^2 dy^2 dxdy
39 47 -26 -19 676 361 494
65 53 0 -13 0 169 0
62 58 -3 -8 9 64 24
90 86 25 20 625 400 500
82 62 17 -4 289 16 -6875 68 10 2 100 4 20
25 60 -40 -6 1600 36 240
98 91 33 25 1089 625 825
36 51 -29 -15 841 225 435
78 84 13 18 169 324 234
650 660 0 0 5398 2224 2704
mx= 65
my= 66
r= 0.78
8/2/2019 Data Analysis1
60/60
Top Related