Download - Data Analysis1

8/2/2019 Data Analysis1

1/60

Data Analysis

Kulwant Singh Kapoor


2/60

Data Structure

The process of arranging data in groups orclasses according to resemblances andsimilarities is technically calledclassification.

Types of Classification:

Geographical

Chronological Qualitative

Quantitative


3/60

Geographical DataIn geographical classification data are classified on the

basis of place.Example: geographical distribution of National Income

COUNTRY INCOME IN US DOLLARSCanada 7950

USA 7880

West Germany 7510France 6730

USSR 2800

India 500


4/60

Chronological DataWhen the data are classified on the basis of time,

also known as time series.Example: production of polio vaccine by a company

X.

YEAR No. of Vaccines

2005 12,800

2006 15,600

2007 18,2002008 16,600

2009 20,000

2010 20,800


5/60

Qualitative Data

When data are classified on the basis of descriptivecharacteristics or attributes.

Examples:

Male/ Female

Strongly agree/ Agree/Disagree/Strongly Disagree

Low/Medium/High

Diabetic/Non- Diabetic Hypertensive/Mildly Hypertensive/Non

Hypertensive


6/60

Quantitative Classification

When classification is based on characteristics

which are capable of Quantitative measurement.

Example:

Height/Weight

Income/Expenditure

Blood PressureBody Temperature

Blood Count


7/60

Quantitative Data

Ungrouped Grouped

Raw Data Discreet data Continuous data


8/60

Mean

Median

Mode

Quartile Percentile

MEASURE OF CENTRAL

TENDENCY


9/60

MEAN

Arithmetic Mean of a given set of observations istheir sum divided by the number of observations.For example if X1, X2, X3,.. Xn are the given nobservations then their arithmetic mean, denotedby

1 2 1........

n

i

n i

x x x x

Xn n


10/60

EXAMPLE 1MARKS OF 24STUDENTS

12 43 54 67 87 98 65 43

54 67 89 90 98 76 54 56

54 98 89 78 90 98 99 87

TOTAL 1746

# OF OBSERVATIONS 24

MEAN 72.75


11/60

Arithmetic's Mean for Un-GroupedSeries

Employee Income X-A

1 1000 -

2 1500 - 3 800 -

4 1200 -

5 900 -


12/60

For discreet data mean is calculated with

respect to frequencies.In Case of continuous data, the value of X istaken as the mid value of the correspondingclass.

1 1 2 2 1

1 2

1

..............

n

i i

n n in

ni

i

f x

f x f x f xX f f f

f


13/60

EXAMPLE 2 NUMBER OF STUDENTS ABSENT IN A YEAR

X f Xf

1 8 8

2 9 18

3 21 63

4 32 128

5 12 60

6 22 132

7 24 1688 37 296

9 15 135

10 20 200

TOTAL 200 1208MEAN 6.04


14/60

Marks Students X-40

X f d F*d

20 8 - -

30 12 - -

40 20 - -50 10 - -

60 6 - -

70 4 - -

Total 60


15/60

EXAMPLE 3 DISTRIBUTION OF NUMBER OF

PROCESSED ARTICLES PER DAY

PER PERSON

LIMITS f X fX

80-100 7 90 630

100-120 50 110 5500120-140 80 130 10400

140-160 60 150 9000

160-180 3 170 510

TOTAL 200 26040

MEAN 130.2


16/60

Mathematical Properties ofArithmetic Mean

Property 1 The Algebraic sum of thedeviations of the given set ofobservations from their arithmetic

mean is zero Property 2 If the sizes and the mean

of two component series is known thenthe mean of resultant series obtainedon combining the given series can befound


17/60

Merits and demerits ofArithmetic Mean

Merits:

i. It is rigidly defined.

ii. It is easy to calculate and understand.

iii. It is based on all the observations

iv. It is suitable for further mathematicaltreatment.

v. Of all the averages, arithmetic mean isaffected least by fluctuations of samplingor arithmetic mean is a stableaverage.

(contd.)


18/60

Merits and demerits ofArithmetic Mean

Demerits:

i. It is affected by extreme observations.ii. It cannot be used in case of open end classes such as less than 10

and more than 70, etc.

iii. It can not be determined by inspection nor can it be locatedgraphically.

iv. It cannot be used in dealing with qualitative characteristics.v. It cannot be obtained if a single observation is missing or lost.vi. It is not representative of the distribution and hence is not a suitable

measure of locationvii. It may lead to wrong conclusion if the details of the data from whichit is obtained are not available.

viii. Arithmetic mean may not be one of the values which the variableactually takes and is termed as fictitious mean


19/60

Mean For Combined Data

If is the mean for observations and

If is the mean for observations

The combined mean is given by

1X 1n

2

X2n

1 1 2 2

1 2

n X n X X

n n


20/60

Example

Mean height of 25 Male worker in thefactory is 61 inches and Mean height of 35female worker is the same factory is 58

inches. Find out the combine Mean of 60workers


21/60

Median

Median is that value of the variable whichdivides the group in two equal parts, onepart comprising all the values greater and

the other, all the values less than themedian.

Median is only a positional average i.e, itsvalue depends on the position occupied bya value in the frequency distribution.


22/60

Calculation of Median

Case I: Ungrouped data: If the number of observation is odd,then the median is the middle value after the observationshave been arranged in ascending or descending order ofmagnitude.

Case II: Discreet Distribution: In case of frequencydistribution where the variable takes the value X1, X2,, , Xnwith respective frequencies 1,2,, ,n with =N, totalfrequency, median is the size of the (N+1)/2th item or

observation. In this case the use of cumulative frequency(c. .) distribution facilitates the calculations.


23/60

EXAMPLE 4

MARKS OF 10

STUDENTS ARE4 7 6 8 9 4 3 2 7 8

IN ORDER 2 3 4 4 6 7 7 8 8 9

MEDIAN 6.5

MARKS OF 11STUDENTS ARE

4 7 6 8 9 4 3 2 7 8 4

IN ORDER 2 3 4 4 4 6 7 7 8 8 9

MEDIAN 6


24/60

ND NUMBER OF HEAD ARE NOATED

THE EXPERIMENT IS REPEATED 256 TIMES

# HEADS FREQUENCY

X f CF xf

0 1 1 0

1 9 10 9

2 26 36 523 59 95 177

4 72 167 288

5 52 219 260

6 29 248 174

7 7 255 498 1 256 8

N/2 128 1017

MEDIAN 4 mean 3.972656


25/60

Case III: Continuous distribution: Compute cumulative frequency (cf)

Find N/2

See cf just greater than N/2

The corresponding class contains the median valuecalled median class

2

h N Median l C f

Where l is the lower limit of median classf is the frequency of the median classH is the magnitude of the median classN is the total frequencyC is the CF of the class preceding the median class


26/60


27/60

Merits:

i. It is rigidly definedii. It is easy to understand and calculate for a non medical

person.iii. It is not affected by extreme observations and as such is very

useful in the case of skewed distributionsiv. It can be computed by dealing with the distribution with open

end classesv. It can sometimes be located by simple inspection and can

also be computed graphicallyvi. It is the only average to be used while dealing with qualitative

characteristics which can not be measured quantitatively butstill can be arranged in ascending oe descending order ofmagnitude.

Merits And Demerits


28/60

Merits And Demerits

Demerits:

i. In case of even number of observations ofungrouped data it can not be determined

exactly.ii. It is not based on each and every item of thedistribution.

iii. It is not suitable for further mathematical

treatment.iv. It is relatively less stable than mean, particularly

for small samples.


29/60

Quartile

The values which divide the givendata into four equal parts areknown as quartiles. Therefore,there will be only three such points


30/60

Quartile

The values which divide the given data into fourequal parts are known as quartiles. Therefore,there will be only three such points Q1, Q2 andQ3such that Q1Q2Q3termed as the three quartiles.

Q1known as the lower or first quartile is the valuewhich has 25% of the items of the distributionbelow it and consequently 75% of the items aregreater than it. Q2, the second quartile coincideswith the median and has equal number of

observations above and below it. Q3upper or thirdquartile, has 75% of the observations below it andconsequently 25% of the observations above it


31/60

1 4

h N

Q l Cf

3

3

4

h NQ l C

f


32/60

Percentile

Percentiles are the values which divide theseries into 100 equal parts. So, there are 99percentiles P1, P2 P99 such that P1 P2

P99. The ith percentile value is:

100i

h iNP l Cf


33/60

MODE

Mode is the value which has thegreatest frequency density

Mode for continuous distribution is

given by

1 0

1 0 2 1

h f f

Mode l f f f f


34/60

EXAMPLE 7

f x xf

10-20 4 15 60

20-30 6 25150

30-40 5 35 175

40-50 10 45 450

50-60 20 55 1100

60-70 22 65 1430

70-80 21 75 1575

80-90 6 85 510

90-100 2 95 190

100-110 1 105 105

f1=22 h=10 5745f0=20 97

f2=21 mean 59.2268

l=60

mode= 66.6666667


35/60

Measures of Dispersion

Range

Quartile deviation

Mean Deviation

Variance

Standard deviation


36/60

RANGE

max min Range X X

Range is the difference between the two extremeobservations of distribution

OR

It is the difference between the greatest (maximum) and thesmallest (minimum) observation of the distribution.

It is the simplest but crude measure of dispersion. It isrigidly defined, readily comprehensible and easiest to

compute requiring very little calculations

RANGE


37/60

EXAMPLE

MARKS OF STUDENTS

ROLL NO. MARKS SORTED

123 98 52

125 95 56

126 96 56127 87 66

128 56 78

134 52 87

135 89 89

136 78 95

137 56 96

138 66 98

RANGE 98-52= 46

RANGE


38/60

Merits and Demerits of Range

It is not based in the entire set of data.

Its value varies very widely from sample tosample.

If the Xmax and Xminremain unaltered and all theother values are replaced by a set of observationthe range of distribution remains the same.

It can not be used when dealing with open endclasses

Not Suitable for mathematical treatment.It is very sensitive to the size of the sample.

It is too indefinite to be used as a practicalmeasure of dispersion.


39/60

QUARTILE DEVIATION

3 1D

2

Q QQuartile eviation

It is a measure of dispersion based on the upper quartileQ3 and the lower quartile Q1.

Inter-quartile Range= Q3 - Q1

Quartile Deviation is obtained from inter quartile rangeon dividing by 2.


40/60

Merits and Demerits of Quartile

Merits:

It is quite easy to understand & calculate.

It makes use of 50% of the data & as such isbetter measure than range

As it ignore 25% of data from the beginning and25% from the top end, it is not affected at all by

extreme observations.It can be Computed from the Frequency

distribution with open end classes .

(Contd.)


41/60

Demerits:

It is not based on all observations.

It is affected considerably byfluctuations of sampling.

It is not suitable for furthermathematical treatment.

Merits and Demerits of Quartile

EXAMPLE


42/60

DISTRIBUTION OF MONTHLY EARNING

MONTH EARNING

1 10239

2 10250

3 10251

4 10251

5 10257

6 10258

7 10260

8 10261

9 10262

10 10262

11 1027312 10275

Q1 10251

Q3 10262

QUARTILE DEVIATIO 5.5


43/60

MEAN DEVIATION

1D i Mean eviation X X

n

1D i i Mean eviation f X X N

Average or Mean deviation is the average amount of scatterof the items in a distribution from either the mean or themedian, ignoring the signs of deviation. The average that istaken of the scatter is an arithmetic mean, which accounts forthe fact that this measure is often called the mean deviation.

For grouped data

For ungrouped data

EXAMPLE


44/60

EXAMPLE

DISTRIBUTION OF SERIES OF DAILY RENTS

HOUSE RENT -MEAN

1 3000 18192 3000 1819

3 3000 1819

4 3750 1069

5 4000 819.4

6 4000 819.4

7 4000 819.4

8 4500 319.4

9 4750 69.44

10 5000 180.6

11 5000 180.6

12 5000 180.6

13 5250 430.6

14 5250 430.615 5500 680.6

16 6250 1431

17 6500 1681

18 9000 4181

TOTAL 86750 18750

MEAN 4819.4


45/60

EXAMPLE

DISTRIBUTION OF HEIGHTS OF STUDEN

HEIGHT # OF STUDENTS

X f fX (X-MEAN

158 15 2370 49.1667

159 20 3180 45.5556

160 32 5120 40.8889161 35 5635 9.72222

162 33 5346 23.8333

163 21 3423 36.1667

164 10 1640 27.2222

165 8 1320 29.7778

166 6 996 28.3333

TOTAL 180 29030 290.667

MEAN 161.278

MD 1.61481


46/60

STANDARD DEVIATION

It is defined as the positive square root of themean of the squares of the deviations of the givenobservations from their mean

21

Standard Deviation iX Xn

21

Standard Deviationi i

f X X N

For un-grouped data

For grouped data


47/60

VARIANCE

22 1

i iVariance f X X N

2

2 1iVariance X X

n

It is the square of standard deviation and is denotedby 2

For un-grouped data

For grouped data

PROPERTIES OF STANDARD


48/60

PROPERTIES OF STANDARDDEVIATION

PROPERTY 1

is independent of change of origin but not scale

PROPERTY 2

Is the minimum value of the root mean square deviation

PROPERTY 3

Is suitable for further mathematical treatment

PROPERTY 4

SD < Range


49/60

MERITS AND DEMERITS OF SD

Is the most important and widely usedmeasure of dispersion

It is defined on all the observations

The squaring of the deviations removes thedrawback of ignoring the signs of deviationsin computing the mean deviation

It is affected least by fluctuations ofsampling


50/60

EXAMPLE

X (X-MEAN)^2

12 13.69

15 0.49

24 68.89

12 13.69

13 7.29

15 0.4914 2.89

12 13.69

16 0.09

24 68.89

TOTAL 157 190.1

MEAN 15.7

VARIAN 19.01

SD 4.36


51/60

EXAMPLE

# LETTERS IN WORREQUENCY X-MEAN

X f fX d fd^d

1 3 3 -3.277 32.208

2 8 16 -2.277 41.463

3 9 27 -1.277 14.667

4 10 40 -0.277 0.765

5 5 25 0.723 2.617

6 4 24 1.723 11.880

7 3 21 2.723 22.251

8 1 8 3.723 13.864

9 3 27 4.723 66.932

10 1 10 5.723 32.757

TOTAl 47 201 239.404

MEAN 4.277

VARIANCE 5.094


52/60

EXAMPLE f x xf d^2 fd^2

30-39 1 29.5-39. 34.5 34.5 1128.96 1128.96

40-49 4 39.5-49. 44.5 178 556.96 2227.8450-59 14 49.5-59. 54.5 763 184.96 2589.44

60-69 20 59.5-69. 64.5 1290 12.96 259.2

70-79 22 69.5-79. 74.5 1639 40.96 901.12

80-89 12 79.5-89. 84.5 1014 268.96 3227.52

90-99 2 89.5-99. 94.5 189 696.96 1393.92

TOTAL 75 5107.5 11728

MEAN 68.1

VARIANCE 156

SD 12.5


53/60

CORRELATION

When the relationships of quantitativenature, the appropriate statistical tool fordiscovering and measuring the relationship

and expressing it in a brief formula is knownas correlation

It is defined as an analysis of the co-

variation between two or more variables


54/60

Types of Correlation

a) Positive and negative correlationb) Linear and non-linear correlation

METHODS OF STUDYING


55/60

METHODS OF STUDYINGCORRELATION

1. Scatter diagram

2. Karl Pearsons coefficient ofcorrelation

3. Bi-variate correlation method

4. Rank correlation

S Di


56/60

Scatter Diagram

Karl Pearsons Coefficient of


57/60

Karl Pearsons Coefficient of

Correlation

Is a numerical measure of linearrelationship between them and isdefined as the ratio of the covariancebetween X & Y to the product of thestandard deviations

( , )

x y

C o v x yr


58/60

2 2

1( )( )

1 1( ) ( )

x x y y

nr

x x y yn n

2 2 2 2

( )( )

[ ( ) ][ ( ) ]

n xy x yr

n x x n y y

EXAMPLE


59/60

EXAMPLE

ADVERTISING Sales

EXPENSES

x y x-mx y-my dx^2 dy^2 dxdy

39 47 -26 -19 676 361 494

65 53 0 -13 0 169 0

62 58 -3 -8 9 64 24

90 86 25 20 625 400 500

82 62 17 -4 289 16 -6875 68 10 2 100 4 20

25 60 -40 -6 1600 36 240

98 91 33 25 1089 625 825

36 51 -29 -15 841 225 435

78 84 13 18 169 324 234

650 660 0 0 5398 2224 2704

mx= 65

my= 66

r= 0.78


60/60