Download - Name of Paper: Sociology

1

Sociology Name of Paper: Methodology of Research in Sociology

Name of Module: Processing and Analyzing Quantitative Data

Module Detail and its Structure

Subject Name Sociology

Paper Name Methodology of Research in Sociology

Module Name/Title Processing and Analyzing Quantitative Data

Module Id RMS 20

Pre-requisites Some knowledge on social statistics

Objectives This module will deal with the issues involved in the process of handling,

managing and interpreting quantitative data collected in the process of research. It

will also discuss about the basic statistical tools with the help of which we analyse

social phenomena.

Keywords Coding, editing, statistics, quantitative research, measures of central tendency,

dispersion, coefficient correlation and regression.

Role in Content

Development

Name Affiliation

Principal Investigator Prof. Sujata Patel Dept. of Sociology,

University of Hyderabad

Paper Co-ordinator Prof. Biswajit Ghosh Professor, Department of Sociology, The

University of Burdwan, Burdwan 713104

Email: [email protected]

Ph. M +91 9002769014

Content Writer Dr. Udita Mitra Assistant Professor, Department of Sociology,

Shri Shikshayatan College, Kolkata-700095

Email: [email protected]

Ph. M +91 9433213816

Ph. L (O) 033-24140594

Content Reviewer (CR)

& Language Editor

Prof. Biswajit Ghosh Professor, Department of Sociology, The

University of Burdwan, Burdwan 713104

mailto:[email protected]

2



Contents

1. Objective .................................................................................................................................... 3

2. Introduction…………………………………………………………………………………….3

3. Learning Outcome .................................................................... Error! Bookmark not defined.

4. Data Processing ........................................................................ Error! Bookmark not defined.

4.1 Editing ........................................................................................................................................ 3

4.2 Coding……………………………………………………………………………………….....3

4.3 Classification .............................................................................................................................. 4

4.4. Tabulation..................................................................................................................................4

Self-check exercise – 1..............................................................................................................4

5. Data Analysis…………………………………………………………………………………..5

6. Statistics in Social Research…………………………………………………………………...5

Self-check exercise – 2..............................................................................................................6

6.1 Measures of Central tendency .................................................................................................... 6

6.2 Measures of Dispersion ............................................................................................................ ..9

6.3 Chi-Square Test ........................................................................................................................ 13

6.4 T-test.......................................................................................................................................15

6.5 Measures of Relationship…………………………………………………………………….18

Self-check Exercise - 3…………………………………………………………………….....22

7. Limitations of Statistics in Sociology…………………………………………………………23

8. Summary..................................................................................................................................23

9. References ................................................................................................................................ 25

3



1. Objective

This module will deal with the issues involved in the process of handling, managing and interpreting

quantitative data collected in the process of research. It will also discuss about the basic statistical

tools with the help of which we analyse social phenomena.

2. Introduction

Quantitative research can be construed as a research strategy that emphasizes quantification in the

collection and analysis of data. It entails a deductive approach to the relationship between theory and

research in which the accent is placed on testing the theories. Quantitative research usually

incorporates the practices and norms of the natural scientific model and of positivism in particular and

it also embodies a view of social reality as an external, objective reality (Bryman 2004: 19). It also

has a preoccupation with measurement and involves a process of collecting large amount of data.

These data may be collected through various ways like survey and field research. The data, after

collection, have to be processed in order to ensure their proper analysis and interpretation. According

to Kothari (2004), technically, processing implies editing, coding, classification and tabulation of

collected data so that they are amenable to analysis. These endeavours help us to search for patterns of

relationship that exist among data-groups (Ibid.: 122).

3. Learning Outcome

This module will help you to understand different issues involved in processing and analysing

quantitative data. It will also help you to grasp the essential steps of applying various statistical

measures in order to interpret data collected through social research.

4. Data Processing

Data reduction or processing mainly involves various steps necessary for preparing the data for

analysis. These steps involve editing, categorising the open-ended questions, coding, computerization

and preparation of tables (Ahuja 2007: 304). The processing of data is an essential step before

analysis because it enables us to overcome the errors at the stage of data collection.

4.1. Editing

According to Majumdar (2005), error can come in at any stage of social research especially in the

stage of data collection. These errors have to be kept at a minimum level to avoid errors in the results

of the research. Editing or checking for errors in the completed questionnaires is a laborious exercise

and needs to be done meticulously. Interviewers tend to commit mistakes like some questions are

missed out; some answers remain unrecorded or are recorded at the wrong places. The questionnaires

therefore need to be checked for completeness, accuracy and uniformity (Ibid.: 310).

4.2. Coding

Coding implicates the process of assigning numbers or other symbols to answers so that they can be

categorized into specific classes. Such classes should be appropriate to the research problem under

consideration (Kothari 2004: 123). Careful consideration should be made so as not to leave out any

response uncoded. According to Majumdar (2005: 313), a set of categories is referred to as “coding

frame” or “code book”. Code book explains how to assign numerical codes for response categories

4



received in the questionnaire/schedule. It also indicates the location of a variable on computer cards.

Ahuja (2007: 306) provides an example to illustrate how variables can be coded. In a question

regarding the religion of the respondent the answer categories of Hindu, Muslim, Sikh, and Christian

can be coded as 1, 2, 3, and 4 respectively. In such cases, the counting of frequencies will not be

according to Hindus, Muslims etc., but as 1, 2 and so on. Coding can be done manually or with the

help of computers.

4.3. Classification

Besides editing and coding of data, classification is another important method to process data.

Classification has been defined as the process of arranging data into groups and classes on the basis of

some common characteristics (Kothari 2004: 123). Classification can be of two types, namely

Classification according to attributes or common characteristics like gender, literacy etc., and

Classification according to class intervals whereby the entire range of data is divided into a

number of classes or class intervals.

4.4. Tabulation

Tabulation is the process of summarising raw data and displaying the same in compact form for

further analysis (Kothari 2004: 127). The necessity of tabulating raw data is:

It conserves space and reduces explanatory and descriptive statement to a minimum, and

It provides a basis for various statistical computations.

Tabulation can be done manually as well as with electronic and mechanical devices like computers.

When the data are not large in number, tabulation can be done by hand with the help of tally marks.

Self check exercise – 1

Question 1. Tabulate the following examination grades for 80 students.

72, 49, 81, 52, 31,38,81, 58,68, 73, 43, 56, 45, 54, 40, 81, 60, 52, 52, 38, 79, 83, 63, 58, 59, 71, 89, 73,

77, 60, 65, 60, 69, 88, 75, 59, 52, 75, 70, 93, 90, 62, 91, 61, 53, 83, 32, 49, 39, 57, 39, 28, 67, 74, 61,

42, 39, 76, 68, 65, 58, 49, 72, 29, 70, 56, 48, 60, 36, 79, 72, 65, 40, 49, 37, 63, 72, 58, 62, 46 (Levin

and Fox 2006).

Procedures for Tabulation/Grouping of Data

The above is an array of scores which otherwise would not be very handy to use. In order to make the

data meaningful and useful it must be organized and classified into frequency tables. There are certain

easy steps to be followed in order to convert the raw scores into frequency tables.

i. We must first find the difference between the highest and the lowest score in the series. In

the above case the difference is 65 (93-28). To it we must add 1 to bring in the entire range

of scores. So it becomes 66.

ii. Next, we would have to assume the number of class intervals that would best summarise the

entire range of scores. In this case we assume the number of intervals as 10.

5



iii. Now we would divide the range of scores by the number of class intervals to obtain the

width (denoted as i) of the class interval. Here it would be 6.6, that is 6 or 7 approximately.

iv. To the lowest score in the series we add (i – 1) to get the first class interval. In this case it

would be 28+ (7-1) or 28 to 33.

v. We take the higher integer from the upper limit of the class interval and repeat step iv to get

the next class interval. In this way we would obtain the class intervals and put the

frequencies in the respective class intervals (Elifson 1997).

Answer: The complete class interval of examination grades for 80 students is the following:

Class Interval Frequencies

28- 33

34-39

40-45

46-51

52-57

58-63

64-69

70-75

76-81

82-87

88-93

4

7

5

6

9

16

7

12

6

3

5

N = 80

5. Data Analysis

The term ‘data analysis’ refers to the computation of certain indices or measures along with searching

for patterns of relationship that exist among the data groups. Analysis, particularly in case of survey

or experimental data (quantitative data), involves estimating the values of unknown parameters of the

population and testing of hypothesis for drawing inferences (Kothari 2004: 130). Quantitative data

analysis occurs typically at a late stage in the research process. But this does not mean that the

researchers should not be considering how they will analyse their data at the beginning of the

research. During the designing phase of the questionnaire or observation schedule, the researchers

should be fully aware of the techniques of data analysis. In other words, the kinds of data the

researchers will collect and the size of the sample will have implications for the sorts of analysis that

would be applied for (Bryman 2004).

6. Statistics in Social Research

The task of analysing quantitative data in research is done by social statistics. Social statistics has two

major areas of function in research. They are namely Descriptive and Inferential. Descriptive statistics

is concerned with organizing raw data obtained in the process of research. Tabulation and

classification of data are instances of descriptive statistics. Inferential statistics is concerned with

making inferences or conclusions from the data collected from the sample and drawing

generalisations on the entire population (Elifson 1997). Inferential statistics is also known as sampling

statistics and it is concerned with two major types of problems:

6



the estimation of population parameters, and

the testing of statistical hypothesis (Kothari 2004: 131)

Some of the most important and useful statistical measures that would be taken up for discussion in

the present module are:

measures of central tendency or statistical averages

measures of dispersion

chi-square test

t-test

measures of relationship

From the next section we are going to take up each for discussion.

Self Check Exercise – 2

1. How does descriptive statistics work?

Descriptive statistics tries to describe and summarize the mass of data that is obtained in the

process of conducting research. It tries to do so with the help of some specific measures. The

very first step of organizing data would be to arrange the raw scores into a number of categories

known as frequency tables. After it is done, the next step would be to represent the data through

various graphs and figures. Some of these would be bar graph, pie chart, frequency polygon etc.

2. What is inferential statistics?

Inferential statistics deals with the task of drawing inferences on the population by studying the

sample drawn from that population. The reasons why we infer on the findings of a sample can be

many. Insufficient resources in terms of money and man power can force a researcher to draw a

sample from the population. Time available for a research may also be short and inadequate to

study an entire population. Statistics can be of great help in generalizing findings. It needs to be

mentioned here that error(s) inevitably appears in the process of sampling, but researchers may

adopt various methods to minimize those. The prefix ‘social’ is attached to statistics due to its

application to interpret social phenomena.

6.1. Measures of Central Tendency

When the scores have been tabulated into a frequency distribution, the next task is to calculate a

measure of central tendency or central position. The measure of central tendency defines a value

around which items have a tendency to cluster. The importance of the Measure of Central Tendency is

twofold. First, it is an “average” which represents all the scores in a distribution and gives a precise

picture of the entire distribution. Second, it enables us to compare two or more groups in terms of

typical performance. Three “averages” or measures of central tendency are commonly used:

Arithmetic Mean, Median and Mode (Garrett 1981: 27).

7



i) Arithmetic Mean: Mean is known as arithmetic average and is the most stable measure of central

tendency. It is defined as the summation of all the values given in the series of numbers divided by the

number of values. Mean can be calculated through different methods:

a) Calculation of the Mean from Ungrouped Scores: This can be computed by the following equation:

𝑋=∑ 𝑓1+ 𝑓2+𝑓3……………… 𝑓𝑛

𝑛 where ‘f’ is the frequency and ‘n’ is the number of scores.

In the case of the following scores, the mean can be found out by the above formula (Garrett 1981).

8, 5, 4, 7, 9, 10

𝑋= 8+5+4+7+9+10

6= 7

b) Calculation of the Mean from Grouped Scores: In case of computing the mean from a grouped

frequency distribution, the mean is calculated by a slightly different method from that given above.

Thus, it can be computed by the following formula:

𝑋 = ∑𝑓𝑋

𝑛where ‘X’ is the midpoint of the class intervals and ‘f’ is the frequency assigned to each class

interval, ∑ is the summation operator and ‘n’ is the total frequency. The calculation is shown in the

table below (see Garrett 1981, for details):

Class Intervals Frequencies Midpoint (X) fX

140-144 1 142 142

145-149 3 147 441

150-154 2 152 304

155-159 4 157 628

160-164 4 162 648

165-169 6 167 1002

170-174 10 172 1720

175-179 8 177 1416

180-184 5 182 910

185-189 4 187 748

190-194 2 192 384

195-199 1 197 197

∑ N = 50 ∑ fX=8540

The Mean will be ∑𝑓𝑋

𝑛 =

8540

50 or, 170.8.

ii) Median: Median is the middle most value in the entire distribution of data. It divides the

distribution into two equal parts: one half of the distribution falls below the median value and the

other half falls above it. Before calculating the median we have to arrange the values in either

ascending or descending order. It is a positional average. It is shown by the following formula:

M= Value of (𝑛+1

2)th item

It should be mentioned in this context that the median is usually used to describe qualitative

phenomena like intelligence. It is not often used in sampling statistics (Kothari 2004: 133).

8



a) Computation of the Median when data are Ungrouped: Two situations arise in the computation of

the Median from ungrouped data: a) when N is odd, and b) when N is even. To consider the first case

where N is odd, suppose we have the following numbers: 7, 10, 8, 12, 9, 11, 7. First we have to

arrange these data in an ascending order like 7, 7, 8, 9, 10, 11, 12. Then we apply the above equation

to compute the median.

M= Value of (𝑛+1

2)th item where ‘n’ is the number of scores.

= 7+1

2 =

8

2 = 4th item

M= 9.

When the total number of scores is even like 7, 8, 9, 10, 11, 12, the median is the average of the two

middlemost numbers. In the above numbers, the two middlemost numbers are 9 and 10. The average

of these numbers is 19/2 or 9.5.

b) Computation of the Median when data are Grouped: When the scores are arranged into a

frequency distribution, the median by definition is the 50% point in the distribution. We calculate the

cumulative frequency of the distribution and divide N by 2 to locate the class interval in which the

median falls. The following equation would help us to compute the median from a grouped frequency

distribution:

Mdn = l + (𝑁

2− 𝐹

𝑓𝑚) 𝑖 where l is the exact lower limit of the class interval upon which the median lies.

N/2 one half of the total number of scores, F is the sum of the scores on all intervals below l, and fm is

the frequency within the interval upon which the median falls and i is the width of the class interval.

The computation of the median is shown in the following table:

Class Intervals Frequencies Cumulative Frequencies

140-144 1 1

145-149 3 4

150-154 2 6

155-159 4 10

160-164 4 14

165-169 6 20

170-174 10 30

175-179 8 38

180-184 5 43

185-189 4 47

190-194 2 49

195-199 1 50

N = 50

When we divide N or 50 by 2 we get 25. We locate the class interval with the help of it and locate it

as 170-174 (since 30 would include 25). Next we compute the median with the help of the equation

above:

Here ‘l’ would be (170- 0.5) = 169.5

Mdn = 169.5 + (25−20

10) 5

= 172.

9



iii) Mode: When a rough and quick estimate of central tendency is wanted, mode is usually the most

preferred measure. Mode is that value which has the greatest frequency in the given series of scores.

Like median, mode is also a positional average and is therefore unaffected by extreme scores in the

series of numbers. It is useful in all situations where we want to eliminate the effect of extreme

variations (Kothari 2004: 133).

a) Calculating Mode from Ungrouped Data: In a simple ungrouped data, the mode is that single

measure or score which occurs most frequently. For instance in the series of the numbers 10, 11, 11,

12, 12, 13, 13, 13, 14, 14, the crude mode is 13 (the most frequented one).

b) Calculating the Mode from Grouped Data: When the data are grouped into a frequency

distribution, the crude mode is found out by the midpoint of the interval which contains the highest

frequency. In the case of the above table, the value of the mode would be 172 (the midpoint of the

class interval 170-174 (Garrett 1981). We can also calculate the true mode from a grouped frequency

distribution. The formula for calculating the true mode in a normal or symmetrical distribution is:

Mode = 3 Mdn – 2 Mean (ibid).

iv) When to Use the Various Measures of Central Tendency: The situations in which the three

measures are used are stated below:

a) The Mean is used when

The scores are distributed symmetrically around a central point

The central tendency having the greatest stability is wanted

Other statistics like standard deviation and correlation coefficient are to be computed later.

b) The Median is used when

The exact midpoint of the distribution is all that is wanted

There are extreme scores which affect the mean but they do not affect the median.

c) The Mode is used when

A rough and quick estimate of central tendency is all that is wanted

The measure of central tendency should be the most typical value (Garrett 1981).

The choice of average depends on the researcher and the objectives of the study. Only then will be the

statistical computation of averages be effective and useful in interpretation of data.

6.2. Measures of Dispersion (Range, Interquartile Range, Mean Deviation or Average Deviation

and Standard Deviation)

Measures of central tendency like mean, median and mode can only be a representative of the entire

series of scores. But it cannot fully describe the nature of a frequency distribution. For instance it

cannot state how far a given score in a series deviates from the average. In other words, how much a

10



score is lower or higher than the average? Therefore, in order to measure this spread of score from the

central tendency, we calculate the measures of dispersion or variability. There are different measures

of dispersion. They are the range, mean deviation and standard deviation.

i) Range: Range is the simplest and the easiest measure of variability. It is usually calculated by

subtracting the lowest score from the highest score in the given series of data. The value of the range

depends on only two values and this is its main limitation. It ignores the remaining values in the

distribution and therefore it fails to provide an accurate and stable picture of the dispersed scores.

a) Range for Ungrouped Data: In a distribution of ungrouped scores, if the scores are arranged in an

array, the range is defined as the largest score minus the smallest score plus one.

Range = (Highest value of an item in a series) ─ (Lowest value of an item in a series) +1

In a distribution that has 103 as the highest score and 30 as the lowest score, the range is computed as

range = (103- 30)+1 = 74 (Leonard 1996).

b) Range for Grouped Data: In case of grouped data, the range is the difference between the upper

true limit of the highest class interval and the lower true limit of the lowest class interval. Let us look

into the following data:

Class Interval Frequency

31-33 3

34-36 0

37-39 1

40-42 5

43-45 7

46-48 6

49-51 24

52-54 18

55-57 14

58-60 15

61-63 16

64-66 7

In case of the above data, the upper true limit of the highest class interval is 66.5 (64-66) and the

lower true limit of the lowest class interval is 30.5 (31-33). Therefore, the range would be 66.5-

30.5=36. Here, 1 is not added because the difference is between the two true limits (Leonard 1996).

Please note that range does not represent the entire series of scores as its computation requires only

the two extreme values.

ii) Mean Deviation or Average Deviation: It is the average of difference of the values of items from

some average of the series (Kothari 2004: 135). It is based on absolute deviations of scores from the

centre (Leonard 1996). This procedure is designed to avoid the algebraic sum of deviations from the

mean equalling zero, in which case it would be impossible to compute indices of variability.

a) Average Deviation for Ungrouped Scores:

11



Mean Deviation from Mean = ∑|𝑋−𝑋|

𝑛 where X denotes a particular score and 𝑋 the mean of the

scores, n stands for total number of frequencies. Let us look into the calculation of following scores:

Observation No. X 𝑋 |𝑋 − 𝑋| or x

1 26 16 10

2 24 16 8

3 22 16 6

4 20 16 4

5 18 16 2

6 16 16 0

7 14 16 2

8 10 16 6

9 6 16 10

10 4 16 12

N= 10 ∑ 𝑋 = 160 16 ∑|𝑋 − 𝑋| = 60

For the above scores, we first calculated the mean which is 16 (160/10). Then, we have subtracted the

mean from the scores in order to know their deviation and ignored the sign of the scores. After this,

the absolute scores have been summed up (60). To find out the average deviation, we divided 60 by n

or 10 and obtain 6. Here 6 is our mean deviation (Ibid. 1996).

b) Average Deviation for Grouped Data: The formula for calculating the average deviation is

A.D. = ∑𝑓|𝑋− 𝑋|

𝑁

The average deviation or mean deviation from the grouped data is calculated below:

Class Intervals Midpoints (m) Frequencies (f) mf x= |𝑋 − 𝑋| Fx

140-144 142 1 142 -28.8 28.8

145-149 147 3 441 -23.8 71.4

150-154 152 2 304 -18.8 37.6

155-159 157 4 628 -13.8 55.2

160-164 162 4 648 -8.8 35.2

165-169 167 6 1002 -3.8 22.8

170-174 172 10 1720 1.2 12

175-179 177 8 1416 6.2 49.6

180-184 182 5 910 11.2 56

185-189 187 4 748 16.2 64.8

190-194 192 2 384 21.2 42.4

195-199 197 1 197 26.2 26.2

∑N= 50 ∑mf= 8540 ∑fx= 502

Mean or 𝑋 of the above group of scores is 8540/50= 170.8. The rest of the calculations have been

shown in the table. Therefore A.D. would be 502/50 or 10.04.

iii) Standard Deviation: Standard Deviation (S.D) is the most stable measure of dispersion or

variability. It is defined as the square root of the average of the squares of deviations when such

12



deviations for the values of individual items in a series are obtained from the arithmetic average. In

finding the S.D, we avoid the difficulty of signs by squaring the separate deviations (Garrett 1981).

a) Standard Deviation for Ungrouped Scores: The formula for computing S.D. from the ungrouped

scores is σ (S.D) = √∑ 𝑋2

𝑁 where ‘X’ is the value of the deviations of the scores from the mean and

‘N’ is the total of frequencies given.

We can calculate standard deviation from the scores below in the following manner (Leonard 1996)

X X2

2 4

2 4

4 16

6 36

8 64

14 196

20 400

N= 56 ∑X2= 720

If we find the square root of 12.85, we will get standard deviation. So √12.85 or 3.58 is the S.D.

b) Standard Deviation for Grouped Data: The following is the formula for computing standard

deviation for grouped data.

Standard Deviation for grouped data (σ) = √∑𝑓𝑥2

𝑁where f stands for the individual frequency, ‘x’ is the

value of the deviation of the individual scores from the mean and N stands for the total frequency

(Garrett 1981). The calculation is shown in the table below:

Class Interval

(1)

Midpoints(X)

(2)

Frequency (f)

(3)

f(X)

(4) x=|𝑋 − 𝑋|

(5)

𝑥2

(6)

f𝑥2

(7)

140-144 142 1 142 -28.80 -28.80 829.44

145-149 147 3 441 -23.80 -71.40 1699.32

150-154 152 2 304 -18.80 -37.60 706.88

155-159 157 4 628 -13.80 -55.20 761.76

160-164 162 4 648 -8.80 -35.20 309.76

165-169 167 6 1002 -3.80 -22.80 86.64

170-174 172 10 1720 1.20 12 14.40

175-179 177 8 1416 6.20 49.60 307.52

180-184 182 5 910 11.20 56 627.20

185-189 187 4 748 16.20 64.80 1049.76

190-194 192 2 384 21.20 42.40 898.88

195-199 197 1 197 26.20 26.20 686.44

N=50 8540 ∑𝑓𝑥2=7978

The mean score of the above distribution of scores is 8540/50 or 170.80.

The computed value of σ is √∑𝑓𝑥2

𝑁 or √

7978

50 or 12.63.

13



iv) When to Use the Various Measures of Variability: The rules for using the measures of

dispersion are as follows:

a) Range can be used when

the scores are scanty in number or are too dispersed

a knowledge of the extreme scores or total spread of scores are wanted.

b) Average Deviation can be computed when

it is desirable to weigh all deviations from the mean according to their size

extreme deviations would influence the S.D. unduly.

c) S.D. is to be used when

the statistic having the greatest stability is wanted

coefficient of correlation and other statistics are subsequently to be computed (Garrett 1981).

6.3. Chi-square Test

The Chi-square test is an important one among several tests of significance developed by statisticians.

It is symbolically written as 𝑥2 and can be used to determine if categorical data shows dependency or

the two classifications are independent. It can be used to make comparisons between theoretical

populations and actual data when categories are used. The test is, in fact, a technique by use of which

it is possible for all researchers to test a) goodness of fit, and b) test of significance of association

between two attributes (Kothari 2008).

a) Test of Goodness of Fit: As a test of Goodness of Fit, Chi-square enables us to see how well the

theoretical distribution fit to the observed data. If the calculated value of 𝑥2 is greater than its table

value at a certain level of significance, the fit is considered to be a good one. When the calculated

value of 𝑥2 is less than the table value, we do not consider the fit to be a good one (Kothari op. cit).

Illustrative Problem

Given below is the data on the number of students entering the University from each school.

School 1 – 22, School 2 – 25, school 3 – 26, School 4 – 28, School 5 – 33.

Is there a difference in the quality of school? N=50

In the case of the above data the most suitable technique of statistical application would be chi-square

goodness of fit test because the data are at the nominal level and the hypothesis is to be tested on one

variable, that is, the quality of schools on the basis of the prospect of entering the University from

each school.

The steps for calculating the chi-square are shown below.

14



1. Stating the Null and the Alternative Hypothesis: The null hypothesis assumes that there is no

difference in the quality of the schools. Whereas the alternative hypothesis would state that

there is a difference in the quality of the schools.

2. Choice of a Statistical Test: As has been stated above, the appropriate statistical test

applicable here would be Chi-square goodness of fit test.

3. Level of Significance and Sample Size: Here the level of significance would be 0.5, that

means only 5 times in 100. The sample size is 50.

4. One versus the two tailed test: It is a two tailed test because no direction is indicated in the

alternative hypothesis. It only suggests that there is a difference in the number of students

entering the University from each school.

5. The Sampling Distribution: The sampling distribution is a function of the degrees of freedom

which are quantities that are free to vary. Here it can be computed by (k-1) where ‘k’ is the

number of categories into which observations are divided. Here there are 5 categories, that

means degrees of freedom (df) = (5-1) = 4.

6. The Region of Rejection: The point of intersection of the ‘df’ and the level of significance

gives the critical value of 𝑥2 which is 9.488. The computed value of the chi-square has to be

greater than the table value, so as to reject the null hypothesis. It is computed by the formula:

𝒙𝟐= ∑(𝑂𝑓− 𝐸𝑓)

2

𝐸𝑓

where Of is the observed frequencies of each and Ef is the expected frequency. In ideal

situation, from each school there would be 10 students selected in the University, therefore

our expected frequencies in each case would be 50/5 = 10. The computation of the 𝑥2 is

shown in the table below:

Schools Of Ef Of–Ef (Of – Ef)2

(𝑂𝑓 − 𝐸𝑓)2

𝐸𝑓

1 22 10 12 144 14.4

2 25 10 15 225 22.5

3 26 10 16 256 25.6

4 28 10 18 324 32.4

5 33 10 23 529 52.9

∑ (Of – Ef)2

/N= 147.8

Since the computed value of chi-square is 147.8, which is greater than its table value of 9.488, the

alternative hypothesis is upheld that there are differences in the quality of schools. This is understood

from the different number of students entering the University from each school.

b) Chi-square Test of Independence: As a test of independence, chi square test enables us to explain

whether or not two attributes are associated. If the table value of chi is greater than its computed

value, we can conclude that there is no association between the attributes, that is, the null hypothesis

is upheld. But if the computed value of chi is greater than its table value, we uphold that the two

attributes are associated and the association is not due to chance factors but it exists in reality (Kothari

2008). For the test of association, the formula for computing the chi-square remains the same as

above.

15




Let us look into the following data:

Level of Job Satisfaction

Union Membership Not Satisfied Satisfied Total

No 75 (A) 125 (B) 200

Yes 65 (C) 135 (D) 200

Total 140 260 400

From the above data, we have to find out if a relation exists between the two variables.

Here, we will apply a chi square test of independence because the data are at the nominal level and

there are two variables in the data, namely job satisfaction and union membership. The steps 1 to 6 are

to be written in the same manner as above. Only the sample size is 400. The degrees of freedom will

be computed by (c-1)(r-1)where ‘c’ is the number of columns and ‘r’ means the number of rows into

which observations are divided. Here the degrees of freedom (df) = (2-1)(2-1) = 1. The point of

intersection between the ‘df’ and the level of significance (0.05) gives the critical or table value of 𝑥2

which is 3.841. The computed value of the chi-square would have to be more than its table value in

order to reject the null hypothesis.

Next, we calculate the expected frequencies against each observed frequency by the following

formula:

Cell A = (𝐴+𝐵)(𝐴+𝐶)

𝑁 =

200 𝑋 140

400 = 70

Cell B = (𝐵+𝐴)(𝐵+𝐷)

𝑁 =

200 𝑋 260

400 = 130

Cell C = (𝐶+𝐷)(𝐶+𝐴)

𝑁 =

200 𝑋 140

400 = 70

Cell D = (𝐷+𝐶)(𝐷+𝐵)

𝑁 =

200 𝑋 260

400 = 130

Now we would compute the value of Chi-square in the following table:

Cell Of Ef Of – Ef (Of – Ef)2 (𝑂𝑓 − 𝐸𝑓)

2

𝐸𝑓

A 75 70 5 25 0.35

B 125 130 -5 25 0.19

C 65 70 -5 25 0.35

D 135 130 5 25 0.19

1.08

Since the computed value of Chi-square (1.08) is less than its table value (3.841), therefore the

alternative hypothesis is rejected. It may hence be argued that there is an association between job

satisfaction and union membership. It appears that the chi-square test is one of the most frequently

used tests, but it should be applied correctly in situations where an individual observations of sample

are independent (Kothari 2008: 295).

16



6.4. T-test

The Central Limit Theorem states that, if sample size N is large, the sample statistic approaches the Z

distribution (explained above). When a sample is taken from a normally distributed population with a

known mean (µ) and standard deviation (σ), and then compute a z-score on the basis of each

observation, the resulting scores will have a z – distribution, that is, a normal distribution with mean =

0 and standard deviation = 1. But the problem is that in most of the cases, the population standard

deviation is unknown. As the Central Limit Theorem involves the use of the standard deviation, it

cannot be ignored. One solution here is to substitute the sample standard deviation (s1) for the

population standard deviation (Vito and Latessa 1989). To test the samples of small size, we have the

“t” statistic. T-test can be of two types, namely – two sample t - test and related sample t – test. The

type of test chosen will depend upon whether or not the two samples are independent or related.

Related samples occur when -

both samples have been matched according to some trait like race or gender, or

repeated measurements of the same sample are taken (before – after or time series design)

(Ibid. 1989).

a) Two Sample t – test: When two samples are to be tested on any trait or variable, then we apply for a

two sample t- test. The formula for computation of the value of t is as follows:

t = 𝑋1− 𝑋2

√(𝑛1𝑠1

2+ 𝑛2𝑠22

𝑛1+ 𝑛2−2)(

𝑛1+𝑛2𝑛1𝑛2

)

where 𝑋1 is the mean of the first sample, 𝑋2 is the mean of the

second sample, s1 and s2 are the standard deviations of the first and the second samples respectively

and n1 and n2 are the sample sizes of the two samples respectively.


The data for two schools have been provided below:

State Funded School

N1 =20 𝑋1 = 64 S1 = 18.5

Private Schools

N2 = 24 𝑋2 = 46 S2 = 18.5 (C.U. 2001)

The steps for computing the value of t would be summarized below.

1. Stating the Null and the Alternative Hypothesis: The null hypothesis assumes that there would

be no differences in the samples. The alternative hypothesis assumes a difference between

two samples.

2. Choice of Statistical Test: The statistical test chosen is the two sample t-test.

3. Level of Significance and Sample Size: The level of significance is .05 which means that 5

times in 100, we can reject the null hypothesis incorrectly or 5 times in 100 our result can be

due to chance. The sample sizes are 20 and 24 respectively.

4. One Versus Two Tailed Test: It is a two tailed test because no direction is implied in the

alternative hypothesis. It only suggests a difference between two sample means.

17



5. The sampling Distributions: It is the function of the degrees of freedom that means the

quantities which are free to vary. It can be calculated by the formula (N1+ N2 – 2). Here

would be (20 + 24 – 2) or 42.

6. The Region of Rejection: The point of intersection between the degrees of freedom and the

level of significance gives the table value of ‘t’. Here the critical or table value of “t” would

be 1.684. The computed value of “t” has to be more than this in order to reject the null

hypothesis. The computed value of “t” can be found out from the formula given above. We

just substitute the values in it.

t = 𝑋1− 𝑋2

√(𝑛1𝑠1

2+ 𝑛2𝑠22

𝑛1+ 𝑛2−2)(

𝑛1+𝑛2𝑛1𝑛2

)

= 64−46

√(20(342.25)+24(342.25)

20+24−2)(

20+24

20𝑥24)

= 3.16.

Since the computed value of “t” is 3.16 and is greater than its critical value which is 1.684, the

alternative hypothesis is upheld. In other words, there are significant differences between the two

school systems.

b) T-test for Related Samples: This is applicable when there are repeated measurements of the same

sample (time series design). The formula for computing the value of “t” for related samples is:

t = 𝑋1− 𝑋2

𝑆𝐷 where again 𝑋1 and 𝑋2 are the two values of the mean of the samples

respectively and 𝑆𝐷 is the estimation of the standard error of the Mean Difference scores. The

standard error is calculated by the formula

SD = √𝑠𝐷

2

𝑁 where SD

2 is the pooled variance of the different scores and N is the total

number of scores. The pooled variance is computed by the formula

SD2 =

𝑁 ∑ 𝐷2− ∑(𝐷)2

𝑁(𝑁−1) where D is the difference between the two mean of the related sample

(Vito and Latessa 1989).


The governor of Florida wants a report on the effects of the death penalty. Homicide rates (per

100,000 population) in Florida cities, two weeks before and two weeks after an execution are noted

below (Vito and Latessa 1989):

City Rate Before

(Test 1)

Rate After

(Test 2)

D = Test 2–Test 1 D2

Pompano Beach 23 19 -4 16

Tallahassee 15 16 1 1

Tampa 12 18 6 36

Miami 20 17 -3 9

Orlando 13 11 -2 4

83 81 -2 66

At first we would calculate the mean of the two tests:

18



𝑋1 = 83

5 = 16.6

𝑋2 = 81

5 = 16.2

Now, we would calculate the population variance of difference scores. The formula is:

SD2 =

𝑁 ∑ 𝐷2− ∑(𝐷)2

𝑁(𝑁−1)

= 5(66)−(−2)2

5 (5−1)

= 16.3

Next, we would calculate the Population Standard Error of the Mean Difference Scores, the formula

for which is:

SD = √𝑠𝐷

2

𝑁 = √

16.3

5 = √3.26 = 1.80

From the above values we calculate the value of “t” as:

t = 𝑋1− 𝑋2

𝑆𝐷 =

16.6−16.2

1.80 = 0.22

The steps for computing the value of ‘t’ are the same as above. Here the sample size is 5. The

sampling distribution would be calculated by (N-1) or (5-1) or 4. The point of intersection between

the degrees of freedom and the level of significance (0.05) gives the table value of ‘t’. Here the

critical or table value of “t” would be 2.776. The computed value of “t” has to be more than this in

order to reject the null hypothesis. The computed value of “t” is found out from the above formula.

We have just substituted the values in it and found out “t” to be 0.22. Since the computed value of t is

less than its table value, the null hypothesis is upheld. Our findings can be due to chance factors.

6.5. Measures of Relationship (Correlation co-efficient, Simple Linear Regression and Bivariate

Contingency Tables)

The statistical measures discussed before have dealt with univariate population that is, the population

which have one variable as their characteristic feature. But cases of observations based on two

variables are known as bivariate relationships. If for every measurement of a variable X, we have

corresponding value of a second variable, Y, the resulting pairs of values are called a bivariate

population. We have to answer two types of questions in bivariate population:

Does there exist an association or correlation between two variables? If yes, to what degree?

Is there any cause and effect relation between the two variables (Kothari 2004: 138)

i) Coefficient of correlation or simple correlation: It is the most widely used method of measuring

the degree of relationship between two variables. At times we want to know if there is a relation

between the variables incidence of child labour and broken homes or that between drug addiction and

involvement into criminal activities. In all such cases, it would be appropriate to use coefficient

correlation. The Pearson correlation coefficient or Pearson’s ‘r’ (also known as Pearson product-

moment coefficient correlation) is a measure of the straight line relationship between two interval-

level variables (Elifson 1997). To employ Pearson’s correlation coefficient correctly as a measure of

association between X and Y variables, the following requirements must be taken into account:

19



Interval data: Both X and Y variables must be measured at the interval level so that

scores may be assigned to the respondents

Normally distributed characteristics: Testing the significance of Pearson’s ‘r’ requires

both X and Y variables to be normally distributed in the population (Levin and Fox

2006: 357).

Computation of the Pearson’s ‘r’ by Mean Deviation Method: The mean deviation computational

equation for ‘r’ is:

‘r’= ∑(𝑋−𝑋)(𝑌− 𝑌)

√∑(𝑋−𝑋)2

(𝑌−𝑌)2 where by X and Y stand for the variables respectively and

𝑋 and 𝑌 refer to the deviation of the scores from the mean. The calculation would be shown in the

following table. An effort would be made here to find out the nature and strength of relationship

between the variables mothers’ education and daughters’ education (Elifson 1997).

Respondents

(1)

Mother’s

education

(X) (2)

(𝑋 − 𝑋)

(3)

(𝑋 − 𝑋)2

(4)

Daughter’s

education

(Y) (5)

(𝑌 − 𝑌)

(6)

(𝑌 − 𝑌)2

(7)

(𝑋 − 𝑋)(𝑌 − 𝑌)

(8)

A 1 -6 36 7 -6 36 36

B 3 -4 16 4 -9 81 36

C 5 -2 4 13 0 0 0

D 7 0 0 16 3 9 0

E 9 2 4 10 -3 9 -6

F 11 4 16 22 9 81 36

G 13 6 36 19 6 36 36

∑(𝑋

− 𝑋)2

= 112

∑(𝑌

− 𝑌)2

= 252

Summation=138

Now we would substitute the values into the above equation and compute the Pearson’s ‘r’.

‘r’ = 150

√(112)(252)

= 150

√28224 =

138

168

= 0 .82

The value of ‘r’ lies in between (+1) and (-1). The direction of a relationship is indicated by the sign

of the correlation coefficient. A positive relationship (or direct relationship) indicates that high scores

on one variable tend to be associated with high scores on a second variable and conversely low scores

on one variable tend to be associated with low scores on the second variable. A negative relationship

(also referred to as an inverse or indirect relationship) indicates that low scores on one variable tend to

be associated with high scores on a second variable. Conversely high scores on one variable tend to be

associated with low scores on the second variable (Elifson 1997: 201). In the above example there is

found to be a strong positive correlation between mothers’ education and their daughters’ education.

In a concluding note it can be said that although there is no established rule so as to specify what

constitutes a weak, moderate or strong relationship, yet there are certain guidelines to follow. A weak

20



relationship is one where the score varies between ± 0.01 to ± 0.30, moderate when the scores vary

between ± 0.31 to ± 0.70, and strong relationship between ± 0.71 to ± 0.99. A perfect relationship is ±

1.00 and no relationship is indicated when ‘r’ = 0 (Elifson 1997: 208).

ii) Simple Regression Analysis: Regression analysis is very closely related to correlation. It is the

statistical determination of a relationship between two or more variables (Kothari 2004). When we

use regression analysis, we are essentially interested in the description of a predictive relationship

(Vito and Latessa 1989). The independent variable in the relationship is known as the cause and the

dependent variable is the effect. In regression analysis, we can state accurately the degree of change in

the two variables. In other words, how much each unit change in X produces a change Y (Kothari

2004). The basic equation of simple linear regression is as follows:

�̂�= a+bX where �̂� is the predicted scores of the dependent variable, X is the scores of the independent

variable, ‘a’ is the Y intercept, the point at which the regression line crosses the Y axis, representing

the predicted value of Y when X = 0 and ‘b’ is the regression coefficient, it is the slope of the

regression line and indicates the expected change in Y with a change of one unit in X (Vito and

Latessa 1989).

Vito and Latessa (1989) state the example of the theory of prisonization in correction. According to

the theory, the longer a person is incarcerated, the more ‘prisonized’ the person will become and their

readjustment to society will be hampered. The hypothesis was tested with a random sample of inmates

using a scale designed to test the degree of prisonization, where 0 indicates no prisonization and 10

equals a high degree of prisonization. Here prisonization is the dependent variable (Y) where as the

time served in years in prison is our independent variable (X). The computation of the X and Y will

be shown in the table below:

Prisoner X x= (X-𝑋) 𝑥2 Y y= Y-𝑌 𝑦2 XY

A 0 -3.4 11.56 1 -3.6 12.96 12.24

B 2 -1.4 1.96 3 -1.6 2.56 2.24

C 5 1.6 2.56 4 -0.6 0.36 -0.96

D 4 0.6 0.36 6 1.4 1.96 0.84

E 6 2.6 6.76 9 4.4 19.36 11.44

N = 5 17 23.2 23 37.2 25.8

The Mean value of X is

𝑋= 17/5 = 3.4

The Mean value of Y is

𝑌 = 23/5 = 4.6

The value of the regression coefficient or ‘b’ can be found out from the formula

b = 𝑥𝑦

𝑥2 = 25.8

23.2 = 1.11

now we can find out the value of ‘a’ from the formula

a = 𝑌- b(𝑋) = 4.6 – 1.11(3.4) = 0.83

‘b’ is the slope of the regression line or the ratio of the change in Y corresponding to a change in X.

Therefore when X changes by 1, Y will change by 1.11 units.

‘a’ is the y-intercept or the value of Y if x = 0 (Vito and Latessa 1989).

21



When the value of X (time spent in prison) would be 2, the value of �̂�(the degree of prisonization)

would be

�̂� = a + bX = 0.83 + 1.11(2) = 0.83 + 2.22 = 3.05.

In this way we can calculate the value of the dependent variable from the existing regression equation

and infer exactly what amount of change in X will lead to what amount of change in Y.

To conclude, we can state that the regression analysis is a statistical method to deal with the

formulation of mathematical model depicting relationship amongst variables which can be used for

the purpose of prediction of the values of the dependent variable, given the values of the independent

variable (Kothari 2004: 142).

iii) Contingency Tables: Contingency Tables are another way of explaining and interpreting

relationship between variables. In the present module, we would be concerned only with the bivariate

contingency tables where the focus of discussion would be on two variables – one an independent

variable or the predictor variable (symbolized by X) and the other a dependent variable (symbolized

by Y). Here we would discuss a relationship between marital status (X) and employment status (Y) of

women. The hypothesis is that marital status exerts an influence on the employment status of women.

The study has been carried out on 200 respondents (Elifson 1997). The data have been presented in

the table below:

Marital Status (X)

Employment

Status (Y)

Never Married Married Divorced Widowed Total

Employed 21 60 11 6 98

Not-Employed 14 65 4 19 102

Total 35 125 15 25 N= 200

Contingency tables can be interpreted by percentaging it in three ways as follows.

Percentaging Down: This is one of the most common ways of calculating percentages. Here the

column marginals (35, 125, 15 and 25) are taken as the base on which the percentages are calculated.

Percentaging down is also referred to as percentaging on the independent variable when it is the

column variable. Percentaging down allows us to determine the effect of the independent variable by

comparing across the percentages within a row that is by comparing people in different categories of

the independent variable (Elifson 1997: 172). The method will be shown below:

Marital Status (X)

Employment Status (Y) Never Married Married Divorced Widowed

Employed 60% 48% 73.3% 24%

Not-Employed 40% 52% 26.7% 76%

Total 100% 100% 100% 100%

While interpreting from the above table, we say 60% (21/35x100) of the never married respondents

are employed, 48% (60/125x100) of the married respondents are employed, 73.3% (11/15x100) of the

divorced respondents are employed and 24% (6/25x100) of the widowed respondents are employed. If

22



we interpret it in this way we get a logical relationship between marital status and employment status

of women.

Percentaging Across: When we are percentaging across we are taking row marginal as the base and

calculating percentages. Here we are percentaging across and comparing up and down. An advantage

of doing this is that a profile of the employed versus those who are not employed can be established in

terms of their marital status (Elifson 1997: 172). This is also shown in the table below:

Marital Status (X)

Employment

Status (Y)

Never

Married

Married Divorced Widowed Total

Employed 21.4% 61.2% 11.2% 6.1% 99.9%

Not-Employed 13.7% 63.7% 3.9% 18.6% 99.9%

From the above table, we can say that 21.4% (21/98x100) of the respondents who are employed have

never married, 13.7% (14/102x100) of the respondents who are not-employed have never married.

Moreover, 61.2% of the employed respondents are married, where as 63.7% of the respondents who

are not-employed have married, 11.2% of the employed respondents have been divorced, 3.9% of the

not-employed respondents have been divorced. 6.1% of the employed respondents are widowed and

18.6% of the not-employed respondents are widowed. In the above table, the total has not come to

100% due to rounding (Elifson 1997).

Percentaging on the total number of cases: This is another method of interpreting bivariate

contingency tables. Here the percentages are calculated on the total number of cases (N). The

following table shows this:

Marital Status (X)

Employment

Status (Y)

Never

Married

Married Divorced Widowed

Employed 10.5% 30% 5.5% 3%

Not-Employed 7% 32.5% 2% 9.5%

Total 100%

From the above table we infer that 10.5% (21/200x100) of the respondents have never married and are

employed where as 7% (14/200x100) of the respondents who have married are not-employed. This

way of percentaging like the second method (percentaging across) also does not allow us to see the

influence of the independent variable on the dependent one and is rarely used. But it is used in certain

instances (Elifson 1997: 172).

Self-Check Exercise – 3

1. What is measurement?

Measurement is the assignment of numbers to objects or events according to some

predetermined (or arbitrary) rules. The different levels of measurement represent different

levels of numerical information contained in a set of observations.

23



2. What are the levels of measurement that are used by the social scientists?

There are four levels of measurement namely – nominal, ordinal, interval and ratio. The

characteristics of each will decide the kind of statistical application we can use.

The nominal level does not involve highly complex measurement but rather involves rules for

placing individuals or objects into categories.

The ordinal scales possess all the characteristics of the nominal and in addition the categories

represent a rank-ordered series of relationships like poorer, healthier, greater than etc.

The interval and ratio scales are the highest level of measurement in science and employ

numbers. The numerical values associated with these scales permit the use of mathematical

operations such as adding, subtracting, multiplying and dividing. The only difference between

the two is that the ratio level has a true zero point which the interval does not have. With both

these levels we can state the exact differences between categories (Elifson 1997).

7. Limitations of Statistics in Sociology

Statistics plays a role in Sociology, especially in Applied Sociology. There is a debate which has been

going on since the middle of the twentieth century between researchers who are committed to the use

of quantitative methods and computer application and those who believe in qualitative approach in

sociology. The latter group argues that statistics, if its importance is overemphasized, will become a

substitute for sociology. They argue that it is not always appropriate to conduct research with

quantitative variables that can be handled by statistical analysis. The decision to apply statistics to the

research would depend on factors like the nature of the problem, the subjects of study and the

availability of previously collected data, to name a few (Weinstein 2011). Researchers now a day

increasingly depend on the use of mixed methods. In general, mixed methods combine both

qualitative and quantitative techniques to cancel out their weaknesses. Triangulation is a particular

application of mixed methods (Guthrie 2010). One way in which a qualitative research approach is

introduced into quantitative research is through ethnostatistics which implicates the study of the

construction, interpretation and display of statistics in quantitative social research. The idea of

ethnostatistics can be applied in many ways but one predominant way to apply it is to treat statistics as

rhetoric. More specifically this implies examining the language used in persuading audiences about

the validity of the research (Bryman 2004: 446). To conclude, we can say that statistics will be a

necessary tool for effective research but can never be a substitute for sociological reasoning. It can

give the data some precision and make it manageable and smart for presentation (Weinstein 2011).

8. Summary

The present module has tried to analyse the processes and methods to examine quantitative data that

is, data that can be reduced to numbers. This process comes at a time when the researcher is through

with the process of data collection. The data are first to be processed through various methods of

coding, tabulation and classification. These help to reduce the data to manageable proportions and

make it ready to be applied to interpret data. After the data are processed different methods of

24



statistics like measures of central tendency, dispersion, chi-square, t-test, coefficient correlation,

simple regression and contingency tables are used to interpret data. The choice of the use of statistical

application depends on the nature of the research and the availability of the levels of data. But it has to

be remembered that statistical analysis is only a helping tool of research. It can never be a substitute

for the efforts of the researcher and the quality of the data collected. A combination of quantitative

and qualitative methods of analysis is essential for the interpretation of data in social research.

25



9. References

Ahuja Ram. Research Methods. Jaipur: Rawat Publications, 2007.

Bryman Alan. Social Research Methods. New York: Oxford University Press, 2004.

Elifson Kirk W, Runyon Richard P. and Haber Audrey. Fundamentals of Social Statistics. United

States: Mc. Graw Hill, 1997.

Garrett, Henry E. Statistics in Psychology and Education. New York: David McKay Company, Inc. ,

1981.

Guthrie, Gerard. Basic Research Methods: An Entry to Social Science Research. New Delhi: Sage

Publications India Private Limited, 2010.

Kothari C.R. Research Methodology: Methods and Techniques. New Delhi: New Age International

(P) Limited, Publishers, 2008.

Leonard Wilbert Marcellus. Basic Social Statistics. Illinois: Stipes Publishing L.L.C., 1996.

Levin Jack and Fox James Alan.Elementary Statistics in Social Research. New Delhi: Dorling

Kindersley (India) Pvt. Ltd., 2006.

Majumdar P. K. Research Methods in Social Science. New Delhi: Vinod Vasishtha for Viva Books

Private Limited, 2005.

Morrison Ken. Marx, Durkheim, Weber. London: Sage Publications, 1995.

Vito Gennaro and Latessa Edward.Statistical Applications in Criminal Justice. London: Sage

Publications, 1989.

Weinstein Jay Alan. Applying Social Statistics. United Kingdom: Rowman and Littlefield Publishers

Inc., 2011.