Bi ostat for pharmacy.ppt2

BIOSTATISTICS

School of pharmacy(COMH 607)

1

1.RESEARCH METHODS

2

1.1.Introduction to Research

What is Research?• A scientific study to seek hidden knowledge• A scientific study to answer a question• A scientific study of causes and effects• A scientific attempt towards new discoveries• A systematic method of inquiry• A logical attempt to find answers to problems• A systematic approach to a (medical) problem

3

Statistical Concept of Research

• Research is a systematic collection, analysis and interpretation of data in order to solve a research question

• It is classified as:– Basic research: necessary to generate

new knowledge and technologies.– Applied research: necessary to

identify priority problems and to design and evaluate policies and programs for optimal health care and delivery.

4

1.2. Types of Epidemiological DesignA. Descriptive studies

• Mainly concerned with the distribution of diseases with respect to time, place and person.

• Useful for health managers to allocate resource and to plan effective prevention programmes.

• Useful to generate epidemiological hypothesis, an important first step in the search for disease determinant or risk factors.

• Can use information collected routinely which are readily available in many places. So generally descriptive studies are less expensive and less time-consuming than analytic studies.

5

• It is the most common type of epidemiological design strategy in medical literature.

• There are three main types: –Correlational –Case report or case series –Cross-section

6

A.1. Correlational or Ecological • Uses data from entire population to compare

disease frequencies – between different groups during the same period of time, or in the same population at different points in time.

• Does not provide individual data, rather presents average exposure level in the community.

• Cause could not be ascertained.

• Correlation coefficient is the measure of association in correlational studies. It is important to note that positive association does not necessarily imply a valid statistical association.

7

Eg.• Hypertension rates and average per capita

salt consumption compared between two communities.

• Average per capita fat consumption and breast cancer rates compared between two communities.

• Comparing incidence of dental cares in relation to fluoride content of the water among towns in the rift valley.

• Mortality from CHD in relation to per capita cigarette sales among the regions of Ethiopia.

8

• Strength: Can be done quickly and inexpensively, often using available data.

• Limitation: – Inability to link exposure with disease. – Lack of ability to control for effects of

potential confounding factors. There may be other things that at the true cause.

– It may mask a non-linear relationship between exposure and disease. For example alcohol consumption and mortality from CHD have a non-linear relationship (the curve is “J” shaped),

9

A.2. Case Report and Case Series

• Describes the experience of a single or a group of patients with similar diagnosis. Has limited value, but occasionally revolutionary.

• E.g. 5 young homosexual men with PCP seen between Oct. 1980 and May 1981 in Los Angeles arose concern among physicians. Later, with further follow-up and thorough investigation of the strange occurrence of the disease the diagnosis of AIDS was established for the first time.

10

• Strength: – very useful for hypothesis generation.

• Limitations: – Report is based on single or few

patients, which could happen just by coincidence. Lack of an appropriate comparison group

11

A.3. Cross Sectional Studies (Survey• Information about the status of an

individual with respect to the presence or absence of exposure and disease is assessed at the same point in time. Easy to do-many surveys are like this.

• For factors that remain unaltered overtime, such as sex, race or blood group, the cross-sectional survey can provide evidence of a valid statistical association.

• Useful for raising the question of the presence of an association rather than for testing a hypothesis.

12

B. ANALYTIC STUDIES • Focuses on the determinants of a

disease by testing the hypothesis formulated from descriptive studies, with the ultimate goal of judging whether a particular exposure causes or prevents disease.

• Broadly classified into two – observational and interventional

studies. – Both types use “controls”. The use of

controls is the main distinguishing feature of analytic studies.

13

B.1. Observational studies• Information are obtained by observation of

events. No intervention is done. Cohort and case-control are in this category.

i. Cohort• Subjects are selected by exposure, or

determinants of interest, and followed to see

• If they develop the disease or outcome interest.

• E.g. Follow 100 children who received BCG vaccination and another 100 who didn’t get BCG vaccination and see how many of them get tuberculosis.

14

• ii. Case Control • Subjects are selected with respect to

presence or absence of disease, or outcome of interest, and then inquiries are made about past exposure to the factor(s) of interest.

• E.g. Take people with and without TB, ask them if they ever had BCG vaccination.

15

B.2. Interventional / Experimental

• The researcher does something about the disease or exposure and observe the changes.

• Investigator has control over who gets exposure and who don’t. The key is that the investigator assign into either group, whether it is done randomly or not.

• Always prospective. • E.g. Assign children randomly to get

chloroquine or not, and see how many develop symptomatic malaria.

16

Description of common terms Statistics- It is the process of scientifically collecting,

organizing, summarizing and interpreting of data, and the drawing of inferences about a body of data when only part of the data are observed.

Biostatistics- It is a special statistics in which the data being analyzed are derived from biological and medical science

Descriptive statistics: A statistical method that is concerned with the collection, organization, summarization, and analysis of data from a sample of population.

Inferential statistics: A statistical method that is concerned with the drawing of inferences/ conclusions about a particular population by selecting and measuring a random sample from the population.

17

Population: Is the largest collection of entities/values of a random variable for which we have an interest at a particular time. Population could be finite or infinite. We can take the whole number of students in a given class (e.g. 100 students) as a population.• Target population: A collection of items

that have something in common for which we wish to draw conclusions at a particular time.

• Study Population: The specific population from which data are collected

18

Sample: It is some part/subset of population of interest. In the above example, if we randomly select 25 students from the 100, we call the former as sample of the class.

Hence, Generalizability is a two-stage procedure: we want to a generalize from the sample to the study population and then from the study population to the target population

19

20

Eg.: In a study of the prevalence of HIV among orphan children in Ethiopia, a random sample of orphan children in LidetaKifleKetema were included.

Target Population: All orphan children in EthiopiaStudy population: All orphan children in Addis AbabaSample: Orphan children in Lideta KifleKetema

Statistical inference: It is the procedure by which we reach a conclusion about a population on the basis of the information contained in a sample that has been drawn from that population.

Parameter: It is numerical expression of population measurements E.g. population mean (µ), population variance, population standard deviation, etc

A descriptive measure computed from the data of a population.

Statistic: A descriptive measure computed from the data of a sample.

Statistical data: Information that is systematically collected tabulated and analysis for which the result is interpreted to draw conclusions about the result obtained.

21

• Data: aggregate of variables as a result of measurement or counting.

• Variable: A characteristics that takes on different values in different persons, places, or things. – Dependent variable(response) :variable

(s)we measure as an out come of interest– Independent variable(predictor) :The

variable(S) that determines the outcome

22

Categorical variable: The notion of magnitude is absent or implicit.

– Nominal: have distinct levels that have no inherent ordering.

– When only with two categories, are called binary or dichotomous.Eg. Sex; male or female

– When more than two categories -are called polythumous eg color

– Ordinal: have levels that do follow a distinct ordering.

Eg. severity of pain(mild, moderate severe)

23

Quantitative(numeric) variable: Variable that has magnitude

• Discrete data: when numbers represent actual measurable quantities rather than mere labels. Discrete data are restricted to taking only

specified values often integers or counts that differ by fixed amounts. e.g. Number of new AIDS cases reported

during one year period, Number of beds available in a particular hospital

• Continuous data: represent measurable quantities but are not restricted to taking on certain specific values i.e fractional values are possible. Can use interval (no true zero value) or ratio scale (begins at zero)

– e.g. weight, cholesterol level, time, temperature

24

1.3.Sampling Methods

Sampling• The process of selecting a portion of the

population to represent the entire population. • A main concern in sampling:

– Ensure that the sample represents the population, and

• The findings can be generalized.

25

Advantages of sampling:

• Feasibility: Sampling may be the only feasible method of collecting information.

• Reduced cost: Sampling reduces demands on resource such as finance, personnel, and material.

• Greater accuracy: Sampling may lead to better accuracy of collecting data

• Sampling error: Precise allowance can be made for sampling error

• Greater speed: Data can be collected and summarized more quickly

26

Disadvantages of sampling:• There is always a sampling error.• Sampling may create a feeling of discrimination within

the population.• Sampling may be inadvisable where every unit in the

population is legally required to have a record.

Errors in sampling

1) Sampling error: Errors introduced due to selection of a sample.– They cannot be avoided or totally eliminated.

2) Non-sampling error: - Observational error

- Respondent error- Lack of preciseness of definition- Errors in editing and tabulation of data

27

Divisions of Sampling Methods

Two broad divisions:

A. Probability sampling methods

B. Non-probability sampling

methods

28

1.4.1. Probability sampling

• Involves random selection of a sample

• A sample is obtained in a way that ensures every member of the population to have a known, non zero probability of being included in the sample.

• Involves the selection of a sample from a population, based on chance.

29

• Probability sampling is: – more complex, – more time-consuming and – usually more costly than non-probability sampling.

• However, because study samples are randomly selected and their probability of inclusion can be calculated, – reliable estimates can be produced and

• inferences can be made about the population.

30

• There are several different ways in which a probability sample can be selected.

• The method chosen depends on a number of factors, such as – the available sampling frame, – how spread out the population is, – how costly it is to survey members of the

population

31

Most common probability sampling methods

1. Simple random sampling2. Systematic random sampling 3. Stratified random sampling 4. Cluster sampling 5. Multi-stage sampling

32

1. Simple random sampling(SRS)

• Involves random selection• Each member of a population has an equal

chance of being included in the sample. • To use a SRS method:

– Make a numbered list of all the units in the population

– Each unit should be numbered from 1 to N (where N is the size of the population)– Select the required number.

33

• The randomness of the sample is ensured by: • use of “lottery’ methods • a table of random numbers

– Using computer programes

• Example • Suppose your school has 500 students and you need

to conduct a short survey on the quality of the food served in the cafeteria.

• You decide that a sample of 10 students should be sufficient for your purposes.

• In order to get your sample, you assign a number from 1 to 500 to each student in your school.

34

• To select the sample, you use a table of randomly generated numbers.

• Pick a starting point in the table (a row and column number) and look at the random numbers that appear there. In this case, since the data run into three digits, the random numbers would need to contain three digits as well.

• Ignore all random numbers after 500 because they do not correspond to any of the students in the school.

• Remember that the sample is without replacement, so if a number recurs, skip over it and use the next random number.

• The first 10 different numbers between 001 and 500 make up your sample

35

• SRS has certain limitations:

– Requires a sampling frame.

– Difficult if the reference population is

dispersed.

– Minority subgroups of interest may not be

selected.

36

2. Systematic random sampling • Sometimes called interval sampling, systematic

sampling means that there is a gap, or interval, between each selected unit in the sample

• The selection is systematic rather than randomly– Individuals are chosen at regular interval from the

sampling frame. Ideally we randomly select a number to tell us where to start selecting individuals from the list.

• Important if the reference population is arranged in some order:– Order of registration of patients– Numerical number of house numbers– Student’s registration books– Taking individuals at fixed intervals (every kth) based

on the sampling fraction, eg. if the sample includes 20%, then every fifth. 37

Steps in systematic random sampling

1. Number the units on your frame from 1 to N (where N is the total population size).

2. Determine the sampling interval (K) by dividing the number of units in the population by the desired sample size.

38

Steps….In order to find one study unit, during survey, it is

important to figure out how many houses must be visited usually through doing a pilot study.

• Example: Assume you are doing a study involving children under 5. There are 1500 households in all, and you have a required sample size of 100 children. From a preliminary study you have done, there is one child every 2.5 households. Normally, if there were a child in every household, you would visit 100 households. But because not every household includes a child, you will need to visit 100 x 2.5 or 250 households to find the required 100 children.

• The sampling interval will therefore be1500/250 or every 6th household.

39

3. Select a number between one and K at random. This number is called the random start and would be the first number included in your sample.

4. Select every Kth unit after that first number Note: Systematic sampling should not be

used when a cyclic repetition is inherent in the sampling frame.

40

Example

To select a sample of 100 from a population of 400, you would need a sampling interval of 400 ÷ 100 = 4.

Therefore, K = 4. You will need to select one unit out of every four units to

end up with a total of 100 units in your sample. Select a number between 1 and 4 from a table of random

numbers. • If you choose 3, the third unit on your frame would

be the first unit included in your sample;

• The sample might consist of the following units to make up a sample of 100: 3 (the random start), 7, 11, 15, 19...395, 399 (up to N, which is 400 in this case).

41

The main difference with SRS, any combination of 100 units would have a chance of making up the sample, while with systematic sampling, there are only four possible samples.

42

Advantages .

• Systematic sampling is usually less time consuming and easier to perform than SRS

• It provides a good approximation to SRS (. i.e. has highest precision)

• Unlike SRS, systematic sampling can be conducted without a sampling frame. So, systematic random sampling is useful when preparing sampling frame is not readily available. – E.g. In patients attending a health center,

where it is not possible to predict in advance who will be attending

43

Disadvantage

• If there is any sort of cyclic pattern in the ordering of the subjects, which coincides with the sampling interval, the sample will not be representative of the population. – May result in systematic error

44

3. Stratified random sampling

• It is done when the population is known to have heterogeneity with regard to some factors and those factors are used for stratification

• Using stratified sampling, the population is divided into homogeneous, mutually exclusive groups called strata, and – A population can be stratified by any variable that is available for

all units prior to sampling (e.g., age, sex, province of residence, income, etc.).

• A separate sample is taken independently from each stratum.

• Any of the sampling methods mentioned in this section (and others that exist) can be used to sample within each stratum.

45

http://www.statcan.ca/english/edu/power/ch13/probability/probability.htm#Top

Why do we need to create strata?

• That it can make the sampling strategy more efficient. • A larger sample is required to get a more accurate estimation

if a characteristic varies greatly from one unit to the other.• For example, if every person in a population had the same

salary, then a sample of one individual would be enough to get a precise estimate of the average salary.

• This is the idea behind the efficiency gain obtained with stratification. – If you create strata within which units share similar

characteristics (e.g., income) and are considerably different from units in other strata (e.g., occupation, type of dwelling) then you would only need a small sample from each stratum to get a precise estimate of total income for that stratum.

46

– Then you could combine these estimates to get a precise estimate of total income for the whole population.

• If you use a SRS approach in the whole population without stratification, the sample would need to be larger than the total of all stratum samples to get an estimate with the same level of precision.

47

• Stratified sampling ensures an adequate sample size for sub-groups in the population of interest.

• When a population is stratified, each stratum

becomes an independent population and you will need to decide the sample size for each stratum.

48

• Equal allocation:– Allocate equal sample size to each stratum

• Proportionate allocation: , j = 1, 2, ..., k where, k is

the number of strata and

– nj is sample size of the jth stratum– Nj is population size of the jth stratum – n = n1 + n2 + ...+ nk is the total sample

size – N = N1 + N2 + ...+ Nk is the total

population size

nn

N Nj j

49

4. Cluster sampling

• Sometimes it is too expensive to spread a sample across the population as a whole.

• Travel costs can become expensive if interviewers have to survey people from one end of the country to the other.

• To reduce costs, researchers may choose a cluster sampling technique

• The clusters should be homogeneous, unlike stratified sampling where by the strata are heterogeneous

50

Steps in cluster sampling

• Cluster sampling divides the population into groups or clusters.

• A number of clusters are selected randomly to represent the total population, and then all units within selected clusters are included in the sample.

• No units from non-selected clusters are included in the sample—they are represented by those from selected clusters.

• This differs from stratified sampling, where some units are selected from each group.

51

Example

• In a school based study, we assume students of the same school are homogeneous.

• We can select randomly sections and include all students of the selected sections only

52

• As mentioned, cost reduction is a reason for using cluster sampling.

• It creates 'pockets' of sampled units instead of spreading the sample over the whole territory.

• Another reason is that sometimes a list of all units in the population is not available, while a list of all clusters is either available or easy to create.

53

• In most cases, the main drawback is a loss of efficiency when compared with SRS.

• It is usually better to survey a large number of

small clusters instead of a small number of large clusters. – This is because neighboring units tend to be

more alike, resulting in a sample that does not represent the whole spectrum of opinions or situations present in the overall population.

54

• Another drawback to cluster sampling is that you do not have total control over the final sample size.

• Since not all schools have the same number of (say Grade 11) students and city blocks do not all have the same number of households, and you must interview every student or household in your sample, as an example, the final size may be larger or smaller than you expected.

55

5. Multi-stage sampling

• Similar to the cluster sampling, except that it involves picking a sample from within each chosen cluster, rather than including all units in the cluster.

• This type of sampling requires at least two stages.

56

• In the first stage, large groups or clusters are identified and selected. These clusters contain more population units than are needed for the final sample.

• In the second stage, population units are picked from within the selected clusters (using any of the possible probability sampling methods) for a final sample.

57

• If more than two stages are used, the process of choosing population units within clusters continues until there is a final sample.

• With multi-stage sampling, you still have the benefit of a more concentrated sample for cost reduction.

• However, the sample is not as concentrated as other clusters and the sample size is still bigger than for a simple random sample size.

58

• Also, you do not need to have a list of all of the units in the population. All you need is a list of clusters and list of the units in the selected clusters.

• Admittedly, more information is needed in this type of sample than what is required in cluster sampling. However, multi-stage sampling still saves a great amount of time and effort by not having to create a list of all the units in a population.

59

1.4.2.. Non-probability sampling

• The difference between probability and non-probability sampling has to do with a basic assumption about the nature of the population under study.

• In probability sampling, every item has a known chance of being selected.

• In non-probability sampling, there is an assumption that there is an even distribution of a characteristic of interest within the population.

60

• This is what makes the researcher believe that any sample would be representative and because of that, results will be accurate.

• For probability sampling, random is a feature of the selection process, rather than an assumption about the structure of the population.

61

• In non-probability sampling, since elements are chosen arbitrarily, there is no way to estimate the probability of any one element being included in the sample.

• Also, no assurance is given that each item has a

chance of being included, making it impossible either to estimate sampling variability or to identify possible bias

62

• Reliability cannot be measured in non-probability sampling; the only way to address data quality is to compare some of the survey results with available information about the population.

• Still, there is no assurance that the estimates will meet an acceptable level of error.

• Researchers are reluctant to use these methods because there is no way to measure the precision of the resulting sample.

63

• Despite these drawbacks, non-probability sampling methods can be useful when descriptive comments about the sample itself are desired.

• Secondly, they are quick, inexpensive and convenient.

• There are also other circumstances, such as researches, when it is unfeasible or impractical to conduct probability sampling.

64

common types of non-probability sampling

1. Convenience or haphazard sampling 2. Volunteer sampling 3. Judgment sampling 4. Quota sampling5. Snowball sampling technique

65

1.4.Scales of measurement

• Measurement: the assignment of numbers or names or events according to a set of rules:

• Clearly not all measurements are the same.• Measuring an individuals weight is qualitatively

different from measuring their response to some treatment on a three category of scale, “improved”, “stable”, “not improved”.

• Measuring scales are different according to the degree of precision involved.

• There are four types of scales of measurement.

66

Scales…

1. Nominal scale: uses names, labels, or symbols to assign each measurement to one of a limited number of categories that cannot be ordered.– Examples: Blood type, sex, race, marital status

2. Ordinal scale: assigns each measurement to one of a limited number of categories that are ranked in terms of a graded order.– Examples: Patient status, Cancer stages

67

Scales…

3. Interval scale: assigns each measurement to one of an unlimited number of categories that are equally spaced. It has no true zero point.– Example: Temperature measured on Celsius or

Fahrenheit4.Ratio scale: measurement begins at a true zero

point and the scale has equal space.– Eg: Height, weight, blood pressure

68

Scales…

69

1.5.Validity and reliability

Validity and Reliability are two major requirements for any measurement. – Validity pertains to the correctness of the

measure; a valid tool measures what it is supposed to measure.

– Reliability pertains to the consistency of the tool across different contexts.

• Validity is often described as internal or external.

70

1.6.Sources and methods of data Collection and it’s handling

SourcesTwo major sources

Primary sources-are those data, which are collected

by the investigator himself/herself for the purpose of a

specific inquiry or study. Such data are original in character and are mostly generated

by surveys conducted by individuals or research institutions.

The first hand information obtained by the investigator is more reliable and accurate since the investigator can extract the correct information by removing doubts, if any, in the minds of the respondents regarding certain questions. High response rates might be obtained since the answers to various questions are obtained on the spot. It permits explanation of questions concerning difficult subject matter. 71

Secondary data

Secondary Data: When an investigator uses data,

which have already been collected by others, such

data are called "Secondary Data". Such data are

primary data for the agency that collected them, and

become secondary for someone else who uses these

data for his/her own purposes.The secondary data can be obtained from journals,

reports ofdifferent institutions, government publications,

publications ofprofessionals and research organizations. These data are

less expensive and can be collected in a short time.

72

Data collection methods 1.Observation

• is a technique that involves systematically selecting,

watching and recoding behaviours of people or other

phenomena and aspects of the setting in which they

occur, for the purpose of getting specified information.

• includes all methods from simple visual observations

to the use of high level machines and measurements,

sophisticated equipment or facilities, such as

radiographic, biochemical, X-ray machines,

microscope, clinical examinations, and microbiological

examinations.

73

Observation…

• Advantages: Gives relatively more accurate data

on behaviour and activities

• Disadvantages: Investigators or observer’s own

biases, prejudice, desires, and etc. .

• needs more resources and skilled human power

during the use of high level machines.

74

2. The Documentary sources• Include clinical records and other personal records,

published mortality statistics, census publications,

etc.• Advantages:a) Documents can provide ready-made information

relatively easilyb) The best means of studying past events• Disadvantages: a) Problems of reliability and validity (because the

information is collected by a number of different persons who may have used different definitions or methods of obtaining data).

b) There is a possibility that errors may occur when the information is extracted from the records . 75

3. Interviews and self-administered questionnaire

a) Interviews: may be less or more structured.

A public health worker conducting interviews may be

armed with a checklist of topics, but may not decide

in advance precisely what questions he/she will ask.

• This approach is flexible; the content, wording and

order of the questions are relatively unstructured.

– the content, wording and order of the questions vary from

interview to interview.

76

Interviews…

On the other hand, in other situations a more standardized

technique may be used, the wording and order of the

questions being decided in

advance.

This may take the form of a highly structured

interview(interviewing using questionnaire),

• the investigator appoints persons/enumerators, who go

to the respondents personally with the questionnaire,

ask them questions and record their replies.

– This can be done using telephone or face-to-face interviews.77

Interviews…

• Questions may take two general forms: they may

be “open ended” questions, which the subject

answers in his/her own words,

• or “closed” questions, which are answered by

choosing from a number of fixed alternative

responses.

78

Advantage of interview

• A good interviewer can stimulate and maintain the respondent’s interest. This leads to the frank answering of questions.

• If anxiety is aroused (e.g., why am I being asked these questions?) , the interviewer can allay it.

An interviewer: • can repeat questions which are not understood,

and give standardized explanations where necessary.

• can ask “follow-up” or “probing” questions to clarify a response.

• can make observations during the interview;• i.e., note is taken not only of what the subject

says but also how he/she says it.

79

b. self-administered questionnaire

• The respondent reads the questions and fills in the

answers by himself/herself (sometimes in the presence of

an interviewer who “stands by” to give assistance if

necessary).

• The use of self-administered questionnaires is simpler and cheaper;

• can be administered to many persons simultaneously (e.g. to a class of school children).

• They can be sent by post. However, they demand a certain level of education on the part of the respondent.

80

.

• Quantitative data are commonly collected using

structured interviews (where standard questionnaires

are common and the collected data can relatively be

processed easily) where as,

• qualitative data are usually collected using

unstructured interviews.

• The unstructured interviews are undertaken by the

help of check lists, key informant interviews, focus

group discussions, etc.

81

Qualitative…Checklist - is a list of questions prepared ahead of

time to facilitate the interviews or discussions. It is not an exhaustive one. It helps the facilitator not to miss any of the important topics under consideration.

Key informant interviews – interviews done with influential individuals (such as community elders, priests, etc.).

Focus group discussions – discussions made with a group of respondents.

• The group contains 6 to 12 people who are more or less similar with respect to level of education, marital status, age, sex, etc. (this composition helps each respondent to talk freely without being dominated by the other). 82

Steps in Questionnaire Design

1. Before beginning to construct, make sure that the

questionnaire is the best method of collecting data

for your objectives

– To know before hand what information is needed and

what is going to be done with this information

2. While drafting the questions one has to know: Why

question is asked and what will be done with

information (to prevent wastage of extra resources)

83

Steps in…3. To get valid and reliable information:• the wording and sequence of question should be

able to facilitate their recall or remember• prevent forgetfulness of the respondents• avoid difficult/ time consuming or embarrassing

or too personal question• the flow of questions should be from simple to

complex and from general to specific, from impersonal to personal

• confidentiality care should be taken for the respondent

• Cover letter( if by mail)• Identify by ID(rather than name)

84

Data Collection and handling Process

85

Data collection

A plan for data collection can be made in two steps:

1. Listing the tasks that have to be carried out and who should be involved, making a rough estimate of the time needed for the different parts of the study, and identifying the most appropriate period in which to carry out the research

2. Actually scheduling the different activities that have to be carried out each week in a work plan

86

Why should you develop a plan for data collection?

A plan for data collection should be developed so that:– you will have a clear overview of what tasks

have to be carried out, who should perform them, and the duration of these tasks;

– you can organize both human and material resources for data collection in the most efficient way; and

– you can minimize errors and delays which may result from lack of planning (for example, the population not being available or data forms being misplaced).

87

Data collection process

Stages

• Stage 1: Permission to proceed– Obtaining consent from the relevant

authorities, individuals and the community in which the project is to be carried out

88

Data collection processStage 2: Data collection• Logistics

– who will collect what, – when and – with what resources

• Quality control – Prepare a field work manual – Select your research assistants– Train research assistants– Supervision– Checked for completeness and accuracy

89

Data collection process

• How long will it take to collect the data for each component of the study?– Step 1: Consider the time required to

reach the study area; to locate the study units; the number of visits required per study unit and for follow-up of non-respondents

– Step 2: Calculate the number of interviews that can be carried out per person per day

– Step 3: Calculate the number of days needed to carry out the interviews.

90

Ensuring data qualityMeasures to help ensure good quality of

data: Prepare a field work manual for the

research team as a whole Select your research assistants, if

required, with care Train research assistants carefully in all

topics covered in the field work manual as well as in interview techniques

Pre-test research instruments and research procedures with the whole research team, including research assistants. 91

Ensuring data quality

Take care that research assistants are not placed under too much stress

Arrange for on-going supervision of research assistants and guidelines should be developed for supervisory tasks.

Devise methods to assure the quality of data collected by all members of the research team.

92

Data Collection Process

Stage 3: Data handling• Once the data have been collected and

checked for completeness and accuracy, a clear procedure should be developed for handling and storing them

• Numbering of all questionnaires• Identify the person responsible for storing

data and the place where it will be stored• Decide how data should be stored. Record

forms should be kept in the sequence in which they have been numbered.

93

Research Assistants• This includes – data collectors,

supervisors and may be local guides • Selection – during selection one

should consider similarities in educational level and may be sex composition

• Training – all research assistants and team members should be trained together

94

Pre-test and pilot study

A pre-test usually refers to a small-scale trial of particular research components.

A pilot study is the process of carrying out a preliminary study, going through the entire research procedure with a small sample.

Why do we carry out a pre-test or pilot study?

A pre-test or pilot study serves as a trial run that allows us to identify potential problems in the proposed study.

95

Pre-test and pilot studyWhat aspects of your research methodology can

be evaluated during pre-testing?1. Reactions of the respondents to the research

procedures can be observed in the pre-test – availability and willingness

2. The data-collection tools can be pre-tested3. Sampling procedures can be checked4. Staffing and activities of the research team

can be checked, while all are involved in the pre-test

5. Procedures for data processing and analysis can be evaluated during the pre-test

6. The proposed work plan and budget for research activities can be assessed during the pre-test. 96

Plan for data processing & analysis

• Data processing and analysis should start in the field, with checking for completeness of the data and

• Performing quality control checks, while sorting the data by instrument used and by group of informants

• Data of small samples may even be processed and analyzed as soon as it is collected.

97


• The plan for data processing and analysis must be made after careful consideration of the objectives of the study as well as of the tools developed to meet the objectives.

• The procedures for the analysis of data collected through qualitative and quantitative techniques are quite different.– For quantitative data the starting point in

analysis is usually a description of the data for each variable

– For qualitative data it is more a matter of describing, summarizing and interpreting the data obtained for each study unit

98


• When making a plan for data processing and analysis the following issues should be considered:– Sorting data,�– Performing quality-control checks,�– Data processing, and�– Data analysis.�

99

Data processing and analysis

• Sorting data – Into groups of different study

populations or comparison groups

• Quality control checks– Check again for completeness and

internal consistency– Missing data - if many exclude the

questionnaire– Inconsistency - correct, return or

exclude

100

Data processing

• Decide whether to process and analyse the data from questionnaires:– manually, using data master sheets or manual

compilation of the questionnaires, or– by computer, for example, using a micro-

computer and existing software or self-written programmes for data analysis.

• Data processing in both cases involves:• categorising the data,• coding, and• summarising the data in data master sheets, manual

compilation without master sheets, or• data entry and verification by computer.

101

2.Descriptive statistics

(Data summarization)

102

2.Data summarization(Descriptive statistics)

2.1.Describing variablesThe methods of describing variables differ

depending on the type of data Categorical or NumericalSome times we transform numeric data into

categorical.eg age.– when lesser degree detail is required

• This is achieved by dividing the range of values, which the

numeric variable takes into intervals.

103

Describing…

Categorical variables• Table of frequency distributions

– Frequency– Relative frequency– Cumulative frequencies

• Charts– Bar charts– Pie charts

104

Describing …

105

In summary,• There are three ways we can summarize and

present data:• Tabular representation - summarizing data by

making a table of the data called frequency distributions.

• Graphical representation of data - we can make a graph of the data.

• Numerical representation of data - we can use a single number to represent many numbers. – Measures of central tendency. – Measures of variability.

106

2.2. Frequency Distribution• A frequency distribution shows the number of observations

falling into each of several ranges of values.• Four different types of frequency distributions.

– Simple frequency distribution (or it can be just called a frequency distribution).

– Cummulative frequency distribution. – Grouped frequency distribution. – Cummulative grouped frequency distribution.

• Are portrayed as Frequency tables, histograms, or

polygons

• Can show either the actual number of observations falling

in each range or the percentage of observations. In the

latter instance, the distribution is called a relative

frequency distribution107

Simple frequency distribution

Data Set - High Temperatures for 30 Days

50 45 49 50 43

49 50 49 45 49

47 47 44 51 51

44 47 46 50 44

51 49 43 43 49

45 46 45 51 46

Consider the following set of data which are

the high temperatures recorded for 30

consecutive days. We wish to summarize

this data by creating a frequency

distribution of the temperatures.

108

Simple frequency distribution…

.

To create a frequency distribution from this data proceed as follows:

1. Identify the highest and lowest values in the data set. For our temperatures the highest temperature is 51 and the lowest temperature is 43.

2. Create a column with the title of the variable we are using, in this case temperature. Enter the highest score at the top, and include all values within the range from the highest score to the lowest score.

109

Simple frequency…

3. Create a tally column to keep track of the scores as you enter them into the frequency distribution. Once the frequency distribution is completed you can omit this column

4. Create a frequency column, with the frequency of each value, as show in the tally column, recorded.

5. At the bottom of the frequency column record the total frequency for the distribution proceeded by N =

6. Enter the name of the frequency distribution at the top of the table.

110

Simple frequency…

Frequency Distribution for High Temperatures

Temperature Tally Frequency

51 //// 4

50 //// 4

49 //// / 6

48

0

47 /// 3

46 /// 3

45 //// 4

44 /// 3

43 /// 3

N = 30

If we applied these steps to the temperature data above we would have the following frequency distribution

111

Cumulative frequency distributionTo create a cummulative frequency

distribution:• Create a frequency distribution • Add a column entitled cummulative

frequency • The cummulative frequency for each score

is the frequency up to and including the frequency for that score

• The highest cummulative frequency should equal N (the total of the frequency column)

112

Cumulative frequency…

113

Cummulative Frequency Distribution for High Temperatures

Temperature Tally Frequency Cummulative Frequency

51 //// 4 30

50 //// 4 26

49 ////// 6 22

48

0 16

47 /// 3 16

46 /// 3 13

45 //// 4 10

44 /// 3 6

43 /// 3 3

N = 30

Grouped frequency distributionTo create a grouped frequency distribution:• select an interval size so that you have 7-20 class

intervals Al so By using surges’ rule

• create a class interval column and list each of the class intervals

• each interval must be the same size, they must not overlap, there may be no gaps within the range of class intervals

• create a tally column (optional) • create a midpoint column for interval midpoints • create a frequency column • enter N = some value at the bottom of the

frequency column

114

Grouped frequency for the temperature data

Grouped Frequency Distribution for High Temperatures

Class Interval Tally Interval Midpoint Frequency

57-59 ////// 58 6

54-56 /////// 55 7

51-53 /////////// 52 11

48-50 ///////// 49 9

45-47 /////// 46 7

42-44 ////// 43 6

39-41 //// 40 4

N = 50

115

Cumulative grouped frequency distribution

Cumulative Grouped Frequency Distribution for High Temperatures

Class Interval Tally Interval Midpoint Frequency Cumulative Frequency

57-59 ////// 58 6 50

54-56 /////// 55 7 44

51-53 /////////// 52 11 37

48-50 ///////// 49 9 26

45-47 /////// 46 7 17

42-44 ////// 43 6 10

39-41 //// 40 4 4

N = 50

We just add a cumulative frequency column to the grouped frequency distribution and we have a cumulative grouped frequency distribution as shown below.

116

Relative Frequency• Sometimes it is useful to compute the proportion, or

percentages of observations in each category.

• Relative frequency of a particular category is the

proportion(fracttion) of observations that fall into the

particular category.

• The cumulative frequency (or proportions) is addition of

the frequencies in each category from zero to a particular

category.

– Is the relative frequency of items less than or equal to

the upper class limit of each class.

• For quantitative data and for categorical (qualitative) data

(but only if the latter are ordinal ) 117

Characteristics and guidelines of table construction

Characteristics

• Table must be explanatory

• Title should describe the content of the table and

should answer the question what? Where? And

when? It was collected

• Percentages in each category should add up to 100

• Foot notes should be placed at the bottom of the

table

118

Guidelines • The shape and size of the table should contain the required

number of raw and Columns to accommodate the whole data

• If a quantity is zero, it should be entered as zero, and leaving

blank space or putting dash in place of zero is confusing and

undesirable

• In case two or more figures are the same, ditto marks should

not be used in a table in the place of the original numerals

• If any figures in a table has to be specified for a particular

purpose, it should be marked with asterisk

119

2.3. Diagrammatic Representation

2.3.1. Importance of diagrammatic representation:

1.Diagrams have greater attraction than mere figures. They give delight to the eye, add a spark of interest and as such catch the attention as much as the figures dispel it.

2.They help in deriving the required information in less time and without any mental strain.

3.They have great memorizing value than mere figures. This is so because the impression left by the diagram is of a lasting nature.

4.They facilitate comparison

120

Importance….

Well designed graphs can be an incredibly powerful means of communicating a great deal of information using visual techniques

When graphs are poorly designed, they not only do not effectively convey your message, they often mislead and confuse.

121

2.3.2.Types 1. Bar graph

•Bar diagram is the easiest and most adaptable general purpose chart.

•Though this type of chart can be used for any type of series, it is especially satisfactory for nominal and ordinal data.

•The categories are represented on the base line (X-axis) at regular interval and the corresponding values of frequencies or relative frequencies represented on the Y-axis (ordinate) in the case of vertical bar diagram and vis-versa in the case of horizontal bar diagram.

122

Method of constructing bar graph•All bars drawn in any single study should be of the same

width•The different bars should be separated by equal distances•All the bars should rest on the same line called the base•It is better to construct a diagram on a graph paper

Types of bar graph• 1.Simple bar graph: It is one-dimensional diagram in

which the bar represents the whole of the magnitude. The height/length of each bar indicates the frequency of the figure represented.

Example: Construct a bar graph for the following data

123

Table__, Distribution of pediatric patients in X hospital ward by type of admitting diagnosis Jan, 2000

Diagnosis Number of patients Relative freq (%)

Pneumonia 487 48.7

Malaria 200 20

Cardiac problems 168 16.8

Malnutrition 80 8.0

Others 65 6.5

Total 1000 100

124

1. Simple bar graph…

.

125

2.Sub-divided (component) bar graph

• It is also called segmented bar graph. If a given

magnitude can be split up into subdivisions, or if there are different quantities forming the subdivisions of the totals, simple bars may be subdivided in the ratio of the various subdivisions to exhibit the relationship of the parts to the whole.

• The order in which the components are shown in a "bar" is followed in all bars used in the diagram.

126

2.Sub-divided…

127

3. Multiple bar graph

Multiple Bar diagrams can be used to represent the relationships among more than two variables.

The following figure shows the relationship between children’s reports of breathlessness and cigarette smoking by themselves and their parents.

128

3. Multiple bar graph…

129

3. Multiple bar graph…

• We can see from the graph quickly that the prevalence of the system increases both with the child's smoking and with that of their parents.

130

2. Pie chart

Pie chart shows the relative frequency for each category by dividing a circle into sectors, the angles of which are proportional to the relative frequency.

Steps to construct a pie-chart Construct a frequency table Change the frequency into percentage (P) Change the percentages into degrees, where:

degree = Percentage X 360o Draw a circle and divide it accordingly

131

2. Pie chart…

Example: Distribution of death for females, in England and Wales, 1989.

• --

132

Cause of death Number (%)of deaths

Circulatory system (C) 100,000

Neoplasm (N) 70,000

Respiratory system(R) 30,000

Injury & poisoning (I) 6,000

Digestive system (D) 10,000

Others (O) 20,000

Total 236,000

2. Pie chart…

133

3.Histogram

Histograms are frequency distributions with continuous class interval that have been turned into graphs.

To construct a histogram, we draw the interval boundaries on a horizontal line and the frequencies on a vertical line.

Non-overlapping intervals that cover all of the data values must be used.

Bars are then drawn over the intervals in such a way that the areas of the bars are all proportional in the same way to their interval frequencies.

134

Example: Distribution of the RBC cholinesterase values (µmol/min/ml) obtained from 35 workers Exposed to Pesticides

eg.

135

RBC cholinesterase (µmol/ min/ ml) Frequency, n (%) Cumulative frequency (%)

5.95-7.95 1(2.9) 2.9

7.95-9.95 8(22.9) 25.8

9.95-11.95 14(40) 65.8

11.95-13.95 9(25.7) 91.5

13.95-15.95 2(5.7) 97.2

15.95-17.95 1(2.9) 100

Total 35(100)

Source: Knapp RG, Miller MC III: Clinical Epidemiology and biostatistics

3.Histogram…

• .

136

RBC choilinesterase(umol/min/ml)

16.9514.9512.9510.958.956.95

Histogram of the RBC cholinesterase values of 35

pesticide exposed workers

Num

ber

of

pesticid

e e

xposed w

ork

ers 16

14

12

10

8

6

4

2

0

4.Frequency polygonA frequency distribution can be portrayed graphically in

yet another way by means of a frequency polygon. •To draw a frequency polygon we connect the mid-point of

the tops of the cells of the histogram by a straight line. •It can be also drawn without erecting rectangles as

follows:

The scale should be marked in the numerical values of the mid-points of intervals.

Erect ordinates on the mid-point of the interval-the length or altitude of an ordinate representing the frequency of the class on whose mid-point it is erected.

Join the tops of the ordinates and extend the connecting line to the scale of sizes.

137

4.Frequency polygon…

138

5.Cumulative frequency polygon (ogive curve)

Some times it may become necessary to know the number of items whose values are more or less than a certain amount.

•We may, for example, be interested in knowing the number of patients whose weight is less than 50 Kg or more than say 60 Kg.

•To get this information it is necessary to change the form of the frequency distribution from a ‘simple’ to ‘cumulative' distribution.

•Ogive curve turns a cumulative frequency distribution in to graphs.

139

5.Cumulative frequency polygon (ogive curve)…

Example: Heart rate of patients admitted to Hospital B, 2000

140

Heart rate

(Beat/ min)

No. of patients Cumulative freq., less

than method

Cumulative freq.,

greater than method

54.95-59.5 1 1 54

59.5-64.5 5 6 53

64.5-69.5 3 9 48

69.5-74.5 5 14 45

74.5-79.5 11 25 40

79.5-84.5 16 41 29

84.5-89.5 5 46 13

89.5-94.5 5 51 8

94.5-99.5 2 53 3

99.5-104.5 1 54 1

Total 54

5.Cumulative frequency polygon (ogive curve)…

141

6.Box-and-whisker plotIt is another way to display information when the

objective is to illustrate certain location in the distribution.

A box is drawn with the top of the box at the third quartile and the bottom at the first quartile.

The location of the midpoint of the distribution is indicated with a horizontal line in the box.

Finally, straight lines or whiskers are drawn from the center of the top of the box to the largest observation and from the center of the bottom of the box to the smallest observation.

Useful When one of the characteristics is qualitative and the other is quantitative

142

Eg: percentage super saturation of bile by sex of patients Men Women

.

143

Subject Age %Super saturation

Subject Age %Super saturation

1 23 40 1 40 65 2 31 86 2 33 86 3 58 11 3 49 76 4 25 86 4 44 89 5 63 106 5 63 142 6 43 66 6 27 58 7 67 123 7 23 98 8 48 90 8 56 146 9 29 112 9 41 80 10 26 52 10 30 66 11 64 88 11 38 52 12 55 137 12 23 35 13 31 88 13 35 55 14 20 80 14 50 127 15 23 65 15 47 77 16 43 79 16 36 91 17 27 87 17 74 128 18 63 56 18 53 75 19 59 110 19 41 82 20 53 106 20 25 89 21 66 110 21 57 84 22 48 78 22 42 116 23 27 80 23 49 73 24 32 47 24 60 87 25 62 74 25 23 76 26 36 58 26 48 107 27 29 88 27 44 84 28 27 73 28 37 120 29 65 118 29 57 123 30 42 67 31 60 57

Box-and-whisker plot…

144

Box-and-whisker plot• The graphs indicate the similarity of the

distribution between the percentage saturation of bile in men and women.

•Again, we see that percentage saturation of bile is a bit more spread out among women with range 35 to 146 but we see also that the mid-points of the distributions are almost the same and that most of the spread in values in women occurs in the upper half of the distribution.

145

7.Scatter plotMost studies in medicine involve measuring more than

one characteristic, and graphs displaying the relationship between two characteristics are common in the literature.

• To illustrate the relationship between two characteristics when both are quantitative variables we use bivariate plots (also called scatter plots or scatter diagrams).

A scatter diagram is constructed by drawing X-and Y-axes.

•Each observation is represented by a point or dot(•). •In the same study on percentage saturation of bile,

information was collected on the age of each patient to see whether a relationship existed between the two measures, the following plot was displayed. 146

7.Scatter plot…

147

The graph suggests the possibility of a positive relationship between age and percentage saturation of bile in women.

8.Line graphIn this type of graph, we have two variables under

consideration like that of scatter diagram.

•A variable is taken along X-axis and the other along Y-axis.

•The points are plotted and joined by line segments in order.

•These graphs depict the trend or variability occurring in the data.

•Sometimes two or more graphs are drawn on the same graph paper taking the same scale so that the plotted graphs are comparable.

Example:The following graph shows level of zidovudine(AZT) in the

blood of AIDS patients at several times after administration of the drug, with normal fat absorption and with fat mal absorption.

148

Response to administration of zidovudine in two groups of AIDS patients in hospital X, 1999.

149

Data Summarization (Numeric Summery)

150

Measures of central tendency

On the scale of values of a variable there is a certain stage at which the largest number of items tend to cluster.

Since this stage is usually in the centre of distribution, the tendency of the statistical data to get concentrated at certain values is called “central tendency”

The various methods of determining the actual value at which the data tends to concentrate are called measures of central tendency.

151

Measures of central tendency…The most important objective of calculating

measure of central tendency is to determine a single figure which may be used to represent a whole series involving magnitude of the same variable.

In that sense it is an even more compact description of the statistical data than the frequency distribution.

•Since a measure of central tendency represents the entire data, it facilitates comparison with in one group or between groups of data.

152

Measures of central tendency…Characteristics of a good measure of central

tendencyA measure of central tendency is good or

satisfactory if it possesses the following characteristics.

1.It should be based on all the observations2.It should not be affected by the extreme values3.It should be as close to the maximum number of

values as possible4.It should have a definite value5.It should not be subjected to complicated and

tedious calculations6.It should be capable of further algebraic

treatment7.It should be stable with regard to sampling

153

Arithmetic mean (x) The most familiar MCT is the AM. It is also

popularly known as average. a) Ungrouped data If x1.,x2., ..., xn are n observed values,

Then:

154

Arithmetic mean…b) Grouped data .In calculating the mean from

grouped data, we assume that all values falling into a particular class interval are located at the mid-point of the interval. It is calculated as follow:

155

where, k = the number of class intervals mi = the mid-point of the ith class interval fi = the frequency of the ith class interval

Arithmetic mean…Example.

156

Mean = 2630/100 = 26.3

Arithmetic mean…• The arithmetic mean possesses the following

properties.• Uniqueness: For given set of data there is one

and only one arithmetic mean.• Simplicity: The arithmetic mean is easily

understood and easy to compute.• Center of gravity: Algebraic sum of the

deviations of the given values from their arithmetic mean is always zero.

• Sensitivity: The arithmetic mean possesses all the characteristics of a central value, except No.2, (is greatly affected by the extreme values).

• In case of grouped data if any class interval is open, arithmetic mean can not be calculated

157

The Median(X)

• a) Ungrouped data•The median of a finite set of values is that value which

divides the set of values in to two equal parts such that the number of values greater than the median is equal to the number of values less than the median.

•If the number of values is odd, the median will be the middle value when all values have been arranged in order of magnitude.

•When the number of observations is even, there is no single middle observation but two middle observations. •In this case the median taken to be the mean of

these two middle observations, when all observations have been arranged in the order their magnitude

158

The Median…

b) Grouped data• In calculating the median from grouped data, we

assume that the values within a class-interval are evenly distributed through the interval.

• The first step is to locate the class interval in which it is located. We use the following procedure.

• Find n/2 and see a class interval with a minimum cumulative frequency which contains n/2.

• To find a unique median value, use the following interpolation formal.

159

Median…

160

Where,Lm= lower true class boundary of the interval containing the medianFc= cumulative frequency of the interval just above the median class intervalfm= frequency of the interval containing the medianW= class interval widthn = total number of observations

Median…..Example

161

n/2 = 75/2 = 37.5Median class interval = 35-44Lm=34.5 ,Fc= 35, W = 10, n = 75,fm=22•Median = 34.5 + (37.5-35)/22 x 10 = 35.64

Properties of the median• There is only one median for a given set of

data• The median is easy to calculate• Median is a positional average and hence it is

not drastically affected by extreme values • Median can be calculated even in the case of

open end intervals• It is not a good representative of data if the

number of items is small

162

Mode (x) a) Ungrouped data•It is a value which occurs most frequently in a set of

values. •If all the values are different there is no mode, on the

other hand, a set of values may have more than one mode.

b) Grouped data• In designating the mode of grouped data, we usually

refer to the modal class, where the modal class is the class interval with the highest frequency.

• If a single value for the mode of grouped data must be specified, it is taken as the mid point of the modal class interval.

163

Properties of mode

• It is not affected by extreme values • It can be calculated for distributions with open

end classes• Often its value is not unique• The main drawback of mode is that often it does

not exist

164

MEASURES OF POSITIONS Quartiles

• Divide the distribution into four equal parts. The 25th percentile demarcates the first quartile (Q1),

• the median or 50th percentile demarcates the second quartile (Q2),

• the 75th percentile demarcates the third quartile (Q3), • and the 100th percentile demarcates the fourth quartile

(Q4), which is the maximum observation. Q1 is the ¼ (n+1)th measurement, i.e, 25% of all the ranked

observations are less than Q1.

Q2 is 2/4 (n+1)th = (n+1 /2)th measurement. I.e. 50% of all ranked observations are less than Q2. Q2=2 Q1

Q3 is the ¾ (n+1)th observation. Q3= 3 Q1. It indicates that 75% of all the ranked observations are less than Q3.

165

Percentile• Is Simply dividing the data into 100 pieces.• value in a set of data that has 100% of the

observations at or below it. When we consider it in this way, we call it the 100th percentile.

• From this same perspective, the median, which has 50% of the observations at or below it, is the 50th percentile.

• The pth percentile of a distribution is the value such that p percent of the observations are less than or equal to it.

The pth percentile value depends on whether np/100 is an integer or not:

The (k+1) Th largest sample point if np/100 is not an integer where k is the largest integer less than np/100.

The average of the (np/100) th and (np/100+1) th largest observation when np/100 is an integer

166

Percentiles…

Example: The following data is the sample of birth weights (grams) of live births at a hospital during a week period.

3265, 3248, 2838, 3323, 3245, 3101, 2581, 3200, 4146, 2759, 3609, 2069, 3260,

3314, 3541, 3649, 3484, 2834, 2841, 3031.Calculate the 10th and 90th percentilesSolution: n=20; p=0.1 & 0.9 First put the data in ascending

order 2069, 2581, 2759, 2834, 2838, 2841, 3031, 3101, 3200, 3245,

3248, 3260, 3265, 3314,3323,3484,3541,3609,3649,4146. 10th percentile = np/100= 20x0.1=2 which is an integer. So,

the 10th percentile will be the average of the 2nd and the 3rd ordered observation which is 2581+ 2759 divided by two which is equal to 2670 grams.

The 90th percentile=np/100= 20x0.9=18 which is an integer. So, the 90th percentile will be the average of the 18th and the 19th ordered observation which is 3609+ 3649 divided by two which is equal to 3629 grams.

167

Percentiles…

• Therefore, we would say that 80 percent of the birth weights would fall between 2607 g and 3629 g, which give us an overall feel for the spread of the distribution.

• The most commonly used percentiles other than

the median (50th percentile) are the 25th percentile and the 75th percentile.

168

Measures of variability

• The measure of central tendency alone is not enough to have a clear idea about the distribution of the data.

• Moreover, two or more sets may have the same mean and/or median but they may be quite different.

• Thus to have a clear picture of data, one needs to have a measure of dispersion or variability (scatterdness) amongst observations in the set.

169

Range (R)

R = XL-XS,

where

• XLis the largest value and XSis the smallest value.

• Properties• It is the simplest measure and can be easily

understood• It takes into account only two values which

causes it to be a poor measure of dispersion

170

Interquartilerange (IQR)

IQR = Q3-Q1,

Where,Q3is the third quartile and Q1is the first quartile.

Example: Suppose the first and third quartile for weights of girls 12 months of age are 8.8 Kg and 10.2 Kg respectively. The interrquartile range is therefore,

IQR = 10.2 Kg –8.8 Kg,i.e.,50% of infant girls at 12 months weigh between

8.8 and 10.2 Kg.

171

Interquartilerange …

172

Interquartile…• Generally, we use interquartile range to describe

variability when we use the median as the measure of central location. We use the standard deviation, which is described in the next section, when we use the mean.

Properties• It is a simple and versatile measure• It encloses the central 50% of the observations• It is not based on all observations but only on two

specific values• It is important in selecting cut-off points in the

formulation of clinical standards• Since it excludes the lowest and highest 25% values,

it is not affected by extreme values• It is not capable of further algebraic treatment 173

Quartile deviation (QD)

174

Coefficient of quartile deviation (CQD)

CQD is an absolute quantity (unit less) and is useful to compare the variability among the middle 50% observations.

Mean deviation (MD)

•Mean deviation is the average of the absolute deviations taken from a central value, generally the mean or median.

•Consider a set of n observations x1, x2, ..., xn.

Then,

175

Where, A is a central value (arithmetic mean or median).

Mean deviation …Properties• MD removes one main objection of the earlier

measures, that it involves each value • It is not affected much by extreme values• Its main drawback is that algebraic negative

signs of the deviations are ignored which is mathematically unsound

• MD is minimum when the deviations are taken from median.

176

The Variance (σ2, S2)

• The main objection of mean deviation, that the negative signs are ignored, is removed by taking the square of the deviations from the mean.

• The variance is the average of the squares of the deviations taken from the mean.

177

Variance…

a)Ungrouped dataLet X1, X2, ..., XN be the measurement on N

population units, then;

178

Variance…

The sample variance of the set x1, x2, ..., xn of n observations is:

179

Variance…

b)Grouped data

180

Variance…

Properties• The main demerit of variance is, that its unit is

the square of the unit of measurement of variate values

• The variance gives more weightage to the extreme values as compared to those which are near to mean value, because the difference is squared in variance.

• The drawbacks of variance are overcome by the standard deviation.

181

Standard deviation (σ, S)It is the positive square root of the variance.

182

Properties

•Standard deviation is considered to be the best measure of dispersion and is used widely because of the properties of the theoretical normal curve.•There is however one difficulty with it. If the units of measurements of variables of two series is not the same, then there variability can not be compared by comparing the values of standard deviation.Formula sheet for variance and standard deviation.docExample to calculate variance.doc

Coefficient of variation

• When we desire to compare the variability in two sets of data, the standard deviation which calculates the absolute variation may lead to false results.

• The coefficient of variation gives relative variation & is the best measure used to compare the variability in two sets of data. Never use SD to compare variability between groups.

• CV = standard deviation Mean

183

4.Basic Probability and probability distributions

• Probability is a mathematical technique for predicting outcomes. It predicts how likely it is that specific events will occur.

• An understanding of probability is fundamental for

quantifying the uncertainty that is inherent in the

decision-making process

• Probability theory also allows us to draw conclusions

about a population of patients based on known

information about a sample of patients drawn from

that population.

184

Basic Probability…

• Mutually exclusive events: Events that cannot occur together– For example, event A=“Male” and B=“Pregnant”

are two mutually exclusive events (as no males can be pregnant).

• Independent events: The presence or absence of one does not alter the chance of the other being present. – one event happens regardless of the other, and its

outcome is not related to the other.• Probability: If an event can occur in N mutually

exclusive and equally likely ways, and if m of these possess a characteristic E, the probability of the occurrence of E is P(E) = m/N.

185

4.1.Properties of probability

1.A probability value must lie between 0 and 1, 0≤P(E)≤1. A probability can never be more than 1.0, nor can it be

negative

• A value 0 means the event can not occur• A value 1 means the event definitely will occur• A value of 0.5 means that the probability that the

event will occur is the same as the probability that it will not occur.

• Probability is measured on a scale from 0 to 1.0 as shown in in the following Figure of probabilty scale.

186

Properties…

187

Fig.___

Properties…

2. The sum of the probabilities of all mutually exclusive outcome is equal to 1.P(E1) + P(E2) + .... + P(En) = 1

3. For any two events A and B,P(A or B) = P(A) + P(B) -P(A and B)(Addition rule)For two mutually exclusive events A and B, P(A or B ) = P(A) + P(B).

4. For any two independent events A and B– P(A and B) = P(A) P(B).(Multiplication rule)

188

Properties…• To calculate the probability of event (A) and event

(B) happening (independent events)for example, if you have two identical packs of cards (pack A and pack B),what is the probability of drawing the ace of spades from both packs?

• Formula: P(A) x P(B)P(pack A) = 1 card, from a pack of 52 cards = 1/52 = 0.0192P(pack B) = 1 card, from a pack of 52 cards = 1/52 = 0.0192P(A) x P(B) = 0.0192 x 0.0192 = 0.00037

5. If A’ is the complementary event of the event A, Then, P(A’) = 1 -P(A).

189

Example

• A study investigating the effect of prolonged exposure to bright light on retina damage in premature infants. Eighteen of 21 premature infants, exposed to bright light developed retinopathy, while 21 of 39 premature infants exposed to reduced light level developed retinopathy. For this sample, the probability of developing retinopathy is:

P(Retinopathy) = No. of infants with retinopathy Total No. of infants= 18 + 21 = 0.65 21 + 39

190

Example…

• The following data are the results of electrocardiograms (ECGs) and radionuclide angiocardiograms(RAs) for 19 patients with post-traumatic myocardial contusions. A “+”indicates abnormal results and a “-”indicates normal results.

• 1.Calculate the probability of both ECG and RA is abnormal

• 2.Calculate the probability that either the ECG or the RA is abnormal

191

Example

192

ExampleSolutions1.P(ECG abnormal and RA abnormal) = 7/19 = 0.372.P(ECG abnormal or RA abnormal) = P(ECG

abnormal) + P(RA abnormal) –P(Both ECG and RA abnormal)

=17/19 + 9/19 –7/19 = 19/19 =1• NB: We can not calculate the above probability by

adding the number of patients with abnormal ECGs to the number of abnormal Ras, I.e. (17+9)/19 = 1.37

• The problem is that the 7 patients whose ECGs and RAs are both abnormal are counted twice

193

4.2.Conditional probability• Are probabilities based on the knowledge that

some event has occurred.• In the retinopathy study described in the above

example, the primary concern is comparison of the bright-light infants with the reduced-light infants. We want to know whether the probability of retinopathy for the bright-light infants differs from the probability of retinopathy for the reduced-light infants.

• We want to compare the probability of retinopathy, given that the infant was exposed to bright light, with that the infant was exposed to reduced light. – Exposure to bright light and exposure to reduced light are

conditioning events, events we want to take into account when calculating conditional probabilities.

194

Conditional…

• Conditional probabilities are denoted by P(A/B) (read as Probability of A given B )or P(Event/Conditioning event). The formula for calculating a sample conditional probability is:

P (Event/Conditioning event) = No. of observations for which event and conditioning event both occur

No. of observations for which conditioning event occurs

P(A/B)= P(A∩B) , if P(B)>0 P (B)

195

Conditional…

• Example: For the retinopathy data, the conditional probability of retinopathy, given exposure to light, is

P (Retinopathy/exposure to bright light)= No. of infants with retinopathy exposed to bright light

No. of infants exposed to bright light

= 18/21= 0.86 P(Retinopathy/exposure to reduced light)= No.of infants with retinopathy exposed to reduced

light No. of infants exposed to reduced light

= 21/39 = 0.54 • The conditional probabilities suggest that premature

infants exposed to bright light have a higher risk of retinopathy than premature infants exposed to reduced light.

196

Summary of formulas for calculating probability

Summary of formulas for calculating probability.doc

More exercises

197

Calculating probability of an event Table --- shows the frequency of cocaine use by gender among adult cocaine users _______________________________________________________________________________________________

Life time frequency MaleFemale Total of cocaine use _______________________________________________________________________________________________

1-19 times 32 7 39 20-99 times 18 20 38 more than 100 times 25 9 34 -------------------------------------------------------------------------------------------- Total 75 36 111 ---------------------------------------------------------------------------------------------

198

Questions

1. What is the probability of a person randomly picked is a male?

2. What is the probability of a person randomly picked uses cocaine more than 100 times?

3. Given that the selected person is male, what is the probability of a person randomly picked uses cocaine more than 100 times?

4. Given that the person has used cocaine less than 100 times, what is the probability of being female?

5. What is the probability of a person randomly picked is a male and uses cocaine more than 100 times?

199

Answers

1. Pr(m)=Total adult males/Total adult cocaine users =75/111 =0.68 .

2. Pr(c>100)=All adult cocaine users more than 100 times/ Total adult cocaine users=34/111=0.31.

3. Pr (c>100\m)=25/75=0.33.4. Pr(f\c<100)=(7+20)/36=27/36=0.75.5. Pr(m ∩ c>100)= Pr(m) × Pr (c>100)=75/111×25/75=25/111=0.23.

200

4.3.Normal distribution

• If we take a large sample of men or women, measure their heights, and plot them on a frequency distribution, the distribution will almost certainly obtain a symmetrical bell-shaped pattern known as the normal distribution (also called the Gaussian distribution).See the following fig.

• The least frequently recorded heights lie at the two extremes

of the curve. From the figure,it can be seen that very few women are extremely short or extremely tall.

201

Normal distribution…

Figure3.4.Distribution of a sample of values of women's heights.

202

Normal distribution…• In practice, many biological measurements follow

this pattern, making it possible to use the normal distribution to describe many features of a population.

• It must be emphasized that some measurements do not follow the symmetrical shape of the normal distribution, and can be positively skewed or negatively skewed.

• For example, more of the populations of developed Western countries are becoming obese. If a large sample of such a population's weights was to be plotted on a graph similar to that in Figure3.4.1. above, there would be an excess of heavier weights which might form a similar shape to the 'negatively skewed' example in Figure3.4.2. below.

• The distribution will therefore not fit the symmetrical pattern of the normal distribution.

203


Figure 3.4.2.Examples of positive and negative skew.

204


Fig.3.4.3.The Normal distribution

205

Normal distribution…From The normal distribution shown in Figure3.4.3.

above, You can see that it is split into two equal and identically shaped halves by the mean.

• The standard deviation indicates the size of the spread of the data. It can also help us to determine how likely it is that a given value will be observed in the population being studied. We know this because the proportion of the population that is covered by any number of standard deviations can be calculated.

• The proportions of values below and above a specified value (e.g. the mean) can be calculated, and are known as tails.

• The normal distribution is useful in a number of applications, including confidence intervals and hypothesis testing .

206

Properties of the normal distribution

1. It is symmetrical about its mean, μ.2. The mean, the median and mode are all equal3. The total area under the curve above the x-axis is

one square unit.4. The curve never touches the x-axis.5. As the value of σ increases, the curve becomes

more and more flat and vice versa.6. About 68% of the values of X fall within one

standard deviation of the mean, 95% of the values are found within two standard deviations of the mean and 99.7% of the values are found within three standard deviations of the mean.

7.The distribution is completely determined by the parameters μ and σ.

8.The mean is μ and the variance is σ2207

Standard normal distribution

• It is a normal distribution that has a mean equal to 0 and a standard deviation equal to 1.

• Z-transformation: If a random variable X~N(μ,σ) then we can transform it to a standard normal distribution with the help of Z-transformation

Z = X -μ σ

208

Example1 • In 1932 the Stanford-BinetIQ test was roughly

normally distributed with μ= 100 and σ= 15.• Over time IQ’s have increased (better nutrition

or more experience taking test??) so average IQ for present day American children taking the 1932 test would be 120 but with same σ .

• “Very Superior" is an IQ above 130.

– (a)What % of 1932 children were “very superior”?– (b) What % of present day children would be “very superior” on 1932

test?

210

SolutionLet X be 1932 IQ scores & Let Y the scores of present day children on the

1932 test.X ~ N(100, 15) & Y ~ N(120, 15)(a)P(X >130) = P(Z > (130 -100)/15) = P(Z >2.0) = 0.0228( fromZ table.) 2.28 % of 1932 children were “very superior”(b) P(Y>130) = P(Z > (130-120)/15) = P(Z > 0.67)

= 0.2514 25.14% of present day children are “very superior

211

Example 2• A data collected on systolic blood pressure in

normal healthy individuals is normally distributed with μ= 120 and σ= 10 mm Hg.

1)What proportion of normal healthy individuals have a systolic blood pressure above 130 mm Hg?

2)What proportion of normal healthy individuals have a systolic blood pressure between 100 and 140 mm Hg?

3)What level of systolic blood pressure cuts off the lower 95% of normal healthy individuals?

212

Solutions

213

4.4.The Binomial distribution

It is one of the most widely encountered discrete distributions.

•The origin of binomial distribution lies in Bernoulli’s trials. When a single trial of some experiment can result in only one of two mutually exclusive outcomes (success or failure; dead or alive; sick or well, male or female) the trail is called Bernoulli trial.

• Suppose an event can have only binary outcomes A and B. Let the probability of A is π and that of B is 1 -π. The probability π stays the same each time the event occurs.

• If an experiment repeated n times and the outcome is independent from one trial to another, the probability that outcome A occurs exactly X times is:

216

The Binomial distribution…

217

Characteristics of a Binomial Distribution

•The experiment consist of n identical trials.•There are only two possible outcomes on each

trial. •The probability of A remains the same from trial to

trial. This probability is denoted by p, and t he probability of B is denoted by q. Note that q=1-p.

•The trials are independent.•The binomial random variable X is the number of

A’s in n trials.•n and π are the parameters of the binomial

distribution.•The mean is nπ and the variance is nπ(1-π)

218

Exercise(Home work)

• Each child born to a particular set of parents has a probability of 0.25 of having blood type O. If these parents have 5 children.

• What is the probability that :

a. Exactly two of them have blood type Ob. At most 2 have blood type Oc. At least 4 have blood type Od. 2 do not have blood type O.

219

Bi ostat for pharmacy.ppt2

Documents

Transcript of Bi ostat for pharmacy.ppt2